elasticsearch_logoAfter setting up ElasticSearch server, it’s a primary requirement to create Index for all the data which is on MS SQL Server. ElasticSearch provides ‘ElasticSearch River’ for this purpose,  and we can create river between ElasticSearch and MS SQL Server for automatic data transfer.

A river between ElasticSearch Server and MS SQL Server can be created by following these steps –

1) Get JDBC river for Elasticsearch –

First of all you need to install ElasticSearch river for JDBC, which does not come in the default installation of ElasticSearch. You can get this river from GitHub Page here. Process to download and install this river is very easy.

Open Command Prompt in ElasticSearch installation folder and go to ‘/bin’ folder, here you will see plugin.bat file which we will use.

Run this command here  –

plugin -install river-jdbc -url http://bit.ly/I7kz9q

(here update this URL with the latest URL given on GitHub page.)

 2) Install JDBC Driver –

ElasticSearch is a Software created in JAVA, therefore river requires JDBC driver to connect to MS Sql Server. You can get Microsoft JDBC Driver from here.

After installation either Copy the ‘sqljdbc4.jar’ into the lib directory of ElasticSearch installation folder or Add JDBC Driver location to CLASSPATH variable of Computer (under environment variables setting).

 3) Create River –

Now comes the most interesting step, which is to create the actual river using command. Following is the sample command which can be used for River creation between ElasticSearch and SQL Server –

——————————————————————————————

PUT /_river/mytest_river/_meta
{
“type” : “jdbc”,
“jdbc” : {
“driver”:”com.microsoft.sqlserver.jdbc.SQLServerDriver”,
“url”:”jdbc:sqlserver://MYSQLSERVERNAME;databaseName=MYProductDatabase”,
“user”:”admin”,”password”:”Password”,
“sql”:”select ProductID as _id, CategoryID,ManufacturerID,MfName,ProductTitle,MfgPartNumber from MyProductsTable(nolock)”,
“poll”:”10m”,
“strategy” : “simple”,
“index” : “myinventory”,
“type” : “product”,

“bulk_size” : 100,

“max_retries”: 5,

“max_retries_wait”:”30s”,

“max_bulk_requests” : 5,

“bulk_flush_interval” : “5s”
}
}

——————————————————————————————-

Quick description about the items used here in script –

– ‘mytest_river’ is the name of river which we are creating, this name is important because you will need it for deleting the river.

– Type defines what type of river we are creating, because river can be set between different data sources such as JDBC, Twitter, CouchDB etc.

– The JDBC part is very important, here you specify the connection type, connection string and SQL Query which will be used to fetch the data.

– Index defines name of the index which will be used for data insertion.

– Here bulk_size is the batch size which will be used while fetching the data from SQL Server, so if you define it 500, river will fetch data in a batch of 500 till all data is fetched.

– In SQL Query two points are important, first you should define Primary Key column as _id (notice the first part of sample query – ‘select ProductID as _id’) otherwise if you do polling, every time river will create duplicate indexes for same data.

– Second you should use nolock with table name if you are doing only Select query, as this will not lock your SQL Table while reading data.

– Strategy used here is ‘Simple’ with Poll interval (here its 10 minutes). This will keep checking data for any changes after every 10 minutes. If you to fetch data only one time then define strategy as ‘oneshot’ and do not give any polling time.

Detailed knowledge of all parameters can be checked here on GitHub.

4) Delete a River –

If required a river can be deleted using following command, at beginning I deleted river and indexes multiple times before I got them right  –

DELETE /_river/ mytest_river

5) Delete Index –

ElasticSearch index can be deleted using following command –

DELETE /myinventory/

Once you setup the river, ElasticSearch will start indexing your SQL Server data by fetching it in batches. Once all indexes are created, ElasticSearch River will keep polling and updating index data on the specified interval.

However now rivers are going to be deprecated soon by ElasticSearch, so its advisable to not rely on them for syncing data between data source and ElasticSearch. Always create your own service to update ElasticSearch whenever main source data is updated. Use River for only one time pull of data.