Thursday, 6 November 2014

Easily compare AWS instance prices

Hi all,

It's been a while ...
I know, lot of work. Currently working on data federation with JBOSS Teiid as well as with Composite Software.
Will write some articles about that ...

For the moment, I want to share a very handy webpage that aim at comparing AWS instance prices. You can modify your variables (location, memory, cpu etc ...) and the prices will change.
Really handy.

Thursday, 29 May 2014

Eclipse Trader !

Hi all,

I’m currently looking on open financial / stocks data feeds. There is a number of things available here !Want to build a Reuters or Bloomberg terminal like ? Have a look to this Eclipse plugin !

Features are great :

  • Realtime Quotes
  • Intraday Charts
  • History Charts
  • Technical Analisys Indicators
  • Price Patterns Detection
  • Financial News
  • Level II (Book) Market Data
  • Trading Accounts Management
  • Integrated Trading




Hi all,

I love this site :

Great dataviz for financials. Realtime (or just about).


And some quick dataviz about data grabbed here : Sheryl Sandberg trading Facebook shares over first half of 2014…


Saturday, 10 May 2014

Load ElasticSearch with tweets from Twitter

Hi all,

Today, I’m gonna give you a quick overview about how to load an ElasticSearch index (and make it searchable !) with data coming from Twitter. That’s pretty easy with the ElasticSearch River ! You must be familiar with ElasticSearch in order to fully understand what is coming. To learn more about the river, please read this.

As usual, a quick schema to explain how it works.



Pretty simple as it comes as plugin. Just type and run (for release 2.0.0) :

  • bin/plugin -install elasticsearch/elasticsearch-river-twitter/2.0.0

Then, if all went fine, you should have your plugin installed.

I highly recommend to install another plugin called Head. This plugin will allow you to fully use and manage your ElasticSearch system within a neat and comfortable GUI. Here is a pic :


Register to Twitter

Now you need to be registered on Twitter in order to be able to use the river. Please login to : and folllow the process. The outcome of this process is to obtain four credentials needed to use the Twitter API :

  • Consumer key (aka API key),
  • Consumer secret (aka API secret),
  • Access token,
  • Access token secret.

These tokens will be used in the query when creating the river (see below).


A simple river

Ok, here we go. Just imagine I want to load, in near real time and continuously, a new index with all the tweets about …. Ukrain (since today, May 9 2014, the cold war is about to start again …).

That’s pretty simple, just connect to your shell, make sure ElasticSearch is running and that you have all the necessary credentials. Then send the PUT query below. Don’t forget to type your OAuth credentials (mine are under the form ****** below ).

curl -XPUT http://localhost:9200/_river/my_twitter_river/_meta -d '
    "type": "twitter",
    "twitter": {
        "oauth": {
            "consumer_key": "******************",
            "consumer_secret": "**************************************",
            "access_token": "**************************************",
            "access_token_secret": "********************************"
        "filter": {
            "tracks": "ukrain",
            "language": "en"
        "index": {
            "index": "my_twitter_river",
            "type": "status",
            "bulk_size": 100,
            "flush_interval": "5s"

After submitting this query, you should see something like :

[2014-05-09 10:58:58,533][INFO ][cluster.metadata         ] [Rex Mundi] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2014-05-09 10:58:58,868][INFO ][cluster.metadata         ] [Rex Mundi] [_river] update_mapping [my_twitter_river] (dynamic)
[2014-05-09 10:58:59,000][INFO ][river.twitter            ] [Mogul of the Mystic Mountain] [twitter][my_twitter_river] creating twitter stream river
{"_index":"_river","_type":"my_twitter_river","_id":"_meta","_version":1,"created":true}tor@ubuntu:~/elasticsearch/elasticsearch-1.1.1$ [2014-05-09 10:58:59,111][INFO ][cluster.metadata         ] [Rex Mundi] [my_twitter_river] creating index, cause [api], shards [5]/[1], mappings [status]
[2014-05-09 10:58:59,381][INFO ][river.twitter            ] [Mogul of the Mystic Mountain] [twitter][my_twitter_river] starting filter twitter stream
[2014-05-09 10:58:59,395][INFO ][twitter4j.TwitterStreamImpl] Establishing connection.
[2014-05-09 10:58:59,796][INFO ][cluster.metadata         ] [Rex Mundi] [_river] update_mapping [my_twitter_river] (dynamic)
[2014-05-09 10:59:31,221][INFO ][twitter4j.TwitterStreamImpl] Connection established.
[2014-05-09 10:59:31,221][INFO ][twitter4j.TwitterStreamImpl] Receiving status stream.

A quick explanation about the river creation query :

  • A river called “_river” is created
  • 2 filters are used :
    • track : track the keyword
    • language : tweet languages to track
  • An index called “my_twitter_river” is created
  • A type (aka a table) called “status” is created
  • Tweets will be indexed :
    • once a bulk size of them have been accumulated (100, default)
      • OR
    • everyflush interval period (5 seconds, default)

By now, you should see the data coming into your newly index.


A first query

Now, time to run a simple query about this “Ukrain” related data.

For now, I will only send a basic query based on a keyword, but stay tuned because I will soon demonstrate how to create analytics on these tweets …

Simple query to retrieve tweets having the word “Putin” in it :

curl -XPOST http://localhost:9200/_search -d '
  "query": {
    "query_string": {
      "query": "Putin",
      "fields": [

… and you’ll have something like :


Monday, 5 May 2014

A cool Big Data tools overview

Hi all,

Here is a cool Big Data tools overview. Found on :

Definitely superb ! I love it.

Don’t over use it in your powerpointware ….


Tuesday, 1 April 2014

Deploy Mongodb replica set

Hi all,
Today, I’m going to summarize some easy steps to create a Mongodb replica set. Well, this won’t be a very detailed article, only a quick reminder on how to do. This won’t be a tutorial on how to install MongoDB since it is pretty easy and the web has already tons of tutorials about this.
This article is a serie about scaling MongoDB. The serie will cover all typical steps I had to go through over the past year :
  • Step 1 : creating a replica set from a single machine server (dev / test environment)   
    • this current article
  • Step 2 : creating a replica set on 2 and more servers (small infra / integration …)) 
    • coming soon …
  • Step 3 : scaling mongo with sharding (high availability production environment)
    • to come a bit later …

Quick reminder

MongoDB ?
Mongodb is and open source document database, part of the NoSQL paradigm. Main key features are :
  • document oriented storage,
  • full index support
  • replication and high availability
  • auto sharding
  • map/reduce
  • gridFS
Document oriented storage ?
Data is not stored in term of rows with a fixed schema. Data is stored under the form of document (json documents) and each of these documents can have their own schema. We talk about schema-less documents.

Our scenario

Imagine : we have a single AWS/EC2 server running a single MongoDB instance. For any good reason, we want to deploy a replica set. For instance, to plug an ElasticSearch river (article to come soon !) in order to feed a search index !
Let’s go from THIS  …
image   … to THIS …      image

Let’s go for it

First, I assume you have an up and running MongoDB instance on a linux box, running one mongodb process. Here is what we are going to do :
  • 1 - Stop the single running instance
  • 2 - Duplicate the configuration file
  • 3 - Update the configuration files
  • 4 - Prepare the filesystem for MongoDB secondary instance
  • 5 - Restart primary and secondary MongoDB instances
  • 6 - Set up and activate the replication
  • 7 – Play with it
1 - Let’s stop this currently running instance :
  • ps aux | grep mongod[b] : will show up mongodb pid ([pid])
    • Trick : using grep mongod[b] will prevent your shell to print the grep itself.
  • kill –2 [pid] : will kill mongodb process properly
2 - Duplicate the configuration file called mongodb.conf :
  • mongodb is located here : /etc/mongodb.conf
  • cp mongodb.conf mongodb1.conf : will create a local copy, and allow you to have 2 mongos instances : mongodb (initial=primary) and mongodb1 (new=secondary)
  • Give the appropriate rights, depending of your installation (give same rights as the initial file is easy and quick for a simple deployment).
3 - Update the configuration files :
  • make the following change, in order to create a completely new mongodb ecosystem :
    • dbpath=[path to your directory where the data will be written]
    • logpath=[path to your directory where the logs will be written]
    • port = 27018 : the first instance is running on 27017, so adding 1 and using 27018 is easy for your second instance
    • replSet=rs1 : name for your replica set
    • nojournal=true : optional
Now you have 2 configurations file : mongodb.conf (original one) and mongodb1.conf (newly created). Don’t forget to add the following to your original mongodb.conf file :
  • replSet=rs1 : name for your replica set. This name should be the same in the two configuration files.
Let’s have a quick overview of the two configuration files (simplified) :
4 – Prepare the filesystem for the new instance
Of course, you need to create the new directories for you secondary (new) instance. That means, according the picture above, creating :
  • /mnt/mongodb/mongodb1
  • /mnt/mongodb/mongodb1/logs
5 - Restart primary and secondary MongoDB instances
Easy. Here is the simple command line to start each instance :
  • For primary : mongod --fork --rest --config /etc/mongodb.conf
  • For secondary : mongod --fork --rest --config /etc/mongodb1.conf
Using –-fork will allow you to start mongo as a background task.
Then you shoud see your processes running from a simple top –c, like this (only one process here but you should have two) :
6 - Set up and activate the replication
Now it’s time to setup the replication process. For that purpose we will now connect to the primary mongo instance ! Here is the complete process.
    "_id" : "rs1"
    "version" : 1,
    "members" : [
            "_id" : 0,
            "host" ""

{ "ok" : 1 }
  • mongo, will start mongo shell for the primary instance, default port is 17017. Note that if you want to connect to the secondary instance, you need to specifiy the port like : mongo –port 27018
  • rs.initiate() will start the replica set configuration.
  • rs.conf() will print the replica set configuration. Here we can see we have replicat set named “rs1”, having only one member “_id” = 0 (which is the current primary instance).
  • rs.add(“”) will add a new member for the “rs1” replica set. You can see I named the new member with a string composed of the localhost IP adress and the port. Sometimes, you may need to use the hostname instead of the IP, especially if you work on AWS.
    • As soon as the rs.add command has been processed, with success, the answer says “ok”:1
Now let’s check we have a replica set up and running. Let’s type the rs.conf() once again. It should display the output below. We can see we really have a replica set.

    "_id" : "rs1"
    "version" : 1,
    "members" : [
            "_id" : 0,
            "host" ""
            "_id" : 1
            "host" ""

6 - Set up and activate the
Now you can play with the replication. Just create a database, a collection and insert some data/documents. Then connect to the secondary instance and tadam … your data has been replicated. If you are working on AWS, don’t forget to whitelist your 27018 port in order to have access to the secondary instance.


This was a VERY simple way to create a Mongo replica set on a single machine. Of course, this is not suitable for production environments. Consider this setup for prototyping, training or small devs. This was just the beginning of the journey.

In the next chapters, I will explain how to :
  • Add an Arbiter : this is highly recommended !! The Arbiter will be responsible to elect a primary instance for the replica set.
  • Adjust priority for a replica set member,
  • Create and use a replica set accross several instances
  • I’m also writing some text about setting up an ElasticSearch River to feed ElasticSearch indexes from a Mongodb database.

Monday, 27 January 2014

ElasticSearch index migration

Hi all,
I’ve been working with ElasticSearch for one year now. It’s a great index and distributed search engine, based on Apache Lucene. Behind the scene, ElasticSearch is document oriented datastore, with schemaless model, a restfull API and offering high availability on real time data. More to read here for more infos.

I’ve been loading data from relational or noSQL (MongoDB), using both API or bulk features. Another interesting way of loading data from a noSQL database is the river. Loading data is fine, but you will probably face the challenge to synchronise / replicate data from one environment to another (from PROD to DEV for instance, in order to allow development on fresh and meaning full data).
I was working on that challenge when I finally found something really good : the ElasticSearch Exporter !

How to easily move / copy indexes from one cluster/machine to another ?

Ravi Gairola developped and released the ElasticSearch Exporter. This small script is available on github here.
ElasticSearch Exporter will allow you, with only a single line of shell, to :
  • Export to ElasticSearch or (compressed) flat files
  • Recreates mapping on target
  • Filter source data by query
  • Specify scope as type, index or whole cluster
  • Sync Index settings along with existing mappings
  • Run in test mode without modifying any data

Install and usage

ElasticSearch Exporter needs a nodeJS server (v0.10 minimum) with the following modules : nomnom, colors. Install can be done with npm.
Let’s go, install node (ubuntu) :
sudo apt-get update
sudo apt-get install -y python-software-properties python g++ make
sudo add-apt-repository ppa:chris-lea/node.js
sudo apt-get update
sudo apt-get install nodejs

Let’s add some needed modules :

npm install colors
npm install nomnom

Download and unpack ElasticSearch Exporter (master from github) :

cd Elasticsearch-Exporter-master

Start migrating an index from machine A to machine B :

node exporter.js -a -p 1201 -i source_idx_name -b -p 1201 -j dest_idx_name
  • -a : Source IP (machine A)
  • -p : Source port (machine A)
  • -i : Source index (machine A)
  • -b : Destination IP (machine B)
  • -j : Destination index (machine B)
You will see some progress lines (warn : could go deeeeep down on your shell window) and a summary at the end :

Processed 118100 of 119923 entries (98%)
Processed 118200 of 119923 entries (99%)
Processed 118300 of 119923 entries (99%)
Processed 118400 of 119923 entries (99%)
Processed 118500 of 119923 entries (99%)
Processed 118600 of 119923 entries (99%)
Processed 118700 of 119923 entries (99%)
Processed 118800 of 119923 entries (99%)
Processed 118900 of 119923 entries (99%)
Processed 119000 of 119923 entries (99%)
Number of calls:    2430
Fetched Entries:    119923 documents
Processed Entries:    119923 documents
Source DB Size:        119923 documents

Then, you are done : your index has moved from machine A to machine B.

Of course, you have plenty of other configurations, have a look to the github page.

Easy, simple, efficient and free. As we love it.

A big thanks to Ravi Gairola.