Friday, 4 October 2013

Loading MongoDB with Talend : easy way

Hi all,

Today, I’m going to summarize some easy steps to load mongoDB with Talend. Well, this won’t be a very detailed article, only a quick reminder on how to do, hoping to give you a leg up.

Mongodb ?

Mongodb is and open source document database, part of the NoSQL paradigm. Main key features are :

Document oriented storage ?

Data is not stored in term of rows with a fixed schema. Data is stored under the form of documents (json documents) and each of these documents can have its own schema. We talk about schema-less documents.

Here is a simple document (snapshot from mongoDB site).


The data you see here is a row. Well … is a document, let’s use the appropriate naming. This document has a key (_id) and fields (name, birth …). Schema less means that the field “death”, for instance, could be missing (in the case death didn’t occur – Turing would be 101 y/o), or the extra field “addictions” could be present in case of known addictions for Turing (he had not of course).

Please note that documents are json objects. That means a field can store a simple string, or some more complex structures : another json object, an array (like “contribs” above) …

This is not the topic here, but I hope you can feel the power of schema-less document oriented databases.

Using Talend to load mongoDB : two ways

You have two options :

My position, based on testing I recently made, is : using your own java routines (instead of Talend native components) will give you :

  • more flexibility and control on your documents before inserting
  • better insertion/update performance. I say BETTER : some testing, at constant perimeter, showed a 100x gap.

Never the less, I’m using Talend with great pleasure and will publish a post on how to use these dedicated components.

Giving you a leg up : a simple job to load/update json documents in mongoDB

Well, to give you a leg, I created a very simple job. Of course, this is quite simple. I have tons of other jobs with really heavy transformations but the idea is the same.


First step :

Load mongoDB driver from the official web site. You will find the drivers here : http://docs.mongodb.org/ecosystem/drivers/downloads/. In our example, we need the java driver. Go here for the lastest release : https://github.com/mongodb/mongo-java-driver/releases. Then, use a tLibraryLoad to point to the jar file on your hard drive, ex : mongo-2.9.3.jar.

Second step :

You need to connect to mongoDB. This is quite simple with the java driver api and with the use of a tjava component. By the way, here is a link about this api : http://api.mongodb.org/java/

The imports are like below (I have lots of imports here because of copy/paste from a custom job :


The java code is like below. Pretty self explanatory :


Third step :

A basic relational source component. In my example, I use a tMysqlInput with a classical SELECT statement : select _id, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10 from source_table;

Fourth component : here is the magic !


First, retrieve the collection object previously stored in a globalmap. Then instanciate a query object and initialize with the _id from the collection (see mongoDB documentation to learn more about collection ids). Create a document object and initialize by appending all the columns you want to load into mongoDB. Close by calling the get() method, and you are done.

Now only call the update method and give the appropiate arguments :

  • query : the query previously
  • doc : the document you just created
  • bool1 : upsert : will insert if _id not existing, otherwise will update
  • bool2 : multi : will update multi documents if existing.

Adding more complex objects to the insert step

Of course, this example is quite simple : creating one field for one column. But you can map a complex json object to a field. Simple process your data and store the result in a DBObject. Then, later, you can map this DBObject into a Talend column and finally into a field in the .append method :

. append(“json_object”,input_row.json_object)    where input_row.json_object is a Talend field having the type “object”.


Stay tuned for more articles about mongoDB !

No comments: