Hi all !
It's been a few days I'm thinking about feeding Amazon SimpleDB with an ETL tool like Kettle / PDI.
Well, it's done. I have a working prototype. It’s a “quick and dirty” prototype of course but it works. I hope we will soon have an official Kettle plugin for that.
Requirements
You have to be familiar with Amazon AWS, EC2 and SimpleDB. Of course you need a valid account on Amazon Web Services. If you want to learn more about SimpleDB, click HERE. You can play with SimpleDB with a graphical interface before starting hard stuff, click here for the ScratchPad (don’t forget to browser the javascript source code, a lot to learn here !).
You need to know how to use Kettle, the famous data integration tool from Pentaho. To learn more about Kettle, follow this link. To discover the full Pentaho BI solution, click here. I recommand you to discover Pentaho BI Suite Enterprise Edition.
The process
First you have to know how SimpleDB is organized and how it’s working.
For the developper, SimpleDB is not seen a traditional relational database like the ones he’s used to work with. Instead of thinking in terms of tables and columns, you have to face a different approach : data is organized within Domains, which are similar to an Excel tab. Then, inside a domain, data is stored with the couple : Attribute/Value. XML guys won’t be suprised with this storage method.
Let's first have a look at a typical relational table. Just a reminder ;)
Now, let’s see how your data will look like once store inside SimpleDB. A bit of XML now. As you can see, this extract represents the first line of the relational table show above. This row is composed of an item name (let’s say for convenience, but it’s false, it’s like the primary key) and attributes. These attributes are made of a Name and an associated Value.
See the difference ? That's the Amazon SimpleDB API. I'm pretty sure that data, at low level, is finally stored into a relational schema, somewhere... But for the developper, this is the way it’s must be done.
Okay, okay. But how to transform my relational structure into something that will be received and understood by the SimpleDB API ? We have two challenges here : transformation and sending. Ok, go for it.
The Mapping !
Here is my transformation, done with Kettle. Pretty simple, uh ? Let’s go in detail now…
First you have a CSV file input. This data input will be reformated to build Name/Values couples and then these couples will be concatenated into a valid URL. Once signed, this URL will be sent to Amazon API and the data will be inserted into the domain (previously created). You can see, on my transformation, a File Output : I use it, sometimes, for debugging. In our exemple, it was easy for me to see and analyse the generated URL into a notepad (here, the link is not activated).
In my example I will use a typical csv file as data source (based on the same relational table shown above). Here is my flat file, typical with ; as separators.
ID;Category;Subcat;Name;Color;Size;Make;Model;Year
Item_01;Clothes;Sweater;Cathair Sweater;Siamese;Small, Medium, Large;Nike;Swoosh;2003
Item_02;Clothes;Pants;Designer Jeans;Paisley Acid Wash;30x32, 32x32, 32x34;Trusardi;BigButt;2005
Item_03;Clothes;Pants;Sweatpants;Blue, Yellow, Pink;Large;Diesel;Steel;2006, 2007
Item_04;Car Parts;Engine;Turbos;Pink;Medium;Audi;S4;2000, 2001, 2002
Item_05;Car Parts;Emissions;O2 Sensor;Black;Small;Audi;S4;2000, 2001, 2002
The JScript Code !
I confess : I only wrote 5% of the JScript code. Let me explain. When you suscribe to Amazon SimpleDB, you can download the official API, written in Java, and use it to create, manage and populate your SimpleDB domain. Java is very usefull of course, but I was looking for Jscript in order to put everything into Kettle. Then I downloaded Amazon SimpleDB ScratchPad. This is a nice utility that allows you to play with SimpleDB without coding, just the mouse. When looking into this application directories, you can find all the Jscript source code needed ! Then my work consisted on porting the ScratchPad code into a Kettle Jscript step, with some adjustments.
This code is a bit long to be shown here, so click HERE to download it. Let’s have a hi level overview of the JScript layout.
The process if very simple : each row is cutted into Name/Values couples (URL building & URL formating routines), these couples are then concatenated into a valid URL (URL concatenation). This URL is then signed (SHA-1 hash algo) and sent to the HTTP client step.
Here is the basic code to create the URL :
var URL2POST = "https://sdb.amazonaws.com"
+ "?SignatureVersion=1&Action=" + "PutAttributes"
+ "&Version=" + encodeURIComponent("2009-04-15")
+ "&DomainName=" + encodeURIComponent('MyStore')
+ "&ItemName=" + encodeURIComponent(ID)
+ "&Attribute.1.Name=Category"
+ "&Attribute.1.Value=" + encodeURIComponent(Category)
+ "&Attribute.2.Name=Subcat"
+ "&Attribute.2.Value=" + encodeURIComponent(Subcat)
+ "&Attribute.3.Name=Subcat"
+ "&Attribute.3.Value=" + encodeURIComponent(Name)
+ "&Attribute.4.Name=Color"
+ "&Attribute.4.Value=" + encodeURIComponent(Color)
+ "&Attribute.5.Name=Size"
+ "&Attribute.5.Value=" + encodeURIComponent(Size)
+ "&Attribute.6.Name=Make"
+ "&Attribute.6.Value=" + encodeURIComponent(Make)
+ "&Attribute.7.Name=Model"
+ "&Attribute.7.Value=" + encodeURIComponent(Model)
+ "&Attribute.8.Name=Year"
+ "&Attribute.8.Value=" + encodeURIComponent(Year)
+ "&Timestamp=" + timestamp
+ "&AWSAccessKeyId=" + encodeURIComponent(accesskey);
Note that in my Jscript code, I didn’t make any loop to go though all the source columns. As it is a quick proof of concept, based on fixed length data structure, I used one line of code for each column in order to create Name/Value couples. If you look closely into Amazon Scratchpad code source, you will see a loop in the function “generateSignedURL”. This is how things have to be done of course !
The final URL looks like this one :
Let’s see it more in detail :
- The endpoint : https://sdb.amazonaws.com
- The SignatureVersion, always 1 for me.
- The Action needed, in our case PutAttributes, in order to load data into the domain.
- The Version, always 2009-04-15. Don’t know why …
- The DomainName : MyStore, in my case. You can create yours easily.
- The ItemName : Item_01 corresponding to my primary key.
- Then you have all the Name/Values couples : Attribute names and Attribute value.
- A timestamp : calculated by a Jscript function.
- Your AWS Access Key. Mine is obfuscated in the exemple above.
- Your Signature : this is you secret AWS Access key that will be signed by the SHA-1 hash algo, as seen above. Obfuscated here again.
Security
Let’s talk about these AWS Access Keys and Signature. In my proof of concept, these keys are stored in clear in my JScript. Of course, this is not recommanded. I let you imagine a more convenient way to be more secure (parameters, repository …).
Let’s send it to Amazon !
Pretty easy now, each row will be sent to a HTTP client step, using a Jscript variable called URL2POST. This step will send the URL to Amazon SimpleDB and the row will be inserted into your domain.
For the moment, I have no time to handle any return code from Amazon API but it’s very easy since Amazon sends you back an XML message like the one below, in case of success. In case of failure, the message is self explanatory.
The goodies !
You can find the Kettle transformation HERE.
You can find the Jscript HERE.
You can find my little flat file HERE.
How to be sure the data is in ?
Pretty easy. Start the Amazon Scratchpad utility, enter your access code and key, go to GetAttributes API drop down menu and fill in the Domain Name and one Item_Name. Have look here. Note : my keys are obfuscated here again.
Hit “Invoke Request” button, and see your data.
There is another way to check your data. You can write a SQL query in order to see the whole data stored in a given DomainName. Here again, with the ScratchPad, go to “Select” in the API drop down menu. Then enter “select * from MyStore” in the Select Expression field. Hit Invoke Request button, and you will see all your data.
The output will look like this one (continues for each Item …).
That’s nice, but what for ?
Imagine you have, like me, to think about storing emails or a call center knowledge base on the cloud. You have messages, and you have headers. Why not storing headers in SimpleDB and message bodies into S3 ? That’s a good solution. In that case, SimpleDB will handle a few attributes while the heavy data will be stored into S3, with the help of any third party database (open source or not). Of course, you have to manage the link between S3 data and SimpleDB headers, but that’s another story …
More to come
Please, give me a feedback for this article. I’m currently working on something more reliable and more professional. If I have time, I will try to write a Kettle plugin.

23 commentaires:
Hi!
Cool, way to go :)
Bonjour Vincent,
Vraiment super intéressant cet article, en effet le code Javascript est assez rude... ;-)
L'idée du plugin est très bonne !
Petite question : avez-vous pu faire quelques tests de l'étape "S3 CSV Input" qui a l'air également fort intéressante ?
Sylvain
http://www.osbi.fr
Hello Sylvain,
Still haven't time to test the S3 CSV step. I will do it quite soon since I have some business ideas behind that ...
Thanks for ur post.
Vincent
I have been visiting various blogs for my term papers writing research. I have found your blog to be quite useful. Keep updating your blog with valuable information... Regards
hi I have been using Kettle / PDI tool and I think it's good, but I want to try something else and I am very interesting in the tool you are developing, I hope to use it someday soon!
Generic Viagra Viagra
Genial fill someone in on and this post helped me alot in my college assignement. Say thank you you on your information.
Sorry for my bad english. Thank you so much for your good post. Your post helped me in my college assignment, If you can provide me more details please email me.
Just wondering, did you write a kettle plugin for this? If not, I am thinking about doing this.
Just wondering, did you write a kettle plugin for this? If not, I am thinking about doing this.
Vincent: Cool way to work with SimpleDB.
I tried downloading the transfos and the java script but when I click in the links it says file does not exist.
Can you please give a link where I can download it from?
Hi,
Seems the download is not available yet. I will fix this.
Vincent
Vincent: Can you please fix the download? So I can take advantage of what you have written?
Hi, the downloads are back !
Enjoy and feel free to contact me for anything else.
hi,
this is pretty interesting.and i also need how to import data from simpledb to pentaho.please provide me the information as soon as possible.
Thanks in advance
I love open-bi.blogspot.com! Here I always find a lot of helpful information for myself. Thanks you for your work.
Webmaster of http://loveepicentre.com and http://movieszone.eu
Best regards
dfgdfg
Hello
A great article to read :)
But is there a way to do it "the other" way.
Sending data from SimpleDB, that Pentaho can read from.
Pentaho dont support adding Simpledb as a DB connection yet) I see.
Is there a way to do it?
We've just released a free plugin to read from Amazon SimpleDB, have a look at http://www.cloud2land.com/ for more details, that should make things a little bit easier.
SDB Explorer has been made as an industry leading graphical user interface (GUI) to explore Amazon SimpleDB service thoroughly and in a very efficient and user friendly way.
http://www.sdbexplorer.com/
Please one more post about that.I wonder how you got so good. This is really a fascinating blog, lots of stuff that I can get into. One thing I just want to say is that your Blog is so perfect
Vincent: Can you please fix the download? So I can take advantage of what you have written?
Nice to be visiting your blog again, Generic Paxil has been months for me. Well this article that i've been waited for so long. Generic Ultracet need this article to complete my assignment in the college, and it has same topic with your article. Thanks, great share
what a style HAHAHAHA. Great and interesting post. Thanks for sharing the information.
Post a Comment