Thursday, 29 November 2012

Amazon SDB bulk extractor

Hi all,

I said it, I did it. After coding a SDB bulk loader a few days ago, based on the batch insert feature from the AWS SDK, I coded a bulk extractor.

                                                         You said bulk, or Hulk ?

What’s the theory ?

sdb_extractor (its name) was coded with the following features in mind :

  • multi query : you can ask multiple SQL orders (selects of course) to be executed and having the data extracted.
  • multi domain : you can query multiple domains at the same time, using the same program execution,
  • multi threaded : each query has its own thread !
  • multi output format : each query generates data that goes into data files. These data files can have different delimiters, different names, different output formats (including json),
  • query validation : each query is first tested and only the relevant ones (good syntax and existing data in the domain) will be executed (into its own thread as I said before).

How does it work ?

Simple as usual. A simple program needs a simple usage. Better than a long speech, let’s have a look at the –help command :

image

As you can see, nothing really complex. You have to provide the program with 2 parameters :

  • --aws-credentials : a file containing your aws credentials. Usual syntax from aws (access key and private key),
  • -- query-file : a file containing the queries you want to execute and grab the data coming from. Let’s have a close look to this file now.

 

imageThe general process 

The control file : the query file.

This file is necessary in order to give instructions to sdb_extractor. Its syntax is very simple, here is the data format (without brackets) :

[SQL query order];[output file];[delimiter];[sdb domain]

… that gives in real life :

select `itemName()`, `Att1` from `your_domain` limit 2000;sdb_extract1.txt;semicolon;sdb.eu-west-1.amazonaws.com
select * from `your_domain_2`;sdb_extract2.txt;pipe;sdb.eu-west-1.amazonaws.com
select * from `your_domain_3`;sdb_extract3.txt;json;sdb.eu-west-1.amazonaws.com

The first query is asking the key (itemName()) and one attribute (Att1), is giving a limit of 2000 (chunks of 2000 lines will be streamed to the data file until no more rows remain), is asking for a txt file output having a semicolon delimiter. This query has to be run upon the EU West SDB domain.

Please note :

  • The query file delimiter is : ;
  • You can put as many queries as you want
  • You can put as many lines as you want

The values for the delimiters are the following :

  • semicolon,
  • colon,
  • pipe,
  • json : in that case, a json syntax will be streamed to the export file.

Pretty simple, hum ?

Here is the program output after a full execution. As you can see, the execution is divided into 3 main parts :

  • Query validation, showing OK or NOK (here only OK),
  • Query running, with a counter. Note iterations with 2000 rows come from the “limit 2000” clause from the SQL query, while iterations with 100 rows are the default from SDB.
  • Final output, showing OK for any successfully created file.

image 

What happens when I write a bad SQL order ?

Very simple, the bad query will be trapped early and won’t generate any extraction thread. I managed to print some useful information from AWS. Here is an example of a failed query (I  intentionally remove a back tick from the domain name).

image

I want to try it !

No problem. Just download the package here. Includes the jar and a query file for example.

Performance

As usual, performance is a lot better when running the program from within Amazon EC2. In order to have and additional boost for your extraction, you could for instance create several SQL orders for the same domain/table but using different WHERE clause in order to “partition” your query. It works.

More to come.

I’m currently adding some data streaming to s3 features !! Sounds usefull for backup jobs …

As usual, please, give me feedback either positive or negative. I received a *lot* of feedback from the previous sdb_loader, thanks for that !! Please continue !

Tuesday, 27 November 2012

Amazon SDB bulk loader

Hi all,

I’ve spent a long time working with AWS and especially SDB. Loading mass data into SDB is not very frequent, but when needed it often comes to an end : there is no utility to do that and it’s a pain in the neck to use an ETL for that task (IMHO ETLs struggle to load key value stores and other NoSQL stores). Note : in the past, I developed a Kettle plugin to load sdb. You can find the post here.

Fortunately, AWS is providing a very powerful API. So one, two, three, Java … I decided to code my own SDB loader.

 
What is SDB ?

SDB is a NoSQL – non relational – database. You store data as key-value pairs and you can query data items via web services requests. For more informations, please go on AWS SDB page.

In order to understand SDB data model, we have to define 3 components :

  • Domains : you store data in domains. Domains are similar to tables, from the relational world.
  • Items : items are unique keys for attributes,
  • Attributes and values : attributes are similar to columns. You add values to attributes.

Better than a long speech, below is a representation of a domain. CustomerID is the item and “First name” to “Telephone” are attributes. (example taken from Amazon). To make a simple analogy, a domain is like an Excel spreadsheet.

CustomerID First name Last name Street address City State Zip Telephone
123 Bob Smith 123 Main St Springfield MO 65801 222-333-4444
456 James Johnson 456 Front St Seattle WA 98104 333-444-5555

 
SDB limitations

SDB has some limitations, the most obvious are :

  • 10 Gb per domain,
  • 100 domains,
  • 256 attributes per item,
  • 1024 attribute value length.

You can learn more about SDB limitations here.

 
What is sdb_loader ?

sdb_loader is built around the BatchPutAttributesRequest API method, allowing to insert “chunks” of 25 rows (no more). Usage is very simple. Running sdb_loader requires four files :

  • sdb_loader itself,
  • a data file containing the data you want to load
  • a control file, containing the sdb column names (attributes) and other parameter I will describe later on.
  • an AWS credentials file, containing your accessKey and secretKey .

These three files are working together like this :

image

sdb_loader.jar : the main program. Has to be started with java –jar sdb_loader. You have different parameters to set up, below is the –help output.

data file : any data file. No header row. Usual delimiters canl be used : “;”, “,”, “|” or “\t”. Delimiters detection is automatic. I will add more delimiters if needed.

control file : a simple txt file describing the SDB attributes and a flag specifying whether or not to replace the attribute/value pair or to add a new attribute/value pair. The default setting is false. Note : choosing false can have a significant impact on performance. The values are delimited with a “;”. You don’t have to specify any key (the itemName()) because the loader will automatically use the first column of the data file as the key. So, in the control file, you just have to describe the attributes.

Here is an example of the control file :

image

And here is an example of a completely silly data file. As you can see, no header and the key (Item_1) is present as first column.

Item_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1;Attribute_1
Item_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2;Attribute_2
Item_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3;Attribute_3
Item_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4;Attribute_4
Item_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5;Attribute_5

The command line is also very simple. Here is the –help command :

image

How to use sdb_loader ?

java -jar sdb_loader.jar --aws-credentials=aws_credentials.txt --control-filename=data_mapping.txt --data-filename=data.txt --domain-name=mass_load_domain --sdb-endpoint=sdb.eu-west-1.amazonaws.com –domainRecreate

Parameters explanation : 

  • --aws-credentials : the file having your AWS credentials,
  • --control-filename : the file having the attributes you want to load and the boolean flag specifying whether or not to replace the attribute/value pair or to add a new attribute/value pair,
  • --data-filename : the data file (without header),
  • --domain-name : the domain you want to load (domain = table),
  • --sdb-endpoint : the endpoint for your sdb instance (eu, us, asia …),
  • -domainRecreate : if present, the domain your specified in [—domain-name] will be dropped and re-built, otherwise the domain won’t be altered,
  • -help : the help output.

Here is the program output, while running :

image

At the end of processing, you should be able to see the whole story :

image

Performances

SDB query (load or select) performance has a lot to do with bandwidth : loading data from your workstation is a lot slower than loading from EC2. That’s why I recommend to use sdb_Loader from an EC2 instance. Loading from a workstation, using a broadband connection behind a corporate firewall, will give you a 20 to 40 rows/second.

Have a look at the graph below. This data was captured from an EC2 instance running Ubuntu when loading a 10 attributes domain, for +1 million rows. As you can see, we are starting slow then peak performance comes quick and seems to be around 200 rows/second and remain stable over time.

Note that I’ve already had better row/sec, like 270 to 300 row/sec. Depends on time (better in the morning, here in France).

sdb_graph

Increase performance with domain sharding

After the interesting reading of  “How to efficiently forklift 1 billion rows into SimpleDB”, from Sid Anand’s blog, I found an interesting idea : domain sharding.

You can create several domains (ex : 100), and then pump all your data into sdb domains with several parallel executions of sdb_loader.

image

Hey, sounds interesting, I want to try it !

No problem, just download the jar here and have fun. You will also find a typical control file and an AWS credential file. The jar is working with java 1.7.

What’s next now ?

If you like or dislike sdb_loader, please let me know. I love having feedback and sharing with people. You can also follow me on Twitter (@vincentteyssier). Of course, feel free to give me any feedback, feature request and/or bug warning.

After the loader, now, I’m starting an extractor. Work is in progress and I have in mind to create a multi threaded extractor (ability to extract from more than a query/domain at a time).

On top of that, I’m currently having a close look to DynamoDB and I will start a loader + extractor in the next coming weeks.

Friday, 16 November 2012

Call RESTful web services from MySQL

Hi,
I’m back after a long time without any post. Sorry for that, I was busy but I promise I will take time to write new posts and share precious things with you.
Recently, I went through a feature request : calling a RESTful web service from MySQL, using SQL. Why ? There are several reasons to do that, including :
  • calling a geolocation web service in order to enrich some PII data,
  • calling a data validation web service (bank, accounting, social, etc …),
  • calling a SMS or EMAIL broker in order to send messages to people, according you have their PII in database. In that scenario, you could use a trigger that fire the RESTful call each time you have an incoming data (ie, a contact) into a given table,
  • send a tweet …
After a quick review on internet, I discovered mysql-udf-http, which is a MySQL User Defined function for HTTP REST. You can find the package here on google code.
This UDF allows you to call any RESTful web service from SQL with HTTP methods : POST, GET, PUT, DELETE.
The process to use it is very simple, as you can see from the google code site.
1 – Install on Linux of course. Maybe you will have to change path below according to your setup.
ulimit -SHn 65535
wget
http://curl.haxx.se/download/curl-7.21.1.tar.gz tar zxvf curl-7.21.1.tar.gz
cd curl-7.21.1/
./configure --prefix=/usr
make && make install
cd ../

echo "/usr/local/webserver/mysql/lib/mysql/" > /etc/ld.so.conf.d/mysql.conf
/sbin/ldconfig
wget
http://mysql-udf-http.googlecode.com/files/mysql-udf-http-1.0.tar.gz tar zxvf mysql-udf-http-1.0.tar.gz
cd mysql-udf-http-1.0/
./configure --prefix=/usr/local/webserver/mysql --with-mysql=/usr/local/webserver/mysql/bin/mysql_config
make && make install
cd ../

2 – Create the UDF in MySQL.
create function http_get returns string soname 'mysql-udf-http.so';
create function http_post returns string soname 'mysql-udf-http.so';
create function http_put returns string soname 'mysql-udf-http.so';
create function http_delete returns string soname 'mysql-udf-http.so';
3 – Use it ! This is very simple.
SELECT http_get('<url>');
SELECT http_post('<url>', '<data>');
SELECT http_put('<url>', '<data>');
SELECT http_delete('<url>');
4 – Another example, using a trigger.
Here, the UDF is triggered each time a new row is added in the database and sends an email to the PII found in the row, given this row complies with some business rules.


CREATE TRIGGER `restful_trigger` BEFORE INSERT ON `customers` FOR EACH ROW BEGIN
DECLARE rtn_value varchar(1000) default ''; 
DECLARE dest_email varchar(100) default ''; 
SET @dest_email = NEW.EMAIL; 
SET @firstname = NEW.firstname; 
SET @message = 'Your message or your html template goes here'; 
-- Some business transformation rules here ...
CASE 
       WHEN NEW.xxx = 'xxx' THEN SET xxx = 'yyy';
-- etc ...
END CASE; 
-- Other business rules here to trigger or not the RESTful call ...
IF bla bla bla AND bla bla bla THEN
SELECT http_get(CONCAT ('http://REST_url_from_the_email_provider&NAME=',@firstname,'&MSG=',@message)) 
INTO @rtn_value; 
END IF;
-- Some rtn_value management code here ...
END;


Happy RESTful interactions with your MySQL … and don’t forget to have a visit on the UDF google code site for more infos and giving credits to this precious piece of software.

Friday, 10 February 2012

Postgresql – quickly check your data demographics

Hi all,
A few days ago, when starting go work on a new postgresql datawarehouse, I was scared by poor performance and poor exec plans while data volume was not so big…
Here is a quick way to check how your data demographics are, by using the pg_stat_user_tables view.

select relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze  
from pg_stat_user_tables, 

order by relname


image


Then, like me, you can ask you dba why some tables do not have up to date data demographics … especially big / frequently queried tables. Note that you have a lot of interesting views / tables to learn more about your data demographics in postgresql :

  • pg_stat_activity,
  • pg_stat_database,
  • pg_stat_all_tables
    • pg_stat_sys_tables
    • pg_stat_user_tables

  • pg_stat_all_indexes,
    • pg_stat_sys_indexes
    • pg_stat_user_indexes

And others, using the same naming schema …

Tuesday, 26 July 2011

Query Twitter with Talend to see what people think about …

Hi all,

Almost on holidays after a very hard working year. I have some beach time.

This morning, I tried to query Twitter and to process the data. My purpose is to quickly build a data set showing what people are talking about … let’s say, Obama. Easy, Twitter is providing an interface to run queries and retrieve the results back with json format.

Well, my proposition is to implement a basic word frequency analysis. My tool is Talend, but this process is also easy to set up using java, php, python, shell, Kettle …

image

Here are some details :

  • fileInputJSON : just read the json from Twitter search engine. The syntax is : http://search.twitter.com/search.json?&q=YOUR_QUERY_HERE&rpp=10000, where YOUR_QUERY_HERE = the key word you want to search for. Don’t forget to distribute the answer into a string column :
    • answer ====> “$..text” : where “$..text” stands for the tweet message you want to read.
  • tmap : just to transform the text into lower case. I tried to do it on the fly in the json step, but it does not work.
  • tnormalize : transform all words from anwer into rows (separator is space).
  • tfilter : filter your data in order to avoid “|” (pipes) and other special chars + get rid of words having less than 4 letters.
  • taggregate : aggregate on the words and add a new column named nb that will store the distinct count of all words.
  • tsortrow : now you have words and counts, just sort the data from higher counts to lower (desc).
  • tsamplerow : as we just want to read the first top 20 words, we create a sample based on the sorted rows with range “1..20”.

Finally, I print everything into the console by using a tlogrow + “:” separator.

Here are the data for the keyword “obama”, the query was done on July 26 at 16h00.

image

Finally, I will use Jonathan Feinberg, wordle, to create a nice and sexy word map with the above results.

image

 

Setting up the process and testing it only took 20 minutes. Of course, this can be improved a lot by adding string cleaning, custom data filtering… or better :

  • Capture who created the tweet and who received it : you can create relationship networks with, for instance, networkx.
  • Send the data to a powerfull text analytics framework like NLTK, for better analytics.
  • Create the word map picture into Talend. I have some source code for this, will have to work on it …

I ll see how to add one of these features soon.

Tuesday, 19 July 2011

Interfacing Talend with Amazon SDB (AWS SDB) – quick way

Hi all,

I had the following challenge : read some ftp account informations (ftp server, username, password, target directory) stored in Amazon SDB and use it in a Talend transformation, published as a web service. You know about SDB I hope. For those who don’t, SDB is a key / value database provided by Amazon. So you can name SDB a noSQL database.
I played with the SDB API from Amazon, and succeeded after some coding and “Talending”. Here is how I did.
Below is only a small part of a much larger project, composed of a large webservice collection, created for my client.
This project does the following :
  • Query into a database using dynamic params given by the user at run time from a Flex portal (a query engine like business objects !),
  • Return a rowcount of the query, into the portal. I’m working in web marketing : counting people (segments) before creating a campaign is very important …
  • Generate an extract of the data, process this extract according to the params given by the user (separator, encoding, spliting, zipping …),
  • Send this data file to different “tubes” : router, ftp, AWS S3, local download …
Here we focus on the ftp sending part, using SDB to retrieve infos.

 

The process.

 

image

The job (partial).

image

 

Data structure in AWS SDB.

I’m using a very nice firefox plugin in order to have easy and quick access to my SDB ecosystem : sdbtool. My data structure is simple ( “dd” is of course not the true value …) :
  • Item : ftp
    • Attribute names : Address :
        • Attribute value : dd
      • Attribute names :Login
        • Attribute value : dd
      • Attribute names : PKey
        • Attribute value : dd
      • Attribute names : Password
        • Attribute value : dd
      • Attribute names : Port
        • Attribute value : 21
Here is a screencap of my sdbtool view :
image

Explanations.

First, we load all the needed libraries, using tlibraryload component.
  • aws-java-sdk-1.0.14.jar
  • commons-codec-1.3.jar
  • commons-httpclient-3.0.1.jar
  • commons-logging-1.1.1.jar
  • jackson-core-asl-1.4.3.jar
  • stax-api-1.0.1.jar
  • stax-1.2.0.jar
For the aws-java-sdk-1.0.14.jar import, I had to write some imports. These imports are required to be able to use the aws jdk.
image
Then we have a tRowGenerator in which I create the value for the variable myDomain, that will be used in SDB queries (see code below). You can avoid this step, I created it only for quick testing purpose.
Then, we have to code a little tJavarow. This java code will :
  • connect to AWS SDB. You must have an account for AWS SDB.
  • run several queries, using SQL, to retrieve ftp account informations :
    • ftp server address
    • ftp server login
    • ftp server pass
    • ftp server port
    • ftp server pkey, if needed
  • store the query results in output_row.[name] so they can be used in Talend process.
The code in the tJavaRow. First we create some credentials (use yours) and then create an endpoint with sdb address from AWS : https://sdb.eu-west-1.amazonaws.com. Be carefull to set a valid endpoint, using a valid country zone.
The code is finally simple : create a string containing your sql query, then call a getItems() function. An Item is sent back, simply call a getAttribute in order to retrieve the value you need.
I chose, for simplicity, to run a different query for each item I need from SDB. Of course, you can write it shortly.
  
        BasicAWSCredentials credentials = new BasicAWSCredentials("KL45LKJ4325MLKJ2345", "LKJ45LKJmlkjdlkjGRhjKLJSFSDG432534");    
        
    final String[] FTP_Items;
    
        

        AmazonSimpleDB sdb = new AmazonSimpleDBClient(credentials);
        sdb.setEndpoint("https://sdb.eu-west-1.amazonaws.com");
        try {
        int i = 0;
            String myDomain = "Clients"; 
            String selectExpression = "select FTP_Address from `" + myDomain + "`where code_client = '" + context.client_name +"'";
           
            SelectRequest selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Address = attribute.getValue().toString();
                }
            }         
            selectExpression = "select FTP_Login from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Login = attribute.getValue().toString();
                }
            }
            selectExpression = "select FTP_Pass from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Pass = attribute.getValue().toString();
                }
            }
            selectExpression = "select FTP_Port from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Port = Integer.valueOf(attribute.getValue());
                }
            }
            selectExpression = "select FTP_PKey from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_PKey = attribute.getValue().toString();
                }
            }
        } catch (AmazonServiceException ase) {
            System.out.println("AWSException");
            System.out.println("ErrorMsg:    " + ase.getMessage());
            System.out.println("HTTPStatcode: " + ase.getStatusCode());
            System.out.println("AWS Errcode:   " + ase.getErrorCode());
            System.out.println("Errortype:       " + ase.getErrorType());
            System.out.println("RequestID:       " + ase.getRequestId());
        } catch (AmazonClientException ace) {
            System.out.println("AWSClientException");
            System.out.println("Error Message: " + ace.getMessage());


Final.

After retrieving all the item I need for sending on ftp (server, username, pass, port or location for ssh key), I store all this into global variables. Then, these global variables are used as arguments into two very customized scripts (needed in my case) that will send the files : simple ftp or sftp when needed. Finally, I catch some usefull infos from the custom ftp scripts, process it into a tmap and send this information into a tBufferOutput step. That way, I can provide a soap feed back when calling this webservice.

This post is very consice, feel free to ask me for more infos about this process.

Links.

Tuesday, 21 June 2011

… about AWS cloud, Talend, Jaspersoft, Postgresql and typical EC2 internal addressing issues …

Hi all,

I’m terribly late with this article, initially scheduled for January 2011 … sorry. Maybe it is a bit outdated now, anyway, I publish it …
Let’s talk about EC2 cloud computing, Talend, Postgresql and JasperServer. Basic setup.
You already know all the pros and cons with cloud computing, I won’t talk about that. As to me, I love cloud computing and use it everyday, because of these particular advantages :
  • Scalablity : scale up or down any instance, according to your needs,
  • Flexibility : create your own instances, boot them, create quick sandboxes, replicate data …
  • Pay per use : you pay for what you use (cpu, storage, security …),
  • Opex, no capex !
Cloud computing is still something new, and it is not surprising to discover softwares that are not ready for it or not fully “cloud compliant”. I recently faced such an issue when implementing Postgresql, Talend and Jaspersoft, which remain my preferred open source BI tools.

First issue

Let’s imagine we have a single server, hosting Postgresql. No big deal with that as long as we use this instance in a simple way : I can start my instance, host data on a persistent EBS, connect to it and stop it whenever I want. By using elastic IPs, I can assign a “fixed” IP address to this server and can easily set up a connection string. Note on 16/12/2010 : Amazon is now offering a DNS service.
Now let’s imagine we need a typical BI architecture (tiers) : one ETL (Talend or Pentaho of course !), a Postgresql database in the middle and Jaspersoft for reporting.
That’s a bit more complex because we need our Postgresql server to allow connections from the ETL and from the reporting tool. On top of that, we want to fully leverage all cloud computing features : stop the servers when they are not used, boot them when the service is needed, maybe change their network properties ... eventually we want this to be fully automated and working without any human actions like changing the connection strings, starting/stopping the servers …
Let’s have a look to a little schema now. As you can see, we have now our architecture up and running. We are also using elastic IPs for each server, which is mandatory for the following demonstration. IPs are fake.
image

How to read Public DNS, Private DNS and Elastic IPs on AWS EC2 ?
Imagine we have an instance running. This instance has an Elastic IP which is 46.52.186.25 and the private IP address is 11.235.33.6.
The Private DNS name is : ip-11-235-33-6.eu-west-1.compute.internal
The Public DNS name is : ec2-46-52-186-25.eu-west-1.compute.amazonaws.com
You see the relationship ?

Ok, now, how do you think we will configure Postgresql server to allow connexions from the ETL server and from the Reporting server ? Easy, here is one answer :
  1. By making the ETL Server and the reporting server point to Postgresql. For that, we will use this nice little Elastic IP we previously set up for Postgresql server because it’s soooo easy to do that way …
  2. By writing the ETL server Elastic IP and reporting server Elastic IP into Postgresql pg_hba.conf of course … because here again it is soooo easy natural to do so.
  3. Don’t forget to open the corresponding ports in your security groups (see picture above).
Ok, easy. Let’s go for it. We make Talend and Jasper point to Postgresql like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver
image
Talend client connexion screen : your client <===> Talend server
image

Talend server connexion screen : Talend server <===> Postgresql database
image

And then we write down the Elastic IPs into the pg_hba file like this, in order to allow Talend server and JasperServer to connect to the postgresql database. This is a basic pg_hba.conf, I encourage you to add stronger authentication.
image
We are done. Don’t forget to adjust the security groups like this :
  • Talend Server : allow 8080, allow 22
  • Postgresql Server : allow 5432, allow 22
  • Jasperserver : allow 80 (or 443 if https), allow 22
Okay, this stuff is fully working, you can test it.
But wait … that’s not the good way to do ! By using the elastic IPs to set up communication between each server/node, we just created a weird monster that makes the traffic going OUT of the cloud and going BACK INTO the cloud. Don’t forget you are paying for that. Look at this schema.

image

First solution

The best practice is to avoid using elastic IPs in order to set up network traffic between servers that are hosted inside the EC2 cloud. Instead, use EC2 internal adresses.
Ok, but … wait a minute.
  • How do I do to retrieve the internal address from inside EC2 ? 
The solution rely on a poorly documented EC2 feature : when you query an ec2 public DNS server from inside EC2, you will be given back the corresponding internal IP address. Just what we need !!!!
For instance, if you query your ETL Server from your your Postgresql server, by using the famous host command, you will have :
image

You see what you have to do ? Replace all elastic IPs, except for your Talend client, by internal IPs. Like that, your internal data won’t leave the cloud, like below.

image
After using the internal addressing, the connexion screens will look like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver

image

 

Talend server connexion screen : Talend server <===> Postgresql database
image


Second issue

Well, ok, we solved our first issue : using internal addresses between the ETL server and the Postgresql server. But, I can see two other issues :
  • Postgresql still does not accept DNS names in the pg_hba.conf ! Only IP addresses allowed. So We can’t ask Postgresql and pg_hba.conf to resolve the dns for us.
  • What if I decide to reboot the ETL server, or the Reporting server ? These internal adresses are nice but they are changing each time I reboot / restart server in EC2. Then, how to keep my Postgreqsl pg_hba.conf updated with frequently changing adresses ?
image…not allowed …

Second solution

No, there is still no support for DNS entries in the pg_hba.conf. I know this is a long awaited feature, at least by me. But, unless I’m wrong (tell me), writing down a DNS name in pg_hba.conf won’t work and the server won’t start.
We need to find a way to update the pg_hba.conf with the last / current ec2 internal addresses corresponding to the ETL server and the Reporting server. Easy, we will use a bit of shell code here. This script will retrieve the internal IP Address for each server (ETL and JasperServer) by using the command host and will update this address in the pg_hba.conf by using some sed or awk. Then, by using a sighup, Postgresql server will apply the new address configuration.
Nothing complex, but the success rely on a good timing.
image
Note here : I created an ORCHESTRATOR, a specialized instance in EC2, to monitor all my servers. This orchestrator will run this kind of script as soon as it detects any change in the internal addressing schema. This ORCHESTRATOR will be detailed in a future article (I made several public presentations, and a lot of people seem interested …).
And the shell script. This shell asks for the internal address, then updates the corresponding line. For that, you must  maintain your file in a tidy way : labels are needed.
################################ 
#                              # 

#      IP adress lookup        # 

#                              #  
################################ 
# POSTGRES (DATABASE) Server
# Public DNS : ec2-12-345-678-999.eu-west-1.compute.amazonaws.com 


# TALEND (ETL) Server     
ETL_SERVER=`host ec2-11-222-33-444.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 

# JASPER (BI & reports) Server    
JASPER_SERVER=`host ec2-22-33-444-555.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 



# Echoing all     
echo "" 

echo "################## EC2 Addresses Update ######################" 

echo "Will update EC2 Talend Server address with : " $ETL_SERVER 

echo "Will update EC2 Jasper Server address with : " $JASPER_SERVER 



echo ""

# Find and replace line Talend Server TALEND_NB=`grep -n "Talend server connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` TALEND_NB=$((TALEND_NB+1)) sed -i "$TALEND_NB s%.*%host    all         all         $ETL_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf # Find and replace line Jasper Server JASPER_NB=`grep -n "JasperServer connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` JASPER_NB=$((JASPER_NB+1)) sed -i "$JASPER_NB s%.*%host    all         all         $JASPER_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf



The end

Having a small (or even big) BI architecture up and running into EC2 is not a big deal. Having it properly set – in order not to pay extra fees – is something different and need some basic thinking before doing. The addressing issue which is technically simple, can have negative impact on your project if you don’t manage it from the start.

I will recommand any AWS / EC2 user (BI or not) to create their own admin tools and scripts, based on the various available APIs, in order to  :
  • reduce reaction time,
  • be fully independent,
  • spare time (graphical tools are nice but need clicks, clicks and clicks …)
Some usefull links about AWS / EC2 documentation :
    Feel free to contact me if this article is not clear enough.