Tuesday, 21 June 2011

… about AWS cloud, Talend, Jaspersoft, Postgresql and typical EC2 internal addressing issues …

Hi all,

I’m terribly late with this article, initially scheduled for January 2011 … sorry. Maybe it is a bit outdated now, anyway, I publish it …
Let’s talk about EC2 cloud computing, Talend, Postgresql and JasperServer. Basic setup.
You already know all the pros and cons with cloud computing, I won’t talk about that. As to me, I love cloud computing and use it everyday, because of these particular advantages :
  • Scalablity : scale up or down any instance, according to your needs,
  • Flexibility : create your own instances, boot them, create quick sandboxes, replicate data …
  • Pay per use : you pay for what you use (cpu, storage, security …),
  • Opex, no capex !
Cloud computing is still something new, and it is not surprising to discover softwares that are not ready for it or not fully “cloud compliant”. I recently faced such an issue when implementing Postgresql, Talend and Jaspersoft, which remain my preferred open source BI tools.

First issue

Let’s imagine we have a single server, hosting Postgresql. No big deal with that as long as we use this instance in a simple way : I can start my instance, host data on a persistent EBS, connect to it and stop it whenever I want. By using elastic IPs, I can assign a “fixed” IP address to this server and can easily set up a connection string. Note on 16/12/2010 : Amazon is now offering a DNS service.
Now let’s imagine we need a typical BI architecture (tiers) : one ETL (Talend or Pentaho of course !), a Postgresql database in the middle and Jaspersoft for reporting.
That’s a bit more complex because we need our Postgresql server to allow connections from the ETL and from the reporting tool. On top of that, we want to fully leverage all cloud computing features : stop the servers when they are not used, boot them when the service is needed, maybe change their network properties ... eventually we want this to be fully automated and working without any human actions like changing the connection strings, starting/stopping the servers …
Let’s have a look to a little schema now. As you can see, we have now our architecture up and running. We are also using elastic IPs for each server, which is mandatory for the following demonstration. IPs are fake.
image

How to read Public DNS, Private DNS and Elastic IPs on AWS EC2 ?
Imagine we have an instance running. This instance has an Elastic IP which is 46.52.186.25 and the private IP address is 11.235.33.6.
The Private DNS name is : ip-11-235-33-6.eu-west-1.compute.internal
The Public DNS name is : ec2-46-52-186-25.eu-west-1.compute.amazonaws.com
You see the relationship ?

Ok, now, how do you think we will configure Postgresql server to allow connexions from the ETL server and from the Reporting server ? Easy, here is one answer :
  1. By making the ETL Server and the reporting server point to Postgresql. For that, we will use this nice little Elastic IP we previously set up for Postgresql server because it’s soooo easy to do that way …
  2. By writing the ETL server Elastic IP and reporting server Elastic IP into Postgresql pg_hba.conf of course … because here again it is soooo easy natural to do so.
  3. Don’t forget to open the corresponding ports in your security groups (see picture above).
Ok, easy. Let’s go for it. We make Talend and Jasper point to Postgresql like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver
image
Talend client connexion screen : your client <===> Talend server
image

Talend server connexion screen : Talend server <===> Postgresql database
image

And then we write down the Elastic IPs into the pg_hba file like this, in order to allow Talend server and JasperServer to connect to the postgresql database. This is a basic pg_hba.conf, I encourage you to add stronger authentication.
image
We are done. Don’t forget to adjust the security groups like this :
  • Talend Server : allow 8080, allow 22
  • Postgresql Server : allow 5432, allow 22
  • Jasperserver : allow 80 (or 443 if https), allow 22
Okay, this stuff is fully working, you can test it.
But wait … that’s not the good way to do ! By using the elastic IPs to set up communication between each server/node, we just created a weird monster that makes the traffic going OUT of the cloud and going BACK INTO the cloud. Don’t forget you are paying for that. Look at this schema.

image

First solution

The best practice is to avoid using elastic IPs in order to set up network traffic between servers that are hosted inside the EC2 cloud. Instead, use EC2 internal adresses.
Ok, but … wait a minute.
  • How do I do to retrieve the internal address from inside EC2 ? 
The solution rely on a poorly documented EC2 feature : when you query an ec2 public DNS server from inside EC2, you will be given back the corresponding internal IP address. Just what we need !!!!
For instance, if you query your ETL Server from your your Postgresql server, by using the famous host command, you will have :
image

You see what you have to do ? Replace all elastic IPs, except for your Talend client, by internal IPs. Like that, your internal data won’t leave the cloud, like below.

image
After using the internal addressing, the connexion screens will look like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver

image

 

Talend server connexion screen : Talend server <===> Postgresql database
image


Second issue

Well, ok, we solved our first issue : using internal addresses between the ETL server and the Postgresql server. But, I can see two other issues :
  • Postgresql still does not accept DNS names in the pg_hba.conf ! Only IP addresses allowed. So We can’t ask Postgresql and pg_hba.conf to resolve the dns for us.
  • What if I decide to reboot the ETL server, or the Reporting server ? These internal adresses are nice but they are changing each time I reboot / restart server in EC2. Then, how to keep my Postgreqsl pg_hba.conf updated with frequently changing adresses ?
image…not allowed …

Second solution

No, there is still no support for DNS entries in the pg_hba.conf. I know this is a long awaited feature, at least by me. But, unless I’m wrong (tell me), writing down a DNS name in pg_hba.conf won’t work and the server won’t start.
We need to find a way to update the pg_hba.conf with the last / current ec2 internal addresses corresponding to the ETL server and the Reporting server. Easy, we will use a bit of shell code here. This script will retrieve the internal IP Address for each server (ETL and JasperServer) by using the command host and will update this address in the pg_hba.conf by using some sed or awk. Then, by using a sighup, Postgresql server will apply the new address configuration.
Nothing complex, but the success rely on a good timing.
image
Note here : I created an ORCHESTRATOR, a specialized instance in EC2, to monitor all my servers. This orchestrator will run this kind of script as soon as it detects any change in the internal addressing schema. This ORCHESTRATOR will be detailed in a future article (I made several public presentations, and a lot of people seem interested …).
And the shell script. This shell asks for the internal address, then updates the corresponding line. For that, you must  maintain your file in a tidy way : labels are needed.
################################ 
#                              # 

#      IP adress lookup        # 

#                              #  
################################ 
# POSTGRES (DATABASE) Server
# Public DNS : ec2-12-345-678-999.eu-west-1.compute.amazonaws.com 


# TALEND (ETL) Server     
ETL_SERVER=`host ec2-11-222-33-444.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 

# JASPER (BI & reports) Server    
JASPER_SERVER=`host ec2-22-33-444-555.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 



# Echoing all     
echo "" 

echo "################## EC2 Addresses Update ######################" 

echo "Will update EC2 Talend Server address with : " $ETL_SERVER 

echo "Will update EC2 Jasper Server address with : " $JASPER_SERVER 



echo ""

# Find and replace line Talend Server TALEND_NB=`grep -n "Talend server connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` TALEND_NB=$((TALEND_NB+1)) sed -i "$TALEND_NB s%.*%host    all         all         $ETL_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf # Find and replace line Jasper Server JASPER_NB=`grep -n "JasperServer connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` JASPER_NB=$((JASPER_NB+1)) sed -i "$JASPER_NB s%.*%host    all         all         $JASPER_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf



The end

Having a small (or even big) BI architecture up and running into EC2 is not a big deal. Having it properly set – in order not to pay extra fees – is something different and need some basic thinking before doing. The addressing issue which is technically simple, can have negative impact on your project if you don’t manage it from the start.

I will recommand any AWS / EC2 user (BI or not) to create their own admin tools and scripts, based on the various available APIs, in order to  :
  • reduce reaction time,
  • be fully independent,
  • spare time (graphical tools are nice but need clicks, clicks and clicks …)
Some usefull links about AWS / EC2 documentation :
    Feel free to contact me if this article is not clear enough.

    30 comments:

    cloud hosting india said...

    Hi , I've read a few things on this site and I really do think that it has helped tremendously. There's still a heap I need to learn thus can continue learning and can keep coming back. Thanks.

    Anonymous said...

    Hello friends,
    Amazon Route53 is a great way to manage the DNS entries of cloud services. DNS30 Professional Edition provides desktop tool for route53 services.It can be used to manage hosted zone.
    http://www.dns30.com/

    sucil said...

    This is great stuff.... keep posting.. thanks a lot

    Anonymous said...

    http://13dfgsdfg57.com/

    Anonymous said...

    Unquestionably believe that which you said. Your favorite justification seemed to be
    on the internet the easiest thing to be aware of.
    I say to you, I certainly get irked while people
    think about worries that they plainly do not know
    about. You managed to hit the nail upon the top and defined out the whole thing without
    having side-effects , people can take a signal. Will
    probably be back to get more. Thanks

    my weblog bestcloudcomputingoffers.com
    Feel free to surf my web-site ; netdepot Evaluations

    Anonymous said...

    Hi there, I wish for to subscribe for this website to obtain newest updates, thus where can i
    do it please help out.

    Here is my blog ... cheapwebhostingfirms.com

    Anonymous said...

    Hi there, I wish for to subscribe for this website to obtain newest updates, thus
    where can i do it please help out.

    Feel free to visit my site :: cheapwebhostingfirms.com
    Also see my webpage :: Hosting Reviews

    Anonymous said...

    Hello this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors
    or if you have to manually code with HTML. I'm starting a blog soon but have no coding skills so I wanted to get advice from someone with experience. Any help would be enormously appreciated!

    Have a look at my web-site; iwebhostingreviews.Com

    Anonymous said...

    Howdy! I could have sworn I've visited this site before but after looking at many of the posts I realized it's new to me.
    Nonetheless, I'm certainly delighted I discovered it and I'll be bookmarking it and checking back often!


    My page ipage Reviews

    Anonymous said...

    Howdy! I know this is kind of off topic but I was wondering if you
    knew where I could find a captcha plugin for my comment form?
    I'm using the same blog platform as yours and I'm having trouble finding one?

    Thanks a lot!

    Have a look at my page - Fatcow Reviews

    Anonymous said...

    I have been surfing online more than three hours today, yet
    I never found any interesting article like yours.

    It is pretty worth enough for me. Personally, if all webmasters and bloggers made
    good content as you did, the net will be much more useful than ever before.


    My homepage; web hosting services dedicated server

    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    HAMDI BOURBIA said...

    Hi vinc,

    Can you post us some nice things on hadoop.

    Hamdi

    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.
    Anonymous said...
    This comment has been removed by a blog administrator.