Hi all,
I’m terribly late with this article, initially scheduled for January 2011 … sorry. Maybe it is a bit outdated now, anyway, I publish it …
Let’s talk about EC2 cloud computing, Talend, Postgresql and JasperServer. Basic setup.
You already know all the pros and cons with cloud computing, I won’t talk about that. As to me, I love cloud computing and use it everyday, because of these particular advantages :
- Scalablity : scale up or down any instance, according to your needs,
- Flexibility : create your own instances, boot them, create quick sandboxes, replicate data …
- Pay per use : you pay for what you use (cpu, storage, security …),
- Opex, no capex !
Cloud computing is still something new, and it is not surprising to discover softwares that are not ready for it or not fully “cloud compliant”. I recently faced such an issue when implementing Postgresql, Talend and Jaspersoft, which remain my preferred open source BI tools.
First issue
Let’s imagine we have a single server, hosting Postgresql. No big deal with that as long as we use this instance in a simple way : I can start my instance, host data on a persistent EBS, connect to it and stop it whenever I want. By using elastic IPs, I can assign a “fixed” IP address to this server and can easily set up a connection string.
Note on 16/12/2010 : Amazon is now offering a DNS service.
Now let’s imagine we need a typical BI architecture (tiers) : one
ETL (Talend or Pentaho of course !), a
Postgresql database in the middle and
Jaspersoft for reporting.
That’s a bit more complex because
we need our Postgresql server to allow connections from the ETL and from the reporting tool. On top of that, we want to fully leverage all cloud computing features :
stop the servers when they are not used,
boot them when the service is needed, maybe change their network properties ...
eventually we want this to be fully automated and working without any human actions like changing the connection strings, starting/stopping the servers …
Let’s have a look to a little schema now. As you can see, we have now our architecture up and running. We are also using
elastic IPs for each server, which is mandatory for the following demonstration.
IPs are fake.
How to read Public DNS, Private DNS and Elastic IPs on AWS EC2 ?
Imagine we have an instance running. This instance has an Elastic IP which is
46.52.186.25 and the private IP address is
11.235.33.6.
The Private DNS name is : ip-
11-235-33-6.eu-west-1.compute.internal
The Public DNS name is : ec2-
46-52-186-25.eu-west-1.compute.amazonaws.com
You see the relationship ?
Ok, now,
how do you think we will configure Postgresql server to allow connexions from the ETL server and from the Reporting server ? Easy, here is one answer :
- By making the ETL Server and the reporting server point to Postgresql. For that, we will use this nice little Elastic IP we previously set up for Postgresql server because it’s soooo easy to do that way …
- By writing the ETL server Elastic IP and reporting server Elastic IP into Postgresql pg_hba.conf of course … because here again it is soooo easy natural to do so.
- Don’t forget to open the corresponding ports in your security groups (see picture above).
Ok, easy. Let’s go for it. We make Talend and Jasper point to Postgresql like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver
Talend client connexion screen : your client <===> Talend server
Talend server connexion screen : Talend server <===> Postgresql database
And then we write down the
Elastic IPs into the pg_hba file like this, in order to allow Talend server and JasperServer to connect to the postgresql database. This is a basic pg_hba.conf, I encourage you to add stronger authentication.

We are done. Don’t forget to adjust the security groups like this :
- Talend Server : allow 8080, allow 22
- Postgresql Server : allow 5432, allow 22
- Jasperserver : allow 80 (or 443 if https), allow 22
Okay, this stuff is fully working, you can test it.
But wait … that’s
not the good way to do ! By using the
elastic IPs to set up communication between each server/node, we just created a weird monster that makes the traffic
going OUT of the cloud and
going BACK INTO the cloud. Don’t forget you are paying for that. Look at this schema.
First solution
The best practice is to
avoid using elastic IPs in order to set up network
traffic between servers that are hosted inside the EC2 cloud. Instead, use EC2
internal adresses.
Ok, but … wait a minute.
- How do I do to retrieve the internal address from inside EC2 ?
The solution rely on a poorly documented EC2 feature :
when you query an ec2 public DNS server from inside EC2, you will be given back the corresponding internal IP address. Just what we need !!!!
For instance, if you query your ETL Server from your your Postgresql server, by using the famous
host command, you will have :
You see what you have to do ? Replace all elastic IPs, except for your Talend client, by internal IPs. Like that, your internal data won’t leave the cloud, like below.

After using the internal addressing, the connexion screens will look like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver

Talend server connexion screen : Talend server <===> Postgresql database
Second issue
Well, ok, we solved our first issue :
using internal addresses between the ETL server and the Postgresql server. But, I can see two other issues :
- Postgresql still does not accept DNS names in the pg_hba.conf ! Only IP addresses allowed. So We can’t ask Postgresql and pg_hba.conf to resolve the dns for us.
- What if I decide to reboot the ETL server, or the Reporting server ? These internal adresses are nice but they are changing each time I reboot / restart server in EC2. Then, how to keep my Postgreqsl pg_hba.conf updated with frequently changing adresses ?
…not allowed …Second solution
No, there is still no support for DNS entries in the pg_hba.conf. I know this is a long awaited feature, at least by me. But, unless I’m wrong (tell me),
writing down a DNS name in pg_hba.conf won’t work and the server won’t start.
We need to find a way to update the pg_hba.conf with the last / current ec2 internal addresses corresponding to the ETL server and the Reporting server. Easy, we will use a bit of shell code here. This script will retrieve the
internal IP Address for each server (ETL and JasperServer) by using the command
host and will
update this address in the pg_hba.conf by using some sed or awk. Then, by using a sighup, Postgresql server will apply the new address configuration.
Nothing complex, but the success rely on a good timing.

Note here : I created an
ORCHESTRATOR, a specialized instance in EC2, to monitor all my servers.
This orchestrator will run this kind of script as soon as it detects any change in the internal addressing schema. This ORCHESTRATOR will be detailed in a future article (I made several public presentations, and a lot of people seem interested …).
And the shell script. This shell asks for the internal address, then updates the corresponding line. For that, you must maintain your file in a tidy way : labels are needed.
################################
# #
# IP adress lookup #
# #
################################
# POSTGRES (DATABASE) Server
# Public DNS : ec2-12-345-678-999.eu-west-1.compute.amazonaws.com
# TALEND (ETL) Server
ETL_SERVER=`host ec2-11-222-33-444.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'`
# JASPER (BI & reports) Server
JASPER_SERVER=`host ec2-22-33-444-555.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'`
# Echoing all
echo ""
echo "################## EC2 Addresses Update ######################"
echo "Will update EC2 Talend Server address with : " $ETL_SERVER
echo "Will update EC2 Jasper Server address with : " $JASPER_SERVER
echo ""
# Find and replace line Talend Server
TALEND_NB=`grep -n "Talend server connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1`
TALEND_NB=$((TALEND_NB+1))
sed -i "$TALEND_NB s%.*%host all all $ETL_SERVER/32 md5%" /mnt/postgres/data/pg_hba.conf
# Find and replace line Jasper Server
JASPER_NB=`grep -n "JasperServer connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1`
JASPER_NB=$((JASPER_NB+1))
sed -i "$JASPER_NB s%.*%host all all $JASPER_SERVER/32 md5%" /mnt/postgres/data/pg_hba.conf
The end
Having a small (or even big) BI architecture up and running into EC2 is not a big deal. Having it properly set – in order not to pay extra fees – is something different and need some basic thinking before doing. The addressing issue which is technically simple, can have negative impact on your project if you don’t manage it from the start.
I will recommand any AWS / EC2 user (BI or not) to create their own admin tools and scripts, based on the various available APIs, in order to :
- reduce reaction time,
- be fully independent,
- spare time (graphical tools are nice but need clicks, clicks and clicks …)
Some usefull links about AWS / EC2 documentation :
Feel free to contact me if this article is not clear enough.