As usual, too many things to share and too little time to write.
Well, this time, I'm doing it.
I'm currently working on a Data Federation / Data Virtualization project, aiming at virtualizing data coming from different horizons : public data from web services, commercial data coming from data feeds, internal and relational data etc ...
One of my data sources is the Statistical DataWarehouse (SDW) from the European Central Bank (ECB). That's funny because 12 or 13 years ago, I was feeding that ECB Warehouse while working for the French National Bank (NCBs).
This warehouse is open to every one and you will find data about employment, production ... well a lot of economic topics well organized into "Concepts" :
- Monetay operations,
- Exchange rates,
- Payment and securities trading
- Monetary statistics and lots of funny things like this ...
SDMX, theorySDMX stands for Statistical Data and MetaData eXchange. This project started in 2002 and aims at giving a standard for statistical data and metadata exchange. Several famous institutions are at the origin of SDMX :
You will find a nice SDMX tutorial here, for the moment here is a quick model description :
- Descriptor concepts : give sense to a statistical observation
- Packaging structure : hierarchy for statistical data : observation level, group level, dataset level ...
- Dimensions and attributes : dimensions for identification and description. Attributes for description only.
- Keys : dimensions are grouped into key sequence and identify an item
- Code lists : list of values
- Data Structure Definition : description for structures
SDMX, by example
- Metadatas :
- Serieskey : giving all the values for each dimensions.
- Attributes : definition for this dataset
- Observation dimension : time, in this example.
- Observation value : the value itself.
How to build a query to ECB data
These web services will allow you to :
- Query metadata, in order
- Query structure definitions
- Query data : this is the interesting part for this article !
But let me write a quick overview now. It is very simple as sooon as you understood it.
Let's write a REST query for some ECB data.
First you need the base service url.
- Easy, it is : https://sdw-wsrest.ecb.europa.eu/service/
- Now you have the url : https://sdw-wsrest.ecb.europa.eu/service/data
- We have the url : https://sdw-wsrest.ecb.europa.eu/service/data/eurostat
- The url is : https://sdw-wsrest.ecb.europa.eu/service/data/eurostat,ieaq
The combination of dimensions allows statistical data to be uniquely identified. This combination is known as series key in SDMX.
Look at the picture below, it shows you how to build a serie key for targetting data for our on going example.
When looking a metadata from the IEAQ serie, we see we need 13 keys to identify an indicator. These keys are ranging from easy ones like FREQ (Frequency) or REF_AREA (country) to complex (business) keys like ESA95TP_ASSET or ESA95TP_SECTOR.
Now we need a value for each of these dimensions, then "stacking" these values with dots (don't forget to follow the order given by the metadata shot).
We now have our key : Q.FR.N.V.LE.F2M.S1M.A1.S.1.N.N.Z
Another way for understanding is considering the keys as coordinates. By choosing a value for each key, you build coordinates, like lat/long, that identify and locate a dataset.
I chose the cube representation below to illustrate the concept of keys as coordinates (of course, a dataset can have more keys than a cube has sides ...). You can see how a flat metadata representation is translated into a multidimensional structure.
Now, query some data
Here is the top of the xml answer, showing some metadata.
In red, the data. In green, the time dimesion. In blue, the value !
Query and process the data !Ok, calling for data from a web browser is nice but not really usefull : data stays in the browser, we need to parse and transform it in order to set up a dataset ...
Here I will introduce some shell code I used in a larger project, where I had to run massive queries against the ECB SDW and build a full data streaming process.
The command below with allow you to run a query and parse the data for easy extraction. I'm using the powerfull xmlstarlet software here.
-s | xmlstarlet sel -t -m "/message:GenericData/message:DataSet/generic:Series/generic:Obs" \
ConclusionThe ECB SDW is massive. It contains loads and loads of series, datasets etc ...
Have a look to this partial inventory I did recently.
As you can see, the amount of data is important.
My best recommendation, at this point, would be to first :
- read about the ECB SDW metadata,
- read about the ECB SDW structures,
- learn how to build complex queries (I only gave a very simple example here).
Here is, once again, the most important documentation about the SDW and how it is organized :