Saturday, 12 October 2013

EANs, open catalog, POD and other product definition data

Hi all,

I’m currently building, embellishing and maintaining large products databases. I have to organize, load and transform hundred of thousand product definitions.

These product definitions are very detailed :

  • an ID (I’ll come back later on this)
  • a description : ingredients, composition, origin, life style, packaging, containing, brand … some are very verbose
  • pricing informations : a price for each distributor

For that purpose, I’m using a blend of relational for the ETL staging and Document Database (MongoDB) for the target. I’m very happy with MongoDB.

Managing all these products, and linking with prices, means having a robust primary key. Here comes the EAN !

 

What does the EAN really mEAN ?

First, a EAN is the code you can see under bar codes.

EAN means European Article Number. Now renamed as International Article Number. Define by the GS1 organization, the EAN code is, most of the time, 13-digit long. It is used worldwide for marking products. They are perfect primary keys for my job, despite some codes can “turn” and change accross time (I have example, a few …).

You can learn more about EAN here. I invite you to digg into EAN secrets : you ll see, they carry a lot of “embedded intelligence”  :

  • prefix : country of the manufacturer
  • company number : id of the company / manufacturer
  • item reference : uniquely identifies a product within the same manufacturer
  • check digit or checksum : a checksum, very close to the Luhn checksum.

I’m not a number !!!!

The EAN race !

These days, a lot of companies are running the EAN race : they want to build large databases product definititions WITH their EAN. You can imagine why … : price comparators, product aggregators, open data movement (I approve this message !).

So, the question is : where can I find EANs ? Well … everywhere and nowhere at the same time !

I’ll try to give you an answer. I believe, as a consumer, this data should and must be available for everybody.

 

Winning the EAN race !

You have different actions for that :

  • Crawling web retailers :

Easy to start, hard to maintain, hard to deal with data / data structures that will inevitably change accross time … but you have to go with that, especially if you are just starting your database.

Note that crawling retailers in order to grab the product catalog (with EAN) is becoming very difficult nowadays (lot of people / companies / price comparators are doing so). Most of the time, you will have to do some hacking : using TOR or proxies in order to obfuscate your origin/ip, fool the web server with header rewriting etc …

  • Use/hack retailer mobile APIs :

Most of these mobile apis are not protected enough. Sometimes you will have to spoof you geolocation. Worth a try, really ! Lot of uncrypted data here.

  • Simply look on internet :

I recently discovered some interesting data sources. I will share these with you now !

 

Some interesting data sources

The Product Open Data (POD).

My preferred one : Free access, and free database download. Thousand and thousand of product definitions and EANs !

http://www.product-open-data.com/en/1-home.html

http://www.product-open-data.com/download/

ean-search.

The one I used first : ean-search.org. I have to admit : last winter I wrote a little shell program, using curl, tor, custom headers … and boom, I was able to crawl and grab the whole database. I’m not proud of it, but, well … I guess I wasn’t the only one …

ean-search is really a good data source. It’s like a google for products. Simple give a product name, and you ll see product definitions and EANs popping up !

image

GS1 itself.

The GS1 has some interesting pages to use. This one gives you access to the GEPIR database (Global Electronic Party Information Registry). It’s like a search engine for all EAN related data.

image

eandata.com

eandata.com is very interesting. Lot’s of educational articles and full database download available.

 

upc database

upcdatabase.com has +1.640.000 items available (+ 155.000 eans).

 

Global product list

Global Product list has a huge list of items, aggregated by company names.

 

French company : sogedial

If you are interested with french market, SOGEDIAL display its full catalog on line. More interesting, products are aggregated by categories (ex : [food / fresh food / fruits / bananas]).

image

opengtindb.org

A German site, with search engine. Worth a try.

 

Amazon Product Advertising API

Very good service for UPC lookup. Well, the power of Amazon here …

 

Factual

Factual is a huge database for + 650.000 consumer goods. Displays core product data (upc/ean, name, manufacturer, brand, size, avg price …), nutrition and ingredients. This database, as well as the user interface, is really amazing !

 

Best Buy API

Needless to present best buy and their API. Only for electronics.

Google shopping API

EAN / UPC lookup here.

 

I’ll come back soon with more things about UCP and EANs. Stay in touch !

No comments: