INSEE Brut

Mirror site for INSEE data aimed to be exhaustive, faster and easier to browse.

The project is splitted into 3 parts :

1 - Scraper
2 - API
3 - Static Website [Deprecated]
4 - React Website

1. Scraper

The scraper (aka crawler or spider) is in charge of fetching the data from the original Insee website. It's a Python 3 project that uses the great Scrapy framework.

The spider runs daily on ScrapingHub. It scraps incrementally all the data, meaning it won't go through all pages every day.

Local setup

cd scrapy-project
mkvirtualenv insee-brut
workon insee-brut
pip3 install -r requirements.txt

To start a crawl :

scrapy crawl insee

Deploy Scrapy spider to Scrapinghub

The first time, you need the shub dependency

pip3 install shub
shub login

And then each time you want to deploy :

cd scrapy-project
shub deploy 362041

2. API

The api is a Rails API project. It doesn't use ActiveRecord but Mongoid instead.

The production version is accessible at api.insee.pw

You can find the full documentation for the API here : http://api.insee.pw/api/docs

Local setup for the API

cd `api`
gem install bundler && bundle install

Optionally work on a split RVM gemset to make sure your dependencies are well isolated.

Locally serve the API

rails s

Deploy the API

The API is hosted on the Digital Ocean droplet. This droplet uses dokku to host different apps, and provide Heroku-like ease-of-use.

You can deploy new version of the apps using this command :

git subtree push --prefix api production master

One-Off Tasks

The main task that's in the API for convenience reasons is rake prepare:db. What it does is filter, clean and augment raw scraped items to more usable ones. It fetches documents that were scraped from the insee_items MongoDB collection, and then creates new ones in other collections such as root_documents.

This script should be ran after each scraping session. It can be optimized.

3. Static Website [Deprecated]

Currently the app is a fully statically built website. It's done using custom python scripts. It uses the Mustache templating language via the pystache lib.

4. React Website

We are switching to a React app that hits the API instead, to reach a more app-like site.

##Local Setup

cd client; 
npm install

##Run local server

npm start

Misc

Dump prod DB locally

mongodump --uri=mongodb://USER:PASSWORD@68.183.79.212/insee_brut
mongorestore --drop -d insee_brut dump/insee_brut

Useful mongo queries for exploration

find the Series with the most Serie in it

db.insee_items.aggregate([
  {$match:{_scrapy_item_class:"Serie"}},
  {$group: { _id: "$familleBdm.id", count: {$sum:1}}},
  {$sort: {count: -1}}
])

investigate the sorting order key :

db.insee_items.find({"_scrapy_item_class": "RootDocument"}, {_id:0, famille: 0, contenu_html: 0, custom: 0, themes:0, etat:0}).limit(5).sort({dateDiffusion: -1}).pretty()

testing INSEE solr API

The INSEE search page uses an AJAX call to un unprotected JSON API that gives us a lot of details.

You can test it in your browser with :

curl 'https://insee.fr/fr/solr/consultation?q=*:*' \
-H 'Accept: application/json, text/javascript, */*; q=0.01' \
-H 'Content-Type: application/json; charset=utf-8' \
--data '{"q":"*:*","start":0,"sortFields":[{"field":"dateDiffusion","order":"desc"}],"filters":[{"field":"rubrique","tag":"tagRubrique","values":["statistiques"]},{"field":"diffusion","values":[true]}],"rows":100,"facetsQuery":[]}' \
| jq '.documents[] .titre'

It's possible to add a filter with an ID : {"field":"id","values":[3648291]}

I'm using jq here for parsing the results inline.

There is a second API endpoint for 'series' types of data :

curl 'https://insee.fr/fr/statistiques/series/ajax/consultation' \
-H 'Accept-Language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3' --compressed \
-H 'content-type: application/json; charset=utf-8' \
--data '{"q":"*:*","start":0,"rows":10,"facetsField":[],"filters":[{"field":"bdm_idFamille","values":["103212792"]}],"sortFields":[{"field":"dateDiffusion","order":"desc"},{"field":"bdm_idbankSerie","order":"asc"}]}' \
| jq '.documents | .idBank'

you can also perform tests with the small bash script in /scrapy-project/scripts/test_insee_api.sh

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
api		api
client		client
scrapy-project		scrapy-project
static-website		static-website
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INSEE Brut

1. Scraper

Local setup

Deploy Scrapy spider to Scrapinghub

2. API

Local setup for the API

Locally serve the API

Deploy the API

One-Off Tasks

3. Static Website [Deprecated]

4. React Website

Misc

Dump prod DB locally

Useful mongo queries for exploration

testing INSEE solr API

About

Releases

Packages

Contributors 2

Languages

adipasquale/insee-brut

Folders and files

Latest commit

History

Repository files navigation

INSEE Brut

1. Scraper

Local setup

Deploy Scrapy spider to Scrapinghub

2. API

Local setup for the API

Locally serve the API

Deploy the API

One-Off Tasks

3. Static Website [Deprecated]

4. React Website

Misc

Dump prod DB locally

Useful mongo queries for exploration

testing INSEE solr API

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages