Skip to content

Possible improvements

François Briatte edited this page Oct 29, 2015 · 1 revision

The code available through the parlnet repository has gone through several rounds of debugging and updates, but the scrapers still require some knowledge of R and some attention to country-specific particularities in order to run properly. The notes below suggest how the code might be improved, in big or in small ways.

IMPROVEMENTS

  1. Most calls to read.csv and write.csv might be handled through the readr package, in order to speed up input/output during code execution. The current version of readr, however, can garble dates, and row names (which are used during network construction) are explicitly forbidden by the data_frame class.
  2. Most calls to the XML package might be handled through (or rewritten for) the rvest package, in order to make the code easier to read. In my experience, this is possible but highly tricky to do.
  3. The networks could be first built as bill-sponsor bipartite graphs, and then collapsed to one-mode cosponsorship networks. This would require using sparse matrixes and only slightly different visualisation code. The bills, however, have very little attributes of their own, because very few chambers provide legislative keywords and/or outcomes.
  4. The code could be greatly accelerated by switching to Python or Ruby scrapers and SQL databases, although that would obviously require starting again from scratch. Similarly, adjacency matrixes are probably faster than edge lists as network constructors, but are less practical for inspection and debugging purposes.
  5. Some aspects of the code could be further standardised, in particular: the organisation of the raw data folders, links to constituencies, sponsor profiles and photos, and network attributes that describe the country, chamber and legislature. These are all neatly organised in the current code, but not perfectly standardised.
  6. Additional official open data portals could be put to use to retrieve bill or sponsor details. These portals are already used in the code for the French upper chamber, Norway, Sweden and Switzerland, but similar services for Austria, the French lower chamber, the Italian lower chamber and the Italian upper chamber are not.

LIMITATIONS

  1. There is no self-updating mechanism: the data have to be refreshed manually, because self-updating the code would probably require recoding all scrapers in a language supported by scraping platforms like Morph or ScraperWiki. Data collection would only improve for ongoing legislatures, and website redesigns would still require manual updates.
  2. Some repositories rely on manual inputs during data collection. The code for Finland requires editing a URL parameter (which does not even work at the moment, since the entire website has been redesigned), and the code for Hungary requires downloading a few bill indexes by hand.
  3. Network errors in the download loops require to rerun some of the scripts: rerunning the data.r scripts two or three times is therefore highly recommended. Some (but not all) scripts contain exception lists to skip over the little amount of errors that might have occurred, but some errors are permanent HTTP 404's and cannot be solved.
  4. Some variables are based on manual or semi-manual imputations: the sex variable is often based on imputation from first names, family names, or both; and the party variable is often based on manual recodings or on the "longest affiliation throughout legislature" rule. These limitations are fully documented in the README files of each repository.
  5. Some variables have many missing counts: this issue affects the born variable, which occasionally has high missing counts in upper houses and is completely missing in Hungary, and the constituency variable, which is occasionally missing in Austria and often missing in pre-redistricted Sweden.
  6. Some variables are imperfectly standardised across countries: the committee variable varies considerably because of differing parliamentary practices in committee formation (some networks have many committee co-memberships, others have almost none), and the nyears variable is not always a perfectly continuous measure.

All improvements and limitations that do not require switching to a different programming language are under consideration for future releases of the repository, as is further integration with the data provided by Every Politician, ParlGov and Wikidata.

Clone this wiki locally