-
Notifications
You must be signed in to change notification settings - Fork 3
Possible improvements
François Briatte edited this page Oct 29, 2015
·
1 revision
The code available through the parlnet
repository has gone through several rounds of debugging and updates, but the scrapers still require some knowledge of R and some attention to country-specific particularities in order to run properly. The notes below suggest how the code might be improved, in big or in small ways.
-
Most calls to
read.csv
andwrite.csv
might be handled through thereadr
package, in order to speed up input/output during code execution. The current version ofreadr
, however, can garble dates, and row names (which are used during network construction) are explicitly forbidden by thedata_frame
class. -
Most calls to the
XML
package might be handled through (or rewritten for) thervest
package, in order to make the code easier to read. In my experience, this is possible but highly tricky to do. - The networks could be first built as bill-sponsor bipartite graphs, and then collapsed to one-mode cosponsorship networks. This would require using sparse matrixes and only slightly different visualisation code. The bills, however, have very little attributes of their own, because very few chambers provide legislative keywords and/or outcomes.
- The code could be greatly accelerated by switching to Python or Ruby scrapers and SQL databases, although that would obviously require starting again from scratch. Similarly, adjacency matrixes are probably faster than edge lists as network constructors, but are less practical for inspection and debugging purposes.
-
Some aspects of the code could be further standardised, in particular: the organisation of the
raw
data folders, links to constituencies, sponsor profiles and photos, and network attributes that describe the country, chamber and legislature. These are all neatly organised in the current code, but not perfectly standardised. - Additional official open data portals could be put to use to retrieve bill or sponsor details. These portals are already used in the code for the French upper chamber, Norway, Sweden and Switzerland, but similar services for Austria, the French lower chamber, the Italian lower chamber and the Italian upper chamber are not.
- There is no self-updating mechanism: the data have to be refreshed manually, because self-updating the code would probably require recoding all scrapers in a language supported by scraping platforms like Morph or ScraperWiki. Data collection would only improve for ongoing legislatures, and website redesigns would still require manual updates.
- Some repositories rely on manual inputs during data collection. The code for Finland requires editing a URL parameter (which does not even work at the moment, since the entire website has been redesigned), and the code for Hungary requires downloading a few bill indexes by hand.
-
Network errors in the download loops require to rerun some of the scripts: rerunning the
data.r
scripts two or three times is therefore highly recommended. Some (but not all) scripts contain exception lists to skip over the little amount of errors that might have occurred, but some errors are permanent HTTP 404's and cannot be solved. -
Some variables are based on manual or semi-manual imputations: the
sex
variable is often based on imputation from first names, family names, or both; and theparty
variable is often based on manual recodings or on the "longest affiliation throughout legislature" rule. These limitations are fully documented in theREADME
files of each repository. -
Some variables have many missing counts: this issue affects the
born
variable, which occasionally has high missing counts in upper houses and is completely missing in Hungary, and theconstituency
variable, which is occasionally missing in Austria and often missing in pre-redistricted Sweden. -
Some variables are imperfectly standardised across countries: the
committee
variable varies considerably because of differing parliamentary practices in committee formation (some networks have many committee co-memberships, others have almost none), and thenyears
variable is not always a perfectly continuous measure.
All improvements and limitations that do not require switching to a different programming language are under consideration for future releases of the repository, as is further integration with the data provided by Every Politician, ParlGov and Wikidata.