Scrapes the Concept2 Logbook site (log.concept2.com) and saves the information in JSON format. Uses multi-threading to spead up the scraping process (~ linear with the number of threads but please be kind to the server!)
Currently only captures data from the rankings (https://log.concept2.com/rankings). It does not record athlete logbook entries that are not ranked.
Some analysis and cleaning tools are included. (TODO)
The default configuration options should be suitable for most systems. See Configuration for available options.
If using the default configuration, you will need to create an "output" and a "cache" folder in the project root directory.
To run the scraper, run C2Scrape.py using python.
The scraper uses multi-threading, provided by the threading module, to speed up the scraping process. The scraper is rate limited by the speed at which it recieved its URL requests.
Multi-threading is optimised for scraping athlete profiles and extended workout profiles. The master process visits each ranking table page in turn, and adds each athlete profile and extended workout profile to a work queue. The worker threads then visit these URLs in turn and scrape the data.
As this scraper is not I/O or processer limited, you can have more threads than CPU cores and still experience a speed up.
The more worker threads present, the more load is put on the Concept2 Logbook server, please consider this when setting the number of threads. This developer's experience is that a URL request to the logbook takes about 1 second.
If scraping for athlete profile and extended workout profiles is disabled, the scraper will not be spead up with additional threads.
The scraper produces up to 5 output files. They are all in JSON format.
Contains all the information from the ranking board page. indexed by workout ID.
Contains all the information found from the additional workout page (found by clicking the athlete name on a ranking board). Indexed by workout ID. Notable additional data: date, time, duration.
Contains the information from an athlets profile on the logbook.
In the same format is the extended profile file. Contains a record of every extended workout profile visited by the scraper during any run.
In the same format is the athlete profile file. Contains a record of every athlete profile visited by the scraper during any run.
The scraper uses caching to minimize the number of URL requests to speed up the scraping process. With caching enabled the scraper will not make additional URL requests for athlete profiles or extended workout profiles that have been saved in the cache files. This is particularly effective when scraping multiple ranking boards and scraping athlete profile data as athletes are present on multiple boards.
TODO
The congiguration is a JSON file called "C2config.json", and should be in the root directory of the project. An example configuration is included in the repo. All are taken as strings unless otherwise specified.
The following options can be configured:
Integer. Specify the maximum number of ranking tables the program should visit. Useful for verifying the program will run to completion on your system before begining a long run. Normally can be left blank ("").
Boolean. Set to true to utilize cached files. This can significantly increase speed. See [caching][#caching]. If set to false, the scraper will neither read nor write cache files.
Integer. Specifify the maximum number of worker threads to use. See Multi-Threading. Set this to 0 if both (get_profile_data)[#get_profile_data] and (get_extended_workout_data)[#get_extended_workout_data] are set to false.
workouts_file/athletes_file/extended_file/athletes_cache_file/extended_cache_file.
Specify the path of the output files. See Output.
The root URL of the an athlete profile on the Concept2 logbook. This will usually not need changed.
The URL of the Concept2 logbook login page. This will usually not need changed.
Boolean. Set to true to login to the website before scraping. Many athlete profiles are only visible to logged in users.
Username for logging in to the Concept2 logbook.
Password for logging in to the Concept2 logbook.
Integer (seconds). The length of time between writing output files. The main process must suspend threads while writing output which takes a few seconds as the files grow in size. The main process checks the elapsed time between writing output files at the end of processing each ranking page. Recomended setting is 60 seconds or greater.
Scrape data from athlete profiles that the program finds while visiting the ranking tables. This requires an extra URL request per profile but this can be mitigated across multiple ranking tables by using [caching](#caching]. See Output.
Scrape additional data about the workout by visiting the workout link on the ranking page. This requires an extra URL request per workout. See Output.
These settings determine which events, types of machine, weight class, gender and adaptive categories will be visited. The program will visit all combinations of ranking tables for the given options. Can be different for each type of machine (row, bike and ski erg). The default congiguration contains every option.
List of strings. Exactly 4 parameters are required for each machine but they can be empty. For example, "rower" and "weight" are only used for the row erg.
List of integers. At least one should be present.
The root URL of the Concept2 Logbook rankings.
List of integers. The year(s) of ranking boards that will be scraped.