This is a ProCyclingStats (PCS) data scraper. It fetches and parses HTML pages to end up building different model entities that will be serialized and exported.
ℹ️ pcs-scraper currently supports scraping teams, riders and races (including results).
The single requirement to run this application is Java 11.
Once installed, the app must be built using the included Gradle wrapper:
./gradlew build
This will place a runnable Java jar under build/libs directory.
The app can be executed from the command line:
java -jar scraper/build/libs/scraper.jar -h
Value for option --season should be always provided in command line.
Usage: pcs-scraper options_list
Options:
--season, -s -> Season (always required) { Int }
--cachePath, -c -> Cache path { String }
--destination, -d -> Destination path (always required) { String }
--format, -f -> Output file format (always required) { Value should be one of [firebase, json, protobuf, sqlite] }
--skipCache, -sc [false] -> Skip cache
--scrapTimeout, -st [20m] -> Scrap timeout { String }
--retryDelay, -rd [1s] -> Retry delay { String }
--help, -h -> Usage info
We can see there are a few arguments that can be passed in:
- season: Season year to scrap.
- cachePath: Directory to be used as cache for HTML documents (to avoid fetching PCS every type).
- destination: Destination path of the output content.
- format: Format of the output file (firebase, json, protobuf or sqlite).
- skipCache: Ignore cache to force the remote fetching.
- scrapTimeout: Timeout before stopping the scraping (ISO-8601 format or value returned by Duration.toString).
- retryDelay: Time to wait between doc fetching retry attempts (ISO-8601 format or value returned by Duration.toString).