-
Notifications
You must be signed in to change notification settings - Fork 33
Generating tests for your scrapers
journal-scrapers comes with a Ruby script to generate regression tests for scraper definitions.
To generate tests, you'll need:
- a scraper definition in ScraperJSON format
- a list of 5-10 URLs for which the scraper should work
The test generator script is in scripts/make_tests.rb
.
To run the tests you'll need Ruby installed, with rubygems and the trollop gem installed.
To install Ruby and rubygems use RVM:
\curl -sSL https://get.rvm.io | bash -s stable --ruby=2.1.2
To install trollop use rubygems:
gem install trollop
You also need to have the quickscrape
scraper installed. See the quickscrape documentation for instructions.
Place all your URLs (you need between 5 and 10) in a file, one per line, e.g.:
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001874
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001882
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004441
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004433
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0098781
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0099348
Next you need to run the test-generator script. You can see the script's help:
$ scripts/make_tests.rb --help
make_tests.rb - ScraperJSON test generator script
by Richard Smith-Unna <rds45@cam.ac.uk> for ContentMine.org
This script generates a test file from a ScraperJSON scraper definition
and a list of URLs the scraper applies to.
The test files record what the scraper extracts from each URL so that tests can detect when the scrapers break.
Example use:
make_tests.rb --definition scraper.json --urls urls.txt
Options:
--definition, -d <s>: Path to ScraperJSON definition
--urls, -u <s>: File containing a list of 5-10 URLs to test
--help, -h: Show this message
Now just run the test-generator script, passing it the path to your scraper definition and the file containing the URLs:
scripts/make_tests.rb --definition scrapers/somejournal.json --urls tests/somejournal_test_urls.txt
The test-generator will now run quickscrape
for each test URL using your scraper definition. It will store the results in a test format in the test
directory. In the case of the example above, the new file will be called test/somejournal.json
.
Once the test file has been generated, you're ready to make a pull request with your contributed scraper.