Scrape HTML tables from a Wikipedia page into CSV format.
wikitablescrape
can be used as a shell command or imported as a Python package.
This tool makes it easy to download any Wikipedia table via CLI in a format ready for text processing.
This is especially useful when combined with a tool like xsv
.
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes' --header='costliest' | xsv select "Season" | xsv stats --median | xsv select field,min,max,median,mean,stddev | xsv table
field min max median mean stddev
Season 1965 2018 2002 1999.1228070175441 12.900523823770502
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_best-selling_music_artists' --header='100 million' | xsv select 'Country / Market' | xsv frequency | xsv table
field value count
Country / Market United States 26
Country / Market United Kingdom 10
Country / Market United Kingdom United States 1
Country / Market Australia 1
Country / Market Spain 1
Country / Market Japan 1
You can download the package from PyPI or build from source using Python 3.
python3 -m pip install wikitablescrape
wikitablescrape --help
python3 -m venv venv
. venv/bin/activate
pip install wikitablescrape
wikitablescrape --help
git clone https://github.com/rocheio/wiki-table-scrape
cd ./wiki-table-scrape
python3 -m venv venv
. venv/bin/activate
python setup.py install
wikitablescrape --help
wikitablescrape --url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films" --header="films by year" | tee >(head -1) >(tail -5) >/dev/null
"Year","Title","Worldwide gross","Budget","Reference(s)"
"2015","Star Wars: The Force Awakens","$2,068,223,624","$245,000,000",""
"2016","Captain America: Civil War","$1,153,304,495","$250,000,000",""
"2017","Star Wars: The Last Jedi","$1,332,539,889","$200,000,000",""
"2018","Avengers: Infinity War","$2,048,359,754","$316,000,000–400,000,000",""
"2019","Avengers: Endgame","$2,796,255,086","$356,000,000",""
wikitablescrape --url="https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list" --output-folder="/tmp/scrape"
Parsing all tables from 'https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list' into '/tmp/scrape'
Writing table 1 to /tmp/scrape/table_1_top_100_list.csv
Writing table 2 to /tmp/scrape/table_2_countries.csv
Writing table 3 to /tmp/scrape/table_3_cities.csv
Writing table 4 to /tmp/scrape/table_4_buildings_&_structures_&_statues.csv
Writing table 5 to /tmp/scrape/table_5_people.csv
Writing table 6 to /tmp/scrape/table_6_people_singers.csv
Writing table 7 to /tmp/scrape/table_7_people_actors.csv
Writing table 8 to /tmp/scrape/table_8_people_romantic_actors.csv
Writing table 9 to /tmp/scrape/table_9_people_athletes.csv
Writing table 10 to /tmp/scrape/table_10_people_modern_political_leaders.csv
Writing table 11 to /tmp/scrape/table_11_people_pre_modern_people.csv
Writing table 12 to /tmp/scrape/table_12_people_3rd_millennium_people.csv
Writing table 13 to /tmp/scrape/table_13_progression_of_the_most_viewed_millennial_persons_on_wikipedia.csv
Writing table 14 to /tmp/scrape/table_14_music_bands_historical_most_viewed_3rd_millennium_persons.csv
Writing table 15 to /tmp/scrape/table_15_sport_teams_historical_most_viewed_3rd_millennium_persons.csv
Writing table 16 to /tmp/scrape/table_16_films_and_tv_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 17 to /tmp/scrape/table_17_albums_historical_most_viewed_3rd_millennium_persons.csv
Writing table 18 to /tmp/scrape/table_18_books_and_book_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 19 to /tmp/scrape/table_19_books_and_book_series_pre_modern_books_and_texts.csv
head -5 /tmp/scrape/table_3_cities.csv
"Rank","Page","Continent","Views in millions"
"1","New York City","North America","75"
"2","Singapore","Asia","63"
"3","London","Europe","61"
"4","Hong Kong","Asia","50"
./scripts/test.sh
# Show coverage data in a browser
coverage html && open htmlcov/index.html
If you would like to contribute to this module, please open an issue or pull request.
If you'd like to read more about this module, please check out my blog post from the initial release.