This is a project to do web scraping of the TOP500 list, which is a list of the 500 most powerful commercially available computers in the world.
The goal of this project is to extract the details of the computers from the lists presented in the web site as HTML tables and store them in CSV format for later processing.
The www.top500.org web site structure
The lists are created twice a year (in June and November each year), and the web site presents each edition of the list split across 5 pages, with 100 entries each.
Each page has one table with one row per entry. Each row provides these details about each supercomputer:
- Rank: the order of the supercomputer in that edition of the list
- Site: where is the supercomputer located
- System: name of the system
- Cores: number of compute cores
- Rmax: maximal LINPACK performance achieved (TFlop/s)
- Rpeak: theoretical peak performance (TFlop/s)
- Power: power consumption (kW)
Details of the collected information are available on their description page.
The site has a very tolerant robots.txt
file:
$ curl https://www.top500.org/robots.txt User-agent: * Allow: / Host: www.top500.org
$ python3 scrape.py -h usage: scrape.py [-h] [-y {1993,...,2017}] [-m {6,11}] [-z {1993,...,2017}] [-n {6,11}] [-c COUNT] [-f] [outfile] positional arguments: outfile Output file (default: top500.csv) optional arguments: -h, --help show this help message and exit -y {1993,...,2017}, --year {1993,...,2017} Collect from year (default: 2017) -m {6,11}, --month {6,11} Collect from month (default: 11) -z {1993,...,2017}, --endyear {1993,...,2017} Collect until year (default: 2017) -n {6,11}, --endmonth {6,11} Collect until month (default: 11) -c COUNT, --count COUNT Number of entries to get from each edition (default: 500) -f, --force Force a partial count (default: False)
In summary: if invoked without arguments, scrape.py
will download the whole
list for the latest available edition (at the time of the writing, this is
November 2017; this is calculated at run time), i.e. 500 entries.
You can specify a starting edition with --year
and/or --month
, and it will
start from there. It will continue until the latest edition, or you can specify
the last edition to collect with --endyear
and/or --endmonth
.
By default it collects all 500 systems for each edition. You can change that
with --count
(this specifies the number of systems to collect per edition,
not the total number of entries to collect - which doesn’t make that much sense).
Because the lists are hosted in pages of 100 entries each, by default --count
only accepts values in hundreds (100, 200, 300, …) to make the most of the
downloaded content. You can force partial scraping of downloaded pages with
--force
(i.e. if you want to collect just 250 entries per list, specify -c
250 -f
; it will still download 300, because that’s how the pages are built, but
will only scrape the first 250).
The scraper has the following dependencies:
- Python 3
- The Python Requests module is used to download the pages.
- The Beatuful Soup Python module is used to parse the pages.
The requests
module has built-in proxy support. If needed, you can set the
HTTPS_PROXY
environment variable pointing to your proxy before running the
scraper:
HTTPS_PROXY="http://my-proxy.example.com:3128"
Some notes about the website structure, to keep in mind during scraping.
Numbers are written in US format, with commas to separate thousands and dots for the decimal point.
Conversion is performed using the locale
module, setting locale to
en_US.UTF-8
and using the atoi
and atof
functions.
Each edition of the list has a highlights list with the top 10 for that edition. That list has 6 columns, where the 2nd column contains both the system and the site mixed in.
The tables with the full list contain 7 columns: the 2nd column contains the site and the 3rd contains the system.
We scrape only the full lists, as the highlights/summary do not provide any additional information.
In general, a system (3rd column in the list) looks like this in most cases:
<td><a href="https://www.top500.org/system/SYSTEM_ID"> SYSTEM_NAME, PROCESSOR, INTERCONNECT, COPROCESSORS </a><br/>MANUFACTURER</td>
Of these fields, System Name and Co-processors can only be found here (i.e. in that colum of the listings) - they are not provided in the system’s details page. Well, the title of that page copies the whole lot, including system name and processors, but the name or co-processors are not listed by themselves anywhere.
Interconnect and Processor are listed in the system’s details page (the href in that column) and can be properly identified from that other page. Therefore, an attempt has been made to extract a system’s name and co-processor (stored as gpu in the output) from this column.
However:
- not all systems provide information about the processors and/or interconnect in the listing page. For example, Tera-100 - Bull bullx super-node S6010/S6030.
- some systems also include GPU information, like QB-2 - Dell C8220X Cluster, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband FDR, NVIDIA K20x. GPU information is not detailed in the system details page.
- some systems have more than one type of coprocessor, e.g. Thunder - SGI ICE X, Xeon E5-2699v3/E5-2697 v3, Infiniband FDR, NVIDIA Tesla K40, Intel Xeon Phi 7120P. We bundle all of them together in the gpu field.
- some systems include GPU information at the beginning of their name, e.g. SANAM - Adtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000.
- some systems have the order of the fields reversed, like Inspur SA5212H5, Xeon E5-2682v4 16C 2.5GHz, NVIDIA Tesla P100, 25G Ethernet which lists the interconnect after the GPU.
- some systems mention the co-processor but not the interconnect, e.g. Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
- some systems simply have no specific name, they are just a concatenation of e.g. processor + interconnect, like Pentium4 Xeon 2.4 GHz, Myrinet
As a result of all this, parsing of the system entry from the listing to obtain the name and gpu is actually impossible to do reliably for all the systems just from the data available in the top500 web site.
Therefore, the code that handles this is “best effort” and the GPU column of the output has “mixed results”. Some additional processing with the help of specialized data sources (e.g. a database of known co-processors) would improve that.
Lacking this, the approach taken was:
- Remove the known processor and interconnect details from the system’s description. NOTE: these are not exact matches, and there’s quite variety here too unfortunately.
- With the remaining content (if any), try to guess what’s the name and what’s the GPU. After some observations, it seems that the simplest approach to take here is to use the first remaining part (delimited by commas) as the name and the rest as GPU.
SequenceMatcher
from difflib
has been used to compare known details with
system descriptions to try to extract the information. It’s challenging.
These are supposed to be equal (interconnect):
>>> from difflib import SequenceMatcher as SM >>> SM(None, 'Custom', 'Custom Interconnect').ratio() 0.48 >>> SM(None, 'Gigabit Ethernet', 'GigEth').ratio() 0.5454545454545454
These should be different (system name vs processor):
>>> SM(None, 'R12000 300MHz', 'ORIGIN 2000 300 MHz').ratio() 0.75 >>> SM(None, 'PowerPC 604e 332MHz', 'SP PC604e 332 MHz').ratio() 0.7777777777777778 >>> SM(None, 'Cray X1E (1 GHz)', 'Cray X1E 2c 1GHz').ratio() 0.8125
A site (2nd column in the list) looks like this:
<td><a href="https://www.top500.org/site/SITE_ID">SITE_NAME</a><br>COUNTRY</td>
Several Power values are marked as “(Submitted)”, e.g. https://www.top500.org/system/177981.
Some have “(Derived)”: https://www.top500.org/system/178751
Many have no value in this field.
Several fields can change on the same system between different editions of the list, including the system’s name, number of cores, Rmax, Rpeak, Power, …
Examples:
- https://www.top500.org/system/178192
- https://www.top500.org/system/177459
- https://www.top500.org/system/177449
The wesite is a bit inconsistent here: the system details page always lists the latest version of the content, regardless of its history.
The scraper keeps the same values that the lists show, overriding system details with the values from the lists on each edition, to avoid the inconsistency as much as possible.
The attached dataset is the full collection of TOP500 lists, i.e. starting from the first available list from June 1993. It was collected with:
$ time python3 -u scrape.py --year 1993 --month 6 > scraping.log real 57m7,175s user 11m38,674s sys 0m4,049s
While adding the dataset in the repo, a problem was detected:
fatal: LF reemplaçaria CRLF en data/top500.csv.
it turns out that the way argparse
opens the output file is not quite what
csv.write
handles best, and the resulting output file used CRLF for line ends.
This was manually corrected in the file.
The web scraping code provided here is released under the GPL v3 license (see LICENSE).
The attached dataset is provided under the CC0 1.0 Public Domain.