TOP500 list web scraping

This is a project to do web scraping of the TOP500 list, which is a list of the 500 most powerful commercially available computers in the world.

The goal of this project is to extract the details of the computers from the lists presented in the web site as HTML tables and store them in CSV format for later processing.

Overview

The www.top500.org web site structure

The lists are created twice a year (in June and November each year), and the web site presents each edition of the list split across 5 pages, with 100 entries each.

Each page has one table with one row per entry. Each row provides these details about each supercomputer:

Rank: the order of the supercomputer in that edition of the list
Site: where is the supercomputer located
System: name of the system
Cores: number of compute cores
Rmax: maximal LINPACK performance achieved (TFlop/s)
Rpeak: theoretical peak performance (TFlop/s)
Power: power consumption (kW)

Details of the collected information are available on their description page.

robots.txt

The site has a very tolerant robots.txt file:

$ curl https://www.top500.org/robots.txt
User-agent: *
Allow: /

Host: www.top500.org

Usage

$ python3 scrape.py -h
usage: scrape.py [-h]
                 [-y {1993,...,2017}]
                 [-m {6,11}]
                 [-z {1993,...,2017}]
                 [-n {6,11}] [-c COUNT] [-f]
                 [outfile]

positional arguments:
  outfile               Output file (default: top500.csv)

optional arguments:
  -h, --help            show this help message and exit
  -y {1993,...,2017}, --year {1993,...,2017}
                        Collect from year (default: 2017)
  -m {6,11}, --month {6,11}
                        Collect from month (default: 11)
  -z {1993,...,2017}, --endyear {1993,...,2017}
                        Collect until year (default: 2017)
  -n {6,11}, --endmonth {6,11}
                        Collect until month (default: 11)
  -c COUNT, --count COUNT
                        Number of entries to get from each edition (default:
                        500)
  -f, --force           Force a partial count (default: False)

In summary: if invoked without arguments, scrape.py will download the whole list for the latest available edition (at the time of the writing, this is November 2017; this is calculated at run time), i.e. 500 entries.

You can specify a starting edition with --year and/or --month, and it will start from there. It will continue until the latest edition, or you can specify the last edition to collect with --endyear and/or --endmonth.

By default it collects all 500 systems for each edition. You can change that with --count (this specifies the number of systems to collect per edition, not the total number of entries to collect - which doesn’t make that much sense).

Because the lists are hosted in pages of 100 entries each, by default --count only accepts values in hundreds (100, 200, 300, …) to make the most of the downloaded content. You can force partial scraping of downloaded pages with --force (i.e. if you want to collect just 250 entries per list, specify -c 250 -f; it will still download 300, because that’s how the pages are built, but will only scrape the first 250).

Dependencies

The scraper has the following dependencies:

Python 3
The Python Requests module is used to download the pages.
The Beatuful Soup Python module is used to parse the pages.

Proxy support

The requests module has built-in proxy support. If needed, you can set the HTTPS_PROXY environment variable pointing to your proxy before running the scraper:

HTTPS_PROXY="http://my-proxy.example.com:3128"

Scraping notes

Some notes about the website structure, to keep in mind during scraping.

List page

Number format

Numbers are written in US format, with commas to separate thousands and dots for the decimal point.

Conversion is performed using the locale module, setting locale to en_US.UTF-8 and using the atoi and atof functions.

Highlights list vs full list

Each edition of the list has a highlights list with the top 10 for that edition. That list has 6 columns, where the 2nd column contains both the system and the site mixed in.

The tables with the full list contain 7 columns: the 2nd column contains the site and the 3rd contains the system.

We scrape only the full lists, as the highlights/summary do not provide any additional information.

System name column structure

In general, a system (3rd column in the list) looks like this in most cases:

<td><a href="https://www.top500.org/system/SYSTEM_ID">
    SYSTEM_NAME, PROCESSOR, INTERCONNECT, COPROCESSORS
</a><br/>MANUFACTURER</td>

Of these fields, System Name and Co-processors can only be found here (i.e. in that colum of the listings) - they are not provided in the system’s details page. Well, the title of that page copies the whole lot, including system name and processors, but the name or co-processors are not listed by themselves anywhere.

Interconnect and Processor are listed in the system’s details page (the href in that column) and can be properly identified from that other page. Therefore, an attempt has been made to extract a system’s name and co-processor (stored as gpu in the output) from this column.

However:

not all systems provide information about the processors and/or interconnect in the listing page. For example, Tera-100 - Bull bullx super-node S6010/S6030.
some systems also include GPU information, like QB-2 - Dell C8220X Cluster, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband FDR, NVIDIA K20x. GPU information is not detailed in the system details page.
some systems have more than one type of coprocessor, e.g. Thunder - SGI ICE X, Xeon E5-2699v3/E5-2697 v3, Infiniband FDR, NVIDIA Tesla K40, Intel Xeon Phi 7120P. We bundle all of them together in the gpu field.
some systems include GPU information at the beginning of their name, e.g. SANAM - Adtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000.
some systems have the order of the fields reversed, like Inspur SA5212H5, Xeon E5-2682v4 16C 2.5GHz, NVIDIA Tesla P100, 25G Ethernet which lists the interconnect after the GPU.
some systems mention the co-processor but not the interconnect, e.g. Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
some systems simply have no specific name, they are just a concatenation of e.g. processor + interconnect, like Pentium4 Xeon 2.4 GHz, Myrinet

As a result of all this, parsing of the system entry from the listing to obtain the name and gpu is actually impossible to do reliably for all the systems just from the data available in the top500 web site.

Therefore, the code that handles this is “best effort” and the GPU column of the output has “mixed results”. Some additional processing with the help of specialized data sources (e.g. a database of known co-processors) would improve that.

Lacking this, the approach taken was:

Remove the known processor and interconnect details from the system’s description. NOTE: these are not exact matches, and there’s quite variety here too unfortunately.
With the remaining content (if any), try to guess what’s the name and what’s the GPU. After some observations, it seems that the simplest approach to take here is to use the first remaining part (delimited by commas) as the name and the rest as GPU.

Fuzzy comparison notes

SequenceMatcher from difflib has been used to compare known details with system descriptions to try to extract the information. It’s challenging.

These are supposed to be equal (interconnect):

>>> from difflib import SequenceMatcher as SM
>>> SM(None, 'Custom', 'Custom Interconnect').ratio()
0.48
>>> SM(None, 'Gigabit Ethernet', 'GigEth').ratio()
0.5454545454545454

These should be different (system name vs processor):

>>> SM(None, 'R12000 300MHz', 'ORIGIN 2000 300 MHz').ratio()
0.75
>>> SM(None, 'PowerPC 604e 332MHz', 'SP PC604e 332 MHz').ratio()
0.7777777777777778
>>> SM(None, 'Cray X1E (1 GHz)', 'Cray X1E 2c 1GHz').ratio()
0.8125

Site name column structure

A site (2nd column in the list) looks like this:

<td><a href="https://www.top500.org/site/SITE_ID">SITE_NAME</a><br>COUNTRY</td>

System details page

Power

Several Power values are marked as “(Submitted)”, e.g. https://www.top500.org/system/177981.

Some have “(Derived)”: https://www.top500.org/system/178751

Many have no value in this field.

Some systems evolve over time

Several fields can change on the same system between different editions of the list, including the system’s name, number of cores, Rmax, Rpeak, Power, …

Examples:

The wesite is a bit inconsistent here: the system details page always lists the latest version of the content, regardless of its history.

The scraper keeps the same values that the lists show, overriding system details with the values from the lists on each edition, to avoid the inconsistency as much as possible.

Collected dataset

The attached dataset is the full collection of TOP500 lists, i.e. starting from the first available list from June 1993. It was collected with:

$ time python3 -u scrape.py --year 1993 --month 6 > scraping.log

real	57m7,175s
user	11m38,674s
sys	0m4,049s

While adding the dataset in the repo, a problem was detected:

fatal: LF reemplaçaria CRLF en data/top500.csv.

it turns out that the way argparse opens the output file is not quite what csv.write handles best, and the resulting output file used CRLF for line ends. This was manually corrected in the file.

License

The web scraping code provided here is released under the GPL v3 license (see LICENSE).

The attached dataset is provided under the CC0 1.0 Public Domain.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
analisi		analisi
data		data
img		img
top500		top500
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
requirements.txt		requirements.txt
scrape.py		scrape.py
top500.org		top500.org
top500.pdf		top500.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOP500 list web scraping

Overview

The www.top500.org web site structure

robots.txt

Usage

Dependencies

Proxy support

Scraping notes

List page

Number format

Highlights list vs full list

System name column structure

Fuzzy comparison notes

Site name column structure

System details page

Power

Some systems evolve over time

Collected dataset

License

About

Releases

Packages

Languages

License

DougBarry/top500

Folders and files

Latest commit

History

Repository files navigation

TOP500 list web scraping

Overview

The www.top500.org web site structure

robots.txt

Usage

Dependencies

Proxy support

Scraping notes

List page

Number format

Highlights list vs full list

System name column structure

Fuzzy comparison notes

Site name column structure

System details page

Power

Some systems evolve over time

Collected dataset

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages