Skip to content
forked from codificat/top500

A web scraping project to extract information from the TOP500 list

License

Notifications You must be signed in to change notification settings

DougBarry/top500

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOP500 list web scraping

This is a project to do web scraping of the TOP500 list, which is a list of the 500 most powerful commercially available computers in the world.

The goal of this project is to extract the details of the computers from the lists presented in the web site as HTML tables and store them in CSV format for later processing.

Overview

The www.top500.org web site structure

The lists are created twice a year (in June and November each year), and the web site presents each edition of the list split across 5 pages, with 100 entries each.

Each page has one table with one row per entry. Each row provides these details about each supercomputer:

  • Rank: the order of the supercomputer in that edition of the list
  • Site: where is the supercomputer located
  • System: name of the system
  • Cores: number of compute cores
  • Rmax: maximal LINPACK performance achieved (TFlop/s)
  • Rpeak: theoretical peak performance (TFlop/s)
  • Power: power consumption (kW)

Details of the collected information are available on their description page.

robots.txt

The site has a very tolerant robots.txt file:

$ curl https://www.top500.org/robots.txt
User-agent: *
Allow: /

Host: www.top500.org

Usage

$ python3 scrape.py -h
usage: scrape.py [-h]
                 [-y {1993,...,2017}]
                 [-m {6,11}]
                 [-z {1993,...,2017}]
                 [-n {6,11}] [-c COUNT] [-f]
                 [outfile]

positional arguments:
  outfile               Output file (default: top500.csv)

optional arguments:
  -h, --help            show this help message and exit
  -y {1993,...,2017}, --year {1993,...,2017}
                        Collect from year (default: 2017)
  -m {6,11}, --month {6,11}
                        Collect from month (default: 11)
  -z {1993,...,2017}, --endyear {1993,...,2017}
                        Collect until year (default: 2017)
  -n {6,11}, --endmonth {6,11}
                        Collect until month (default: 11)
  -c COUNT, --count COUNT
                        Number of entries to get from each edition (default:
                        500)
  -f, --force           Force a partial count (default: False)

In summary: if invoked without arguments, scrape.py will download the whole list for the latest available edition (at the time of the writing, this is November 2017; this is calculated at run time), i.e. 500 entries.

You can specify a starting edition with --year and/or --month, and it will start from there. It will continue until the latest edition, or you can specify the last edition to collect with --endyear and/or --endmonth.

By default it collects all 500 systems for each edition. You can change that with --count (this specifies the number of systems to collect per edition, not the total number of entries to collect - which doesn’t make that much sense).

Because the lists are hosted in pages of 100 entries each, by default --count only accepts values in hundreds (100, 200, 300, …) to make the most of the downloaded content. You can force partial scraping of downloaded pages with --force (i.e. if you want to collect just 250 entries per list, specify -c 250 -f; it will still download 300, because that’s how the pages are built, but will only scrape the first 250).

Dependencies

The scraper has the following dependencies:

Proxy support

The requests module has built-in proxy support. If needed, you can set the HTTPS_PROXY environment variable pointing to your proxy before running the scraper:

HTTPS_PROXY="http://my-proxy.example.com:3128"

Scraping notes

Some notes about the website structure, to keep in mind during scraping.

List page

Number format

Numbers are written in US format, with commas to separate thousands and dots for the decimal point.

Conversion is performed using the locale module, setting locale to en_US.UTF-8 and using the atoi and atof functions.

Highlights list vs full list

Each edition of the list has a highlights list with the top 10 for that edition. That list has 6 columns, where the 2nd column contains both the system and the site mixed in.

The tables with the full list contain 7 columns: the 2nd column contains the site and the 3rd contains the system.

We scrape only the full lists, as the highlights/summary do not provide any additional information.

System name column structure

In general, a system (3rd column in the list) looks like this in most cases:

<td><a href="https://www.top500.org/system/SYSTEM_ID">
    SYSTEM_NAME, PROCESSOR, INTERCONNECT, COPROCESSORS
</a><br/>MANUFACTURER</td>

Of these fields, System Name and Co-processors can only be found here (i.e. in that colum of the listings) - they are not provided in the system’s details page. Well, the title of that page copies the whole lot, including system name and processors, but the name or co-processors are not listed by themselves anywhere.

Interconnect and Processor are listed in the system’s details page (the href in that column) and can be properly identified from that other page. Therefore, an attempt has been made to extract a system’s name and co-processor (stored as gpu in the output) from this column.

However:

As a result of all this, parsing of the system entry from the listing to obtain the name and gpu is actually impossible to do reliably for all the systems just from the data available in the top500 web site.

Therefore, the code that handles this is “best effort” and the GPU column of the output has “mixed results”. Some additional processing with the help of specialized data sources (e.g. a database of known co-processors) would improve that.

Lacking this, the approach taken was:

  • Remove the known processor and interconnect details from the system’s description. NOTE: these are not exact matches, and there’s quite variety here too unfortunately.
  • With the remaining content (if any), try to guess what’s the name and what’s the GPU. After some observations, it seems that the simplest approach to take here is to use the first remaining part (delimited by commas) as the name and the rest as GPU.

Fuzzy comparison notes

SequenceMatcher from difflib has been used to compare known details with system descriptions to try to extract the information. It’s challenging.

These are supposed to be equal (interconnect):

>>> from difflib import SequenceMatcher as SM
>>> SM(None, 'Custom', 'Custom Interconnect').ratio()
0.48
>>> SM(None, 'Gigabit Ethernet', 'GigEth').ratio()
0.5454545454545454

These should be different (system name vs processor):

>>> SM(None, 'R12000 300MHz', 'ORIGIN 2000 300 MHz').ratio()
0.75
>>> SM(None, 'PowerPC 604e 332MHz', 'SP PC604e 332 MHz').ratio()
0.7777777777777778
>>> SM(None, 'Cray X1E (1 GHz)', 'Cray X1E 2c 1GHz').ratio()
0.8125

Site name column structure

A site (2nd column in the list) looks like this:

<td><a href="https://www.top500.org/site/SITE_ID">SITE_NAME</a><br>COUNTRY</td>

System details page

Power

Several Power values are marked as “(Submitted)”, e.g. https://www.top500.org/system/177981.

Some have “(Derived)”: https://www.top500.org/system/178751

Many have no value in this field.

Some systems evolve over time

Several fields can change on the same system between different editions of the list, including the system’s name, number of cores, Rmax, Rpeak, Power, …

Examples:

The wesite is a bit inconsistent here: the system details page always lists the latest version of the content, regardless of its history.

The scraper keeps the same values that the lists show, overriding system details with the values from the lists on each edition, to avoid the inconsistency as much as possible.

Collected dataset

The attached dataset is the full collection of TOP500 lists, i.e. starting from the first available list from June 1993. It was collected with:

$ time python3 -u scrape.py --year 1993 --month 6 > scraping.log

real	57m7,175s
user	11m38,674s
sys	0m4,049s

While adding the dataset in the repo, a problem was detected:

fatal: LF reemplaçaria CRLF en data/top500.csv.

it turns out that the way argparse opens the output file is not quite what csv.write handles best, and the resulting output file used CRLF for line ends. This was manually corrected in the file.

License

The web scraping code provided here is released under the GPL v3 license (see LICENSE).

The attached dataset is provided under the CC0 1.0 Public Domain.

About

A web scraping project to extract information from the TOP500 list

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 69.5%
  • R 25.9%
  • TeX 4.6%