commoncrawl · sebastian-nagel · Sep 15, 2021 · Sep 30, 2021 · Oct 4, 2021 · Oct 18, 2021
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
-![Common Crawl Logo](http://commoncrawl.org/wp-content/uploads/2016/12/logocommoncrawl.png)
+![Common Crawl Logo](https://commoncrawl.org/wp-content/uploads/2016/12/logocommoncrawl.png)
 
-# mrjob starter kit
+# Common Crawl mrjob starter kit
 
-This project demonstrates using Python to process the Common Crawl dataset with the mrjob framework.
+This project demonstrates using Python to process the Common Crawl dataset with the [mrjob framework](https://mrjob.readthedocs.io/en/latest/).
 There are three tasks to run using the three different data formats:
 
 + Counting HTML tags using Common Crawl's raw response data (WARC files)
@@ -27,7 +27,7 @@ If you would like to create a virtual environment to protect local dependencies:
     pip install -r requirements.txt
 
 To develop locally, you'll need at least three data files -- one for each format the crawl uses.
-These can either be downloaded by running the `get-data.sh` command line program or manually by grabbing the [WARC](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz), [WAT](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wat/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wat.gz), and [WET](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wet/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wet.gz) files.
+These can either be downloaded by running the `get-data.sh` command line program or manually by grabbing the [WARC](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz), [WAT](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wat/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wat.gz), and [WET](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wet/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wet.gz) files.
 
 ## Running the code
 
@@ -77,20 +77,22 @@ Using the 'local' runner simulates more features of Hadoop, such as counters:
 
     python tag_counter.py -r local --conf-path mrjob.conf --no-output --output-dir output/ input/test-1.warc
 
+Note: Python 3 is required. Eventually and depending on the underlying operating system, you need to run the jobs calling the executable `python3`. If you need the older version running on Python 2.7 (not maintained anymore), please checkout the git branch `python-2.7` instead.
+
 ### Running via Elastic MapReduce
 
-As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it without incurring any transfer costs.
+As the Common Crawl dataset lives in the [Amazon Web Services’ Open Data Sets Sponsorships](https://aws.amazon.com/opendata/) program, you can [access it for free](https://commoncrawl.org/access-the-data/).
 The only cost that you incur is the cost of the machines and Elastic MapReduce itself.
 
-By default, EMR machines run with Python 2.6.
-The configuration file automatically installs Python 2.7 on your cluster for you.
-The steps to do this are documented in `mrjob.conf`.
+By default, EMR machines run with Python 3.x.
+The configuration file automatically installs Python 3 on your cluster for you.
+The steps to do this are documented in [mrjob.conf](./mrjob.conf).
 
-The three job examples in this repository (`tag_counter.py`, `server_analysis.py`, `word_count.py`) rely on a common module - `mrcc.py`.
+The three job examples in this repository ([tag_counter.py](./tag_counter.py), [server_analysis.py](./server_analysis.py), [word_count.py](./word_count.py) rely on a common module - [mrcc.py](./mrcc.py).
 By default, this module will not be present when you run the examples on Elastic MapReduce, so you have to include it explicitly.
 You have two options:
 
-1. [Deploy your source tree as a tar ball](http://pythonhosted.org/mrjob/guides/setup-cookbook.html#uploading-your-source-tree-as-an-archive)
+1. [Deploy your source tree as a tar ball](https://mrjob.readthedocs.io/en/latest/guides/setup-cookbook.html#uploading-your-source-tree)
 2. Copy-paste the code from mrcc.py into the job example that you are trying to run:
 
         cat mrcc.py tag_counter.py | sed "s/from mrcc import CCJob//" > tag_counter_emr.py
@@ -112,7 +114,7 @@ To launch the job on a Hadoop cluster of AWS EC2 instances (e.g., CDH), see the
 
 To run your mrjob task over the entirety of the Common Crawl dataset, you can use the WARC, WAT, or WET file listings found at `CC-MAIN-YYYY-WW/[warc|wat|wet].paths.gz`.
 
-As an example, the [August 2014 crawl](http://commoncrawl.org/august-2014-crawl-data-available/) has 52,849 WARC files listed by [warc.paths.gz](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/warc.paths.gz). You'll find pointers to listings for all crawls including the most recent ones on the [commoncrawl Public Data Set bucket](https://commoncrawl.s3.amazonaws.com/crawl-data/index.html) and the [get-started page](https://commoncrawl.org/the-data/get-started/).
+As an example, the [August 2014 crawl](https://commoncrawl.org/august-2014-crawl-data-available/) has 52,849 WARC files listed by [warc.paths.gz](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/warc.paths.gz). You'll find pointers to listings for all crawls including the most recent ones on the [commoncrawl Public Data Set bucket](https://data.commoncrawl.org/crawl-data/index.html) and the [get-started page](https://commoncrawl.org/the-data/get-started/).
 
 It is highly recommended to run over batches of files at a time and then perform a secondary reduce over those results.
 Running a single job over the entirety of the dataset complicates the situation substantially.

diff --git a/fastwarc_warc_wrapper.py b/fastwarc_warc_wrapper.py
@@ -0,0 +1,85 @@
+import io
+
+import fastwarc.warc as fastwarc
+
+
+class WARCFile():
+    """Wrapper around fastwarc.warc.ArchiveIterator to mimic the behavior of
+    warc.WARCFile when reading WARC/WAT/WET files"""
+    def __init__(self, filename=None, fileobj=None):
+        if fileobj is None:
+            fileobj = open(filename, "rb")
+        self.fileobj = fileobj
+        self.iter = fastwarc.ArchiveIterator(self.fileobj, parse_http=False)
+
+    def __iter__(self):
+        try:
+            while True:
+                yield next(self)
+        except  StopIteration:
+            pass
+
+    def __next__(self):
+        return WARCRecord(next(self.iter))
+
+    def close(self):
+        self.fileobj.close()
+
+    def tell(self):
+        raise NotImplementedError('Use record.stream_pos instead')
+
+
+class WARCRecord(object):
+    """Replacement for warc.WARCRecord backed by warcio.recordloader.ArcWarcRecord"""
+    def __init__(self, warc_record):
+        self._rec = warc_record
+
+    @property
+    def type(self):
+        """Record type"""
+        return self._rec.record_type
+
+    @property
+    def url(self):
+        """The value of the WARC-Target-URI header if the record is of type "response"."""
+        return self._rec.headers.get('WARC-Target-URI')
+
+    @property
+    def ip_address(self):
+        """The IP address of the host contacted to retrieve the content of this record. 
+
+        This value is available from the WARC-IP-Address header."""
+        return self._rec.headers.get('WARC-IP-Address')
+    @property
+    def date(self):
+        """UTC timestamp of the record."""
+        return self._rec.headers.get("WARC-Date")
+
+    @property
+    def checksum(self):
+        return self._rec.headers.get('WARC-Payload-Digest')
+
+    @property
+    def payload(self):
+        return io.BytesIO(self._rec.reader.read())
+
+    def __getitem__(self, name):
+        return self._rec.headers.get(name)
+
+    def __setitem__(self, name, value):
+        raise NotImplementedError('FastWARC headers cannot be modified')
+
+    def __contains__(self, name):
+        return name in self._rec.headers
+
+    def __str__(self):
+        return str(self._rec)
+
+    def __repr__(self):
+        return "<WARCRecord: type={} record_id={}>".format(
+            self.type, self['WARC-Record-ID'])
+
+
+class ARCFile(WARCFile):
+    def __init__(self, filename=None, fileobj=None):
+        raise NotImplementedError('FastWARC cannot read ARC files.')
diff --git a/get-data.sh b/get-data.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/usr/bin/env bash
 
 ccfiles=(
   'crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz'
@@ -11,5 +11,5 @@ for ccfile in ${ccfiles[@]}; do
   mkdir -p `dirname $ccfile`
   echo "Downloading `basename $ccfile` ..."
   echo "---"
-  wget --no-clobber https://commoncrawl.s3.amazonaws.com/$ccfile -O $ccfile
+  wget --no-clobber https://data.commoncrawl.org/$ccfile -O $ccfile
 done