Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to use Python 3, fixes #11 #31

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
![Common Crawl Logo](http://commoncrawl.org/wp-content/uploads/2016/12/logocommoncrawl.png)
![Common Crawl Logo](https://commoncrawl.org/wp-content/uploads/2016/12/logocommoncrawl.png)

# mrjob starter kit
# Common Crawl mrjob starter kit

This project demonstrates using Python to process the Common Crawl dataset with the mrjob framework.
This project demonstrates using Python to process the Common Crawl dataset with the [mrjob framework](https://mrjob.readthedocs.io/en/latest/).
There are three tasks to run using the three different data formats:

+ Counting HTML tags using Common Crawl's raw response data (WARC files)
Expand All @@ -27,7 +27,7 @@ If you would like to create a virtual environment to protect local dependencies:
pip install -r requirements.txt

To develop locally, you'll need at least three data files -- one for each format the crawl uses.
These can either be downloaded by running the `get-data.sh` command line program or manually by grabbing the [WARC](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz), [WAT](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wat/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wat.gz), and [WET](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wet/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wet.gz) files.
These can either be downloaded by running the `get-data.sh` command line program or manually by grabbing the [WARC](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz), [WAT](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wat/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wat.gz), and [WET](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/wet/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.wet.gz) files.

## Running the code

Expand Down Expand Up @@ -77,20 +77,22 @@ Using the 'local' runner simulates more features of Hadoop, such as counters:

python tag_counter.py -r local --conf-path mrjob.conf --no-output --output-dir output/ input/test-1.warc

Note: Python 3 is required. Eventually and depending on the underlying operating system, you need to run the jobs calling the executable `python3`. If you need the older version running on Python 2.7 (not maintained anymore), please checkout the git branch `python-2.7` instead.

### Running via Elastic MapReduce

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it without incurring any transfer costs.
As the Common Crawl dataset lives in the [Amazon Web Services’ Open Data Sets Sponsorships](https://aws.amazon.com/opendata/) program, you can [access it for free](https://commoncrawl.org/access-the-data/).
The only cost that you incur is the cost of the machines and Elastic MapReduce itself.

By default, EMR machines run with Python 2.6.
The configuration file automatically installs Python 2.7 on your cluster for you.
The steps to do this are documented in `mrjob.conf`.
By default, EMR machines run with Python 3.x.
The configuration file automatically installs Python 3 on your cluster for you.
The steps to do this are documented in [mrjob.conf](./mrjob.conf).

The three job examples in this repository (`tag_counter.py`, `server_analysis.py`, `word_count.py`) rely on a common module - `mrcc.py`.
The three job examples in this repository ([tag_counter.py](./tag_counter.py), [server_analysis.py](./server_analysis.py), [word_count.py](./word_count.py) rely on a common module - [mrcc.py](./mrcc.py).
By default, this module will not be present when you run the examples on Elastic MapReduce, so you have to include it explicitly.
You have two options:

1. [Deploy your source tree as a tar ball](http://pythonhosted.org/mrjob/guides/setup-cookbook.html#uploading-your-source-tree-as-an-archive)
1. [Deploy your source tree as a tar ball](https://mrjob.readthedocs.io/en/latest/guides/setup-cookbook.html#uploading-your-source-tree)
2. Copy-paste the code from mrcc.py into the job example that you are trying to run:

cat mrcc.py tag_counter.py | sed "s/from mrcc import CCJob//" > tag_counter_emr.py
Expand All @@ -112,7 +114,7 @@ To launch the job on a Hadoop cluster of AWS EC2 instances (e.g., CDH), see the

To run your mrjob task over the entirety of the Common Crawl dataset, you can use the WARC, WAT, or WET file listings found at `CC-MAIN-YYYY-WW/[warc|wat|wet].paths.gz`.

As an example, the [August 2014 crawl](http://commoncrawl.org/august-2014-crawl-data-available/) has 52,849 WARC files listed by [warc.paths.gz](https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-35/warc.paths.gz). You'll find pointers to listings for all crawls including the most recent ones on the [commoncrawl Public Data Set bucket](https://commoncrawl.s3.amazonaws.com/crawl-data/index.html) and the [get-started page](https://commoncrawl.org/the-data/get-started/).
As an example, the [August 2014 crawl](https://commoncrawl.org/august-2014-crawl-data-available/) has 52,849 WARC files listed by [warc.paths.gz](https://data.commoncrawl.org/crawl-data/CC-MAIN-2014-35/warc.paths.gz). You'll find pointers to listings for all crawls including the most recent ones on the [commoncrawl Public Data Set bucket](https://data.commoncrawl.org/crawl-data/index.html) and the [get-started page](https://commoncrawl.org/the-data/get-started/).

It is highly recommended to run over batches of files at a time and then perform a secondary reduce over those results.
Running a single job over the entirety of the dataset complicates the situation substantially.
Expand Down
85 changes: 85 additions & 0 deletions fastwarc_warc_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import io

import fastwarc.warc as fastwarc


class WARCFile():
"""Wrapper around fastwarc.warc.ArchiveIterator to mimic the behavior of
warc.WARCFile when reading WARC/WAT/WET files"""
def __init__(self, filename=None, fileobj=None):
if fileobj is None:
fileobj = open(filename, "rb")
self.fileobj = fileobj
self.iter = fastwarc.ArchiveIterator(self.fileobj, parse_http=False)

def __iter__(self):
try:
while True:
yield next(self)
except StopIteration:
pass

def __next__(self):
return WARCRecord(next(self.iter))

def close(self):
self.fileobj.close()

def tell(self):
raise NotImplementedError('Use record.stream_pos instead')


class WARCRecord(object):
"""Replacement for warc.WARCRecord backed by warcio.recordloader.ArcWarcRecord"""
def __init__(self, warc_record):
self._rec = warc_record

@property
def type(self):
"""Record type"""
return self._rec.record_type

@property
def url(self):
"""The value of the WARC-Target-URI header if the record is of type "response"."""
return self._rec.headers.get('WARC-Target-URI')

@property
def ip_address(self):
"""The IP address of the host contacted to retrieve the content of this record.

This value is available from the WARC-IP-Address header."""
return self._rec.headers.get('WARC-IP-Address')
@property
def date(self):
"""UTC timestamp of the record."""
return self._rec.headers.get("WARC-Date")

@property
def checksum(self):
return self._rec.headers.get('WARC-Payload-Digest')

@property
def payload(self):
return io.BytesIO(self._rec.reader.read())

def __getitem__(self, name):
return self._rec.headers.get(name)

def __setitem__(self, name, value):
raise NotImplementedError('FastWARC headers cannot be modified')

def __contains__(self, name):
return name in self._rec.headers

def __str__(self):
return str(self._rec)

def __repr__(self):
return "<WARCRecord: type={} record_id={}>".format(
self.type, self['WARC-Record-ID'])


class ARCFile(WARCFile):
def __init__(self, filename=None, fileobj=None):
raise NotImplementedError('FastWARC cannot read ARC files.')
4 changes: 2 additions & 2 deletions get-data.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/sh
#!/usr/bin/env bash

ccfiles=(
'crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/CC-MAIN-20140820021320-00000-ip-10-180-136-8.ec2.internal.warc.gz'
Expand All @@ -11,5 +11,5 @@ for ccfile in ${ccfiles[@]}; do
mkdir -p `dirname $ccfile`
echo "Downloading `basename $ccfile` ..."
echo "---"
wget --no-clobber https://commoncrawl.s3.amazonaws.com/$ccfile -O $ccfile
wget --no-clobber https://data.commoncrawl.org/$ccfile -O $ccfile
done
Loading