Tweets2013 Collection from the Internet Archive

This is the documentation corresponding to the experiments reported in the short paper titled, "Finally, a Downloadable Test Collection of Tweets". You can find the paper here.

Download the collection:

Download the the tweet datasets from the following sources:

Verify the checksum!

md5sum archiveteam-twitter-stream-2013-02.tar .
md5sum archiveteam-twitter-stream-2013-03.tar .

Extract

tar -xvf archiveteam-twitter-stream-2013-02.tar 
tar -xvf archiveteam-twitter-stream-2013-03.tar

Combine

Copy the extracted contents into a folder named ArchivedTweets2013

Copy to HDFS:

Use hadoop fs -put ArchivedTweets2013 .

Collection stats:

To obtain the collection stats as presented in Table 1 and Table 2 in the paper:

Clone and setup warcbase. Follow this for detailed instructions.
Start a spark-shell:

spark-shell   --jars lib/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar --num-executors 50 \
 --executor-cores 10 --executor-memory 40G

Run collectionStats.scala

Table1: Collection Statistics:

Source	Count
\|T\|	259,035,603
\|A\|	246,615,368
\|T U A\|	260,382,756
\|T and A\|	245,268,215
\|T - A\|	13,767,388
\|A - T\|	1,347,153

Table2: Overlap Analysis:

Collection	Overlap
1 - \|(T - A)\|/\|T\|	94.69%
1 - \|(A - T)\|/\|A\|	99.45%
\|T and A\|/\|T\|	94.69%

Deletion Stats:

Run the following to obtain the results as in Table 3:

python deletionAnalysis.py

Source	Count
\|T\|	259,035,603
\|A\|	246,615,368
\|D(13/02-13/03)\|	10,631,099
\|D(13/04-13/06)\|	5,091,183
\|D(13/07-13/12)\|	7,197,460
\|D(14/01-14/12)\|	96,98,613
\|D(15/01-15/12)\|	7,928,857
\|D(16/01-16/12)\|	7,496,871
\|T − D(13/02-13/03)\|	248,404,504
\|A − D(13/02-13/03)\|	234,337,730
\|T − D(13/02-13/06)\|	243,313,321
\|A − D(13/02-13/06)\|	230,893,086
\|T − D(13/02-13/12)\|	236,115,861
\|A − D(13/02-13/12)\|	223,695,626
\|T − D(13/02-14/12)\|	226,417,248
\|A − D(13/02-14/12)\|	213,997,013
\|T − D(13/02-15/12)\|	218,488,391
\|A − D(13/02-15/12)\|	206,068,156
\|T − D(13/02-16/12)\|	210,991,520
\|A − D(13/02-16/12)\|	198,571,285

The delete list used in the paper can be downloaded from here

Missing qrels

Source	missing reldocs	missing qrels
\|T − D(13/02-13/12)\|	220 (1.12%)	1,820 (1.41%)
\|A − D(13/02-13/12)\|	209 (1.06%)	1,707 (1.32%)
\|T − D(13/02-14/12)\|	539 (2.74%)	4,456 (3.45%)
\|A − D(13/02-14/12)\|	513 (2.61%)	4,190 (3.24%)
\|T − D(13/02-15/12)\|	816 (4.15%)	6,576 (5.09%)
\|A − D(13/02-15/12)\|	776 (3.95%)	6,193 (4.79%)
\|T − D(13/02-16/12)\|	1,095 (5.57%)	8,500 (6.58%)
\|A − D(13/02-16/12)\|	1,042 (5.30%)	7,997 (6.19%)

To index the tweets collection:

Create .bz2 delete list for Internet Archive:

cd deletes-ia
for d in */ ; do   cat $d/*; done > ../delete-list-13-ia-now.txt
bzip2 delete-list-13-ia-now.txt

Create .bz2 delete list for Trec Microblog:

cd deletes-trec
for d in */ ; do   cat $d/*; done > ../delete-list-13-trec-now.txt
bzip2 delete-list-13-trec-now.txt

Index

Clone and build Anserini

git clone https://github.com/castorini/Anserini.git
cd Anserini && mvn clean package appassembler:assemble

Checkout to twitter-search

git checkout twitter-search

Index the IA collections:

sh target/appassembler/bin/IndexTweets -collection <path of IA collection> -deletes delete-list-13-ia-now.txt -index \
tweets2013-IA-index-del -optimize -store

Search

sh target/appassembler/bin/SearchTweets -index tweets2013-IA-index-del -bm25 -topics topics.microblog2013.txt -output run.ia.del.mb13.txt

Evaluate

Download the latest trec_eval code:

wget http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz
tar -xvf trec_eval_latest.tar.gz  
cd trec_eval.9.0/ 
make

Evaluate on different configuations to obtain the results as shown in Table 5 of the paper:

eval/trec_eval.9.0/trec_eval qrels.mb.txt run.a-d.mb.txt

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
src/main/scala/ca/uwaterloo/cs/texamine		src/main/scala/ca/uwaterloo/cs/texamine
stats		stats
.gitignore		.gitignore
README.md		README.md
collection.stats.scala		collection.stats.scala
computeRecall.py		computeRecall.py
del-count.scala		del-count.scala
del-qrel-trec.scala		del-qrel-trec.scala
deleteAnalysis.py		deleteAnalysis.py
deleteAnalysisRelDocs.py		deleteAnalysisRelDocs.py
deleteTwtsTREC.scala		deleteTwtsTREC.scala
deleteTwtsTRECAll.scala		deleteTwtsTRECAll.scala
downloadTwtMonths.sh		downloadTwtMonths.sh
forRef.scala		forRef.scala
pom.xml		pom.xml
saveDelQrel.scala		saveDelQrel.scala
sigir17.iml		sigir17.iml
statTest.py		statTest.py
untitled.scala		untitled.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweets2013 Collection from the Internet Archive

Download the collection:

Verify the checksum!

Extract

Combine

Copy to HDFS:

Collection stats:

Table1: Collection Statistics:

Table2: Overlap Analysis:

Deletion Stats:

Missing qrels

To index the tweets collection:

Index

Search

Evaluate

About

Releases

Packages

Languages

castorini/Tweets2013-IA

Folders and files

Latest commit

History

Repository files navigation

Tweets2013 Collection from the Internet Archive

Download the collection:

Verify the checksum!

Extract

Combine

Copy to HDFS:

Collection stats:

Table1: Collection Statistics:

Table2: Overlap Analysis:

Deletion Stats:

Missing qrels

To index the tweets collection:

Index

Search

Evaluate

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages