This is the documentation corresponding to the experiments reported in the short paper titled, "Finally, a Downloadable Test Collection of Tweets". You can find the paper here.
Download the the tweet datasets from the following sources:
- ArchiveTeam JSON Download of Twitter Stream 2013-02
- ArchiveTeam JSON Download of Twitter Stream 2013-03
md5sum archiveteam-twitter-stream-2013-02.tar .
md5sum archiveteam-twitter-stream-2013-03.tar .
tar -xvf archiveteam-twitter-stream-2013-02.tar
tar -xvf archiveteam-twitter-stream-2013-03.tar
Copy the extracted contents into a folder named ArchivedTweets2013
Use hadoop fs -put ArchivedTweets2013 .
To obtain the collection stats as presented in Table 1 and Table 2 in the paper:
- Clone and setup warcbase. Follow this for detailed instructions.
- Start a spark-shell:
spark-shell --jars lib/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar --num-executors 50 \
--executor-cores 10 --executor-memory 40G
Source | Count |
---|---|
|T| | 259,035,603 |
|A| | 246,615,368 |
|T U A| | 260,382,756 |
|T and A| | 245,268,215 |
|T - A| | 13,767,388 |
|A - T| | 1,347,153 |
Collection | Overlap |
---|---|
1 - |(T - A)|/|T| | 94.69% |
1 - |(A - T)|/|A| | 99.45% |
|T and A|/|T| | 94.69% |
Run the following to obtain the results as in Table 3:
python deletionAnalysis.py
Source | Count |
---|---|
|T| | 259,035,603 |
|A| | 246,615,368 |
|D(13/02-13/03)| | 10,631,099 |
|D(13/04-13/06)| | 5,091,183 |
|D(13/07-13/12)| | 7,197,460 |
|D(14/01-14/12)| | 96,98,613 |
|D(15/01-15/12)| | 7,928,857 |
|D(16/01-16/12)| | 7,496,871 |
|T − D(13/02-13/03)| | 248,404,504 |
|A − D(13/02-13/03)| | 234,337,730 |
|T − D(13/02-13/06)| | 243,313,321 |
|A − D(13/02-13/06)| | 230,893,086 |
|T − D(13/02-13/12)| | 236,115,861 |
|A − D(13/02-13/12)| | 223,695,626 |
|T − D(13/02-14/12)| | 226,417,248 |
|A − D(13/02-14/12)| | 213,997,013 |
|T − D(13/02-15/12)| | 218,488,391 |
|A − D(13/02-15/12)| | 206,068,156 |
|T − D(13/02-16/12)| | 210,991,520 |
|A − D(13/02-16/12)| | 198,571,285 |
The delete list used in the paper can be downloaded from here
Source | missing reldocs | missing qrels |
---|---|---|
|T − D(13/02-13/12)| | 220 (1.12%) | 1,820 (1.41%) |
|A − D(13/02-13/12)| | 209 (1.06%) | 1,707 (1.32%) |
|T − D(13/02-14/12)| | 539 (2.74%) | 4,456 (3.45%) |
|A − D(13/02-14/12)| | 513 (2.61%) | 4,190 (3.24%) |
|T − D(13/02-15/12)| | 816 (4.15%) | 6,576 (5.09%) |
|A − D(13/02-15/12)| | 776 (3.95%) | 6,193 (4.79%) |
|T − D(13/02-16/12)| | 1,095 (5.57%) | 8,500 (6.58%) |
|A − D(13/02-16/12)| | 1,042 (5.30%) | 7,997 (6.19%) |
Create .bz2 delete list for Internet Archive:
cd deletes-ia
for d in */ ; do cat $d/*; done > ../delete-list-13-ia-now.txt
bzip2 delete-list-13-ia-now.txt
Create .bz2 delete list for Trec Microblog:
cd deletes-trec
for d in */ ; do cat $d/*; done > ../delete-list-13-trec-now.txt
bzip2 delete-list-13-trec-now.txt
Clone and build Anserini
git clone https://github.com/castorini/Anserini.git
cd Anserini && mvn clean package appassembler:assemble
Checkout to twitter-search
git checkout twitter-search
Index the IA collections:
sh target/appassembler/bin/IndexTweets -collection <path of IA collection> -deletes delete-list-13-ia-now.txt -index \
tweets2013-IA-index-del -optimize -store
sh target/appassembler/bin/SearchTweets -index tweets2013-IA-index-del -bm25 -topics topics.microblog2013.txt -output run.ia.del.mb13.txt
Download the latest trec_eval
code:
wget http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz
tar -xvf trec_eval_latest.tar.gz
cd trec_eval.9.0/
make
Evaluate on different configuations to obtain the results as shown in Table 5 of the paper:
eval/trec_eval.9.0/trec_eval qrels.mb.txt run.a-d.mb.txt