Student project, one-month tweets analysis using Hadoop framework. The project is divided into 5 different packages:
-
datacleaning
: clean the raw data -
wordcount
: list the hashtags and their number of occurrences. -
topk
: list the K top-used hashtags. -
hashtagbyuser
: list the users and their hashtags. -
triplet
: list the triplets of hashtags and their users.
- make
- maven
Go to the project root.
make build
: packages the modules all at once.
If you prefer to package the modules one by one:
make datacleaning
make wordcount
make topk
make hashtagbyuser
make triplet
yarn jar Projet-Tweeter-1.0.jar ${tweet} clean_data
: produces 164 sequence files into the clean_data
folder.
yarn jar Projet-Tweeter-TopHashtags-1.0.jar clean_data count_hashtags top_hashtags [K]
: produces 2 output folders count_hashtags
and top_hashtags
.
By default K=10.
hdfs dfs -text count_hashtags/part-r-<>
: Visualize the results of the word count pattern.hdfs dfs -head top_hashtags/part-m-<>
: Visualize the results of the top k pattern.
yarn jar Projet-Tweeter-HashtagByUser-1.0.jar clean_data hashtag_by_user
: produces 1 output foldercount_hashtags
.hdfs dfs -text hashtag_by_user/part-r-<>
: Visualize the users (userId, userName) and their hashtags.
yarn jar Projet-Tweeter-HashtagByUser-1.0.jar hashtag_by_user hashtag_triplets
: produces 1 output folderhashtag_triplets
.hdfs dfs -text count_hashtags/part-r-<>
: Visualize the triplets of hashtags and their users.
// TODO: wordcount and topk on top of the triplets
Deborah Pereira & Sophie Stan.
Supervising teacher: David Auber.