Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing dependencies in; addresses #227. #233

Merged
merged 1 commit into from
May 22, 2018
Merged

Add missing dependencies in; addresses #227. #233

merged 1 commit into from
May 22, 2018

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented May 21, 2018

GitHub issue(s):

What does this Pull Request do?

Adds in missing dependencies so we can use with --packages on master.

How should this be tested?

  • TravisCI should turn green
  • You can build this branch locally, and then run ./spark-shell --packages "io.archivesunleashed:aut:0.16.1-SNAPSHOT"
  • I additionally tested with DataFrames and Tweet analysis just to make sure we didn't silently break this.

Tested with Tweets

// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.TweetUtils._

// Load tweets from HDFS
val tweets = RecordLoader.loadTweets("/home/nruest/Dropbox/donald_search_2018_02_01.json", sc)

// Count them
tweets.count()

// Extract some fields
val r = tweets.map(tweet => (tweet.id, tweet.createdAt, tweet.username, tweet.text, tweet.lang,
                             tweet.isVerifiedUser, tweet.followerCount, tweet.friendCount))

// Take a sample of 10 on console
r.take(10)

// Count the different number of languages
val s = tweets.map(tweet => tweet.lang).countItems().collect()

// Count the number of hashtags
// (Note we don't 'collect' here because it's too much data to bring into the shell)
val hashtags = tweets.map(tweet => tweet.text)
                     .filter(text => text != null)
                     .flatMap(text => {"""#[^ ]+""".r.findAllIn(text).toList})
                     .countItems()

// Take the top 10 hashtags
hashtags.take(10)


// Exiting paste mode, now interpreting.

import io.archivesunleashed._                                                   
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.TweetUtils._
tweets: org.apache.spark.rdd.RDD[org.json4s.JValue] = MapPartitionsRDD[4] at filter at package.scala:66
r: org.apache.spark.rdd.RDD[(String, String, String, String, String, Boolean, Int, Int)] = MapPartitionsRDD[5] at map at <console>:35
s: Array[(String, Int)] = Array((en,711066), (und,131214), (es,4943), (tl,1477), (fr,1434), (in,1240), (tr,769), (it,734), (et,689), (sv,687), (pt,641), (ar,625), (de,573), (ja,511), (da,482), (pl,451), (zh,450), (ht,435), (nl,346), (ru,278), (ro,224), (hi,202), (no,176), (cy,137), (fa,135), (ko,126), (fi,119), (eu,109), (lt,106), (hu,84), (vi,79), (lv,64), (cs,63), (is,37), (sl,28), (iw,18), (th,18), (uk,16), (bn,13), (ur,13), (sr,13), (ml...
scala> 

Tested with DataFrames

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val df = RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/9848/warcs/*gz", sc).extractImageDetailsDF();
df.printSchema()
df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()

// Exiting paste mode, now interpreting.

2018-05-21 18:33:37 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://theoryandp...|image/jpeg|  780|    95|fe68eb5279968fce3...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...| image/gif|  400|    20|f82d8892a2823dd1a...|R0lGODlhkAEUANUAA...|
|http://theoryandp...|image/jpeg|  682|   379|c68e37b72dc21af40...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  780|    95|b1f2cbe3abdebbf4c...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  181|    50|97eae7340dfd8524a...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  780|    95|80044f52e1da2f1bc...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  780|    95|71fec1ee8d5703c50...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...| image/png|  250|   100|6bf821ec11d4b9ccb...|iVBORw0KGgoAAAANS...|
|http://theoryandp...| image/png|  150|    33|666aaae588eadbe26...|iVBORw0KGgoAAAANS...|
|http://theoryandp...| image/png|  780|    95|64b8ae99b8244bd43...|iVBORw0KGgoAAAANS...|
|http://theoryandp...|image/jpeg|  792|   288|64012ee1b04864433...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  780|    95|582375dbe9696a9b8...|/9j/4QuyRXhpZgAAS...|
|http://theoryandp...|image/jpeg|  780|    95|13c9cedd872f718f9...|/9j/4AAQSkZJRgABA...|
|http://theoryandp...|image/jpeg|  780|    95|12eaaba09fbfc532a...|/9j/4QphRXhpZgAAT...|
|http://theoryandp...| image/png|  704|    50|0aad31170de524195...|iVBORw0KGgoAAAANS...|
+--------------------+----------+-----+------+--------------------+--------------------+

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]

@lintool @ianmilligan1 should be an easy one.


@Natkeeran, @JWZ2018, y'all might be interested in this one too.

@codecov
Copy link

codecov bot commented May 21, 2018

Codecov Report

Merging #233 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #233   +/-   ##
=======================================
  Coverage   60.07%   60.07%           
=======================================
  Files          39       39           
  Lines         774      774           
  Branches      137      137           
=======================================
  Hits          465      465           
  Misses        268      268           
  Partials       41       41

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e57a99c...75ef4d3. Read the comment docs.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lintool lintool merged commit 496cd1b into master May 22, 2018
@ruebot ruebot deleted the issue-227 branch May 22, 2018 12:58
ruebot added a commit that referenced this pull request Nov 21, 2019
ianmilligan1 pushed a commit that referenced this pull request Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants