Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

TitusAn · 2018-05-15T18:31:50Z

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests

GitHub issue(s): #195

What does this Pull Request do?

This pull request implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests. Job can be submitted like so:

./spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor EXTRACTOR --input ./aut/src/test/resources/warc/example.warc.gz --output OUTPUT

where EXTRACTOR is one of domainFreq, domainGraph or plainText, and OUTPUT is the directory to write output to.

How should this be tested?

mvn install to build the executable and run tests. There are three new tests that test each one of the operations added.

Execute the above command in command line, substituting path to output, Spark and AUT executable as necessary.

Additional Notes:

Does this change require documentation to be updated?

Possibly yes, because the jar file now has additional main function that can be invoked directly.

Does this change add any new dependencies?

Yes, org.rogach.scallop is used to parse command line arguments.

Could this change or impact execution of existing code?

No, because this pull request only creates new files.

Interested parties

Tag (@ mention) interested parties.

Thanks in advance for your help with the Archives Unleashed Toolkit!

…ext extractor that can be run via command line in spark-submit, along with their tests

ruebot · 2018-05-15T18:33:10Z

@TitusAn just as a future tip, it's best to work off a branch instead of master.

codecov · 2018-05-15T18:48:42Z

Codecov Report

Merging #225 into master will decrease coverage by 3.02%.
The diff coverage is 26.56%.

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
- Coverage    61.7%   58.68%   -3.03%     
==========================================
  Files          34       38       +4     
  Lines         679      743      +64     
  Branches      124      137      +13     
==========================================
+ Hits          419      436      +17     
- Misses        219      266      +47     
  Partials       41       41

Impacted Files	Coverage Δ
...o/archivesunleashed/app/CommandLineAppRunner.scala	`0% <0%> (ø)`
...chivesunleashed/app/DomainFrequencyExtractor.scala	`100% <100%> (ø)`
.../io/archivesunleashed/app/PlainTextExtractor.scala	`100% <100%> (ø)`
...o/archivesunleashed/app/DomainGraphExtractor.scala	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f9f9b4...1eb56f7. Read the comment docs.

ianmilligan1 · 2018-05-15T18:55:10Z

👍 I will try to test this right now before my next meeting.

ianmilligan1 · 2018-05-15T19:04:09Z

I'm getting an error. Any idea what I'm doing wrong?

Command that I'm using:

./spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/i2millig/aut/forks/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor domainGraph --input /mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/*.gz --output /mnt/vol1/derivative_data/test/domain-graph

And the output I get is:

2018-05-15 15:02:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--extractorException in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at org.rogach.scallop.Scallop.<init>(Scallop.scala:72)
        at org.rogach.scallop.Scallop$.apply(Scallop.scala:12)
        at org.rogach.scallop.ScallopConfBase.<init>(ScallopConfBase.scala:49)
        at org.rogach.scallop.ScallopConf.<init>(ScallopConf.scala:6)
        at io.archivesunleashed.app.CommandLineAppRunner$Conf.<init>(CommandLineAppRunner.scala:40)
        at io.archivesunleashed.app.CommandLineAppRunner$.main(CommandLineAppRunner.scala:51)
        at io.archivesunleashed.app.CommandLineAppRunner.main(CommandLineAppRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-05-15 15:02:12 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-05-15 15:02:12 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4e7658bd-f30c-48be-afdd-039b622b4a94

TitusAn · 2018-05-15T19:29:17Z

I will investigate this when I got home. Looks like it is caused by incompatible Scala version between Scallop command line parsing library and the one that is used by the project.

ianmilligan1

This is going to be a fantastic addition to AUT - thanks Titus. One comment below as well (in addition to the execution problems I'm having).

ianmilligan1 · 2018-05-15T19:46:36Z

src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala

+      .filter(r => r._2 != "" && r._3 != "")
+      .countItems()
+      .filter(r => r._2 > 5)
+  }


Would it be possible to make the output of DomainGraphExtractor a GEXF file?

https://archivesunleashed.org/aut/#exporting-to-gephi-directly

Or perhaps add a fourth option GEXFGraphExtractor or something along those lines?

The cleanest way is probably to add --outputFormat in the top-level app, with plain text as default?

Obviously, some output formats won't make sense with others, but that should be fine.

I like this approach because when we do DF later, we can do output format to, for example, MySQL dump output... which we can load directly into MySQL.

That would be really nice if possible!

lintool · 2018-05-15T19:51:18Z

pom.xml

@@ -580,6 +580,11 @@
      <artifactId>tika-parsers</artifactId>
      <version>1.12</version>
    </dependency>
+    <dependency>
+      <groupId>org.rogach</groupId>
+      <artifactId>scallop_2.12</artifactId>


FWIW, bespin is on 2.11: https://github.com/lintool/bespin/blob/master/pom.xml#L244

As noted below, this removed the error I was encountering.

lintool · 2018-05-15T19:52:34Z

src/main/scala/io/archivesunleashed/app/DomainFrequencyExtractor.scala

+import org.apache.log4j.Logger
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.rdd.RDD
+import org.rogach.scallop._


I think you can remove this import?

lintool · 2018-05-15T19:53:42Z

src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala

+    var archive = RecordLoader.loadArchives(args.input(), sc)
+
+    args.input() match {
+      case "domainFreq" =>


Why don't we make the arg exactly the name of the class? E.g., DomainFrequencyExtractor - I think this will reduce confusion...

lintool · 2018-05-15T19:54:54Z

src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala

+  class Conf(args: Seq[String]) extends ScallopConf(args) {
+    mainOptions = Seq(input, output)
+    var extractor = opt[String](descr =
+      "extractor, one of domainFreq, domainGraph or plainText", required = true)


See below. Otherwise, documenting the options will become unwieldy when we have 20 apps?

ianmilligan1 · 2018-05-15T19:56:30Z

@TitusAn dropping scallop down to 2.11 got things working as per @lintool's comment above, but it's not accepting the wildcard in the input - it says I should provide exactly one argument for this option. What's the best way to load a directory of files in?

lintool · 2018-05-15T19:58:03Z

src/test/scala/io/archivesunleashed/app/PlainTextExtractorTest.scala

+    assert(plainText(0)._2 == "www.archive.org")
+    assert(plainText(0)._3 == "http://www.archive.org/")
+    assert(plainText(0)._4 == "HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT ETag: \"47ac-16e-4f9e5b40\" Accept-Ranges: bytes Content-Length: 366 Connection: close Content-Type: text/html; charset=UTF-8 Please visit our website at: http://www.archive.org")
+


remove blank line.

…o write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.

TitusAn · 2018-05-16T03:16:54Z

Restructure CommandLineAppRunner to make it more robust.
Add option to write GEXF output for DomainGraphExtractor (enable via --output-format GEXF).
Add support for multiple input files. (Bash wildcard works fine)
Other polish and cleanup.

lintool · 2018-05-16T11:37:29Z

src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala

+      "DomainFrequencyExtractor" ->
+        ((rdd: RDD[ArchiveRecord], subFolder: String) => {
+          DomainFrequencyExtractor(rdd).saveAsTextFile(subFolder)}),
+    "DomainGraphExtractor" ->


should this be 2-space indented?

lintool

I'm happy to have this merged once Ian smoke tests.

ianmilligan1

Tested on a big collection of WARCs and looks quite nice to me.

…limit.

Resolves issue 195. Implement DomainFrequency, DomainGraph and PlainT…

6b86af0

…ext extractor that can be run via command line in spark-submit, along with their tests

ianmilligan1 reviewed May 15, 2018

View reviewed changes

lintool reviewed May 15, 2018

View reviewed changes

Restructure CommandLineAppRunner to make it more robust. Add option t…

1eb56f7

…o write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.

lintool reviewed May 16, 2018

View reviewed changes

lintool approved these changes May 16, 2018

View reviewed changes

ianmilligan1 approved these changes May 16, 2018

View reviewed changes

lintool merged commit 2bdc740 into archivesunleashed:master May 16, 2018

ruebot mentioned this pull request May 16, 2018

Add Extract Image Details API #226

Merged

greebie mentioned this pull request May 23, 2018

CommandLineAppRunner.scala produces output per WARC instead of combined result. #235

Closed

greebie referenced this pull request Jul 25, 2018

Make the TravisCI build less verbose since we're hitting the 4MB log …

9b71990

…limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

TitusAn commented May 15, 2018

ruebot commented May 15, 2018

codecov bot commented May 15, 2018 •

edited

Loading

ianmilligan1 commented May 15, 2018

ianmilligan1 commented May 15, 2018

TitusAn commented May 15, 2018

ianmilligan1 left a comment

ianmilligan1 May 15, 2018

lintool May 15, 2018

ianmilligan1 May 15, 2018

lintool May 15, 2018

ianmilligan1 May 15, 2018

lintool May 15, 2018

lintool May 15, 2018

lintool May 15, 2018

ianmilligan1 commented May 15, 2018

lintool May 15, 2018

TitusAn commented May 16, 2018 •

edited

Loading

lintool May 16, 2018

lintool left a comment

ianmilligan1 left a comment

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

Conversation

TitusAn commented May 15, 2018

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

ruebot commented May 15, 2018

codecov bot commented May 15, 2018 • edited Loading

Codecov Report

ianmilligan1 commented May 15, 2018

ianmilligan1 commented May 15, 2018

TitusAn commented May 15, 2018

ianmilligan1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianmilligan1 commented May 15, 2018

Choose a reason for hiding this comment

TitusAn commented May 16, 2018 • edited Loading

Choose a reason for hiding this comment

lintool left a comment

Choose a reason for hiding this comment

ianmilligan1 left a comment

Choose a reason for hiding this comment

codecov bot commented May 15, 2018 •

edited

Loading

TitusAn commented May 16, 2018 •

edited

Loading