Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225

Merged
merged 2 commits into from
May 16, 2018

Conversation

TitusAn
Copy link
Contributor

@TitusAn TitusAn commented May 15, 2018

Implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests


GitHub issue(s): #195

What does this Pull Request do?

This pull request implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests. Job can be submitted like so:

./spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor EXTRACTOR --input ./aut/src/test/resources/warc/example.warc.gz --output OUTPUT

where EXTRACTOR is one of domainFreq, domainGraph or plainText, and OUTPUT is the directory to write output to.

How should this be tested?

mvn install to build the executable and run tests. There are three new tests that test each one of the operations added.

Execute the above command in command line, substituting path to output, Spark and AUT executable as necessary.

Additional Notes:

  • Does this change require documentation to be updated?

Possibly yes, because the jar file now has additional main function that can be invoked directly.

  • Does this change add any new dependencies?

Yes, org.rogach.scallop is used to parse command line arguments.

  • Could this change or impact execution of existing code?

No, because this pull request only creates new files.

Interested parties

Tag (@ mention) interested parties.

Thanks in advance for your help with the Archives Unleashed Toolkit!

…ext extractor that can be run via command line in spark-submit, along with their tests
@ruebot
Copy link
Member

ruebot commented May 15, 2018

@TitusAn just as a future tip, it's best to work off a branch instead of master.

@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #225 into master will decrease coverage by 3.02%.
The diff coverage is 26.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
- Coverage    61.7%   58.68%   -3.03%     
==========================================
  Files          34       38       +4     
  Lines         679      743      +64     
  Branches      124      137      +13     
==========================================
+ Hits          419      436      +17     
- Misses        219      266      +47     
  Partials       41       41
Impacted Files Coverage Δ
...o/archivesunleashed/app/CommandLineAppRunner.scala 0% <0%> (ø)
...chivesunleashed/app/DomainFrequencyExtractor.scala 100% <100%> (ø)
.../io/archivesunleashed/app/PlainTextExtractor.scala 100% <100%> (ø)
...o/archivesunleashed/app/DomainGraphExtractor.scala 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f9f9b4...1eb56f7. Read the comment docs.

@ianmilligan1
Copy link
Member

👍 I will try to test this right now before my next meeting.

@ianmilligan1
Copy link
Member

I'm getting an error. Any idea what I'm doing wrong?

Command that I'm using:

./spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/i2millig/aut/forks/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor domainGraph --input /mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/*.gz --output /mnt/vol1/derivative_data/test/domain-graph

And the output I get is:

2018-05-15 15:02:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--extractorException in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at org.rogach.scallop.Scallop.<init>(Scallop.scala:72)
        at org.rogach.scallop.Scallop$.apply(Scallop.scala:12)
        at org.rogach.scallop.ScallopConfBase.<init>(ScallopConfBase.scala:49)
        at org.rogach.scallop.ScallopConf.<init>(ScallopConf.scala:6)
        at io.archivesunleashed.app.CommandLineAppRunner$Conf.<init>(CommandLineAppRunner.scala:40)
        at io.archivesunleashed.app.CommandLineAppRunner$.main(CommandLineAppRunner.scala:51)
        at io.archivesunleashed.app.CommandLineAppRunner.main(CommandLineAppRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-05-15 15:02:12 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-05-15 15:02:12 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4e7658bd-f30c-48be-afdd-039b622b4a94

@TitusAn
Copy link
Contributor Author

TitusAn commented May 15, 2018

I will investigate this when I got home. Looks like it is caused by incompatible Scala version between Scallop command line parsing library and the one that is used by the project.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a fantastic addition to AUT - thanks Titus. One comment below as well (in addition to the execution problems I'm having).

.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make the output of DomainGraphExtractor a GEXF file?

https://archivesunleashed.org/aut/#exporting-to-gephi-directly

Or perhaps add a fourth option GEXFGraphExtractor or something along those lines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanest way is probably to add --outputFormat in the top-level app, with plain text as default?

Obviously, some output formats won't make sense with others, but that should be fine.

I like this approach because when we do DF later, we can do output format to, for example, MySQL dump output... which we can load directly into MySQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be really nice if possible!

pom.xml Outdated
@@ -580,6 +580,11 @@
<artifactId>tika-parsers</artifactId>
<version>1.12</version>
</dependency>
<dependency>
<groupId>org.rogach</groupId>
<artifactId>scallop_2.12</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, this removed the error I was encountering.

import org.apache.log4j.Logger
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.rogach.scallop._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this import?

var archive = RecordLoader.loadArchives(args.input(), sc)

args.input() match {
case "domainFreq" =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we make the arg exactly the name of the class? E.g., DomainFrequencyExtractor - I think this will reduce confusion...

class Conf(args: Seq[String]) extends ScallopConf(args) {
mainOptions = Seq(input, output)
var extractor = opt[String](descr =
"extractor, one of domainFreq, domainGraph or plainText", required = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below. Otherwise, documenting the options will become unwieldy when we have 20 apps?

@ianmilligan1
Copy link
Member

@TitusAn dropping scallop down to 2.11 got things working as per @lintool's comment above, but it's not accepting the wildcard in the input - it says I should provide exactly one argument for this option. What's the best way to load a directory of files in?

assert(plainText(0)._2 == "www.archive.org")
assert(plainText(0)._3 == "http://www.archive.org/")
assert(plainText(0)._4 == "HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT ETag: \"47ac-16e-4f9e5b40\" Accept-Ranges: bytes Content-Length: 366 Connection: close Content-Type: text/html; charset=UTF-8 Please visit our website at: http://www.archive.org")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line.

…o write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.
@TitusAn
Copy link
Contributor Author

TitusAn commented May 16, 2018

Restructure CommandLineAppRunner to make it more robust.
Add option to write GEXF output for DomainGraphExtractor (enable via --output-format GEXF).
Add support for multiple input files. (Bash wildcard works fine)
Other polish and cleanup.

"DomainFrequencyExtractor" ->
((rdd: RDD[ArchiveRecord], subFolder: String) => {
DomainFrequencyExtractor(rdd).saveAsTextFile(subFolder)}),
"DomainGraphExtractor" ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 2-space indented?

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have this merged once Ian smoke tests.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a big collection of WARCs and looks quite nice to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants