-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225
Conversation
…ext extractor that can be run via command line in spark-submit, along with their tests
@TitusAn just as a future tip, it's best to work off a branch instead of master. |
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
- Coverage 61.7% 58.68% -3.03%
==========================================
Files 34 38 +4
Lines 679 743 +64
Branches 124 137 +13
==========================================
+ Hits 419 436 +17
- Misses 219 266 +47
Partials 41 41
Continue to review full report at Codecov.
|
👍 I will try to test this right now before my next meeting. |
I'm getting an error. Any idea what I'm doing wrong? Command that I'm using:
And the output I get is:
|
I will investigate this when I got home. Looks like it is caused by incompatible Scala version between Scallop command line parsing library and the one that is used by the project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a fantastic addition to AUT - thanks Titus. One comment below as well (in addition to the execution problems I'm having).
.filter(r => r._2 != "" && r._3 != "") | ||
.countItems() | ||
.filter(r => r._2 > 5) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to make the output of DomainGraphExtractor
a GEXF file?
https://archivesunleashed.org/aut/#exporting-to-gephi-directly
Or perhaps add a fourth option GEXFGraphExtractor
or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanest way is probably to add --outputFormat
in the top-level app, with plain text as default?
Obviously, some output formats won't make sense with others, but that should be fine.
I like this approach because when we do DF later, we can do output format to, for example, MySQL dump output... which we can load directly into MySQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be really nice if possible!
pom.xml
Outdated
@@ -580,6 +580,11 @@ | |||
<artifactId>tika-parsers</artifactId> | |||
<version>1.12</version> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.rogach</groupId> | |||
<artifactId>scallop_2.12</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, bespin is on 2.11
: https://github.com/lintool/bespin/blob/master/pom.xml#L244
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted below, this removed the error I was encountering.
import org.apache.log4j.Logger | ||
import org.apache.spark.{SparkConf, SparkContext} | ||
import org.apache.spark.rdd.RDD | ||
import org.rogach.scallop._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this import?
var archive = RecordLoader.loadArchives(args.input(), sc) | ||
|
||
args.input() match { | ||
case "domainFreq" => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we make the arg exactly the name of the class? E.g., DomainFrequencyExtractor
- I think this will reduce confusion...
class Conf(args: Seq[String]) extends ScallopConf(args) { | ||
mainOptions = Seq(input, output) | ||
var extractor = opt[String](descr = | ||
"extractor, one of domainFreq, domainGraph or plainText", required = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below. Otherwise, documenting the options will become unwieldy when we have 20 apps?
assert(plainText(0)._2 == "www.archive.org") | ||
assert(plainText(0)._3 == "http://www.archive.org/") | ||
assert(plainText(0)._4 == "HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT ETag: \"47ac-16e-4f9e5b40\" Accept-Ranges: bytes Content-Length: 366 Connection: close Content-Type: text/html; charset=UTF-8 Please visit our website at: http://www.archive.org") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blank line.
…o write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.
Restructure CommandLineAppRunner to make it more robust. |
"DomainFrequencyExtractor" -> | ||
((rdd: RDD[ArchiveRecord], subFolder: String) => { | ||
DomainFrequencyExtractor(rdd).saveAsTextFile(subFolder)}), | ||
"DomainGraphExtractor" -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be 2-space indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to have this merged once Ian smoke tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on a big collection of WARCs and looks quite nice to me.
Implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests
GitHub issue(s): #195
What does this Pull Request do?
This pull request implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests. Job can be submitted like so:
./spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor EXTRACTOR --input ./aut/src/test/resources/warc/example.warc.gz --output OUTPUT
where EXTRACTOR is one of domainFreq, domainGraph or plainText, and OUTPUT is the directory to write output to.
How should this be tested?
mvn install
to build the executable and run tests. There are three new tests that test each one of the operations added.Execute the above command in command line, substituting path to output, Spark and AUT executable as necessary.
Additional Notes:
Possibly yes, because the jar file now has additional main function that can be invoked directly.
Yes, org.rogach.scallop is used to parse command line arguments.
No, because this pull request only creates new files.
Interested parties
Tag (@ mention) interested parties.
Thanks in advance for your help with the Archives Unleashed Toolkit!