-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272
Conversation
Thanks @borislin! I'll try and get this tested tonight. |
I'm not able to replicate with all the 499 files, and the two files from aut-resources.
|
Getting closer! If I remove the empty Is something weird going on with how we handle empty files? |
@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file. |
Can confirm that if I remove the empty file, we are successful. The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org. |
Codecov Report
@@ Coverage Diff @@
## master #272 +/- ##
==========================================
+ Coverage 70.35% 70.36% +<.01%
==========================================
Files 41 41
Lines 1039 1046 +7
Branches 191 192 +1
==========================================
+ Hits 731 736 +5
- Misses 242 244 +2
Partials 66 66
Continue to review full report at Codecov.
|
|
@ruebot Oh, I forgot to mention. Remove the |
Is it possible to tweak this to retain the |
@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files. |
That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation. |
Yeah, if possible it'd be nice to keep it. In any case I'm still having trouble reproducing your success. I'm using:
and am getting this error log. |
@ruebot @ianmilligan1 I see. Let me tweak it further. |
@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two little things with comments, and I think we're good to go.
@@ -41,17 +42,32 @@ import scala.util.matching.Regex | |||
package object archivesunleashed { | |||
/** Loads records from either WARCs, ARCs or Twitter API data (JSON). */ | |||
object RecordLoader { | |||
/** Gets all non-empty archive files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a full stop.
* | ||
* @param dir the path to the directory containing archive files | ||
* @param fs filesystem | ||
* @return a String consisting of all non-empty archive files path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a full stop.
@borislin looks good now. Nice work! |
@lintool can you give it a look before I merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
My only comment is that getFiles
will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.
Regardless, 🚢 and we'll deal with issues later.
Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.
GitHub issue(s):
If you are responding to an issue, please mention their numbers below.
What does this Pull Request do?
This PR catches exceptions in
getContent()
inArcRecordUtils.java
when reading the content of corrupted ARC files, allowing Spark jobs to continue running.How should this be tested?
Build aut on master, as of the latest commit:
mvn clean install
Create an output directory with sub-directories:
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
Adapt the script below:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
Current results
/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives
Interested parties
@lintool @ianmilligan1 @ruebot