Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

borislin · 2018-10-03T21:06:52Z

Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.

GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271

What does this Pull Request do?

This PR catches exceptions in getContent() in ArcRecordUtils.java when reading the content of corrupted ARC files, allowing Spark jobs to continue running.

How should this be tested?

Build aut on master, as of the latest commit: mvn clean install
Create an output directory with sub-directories:
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
Adapt the script below:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-271/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-271/derivatives/gephi"

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit

Run the following command with adapted paths with Apache Spark 2.1.3:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Current results

/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives

Interested parties

@lintool @ianmilligan1 @ruebot

ruebot · 2018-10-03T21:49:20Z

Thanks @borislin! I'll try and get this tested tonight.

ruebot · 2018-10-03T23:01:25Z

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

rm -rf ~/m2/repository/*
rm -rf ~/git/aut/target
git fetch --all
git checkout fix-exeception
mvn clean install
Run the following:

/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
Where 499.scala is:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit

It fails. Log here.

ianmilligan1 · 2018-10-03T23:41:33Z

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

borislin · 2018-10-03T23:50:42Z

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

ruebot · 2018-10-04T00:19:58Z

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

codecov-io · 2018-10-04T01:11:43Z

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66

Impacted Files	Coverage Δ
src/main/scala/io/archivesunleashed/package.scala	`84.82% <100%> (+0.7%)`	⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java	`78.26% <50%> (-3.56%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

ruebot · 2018-10-04T02:29:03Z

[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a

borislin · 2018-10-04T02:53:52Z

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

ianmilligan1 · 2018-10-04T02:58:19Z

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

borislin · 2018-10-04T03:00:17Z

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

ruebot · 2018-10-04T03:03:35Z

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

ianmilligan1 · 2018-10-04T03:05:35Z

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

borislin · 2018-10-04T03:10:33Z

@ruebot @ianmilligan1 I see. Let me tweak it further.

borislin · 2018-10-04T18:32:49Z

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

ruebot

Two little things with comments, and I think we're good to go.

ruebot · 2018-10-04T02:22:08Z

src/main/scala/io/archivesunleashed/package.scala

@@ -41,17 +42,32 @@ import scala.util.matching.Regex
 package object archivesunleashed {
  /** Loads records from either WARCs, ARCs or Twitter API data (JSON). */
  object RecordLoader {
+    /** Gets all non-empty archive files


Needs a full stop.

ruebot · 2018-10-04T02:22:20Z

src/main/scala/io/archivesunleashed/package.scala

+      *
+      * @param dir the path to the directory containing archive files
+      * @param fs filesystem
+      * @return a String consisting of all non-empty archive files path


Needs a full stop.

ruebot · 2018-10-04T18:59:35Z

@borislin looks good now. Nice work!

ruebot · 2018-10-04T20:03:39Z

@lintool can you give it a look before I merge?

ianmilligan1 · 2018-10-04T20:16:52Z

Just echoing @ruebot - this is great work, @borislin! 🎉

lintool

lgtm.

My only comment is that getFiles will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.

Regardless, 🚢 and we'll deal with issues later.

Fix exception when processing corrupted ARC files

836db56

borislin requested review from ruebot and lintool October 3, 2018 21:43

Filter out non-empty archive files in loadArchives()

610058a

Fix archive files path pattern

041d983

ruebot requested changes Oct 4, 2018

View reviewed changes

Minor commment changes

7ecdd9e

ruebot approved these changes Oct 4, 2018

View reviewed changes

lintool approved these changes Oct 4, 2018

View reviewed changes

ruebot merged commit b8e57ec into master Oct 4, 2018

ruebot deleted the fix-exception branch October 4, 2018 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

borislin commented Oct 3, 2018 •

edited

Loading

ruebot commented Oct 3, 2018

ruebot commented Oct 3, 2018

ianmilligan1 commented Oct 3, 2018

borislin commented Oct 3, 2018

ruebot commented Oct 4, 2018

codecov-io commented Oct 4, 2018 •

edited

Loading

ruebot commented Oct 4, 2018

borislin commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

borislin commented Oct 4, 2018

ruebot commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

borislin commented Oct 4, 2018

borislin commented Oct 4, 2018 •

edited

Loading

ruebot left a comment

ruebot Oct 4, 2018

ruebot Oct 4, 2018

ruebot commented Oct 4, 2018

ruebot commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

lintool left a comment

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Conversation

borislin commented Oct 3, 2018 • edited Loading

What does this Pull Request do?

How should this be tested?

Current results

Interested parties

ruebot commented Oct 3, 2018

ruebot commented Oct 3, 2018

ianmilligan1 commented Oct 3, 2018

borislin commented Oct 3, 2018

ruebot commented Oct 4, 2018

codecov-io commented Oct 4, 2018 • edited Loading

Codecov Report

ruebot commented Oct 4, 2018

borislin commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

borislin commented Oct 4, 2018

ruebot commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

borislin commented Oct 4, 2018

borislin commented Oct 4, 2018 • edited Loading

ruebot left a comment

Choose a reason for hiding this comment

ruebot Oct 4, 2018

Choose a reason for hiding this comment

ruebot Oct 4, 2018

Choose a reason for hiding this comment

ruebot commented Oct 4, 2018

ruebot commented Oct 4, 2018

ianmilligan1 commented Oct 4, 2018

lintool left a comment

Choose a reason for hiding this comment

borislin commented Oct 3, 2018 •

edited

Loading

codecov-io commented Oct 4, 2018 •

edited

Loading

borislin commented Oct 4, 2018 •

edited

Loading