Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Merged
merged 4 commits into from
Oct 4, 2018

Conversation

borislin
Copy link
Collaborator

@borislin borislin commented Oct 3, 2018

Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.


GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

This PR catches exceptions in getContent() in ArcRecordUtils.java when reading the content of corrupted ARC files, allowing Spark jobs to continue running.

How should this be tested?

  • Build aut on master, as of the latest commit: mvn clean install

  • Create an output directory with sub-directories:
    mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs

  • Adapt the script below:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-271/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-271/derivatives/gephi"

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit
  • Run the following command with adapted paths with Apache Spark 2.1.3:
    /home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Current results

/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives

Interested parties

@lintool @ianmilligan1 @ruebot

@borislin borislin requested review from ruebot and lintool October 3, 2018 21:43
@ruebot
Copy link
Member

ruebot commented Oct 3, 2018

Thanks @borislin! I'll try and get this tested tonight.

@ruebot
Copy link
Member

ruebot commented Oct 3, 2018

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

  1. rm -rf ~/m2/repository/*
  2. rm -rf ~/git/aut/target
  3. git fetch --all
  4. git checkout fix-exeception
  5. mvn clean install
  6. Run the following:
  • /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
  • Where 499.scala is:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
  1. It fails. Log here.

@ianmilligan1
Copy link
Member

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

@borislin
Copy link
Collaborator Author

borislin commented Oct 3, 2018

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

@ruebot
Copy link
Member

ruebot commented Oct 4, 2018

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

@codecov-io
Copy link

codecov-io commented Oct 4, 2018

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 84.82% <100%> (+0.7%) ⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java 78.26% <50%> (-3.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented Oct 4, 2018

[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a

@borislin
Copy link
Collaborator Author

borislin commented Oct 4, 2018

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

@ianmilligan1
Copy link
Member

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

@borislin
Copy link
Collaborator Author

borislin commented Oct 4, 2018

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

@ruebot
Copy link
Member

ruebot commented Oct 4, 2018

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

@ianmilligan1
Copy link
Member

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

@borislin
Copy link
Collaborator Author

borislin commented Oct 4, 2018

@ruebot @ianmilligan1 I see. Let me tweak it further.

@borislin
Copy link
Collaborator Author

borislin commented Oct 4, 2018

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two little things with comments, and I think we're good to go.

@@ -41,17 +42,32 @@ import scala.util.matching.Regex
package object archivesunleashed {
/** Loads records from either WARCs, ARCs or Twitter API data (JSON). */
object RecordLoader {
/** Gets all non-empty archive files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a full stop.

*
* @param dir the path to the directory containing archive files
* @param fs filesystem
* @return a String consisting of all non-empty archive files path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a full stop.

@ruebot
Copy link
Member

ruebot commented Oct 4, 2018

@borislin looks good now. Nice work!

@ruebot
Copy link
Member

ruebot commented Oct 4, 2018

@lintool can you give it a look before I merge?

@ianmilligan1
Copy link
Member

Just echoing @ruebot - this is great work, @borislin! 🎉

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

My only comment is that getFiles will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.

Regardless, 🚢 and we'll deal with issues later.

@ruebot ruebot merged commit b8e57ec into master Oct 4, 2018
@ruebot ruebot deleted the fix-exception branch October 4, 2018 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants