Unparseable date error #163

cjer · 2018-01-21T22:07:50Z

Hello
I'm running a simple script, to count domains by year, on a folder with a large WARC archive.
Using aut 0.12.1 on a cluster with 4 workers.

This is the code:

import io.archivesunleashed.spark.matchbox._ 
import io.archivesunleashed.spark.rdd.RecordRDD._ 

val r = 
RecordLoader.loadArchives("/warc_folder/*.warc.gz", sc) 
.keepValidPages() 
.map(r => (r.getCrawlDate, r.getDomain)) 
.countItems() 
.saveAsTextFile("out/domains")

Getting these errors at a certain point, after it has gone through a few thousand tasks, that end up in a crash of the script:

java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
...

I'm guessing there's an issue with the dates in some of the WARC records, am I right? Could it be another issue? what should I do with this?

Full traceback:

TaskSetManager: Lost task 1492.0 in stage 0.0 (TID 1492, <WORKER_IP>, executor 0): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1501 + 8) / 43721]18/01/21 20:41:44 WARN TaskSetManager: Lost task 1492.1 in stage 0.0 (TID 1503, <WORKER_IP>, executor 0): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1512 + 8) / 43721]18/01/21 20:42:01 WARN TaskSetManager: Lost task 1492.2 in stage 0.0 (TID 1511, <WORKER_IP>, executor 1): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1520 + 8) / 43721]18/01/21 20:42:18 WARN TaskSetManager: Lost task 1492.3 in stage 0.0 (TID 1523, <WORKER_IP>, executor 2): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/01/21 20:42:18 ERROR TaskSetManager: Task 1492 in stage 0.0 failed 4 times; aborting job
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1522.0 in stage 0.0 (TID 1525, <WORKER_IP>, executor 1): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1527.0 in stage 0.0 (TID 1530, <WORKER_IP>, executor 1): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1526.0 in stage 0.0 (TID 1529, <WORKER_IP>, executor 3): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1524.0 in stage 0.0 (TID 1527, <WORKER_IP>, executor 3): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1523.0 in stage 0.0 (TID 1526, <WORKER_IP>, executor 2): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1528.0 in stage 0.0 (TID 1531, <WORKER_IP>, executor 2): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1521.0 in stage 0.0 (TID 1524, <WORKER_IP>, executor 0): TaskKilled (stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1492 in stage 0.0 failed 4 times, most recent failure: Lost task 1492.3 in stage 0.0 (TID 1523, <WORKER_IP>, executor 2): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:128)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:619)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:620)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:617)
  at io.archivesunleashed.spark.rdd.RecordRDD$CountableRDD.countItems(RecordRDD.scala:40)
  ... 53 elided
Caused by: java.text.ParseException: Unparseable date: "200012060402"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
  at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
  at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
scala> 18/01/21 20:42:18 WARN TaskSetManager: Lost task 1525.0 in stage 0.0 (TID 1528, <WORKER_IP>, executor 0): TaskKilled (stage cancelled)

Thanks

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2018-01-21T22:28:06Z

Thanks for catching this this.. I've monkeyed around with a sample file and yep, dropping the two digits from the end of a date field trips an exception so you don't get results - I haven't yet been able to make it crash like yours, though (but it does screw up the job).

In your case the WARC has 200012060402 which is 12 digits, and we are looking for a 14 digit date.

Let us chat and see if we can make it more robust.

cjer · 2018-01-22T10:43:40Z

Thanks!

I actually have another issue with memory allocation and OutOfMemory errors that make it impossible for any run on a big archive folder to finish. I'm pretty new to spark architecture, so I'm not sure whether I should post this issue here. Could you help with this?

Thanks again :)

ianmilligan1 · 2018-01-22T14:15:42Z

Yep if you could post that issue, happy to look at it. It may in theory be easier to fix than this date issue. :/

ianmilligan1 · 2018-01-22T14:16:20Z

Re: date issue – where did this WARC come from? Am curious if a 12-digit date field is a one-off due to an error somewhere, or if there's a tool that generates these.

ruebot · 2018-01-22T14:17:45Z

@cjer for the memory issue, have you looked at #159?

cjer · 2018-01-22T14:54:08Z

On the date thing. All I know is that the WARCs were purchased from the Internet Archive. I'll try to figure out if I can get more info.

On the memory thing. I'll check that issue out and comment on my stuff there if it suits.

ianmilligan1 · 2018-01-22T14:55:19Z

OK thanks, please keep us posted!

cjer · 2018-01-23T21:48:49Z

Ok, so these WARCs were converted from ARCs, where some of them had these 12 digit dates. The conversion just left these defective dates as-is, and did not try to convert to ISO8601.

I don't think this justifies building a more robust date extraction, more so even as the WARC standard clearly states the date is to be in ISO8601 only.

We will fix the defective dates in the actual WARC files.

What still concerns me though, is that a few bad records crashed the whole run. Is there a way around this?

Edit: I ran a small script to help me find try to get the raw dates, where I did not explicitly call getCrawlDate, it also crashed with the same error when it got to the records with the 12-digit dates.
I looked at the ArchiveRecord code to see why, and as I understand it, all the getCrawlDate, getUrl etc. are actually just class attributes and the code to get their value runs even if you don't use them. Is that right?

What I ran:

import io.archivesunleashed.spark.matchbox.{RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/warc_folder/*.warc.gz", sc)
.map(r => (r.warcRecord.getHeader.getDate))
.saveAsTextFile("out/raw_dates")

ruebot · 2018-01-23T22:15:24Z

@cjer we talked about this on a call today, and completely agree we need better exception handling here. Are these files that you can share, so we can test as well?

cjer · 2018-01-23T22:47:47Z

I'm sorry, but I can't share the files.
I think creating a few files with a defective date or two in them should do the trick in terms of recreating the problem.

cjer · 2018-02-21T13:20:26Z

Problem was fixed using sed and jwattools.
I think this issue can be closed, but still call into play discussing error handling options when dealing with bad WARC files/records.

ruebot · 2018-02-21T13:22:11Z

@cjer please feel free to create a new issue around error handling.

cjer · 2018-02-21T13:27:25Z

Ok cool, I'll try to come up with something reasonable.

ianmilligan1 added the RA-Task label Jan 23, 2018

cjer mentioned this issue Feb 20, 2018

Fixing bad dates in WARC file iipc/webarchive-commons#80

Closed

cjer closed this as completed Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unparseable date error #163

Unparseable date error #163

cjer commented Jan 21, 2018

ianmilligan1 commented Jan 21, 2018

cjer commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

ruebot commented Jan 22, 2018

cjer commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

cjer commented Jan 23, 2018 •

edited

Loading

ruebot commented Jan 23, 2018

cjer commented Jan 23, 2018

cjer commented Feb 21, 2018 •

edited

Loading

ruebot commented Feb 21, 2018

cjer commented Feb 21, 2018

Unparseable date error #163

Unparseable date error #163

Comments

cjer commented Jan 21, 2018

ianmilligan1 commented Jan 21, 2018

cjer commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

ruebot commented Jan 22, 2018

cjer commented Jan 22, 2018

ianmilligan1 commented Jan 22, 2018

cjer commented Jan 23, 2018 • edited Loading

ruebot commented Jan 23, 2018

cjer commented Jan 23, 2018

cjer commented Feb 21, 2018 • edited Loading

ruebot commented Feb 21, 2018

cjer commented Feb 21, 2018

cjer commented Jan 23, 2018 •

edited

Loading

cjer commented Feb 21, 2018 •

edited

Loading