Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unparseable date error #163

Closed
cjer opened this issue Jan 21, 2018 · 13 comments
Closed

Unparseable date error #163

cjer opened this issue Jan 21, 2018 · 13 comments
Labels

Comments

@cjer
Copy link

cjer commented Jan 21, 2018

Hello
I'm running a simple script, to count domains by year, on a folder with a large WARC archive.
Using aut 0.12.1 on a cluster with 4 workers.

This is the code:

import io.archivesunleashed.spark.matchbox._ 
import io.archivesunleashed.spark.rdd.RecordRDD._ 

val r = 
RecordLoader.loadArchives("/warc_folder/*.warc.gz", sc) 
.keepValidPages() 
.map(r => (r.getCrawlDate, r.getDomain)) 
.countItems() 
.saveAsTextFile("out/domains")

Getting these errors at a certain point, after it has gone through a few thousand tasks, that end up in a crash of the script:

java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
...

I'm guessing there's an issue with the dates in some of the WARC records, am I right? Could it be another issue? what should I do with this?

Full traceback:

TaskSetManager: Lost task 1492.0 in stage 0.0 (TID 1492, <WORKER_IP>, executor 0): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1501 + 8) / 43721]18/01/21 20:41:44 WARN TaskSetManager: Lost task 1492.1 in stage 0.0 (TID 1503, <WORKER_IP>, executor 0): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1512 + 8) / 43721]18/01/21 20:42:01 WARN TaskSetManager: Lost task 1492.2 in stage 0.0 (TID 1511, <WORKER_IP>, executor 1): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[Stage 0:=>                                                  (1520 + 8) / 43721]18/01/21 20:42:18 WARN TaskSetManager: Lost task 1492.3 in stage 0.0 (TID 1523, <WORKER_IP>, executor 2): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/01/21 20:42:18 ERROR TaskSetManager: Task 1492 in stage 0.0 failed 4 times; aborting job
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1522.0 in stage 0.0 (TID 1525, <WORKER_IP>, executor 1): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1527.0 in stage 0.0 (TID 1530, <WORKER_IP>, executor 1): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1526.0 in stage 0.0 (TID 1529, <WORKER_IP>, executor 3): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1524.0 in stage 0.0 (TID 1527, <WORKER_IP>, executor 3): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1523.0 in stage 0.0 (TID 1526, <WORKER_IP>, executor 2): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1528.0 in stage 0.0 (TID 1531, <WORKER_IP>, executor 2): TaskKilled (stage cancelled)
18/01/21 20:42:18 WARN TaskSetManager: Lost task 1521.0 in stage 0.0 (TID 1524, <WORKER_IP>, executor 0): TaskKilled (stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1492 in stage 0.0 failed 4 times, most recent failure: Lost task 1492.3 in stage 0.0 (TID 1523, <WORKER_IP>, executor 2): java.text.ParseException: Unparseable date: "200012060402"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:128)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:619)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:620)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:617)
  at io.archivesunleashed.spark.rdd.RecordRDD$CountableRDD.countItems(RecordRDD.scala:40)
  ... 53 elided
Caused by: java.text.ParseException: Unparseable date: "200012060402"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:48)
  at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
  at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
scala> 18/01/21 20:42:18 WARN TaskSetManager: Lost task 1525.0 in stage 0.0 (TID 1528, <WORKER_IP>, executor 0): TaskKilled (stage cancelled)

Thanks

@ianmilligan1
Copy link
Member

Thanks for catching this this.. I've monkeyed around with a sample file and yep, dropping the two digits from the end of a date field trips an exception so you don't get results - I haven't yet been able to make it crash like yours, though (but it does screw up the job).

In your case the WARC has 200012060402 which is 12 digits, and we are looking for a 14 digit date.

Let us chat and see if we can make it more robust.

@cjer
Copy link
Author

cjer commented Jan 22, 2018

Thanks!

I actually have another issue with memory allocation and OutOfMemory errors that make it impossible for any run on a big archive folder to finish. I'm pretty new to spark architecture, so I'm not sure whether I should post this issue here. Could you help with this?

Thanks again :)

@ianmilligan1
Copy link
Member

Yep if you could post that issue, happy to look at it. It may in theory be easier to fix than this date issue. :/

@ianmilligan1
Copy link
Member

Re: date issue – where did this WARC come from? Am curious if a 12-digit date field is a one-off due to an error somewhere, or if there's a tool that generates these.

@ruebot
Copy link
Member

ruebot commented Jan 22, 2018

@cjer for the memory issue, have you looked at #159?

@cjer
Copy link
Author

cjer commented Jan 22, 2018

On the date thing. All I know is that the WARCs were purchased from the Internet Archive. I'll try to figure out if I can get more info.

On the memory thing. I'll check that issue out and comment on my stuff there if it suits.

@ianmilligan1
Copy link
Member

OK thanks, please keep us posted!

@cjer
Copy link
Author

cjer commented Jan 23, 2018

Ok, so these WARCs were converted from ARCs, where some of them had these 12 digit dates. The conversion just left these defective dates as-is, and did not try to convert to ISO8601.

I don't think this justifies building a more robust date extraction, more so even as the WARC standard clearly states the date is to be in ISO8601 only.

We will fix the defective dates in the actual WARC files.

What still concerns me though, is that a few bad records crashed the whole run. Is there a way around this?

Edit: I ran a small script to help me find try to get the raw dates, where I did not explicitly call getCrawlDate, it also crashed with the same error when it got to the records with the 12-digit dates.
I looked at the ArchiveRecord code to see why, and as I understand it, all the getCrawlDate, getUrl etc. are actually just class attributes and the code to get their value runs even if you don't use them. Is that right?

What I ran:

import io.archivesunleashed.spark.matchbox.{RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/warc_folder/*.warc.gz", sc)
.map(r => (r.warcRecord.getHeader.getDate))
.saveAsTextFile("out/raw_dates")

@ruebot
Copy link
Member

ruebot commented Jan 23, 2018

@cjer we talked about this on a call today, and completely agree we need better exception handling here. Are these files that you can share, so we can test as well?

@cjer
Copy link
Author

cjer commented Jan 23, 2018

I'm sorry, but I can't share the files.
I think creating a few files with a defective date or two in them should do the trick in terms of recreating the problem.

@cjer
Copy link
Author

cjer commented Feb 21, 2018

Problem was fixed using sed and jwattools.
I think this issue can be closed, but still call into play discussing error handling options when dealing with bad WARC files/records.

@ruebot
Copy link
Member

ruebot commented Feb 21, 2018

@cjer please feel free to create a new issue around error handling.

@cjer
Copy link
Author

cjer commented Feb 21, 2018

Ok cool, I'll try to come up with something reasonable.

@cjer cjer closed this as completed Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants