-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unparseable date error #163
Comments
Thanks for catching this this.. I've monkeyed around with a sample file and yep, dropping the two digits from the end of a date field trips an exception so you don't get results - I haven't yet been able to make it crash like yours, though (but it does screw up the job). In your case the WARC has Let us chat and see if we can make it more robust. |
Thanks! I actually have another issue with memory allocation and OutOfMemory errors that make it impossible for any run on a big archive folder to finish. I'm pretty new to spark architecture, so I'm not sure whether I should post this issue here. Could you help with this? Thanks again :) |
Yep if you could post that issue, happy to look at it. It may in theory be easier to fix than this date issue. :/ |
Re: date issue – where did this WARC come from? Am curious if a 12-digit date field is a one-off due to an error somewhere, or if there's a tool that generates these. |
On the date thing. All I know is that the WARCs were purchased from the Internet Archive. I'll try to figure out if I can get more info. On the memory thing. I'll check that issue out and comment on my stuff there if it suits. |
OK thanks, please keep us posted! |
Ok, so these WARCs were converted from ARCs, where some of them had these 12 digit dates. The conversion just left these defective dates as-is, and did not try to convert to ISO8601. I don't think this justifies building a more robust date extraction, more so even as the WARC standard clearly states the date is to be in ISO8601 only. We will fix the defective dates in the actual WARC files. What still concerns me though, is that a few bad records crashed the whole run. Is there a way around this? Edit: I ran a small script to help me find try to get the raw dates, where I did not explicitly call What I ran:
|
@cjer we talked about this on a call today, and completely agree we need better exception handling here. Are these files that you can share, so we can test as well? |
I'm sorry, but I can't share the files. |
Problem was fixed using sed and jwattools. |
@cjer please feel free to create a new issue around error handling. |
Ok cool, I'll try to come up with something reasonable. |
Hello
I'm running a simple script, to count domains by year, on a folder with a large WARC archive.
Using aut 0.12.1 on a cluster with 4 workers.
This is the code:
Getting these errors at a certain point, after it has gone through a few thousand tasks, that end up in a crash of the script:
I'm guessing there's an issue with the dates in some of the WARC records, am I right? Could it be another issue? what should I do with this?
Full traceback:
Thanks
The text was updated successfully, but these errors were encountered: