-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid format exception in scanJobLog #239
Labels
Comments
anjackson
added a commit
to ukwa/heritrix3
that referenced
this issue
Mar 11, 2019
I have a patch but haven't tested it yet. |
@anjackson happy to test with that |
@ruebot Awesome, thanks. I've popped a SNAPSHOT build here: Can you give it a spin? |
Good to go on my end: $ tail -f job.log
2019-03-01T12:01:32.130Z INFO Job instantiated
2019-03-01T12:01:42.293Z INFO Job launched
2019-03-01T12:01:43.923Z INFO PREPARING 20190301120143
2019-03-01T12:01:43.955Z INFO PAUSED 20190301120143
2019-03-01T12:01:54.079Z INFO RUNNING 20190301120143
2019-03-01T12:02:01.454Z WARNING nowhere to log added seed: http://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.registrar.yorku.ca/'; in processor 'candidates')
2019-03-01T12:02:07.490Z WARNING nowhere to log added seed: https://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.students.yorku.ca/'; in processor 'candidates')
2019-03-11T12:26:03.810Z INFO CHECKPOINTED cp00001-20190311122554
2019-03-11T12:26:06.453Z INFO PAUSING 20190301120143
2019-03-11T12:26:07.702Z INFO PAUSED 20190301120143 $ cat heritrix_out.log
Tue Mar 12 08:32:54 EDT 2019 Starting heritrix
Linux wombat 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
JAVA_OPTS= -Xmx256m
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256586
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 256586
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Oracle Corporation OpenJDK Runtime Environment 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12
Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore adhoc.keystore -destkeystore adhoc.keystore -deststoretype pkcs12".
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:E7:65:F2:C6:5A:90:A0:79:61:FF:28:1F:24:81:A6:74:D0:38:2C:0B
Verify in browser before accepting exception.
2019-03-12 12:32:55.523 INFO thread-1 org.archive.crawler.framework.Engine.addJobDirectory() added crawl job: academic-calendars
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 9191
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web
interface will be internet-accessible.
Heritrix version: 3.4.0-SNAPSHOT-2019-03-11T21:36:31Z
log4j:WARN No appenders could be found for logger (freemarker.cache).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. |
Patch looks correct to me. |
Thanks both! |
Pull request created: #240 |
ato
pushed a commit
that referenced
this issue
Mar 13, 2019
nlevitt
added a commit
that referenced
this issue
Apr 1, 2019
* master: use constant from rethinkdb lib for default port Update README.md Handle missing closing paren in srcset descriptor Teach jericho extractor srcset Don't run srcset test against jericho, it doesn't handle it Handle commas more compliantly when parsing srcset Ensure we start parsing full lines, for #239.
nlevitt
added a commit
to nlevitt/heritrix3
that referenced
this issue
Apr 10, 2019
* master: replace System.err.println with logger.info Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" Removing outdated test. Disable questionalbe test. Avoid deprecated flag. Supply an iterator, for internetarchive#245 Updated POM to use latest version. Update README.md Handle missing closing paren in srcset descriptor Teach jericho extractor srcset Don't run srcset test against jericho, it doesn't handle it Handle commas more compliantly when parsing srcset Ensure we start parsing full lines, for internetarchive#239.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@ruebot encountered the following exception after checkpointing and restarting Heritrix:
Looks like a bug here:
heritrix3/engine/src/main/java/org/archive/crawler/framework/CrawlJob.java
Line 168 in aa705be
If the job log is larger than 100KB,
startPosition
is set to 100KB from the end which might be in the middle of a line. If that partial line still happens to matchPattern.compile("(\\S+) (\\S+) Job launched")
then an incomplete timestamp may be parsed causing the exception.@anjackson suggests the following fix:
The text was updated successfully, but these errors were encountered: