Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some urls with whitespace acceptable to JavaScript extractor. #145

Merged
merged 5 commits into from
Feb 11, 2016

Conversation

vonrosen
Copy link
Contributor

No description provided.

@vonrosen
Copy link
Contributor Author

Make some urls with whitespace acceptable to JavaScript extractor. Also, allow UTF-16 forward-slash to be used as path separator for use-case at https://agency.wisconsin.gov/sites/prb/SitePages/Home.aspx where many urls have the format: "\u002fsites\u002fprb\u002fPublic Comment Emails PDF\u002fOpen records of electronic records_1r71vt0k.pdf"

@nlevitt
Copy link
Contributor

nlevitt commented Jan 27, 2016

Thanks Hunter.

  • ExtractorJS.considerString() calls StringEscapeUtils.unescapeJavaScript() -- it would be best if handling of "\u002f" happened somewhere around here -- it looks like StringEscapeUtils.unescapeJavaScript() already unescapes "\u002f" et al, so you can probably just get rid of your code for that
  • looks like you have a stray close-parenthesis here: if (TextUtils.matches(".*[\\s)]+.*", candidate)) { (also, the square brackets are superfluous)
  • could you add some tests to ExtractorJSTest?

Hunter Stern and others added 2 commits January 28, 2016 15:11
…ttern of UriUtils.isVeryLikelyUri(). Involves a subtle adjustment to the regex LIKELY_RELATIVE_URI_PATTERN to ensure the 2nd capturing group always gets the file extension. Also add a couple of tests of strings with spaces and file extensions that are not known good extensions.
vonrosen pushed a commit that referenced this pull request Feb 11, 2016
Make some urls with whitespace acceptable to JavaScript extractor.
@vonrosen vonrosen merged commit bfbd61e into internetarchive:master Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants