-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1448 url encoding with lower case letters #1456
Conversation
core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java
Outdated
Show resolved
Hide resolved
Formatting seems to be off. Think you can resolve that by running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mstrewe,
thanks for the PR.
Please, see the comment regarding the uppercase representation of percent-encodings.
void testProperURLEncodingWithLowerCase() throws MalformedURLException { | ||
URLFilter urlFilter = createFilter(queryParamsToFilter); | ||
String urlWithEscapedCharacters = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d"; | ||
String expectedResult = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the expected result be %3D%3D
?
This is the canonical representation of percent-encoded characters defined in RFC 3986.
If case variants of percent-encoded chars remain in URLs, this may cause duplicates. Note that in addition to pure lowercase variant, there could be also %3d%3D
and %3D%3d
.
After a closer look into the code: the reason for the issue is likely in line 398 of BasicURLNormalizer.
|
I dont think so.
So the escaping unescaping works like expected. But since the letters now upper case
This line will then encode the percentage character again. Then we have |
Ok, this might require to run a debugger. But it doesn't seem to be the URL constructor:
|
I ran the test again, without the fix. It returned
It seems to work correctly. I found the bug in my software using maybe an older version.. I will investigate |
OK found it. In my Version the line 154 is the following
Sorry for bothering you. I will close the merge Request since problem was my old code. |
No problem. Thanks! |
Will fix #1448 the lower case url encoding error. Test added