#1448 url encoding with lower case letters #1456

mstrewe · 2025-01-27T12:12:50Z

Will fix #1448 the lower case url encoding error. Test added

core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java

rzo1 · 2025-01-27T19:06:36Z

Formatting seems to be off. Think you can resolve that by running

mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false

sebastian-nagel

Hi @mstrewe,

thanks for the PR.

Please, see the comment regarding the uppercase representation of percent-encodings.

sebastian-nagel · 2025-01-28T08:50:00Z

core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java

+    void testProperURLEncodingWithLowerCase() throws MalformedURLException {
+        URLFilter urlFilter = createFilter(queryParamsToFilter);
+        String urlWithEscapedCharacters = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";
+        String expectedResult = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";


Shouldn't the expected result be %3D%3D?

This is the canonical representation of percent-encoded characters defined in RFC 3986.

If case variants of percent-encoded chars remain in URLs, this may cause duplicates. Note that in addition to pure lowercase variant, there could be also %3d%3D and %3D%3d.

sebastian-nagel · 2025-01-28T09:48:57Z

After a closer look into the code: the reason for the issue is likely in line 398 of BasicURLNormalizer.

a percent character is unconditionally converted to %25 even if it's the first character of a valid percent-encoding
the "basic" URL normalizers of Nutch and crawler-commons treat the percent character separately and do not unconditionally escape it. All three "basic" URL normalizers share the same origin years ago, so they are still quite similar in their source code.

mstrewe · 2025-01-28T10:16:09Z

After a closer look into the code: the reason for the issue is likely in line 398 of BasicURLNormalizer.

a percent character is unconditionally converted to %25 even if it's the first character of a valid percent-encoding

the "basic" URL normalizers of Nutch and crawler-commons treat the percent character separately and do not unconditionally escape it. All three "basic" URL normalizers share the same origin years ago, so they are still quite similar in their source code.

I dont think so.
In the given URL of the test, the URL is unescaped first in line 146 and then escaped in 147. Until now the encoding differ only in upper and lower case (the percent is not yet encoded again)

// .../NjAxOA%3d%3d     - file
String file2 = unescapePath(file);
// .../NjAxOA==    - file2
file2 = escapePath(file2);
// .../NjAxOA%3D%3D   - file2

So the escaping unescaping works like expected.

But since the letters now upper case equals (without ignore case) will lead to line 152, which will create a new URL with file 2.

urlToFilter = new URL(protocol, host, port, file2).toString();

This line will then encode the percentage character again. Then we have .../NjAxOA%253D%253D

sebastian-nagel · 2025-01-28T10:36:24Z

Ok, this might require to run a debugger. But it doesn't seem to be the URL constructor:

jshell> new URL("http", "www.example.com", -1, "/NjAxOA%3D%3D").toString();
$1 ==> "http://www.example.com/NjAxOA%3D%3D"

mstrewe · 2025-01-28T10:49:17Z

I ran the test again, without the fix.

It returned

BasicURLNormalizerTest.testProperURLEncodingWithLowerCase:313 Failed to normalize url encoded url with lower case letters ==> expected: <http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d> but was: <http://www.example.com/Exhibitions/Detail/NjAxOA%3D%3D>

It seems to work correctly.

I found the bug in my software using maybe an older version.. I will investigate

mstrewe · 2025-01-28T10:56:38Z

OK found it.

In my Version the line 154 is the following

urlToFilter = new URI(protocol, null, host, port, file2, null, null).toURL().toString();

Sorry for bothering you. I will close the merge Request since problem was my old code.

sebastian-nagel · 2025-01-28T10:57:44Z

Sorry for bothering you. I will close the merge Request since problem was my old code.

No problem. Thanks!

mstrewe added 4 commits January 27, 2025 12:00

BasicURLNormalizer.java - fix url encoding comparison

4859b59

fix apache#1448

BasicURLNormalizerTest.java: add test for fix

7dd5bb8

apache#1448

Update BasicURLNormalizerTest.java: forgot to change to example.com

e36a038

BasicURLNormalizerTest.java: fix code formatting

00c854b

rzo1 reviewed Jan 27, 2025

View reviewed changes

core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java Outdated Show resolved Hide resolved

BasicURLNormalizer.java: simplify code and formatting

5e02c15

sebastian-nagel requested changes Jan 28, 2025

View reviewed changes

mstrewe closed this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1448 url encoding with lower case letters #1456

#1448 url encoding with lower case letters #1456

mstrewe commented Jan 27, 2025

rzo1 commented Jan 27, 2025

sebastian-nagel left a comment

sebastian-nagel Jan 28, 2025

sebastian-nagel commented Jan 28, 2025 •

edited

Loading

mstrewe commented Jan 28, 2025 •

edited

Loading

sebastian-nagel commented Jan 28, 2025

mstrewe commented Jan 28, 2025

mstrewe commented Jan 28, 2025

sebastian-nagel commented Jan 28, 2025

#1448 url encoding with lower case letters #1456

#1448 url encoding with lower case letters #1456

Conversation

mstrewe commented Jan 27, 2025

rzo1 commented Jan 27, 2025

sebastian-nagel left a comment

Choose a reason for hiding this comment

sebastian-nagel Jan 28, 2025

Choose a reason for hiding this comment

sebastian-nagel commented Jan 28, 2025 • edited Loading

mstrewe commented Jan 28, 2025 • edited Loading

sebastian-nagel commented Jan 28, 2025

mstrewe commented Jan 28, 2025

mstrewe commented Jan 28, 2025

sebastian-nagel commented Jan 28, 2025

sebastian-nagel commented Jan 28, 2025 •

edited

Loading

mstrewe commented Jan 28, 2025 •

edited

Loading