-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708
Comments
…upport and reorganize string extraction code - Introduces a new function `extract_readable_strings` to retrieve multiple human-readable substrings from a payload, with an optional filter callback for advanced matching (e.g., FQDN detection). - Moves the old string extraction function (`ndpi_has_human_readable_string`) from nfdpi_utils.c to readable_string.c, consolidating all string extraction logic into one location. This commit complements or replaces the existing `ndpi_has_human_readable_string` by providing: 1) Multi-substring extraction. 2) More flexible handling of textual data. 3) An optional user-defined filter for fine-tuned processing. References ntop#2708
While I understand that a "better/more generic/alternative" Let me rephrase my question: could you provide an HTTP trace where the classification already provided by nDPI is not good enough or not fast enough, please? |
To summarize, my idea is to retrieve the first human-readable string in the payload that conforms to the FQDN format and then use it for logging and verification. We could even create a new attribute in the flow data to store this information, which would be useful for reporting. When testing with various browsers (Chrome, Firefox, Opera, Safari), I noticed that they always include the destination website’s hostname in the very first HTTP/HTTPS request. With this additional detail, nDPI could use the hostname to help identify the application or sub-application. Below are some examples from PCAP files in the tests folder (covering protocols beyond just HTTP, TLS, or QUIC) where the hostname is present in the payload but is not currently captured or used in the flow data: |
Now, let’s move on to the main focus of this issue, which is improving the ndpi_has_human_readable_string function. If you agree, I’ll submit a PR soon. Regarding detection improvements, I’ll open a separate issue if I spot something that might be really useful. Thanks for your support. |
…upport and reorganize string extraction code - Introduces a new function `extract_readable_strings` to retrieve multiple human-readable substrings from a payload, with an optional filter callback for advanced matching (e.g., FQDN detection). - Moves the old string extraction function (`ndpi_has_human_readable_string`) from nfdpi_utils.c to readable_string.c, consolidating all string extraction logic into one location. This commit complements or replaces the existing `ndpi_has_human_readable_string` by providing: 1) Multi-substring extraction. 2) More flexible handling of textual data. 3) An optional user-defined filter for fine-tuned processing. References ntop#2708
yes, definitely. Thanks! |
Add include readeable_string.c
Is your feature request related to a problem? Please describe.
Currently, nDPI offers a function called ndpi_has_human_readable_string that scans the payload for “human-readable” text. However, this function only captures the first readable string it finds and does not allow any form of custom filtering or returning multiple matches. This can be limiting when we want to look for specific strings (e.g., FQDNs) that may appear anywhere in the initial payload, or when we need to run multiple heuristics on potentially more than one human-readable string.
In particular, when analyzing the first HTTP request from a browser (often HTTP/1.1 before the traffic moves to TLS or QUIC), we may want to extract the FQDN or hostname that appears in the plaintext request. With the current ndpi_has_human_readable_string, we risk capturing incomplete or irrelevant data, and cannot filter or retrieve subsequent matches.
Describe the solution you'd like
I propose introducing a more advanced function – e.g., extract_readable_strings – that:
1. Returns all readable substrings from the given payload (rather than just the first).
2. Supports an optional filter function (callback) so the user can decide whether a found substring is relevant (e.g., check if it matches an FQDN pattern, a certain length, etc.).
3. Potentially stores these extracted strings in a list/array, allowing the application to iterate and pick the most relevant one (like the hostname).
This approach would allow nDPI to capture multiple candidate strings and apply a user-defined filter or heuristic for each. It would be especially useful for flows where we expect to see an HTTP Host header, partial domain information, or other readable data that helps classify the flow early.
Describe alternatives you've considered
• Modifying ndpi_has_human_readable_string directly to return a list of strings instead of just one. However, this could break existing usage or require a lot of conditional checks.
• Performing all the logic outside of nDPI: capturing the packet payload and writing separate code for advanced string extraction. But this means duplicating effort instead of leveraging the built-in DPI mechanisms.
• Using specialized protocol parsers (if the traffic is HTTP/1.1, create a mini-parser). While that’s possible, it’s more rigid and doesn’t generalize well to other protocols where we just want “any text that might be relevant.”
Additional context
• I have prototyped a function called extract_readable_strings that scans the payload, extracting all substrings of a specified minimum length, and it accepts a custom filter callback for advanced matching.
• This enhancement helps classify traffic at the very first packet, which is useful if the user is connecting over HTTP before upgrading to TLS, or when analyzing initial QUIC or other protocols that may carry readable strings in early packets.
• In my tests, capturing the FQDN from the initial request significantly improves application detection accuracy in nDPI, especially for short-lived or early-phase flows.
The text was updated successfully, but these errors were encountered: