Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

fabiodepin · 2025-01-31T17:40:59Z

Is your feature request related to a problem? Please describe.

Currently, nDPI offers a function called ndpi_has_human_readable_string that scans the payload for “human-readable” text. However, this function only captures the first readable string it finds and does not allow any form of custom filtering or returning multiple matches. This can be limiting when we want to look for specific strings (e.g., FQDNs) that may appear anywhere in the initial payload, or when we need to run multiple heuristics on potentially more than one human-readable string.

In particular, when analyzing the first HTTP request from a browser (often HTTP/1.1 before the traffic moves to TLS or QUIC), we may want to extract the FQDN or hostname that appears in the plaintext request. With the current ndpi_has_human_readable_string, we risk capturing incomplete or irrelevant data, and cannot filter or retrieve subsequent matches.

Describe the solution you'd like
I propose introducing a more advanced function – e.g., extract_readable_strings – that:
1. Returns all readable substrings from the given payload (rather than just the first).
2. Supports an optional filter function (callback) so the user can decide whether a found substring is relevant (e.g., check if it matches an FQDN pattern, a certain length, etc.).
3. Potentially stores these extracted strings in a list/array, allowing the application to iterate and pick the most relevant one (like the hostname).

This approach would allow nDPI to capture multiple candidate strings and apply a user-defined filter or heuristic for each. It would be especially useful for flows where we expect to see an HTTP Host header, partial domain information, or other readable data that helps classify the flow early.

Describe alternatives you've considered
• Modifying ndpi_has_human_readable_string directly to return a list of strings instead of just one. However, this could break existing usage or require a lot of conditional checks.
• Performing all the logic outside of nDPI: capturing the packet payload and writing separate code for advanced string extraction. But this means duplicating effort instead of leveraging the built-in DPI mechanisms.
• Using specialized protocol parsers (if the traffic is HTTP/1.1, create a mini-parser). While that’s possible, it’s more rigid and doesn’t generalize well to other protocols where we just want “any text that might be relevant.”

Additional context
• I have prototyped a function called extract_readable_strings that scans the payload, extracting all substrings of a specified minimum length, and it accepts a custom filter callback for advanced matching.
• This enhancement helps classify traffic at the very first packet, which is useful if the user is connecting over HTTP before upgrading to TLS, or when analyzing initial QUIC or other protocols that may carry readable strings in early packets.
• In my tests, capturing the FQDN from the initial request significantly improves application detection accuracy in nDPI, especially for short-lived or early-phase flows.

…upport and reorganize string extraction code - Introduces a new function `extract_readable_strings` to retrieve multiple human-readable substrings from a payload, with an optional filter callback for advanced matching (e.g., FQDN detection). - Moves the old string extraction function (`ndpi_has_human_readable_string`) from nfdpi_utils.c to readable_string.c, consolidating all string extraction logic into one location. This commit complements or replaces the existing `ndpi_has_human_readable_string` by providing: 1) Multi-substring extraction. 2) More flexible handling of textual data. 3) An optional user-defined filter for fine-tuned processing. References ntop#2708

IvanNardi · 2025-02-07T15:25:00Z

While I understand that a "better/more generic/alternative" ndpi_has_human_readable_string might be interesting/useful, I don't get how it could be used to improve "detection accuracy in nDPI" or to "helps classify traffic at the very first packet".

Let me rephrase my question: could you provide an HTTP trace where the classification already provided by nDPI is not good enough or not fast enough, please?

fabiodepin · 2025-02-12T17:40:20Z

While I understand that a "better/more generic/alternative" ndpi_has_human_readable_string might be interesting/useful, I don't get how it could be used to improve "detection accuracy in nDPI" or to "helps classify traffic at the very first packet".

Let me rephrase my question: could you provide an HTTP trace where the classification already provided by nDPI is not good enough or not fast enough, please?

To summarize, my idea is to retrieve the first human-readable string in the payload that conforms to the FQDN format and then use it for logging and verification. We could even create a new attribute in the flow data to store this information, which would be useful for reporting.

When testing with various browsers (Chrome, Firefox, Opera, Safari), I noticed that they always include the destination website’s hostname in the very first HTTP/HTTPS request. With this additional detail, nDPI could use the hostname to help identify the application or sub-application.

Below are some examples from PCAP files in the tests folder (covering protocols beyond just HTTP, TLS, or QUIC) where the hostname is present in the payload but is not currently captured or used in the flow data:
• tls_1.2_unidirectional_server.pcapng: upload.video.google.com
• tls_invalid_reads.pcap: e.crashlytics.com
• tls_long_cert.pcap: www.repstatic.it
• tls_missing_ch_frag.pcap: r1—sn-5f5nxgvh5o-hjul.googlevideo.com
• tls_verylong_certificate.pcap: p2.shared.global.fastly.net
• tls_verylong_certificate.pcap: 12wbt.com
• tls_verylong_certificate.pcap: guru.com
• http_guessed_host_and_guessed.pcapng: pornhub.com
• http_ipv6.pcap: shop.ntop.org
• http_origin_different_than_host.pcap: csb.performgroup.io
• xiaomi.pcap: xiaomi.com
• vxlan.pcap: facebook.com
• vxlan.pcap: www.facebook.com
• tls-rdn-extract.pcap: mscrl.microsoft.com
• tls_missing_ch_frag.pcap: r1—sn-5f5nxgvh5o-hjul.googlevideo.com
• smb_frags.pcap: hqdc-02.civilpension.local

fabiodepin · 2025-02-12T18:02:12Z

Now, let’s move on to the main focus of this issue, which is improving the ndpi_has_human_readable_string function.

If you agree, I’ll submit a PR soon.

Regarding detection improvements, I’ll open a separate issue if I spot something that might be really useful.

Thanks for your support.

…upport and reorganize string extraction code - Introduces a new function `extract_readable_strings` to retrieve multiple human-readable substrings from a payload, with an optional filter callback for advanced matching (e.g., FQDN detection). - Moves the old string extraction function (`ndpi_has_human_readable_string`) from nfdpi_utils.c to readable_string.c, consolidating all string extraction logic into one location. This commit complements or replaces the existing `ndpi_has_human_readable_string` by providing: 1) Multi-substring extraction. 2) More flexible handling of textual data. 3) An optional user-defined filter for fine-tuned processing. References ntop#2708

IvanNardi · 2025-02-13T10:41:18Z

If you agree, I’ll submit a PR soon.

yes, definitely. Thanks!

Add include readeable_string.c

fabiodepin added the enhancement label Jan 31, 2025

fabiodepin mentioned this issue Feb 13, 2025

feat: add extract_readable_strings function with advanced filtering support and reorganize string extraction code #2720

Closed

1 task

fabiodepin added a commit to fabiodepin/nDPI that referenced this issue Feb 13, 2025

Fix MSBuild (ntop#2708)

6390a5d

Add include readeable_string.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

fabiodepin commented Jan 31, 2025

IvanNardi commented Feb 7, 2025

fabiodepin commented Feb 12, 2025

fabiodepin commented Feb 12, 2025

IvanNardi commented Feb 13, 2025

Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

Comments

fabiodepin commented Jan 31, 2025

IvanNardi commented Feb 7, 2025

fabiodepin commented Feb 12, 2025

fabiodepin commented Feb 12, 2025

IvanNardi commented Feb 13, 2025