Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance ndpi_has_human_readable_string to return multiple strings and support filtering #2708

Open
fabiodepin opened this issue Jan 31, 2025 · 4 comments

Comments

@fabiodepin
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently, nDPI offers a function called ndpi_has_human_readable_string that scans the payload for “human-readable” text. However, this function only captures the first readable string it finds and does not allow any form of custom filtering or returning multiple matches. This can be limiting when we want to look for specific strings (e.g., FQDNs) that may appear anywhere in the initial payload, or when we need to run multiple heuristics on potentially more than one human-readable string.

In particular, when analyzing the first HTTP request from a browser (often HTTP/1.1 before the traffic moves to TLS or QUIC), we may want to extract the FQDN or hostname that appears in the plaintext request. With the current ndpi_has_human_readable_string, we risk capturing incomplete or irrelevant data, and cannot filter or retrieve subsequent matches.

Describe the solution you'd like
I propose introducing a more advanced function – e.g., extract_readable_strings – that:
1. Returns all readable substrings from the given payload (rather than just the first).
2. Supports an optional filter function (callback) so the user can decide whether a found substring is relevant (e.g., check if it matches an FQDN pattern, a certain length, etc.).
3. Potentially stores these extracted strings in a list/array, allowing the application to iterate and pick the most relevant one (like the hostname).

This approach would allow nDPI to capture multiple candidate strings and apply a user-defined filter or heuristic for each. It would be especially useful for flows where we expect to see an HTTP Host header, partial domain information, or other readable data that helps classify the flow early.

Describe alternatives you've considered
• Modifying ndpi_has_human_readable_string directly to return a list of strings instead of just one. However, this could break existing usage or require a lot of conditional checks.
• Performing all the logic outside of nDPI: capturing the packet payload and writing separate code for advanced string extraction. But this means duplicating effort instead of leveraging the built-in DPI mechanisms.
• Using specialized protocol parsers (if the traffic is HTTP/1.1, create a mini-parser). While that’s possible, it’s more rigid and doesn’t generalize well to other protocols where we just want “any text that might be relevant.”

Additional context
• I have prototyped a function called extract_readable_strings that scans the payload, extracting all substrings of a specified minimum length, and it accepts a custom filter callback for advanced matching.
• This enhancement helps classify traffic at the very first packet, which is useful if the user is connecting over HTTP before upgrading to TLS, or when analyzing initial QUIC or other protocols that may carry readable strings in early packets.
• In my tests, capturing the FQDN from the initial request significantly improves application detection accuracy in nDPI, especially for short-lived or early-phase flows.

fabiodepin added a commit to fabiodepin/nDPI that referenced this issue Jan 31, 2025
…upport and reorganize string extraction code

- Introduces a new function `extract_readable_strings` to retrieve multiple 
  human-readable substrings from a payload, with an optional filter callback 
  for advanced matching (e.g., FQDN detection).
- Moves the old string extraction function (`ndpi_has_human_readable_string`) 
  from nfdpi_utils.c to readable_string.c, consolidating all string extraction
  logic into one location.

This commit complements or replaces the existing `ndpi_has_human_readable_string`
by providing:
1) Multi-substring extraction.
2) More flexible handling of textual data.
3) An optional user-defined filter for fine-tuned processing.

References ntop#2708
@IvanNardi
Copy link
Collaborator

While I understand that a "better/more generic/alternative" ndpi_has_human_readable_string might be interesting/useful, I don't get how it could be used to improve "detection accuracy in nDPI" or to "helps classify traffic at the very first packet".

Let me rephrase my question: could you provide an HTTP trace where the classification already provided by nDPI is not good enough or not fast enough, please?

@fabiodepin
Copy link
Contributor Author

While I understand that a "better/more generic/alternative" ndpi_has_human_readable_string might be interesting/useful, I don't get how it could be used to improve "detection accuracy in nDPI" or to "helps classify traffic at the very first packet".

Let me rephrase my question: could you provide an HTTP trace where the classification already provided by nDPI is not good enough or not fast enough, please?

To summarize, my idea is to retrieve the first human-readable string in the payload that conforms to the FQDN format and then use it for logging and verification. We could even create a new attribute in the flow data to store this information, which would be useful for reporting.

When testing with various browsers (Chrome, Firefox, Opera, Safari), I noticed that they always include the destination website’s hostname in the very first HTTP/HTTPS request. With this additional detail, nDPI could use the hostname to help identify the application or sub-application.

Below are some examples from PCAP files in the tests folder (covering protocols beyond just HTTP, TLS, or QUIC) where the hostname is present in the payload but is not currently captured or used in the flow data:
• tls_1.2_unidirectional_server.pcapng: upload.video.google.com
• tls_invalid_reads.pcap: e.crashlytics.com
• tls_long_cert.pcap: www.repstatic.it
• tls_missing_ch_frag.pcap: r1—sn-5f5nxgvh5o-hjul.googlevideo.com
• tls_verylong_certificate.pcap: p2.shared.global.fastly.net
• tls_verylong_certificate.pcap: 12wbt.com
• tls_verylong_certificate.pcap: guru.com
• http_guessed_host_and_guessed.pcapng: pornhub.com
• http_ipv6.pcap: shop.ntop.org
• http_origin_different_than_host.pcap: csb.performgroup.io
• xiaomi.pcap: xiaomi.com
• vxlan.pcap: facebook.com
• vxlan.pcap: www.facebook.com
• tls-rdn-extract.pcap: mscrl.microsoft.com
• tls_missing_ch_frag.pcap: r1—sn-5f5nxgvh5o-hjul.googlevideo.com
• smb_frags.pcap: hqdc-02.civilpension.local

@fabiodepin
Copy link
Contributor Author

Now, let’s move on to the main focus of this issue, which is improving the ndpi_has_human_readable_string function.

If you agree, I’ll submit a PR soon.

Regarding detection improvements, I’ll open a separate issue if I spot something that might be really useful.

Thanks for your support.

fabiodepin added a commit to fabiodepin/nDPI that referenced this issue Feb 12, 2025
…upport and reorganize string extraction code

- Introduces a new function `extract_readable_strings` to retrieve multiple 
  human-readable substrings from a payload, with an optional filter callback 
  for advanced matching (e.g., FQDN detection).
- Moves the old string extraction function (`ndpi_has_human_readable_string`) 
  from nfdpi_utils.c to readable_string.c, consolidating all string extraction
  logic into one location.

This commit complements or replaces the existing `ndpi_has_human_readable_string`
by providing:
1) Multi-substring extraction.
2) More flexible handling of textual data.
3) An optional user-defined filter for fine-tuned processing.

References ntop#2708
@IvanNardi
Copy link
Collaborator

If you agree, I’ll submit a PR soon.

yes, definitely. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants