Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate regular expression matching for numeric timezones #288

Open
jtmoon79 opened this issue Apr 21, 2024 · 0 comments
Open

consolidate regular expression matching for numeric timezones #288

jtmoon79 opened this issue Apr 21, 2024 · 0 comments
Labels
code improvement enhancement not seen by the user enhancement New feature or request P1 important

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Apr 21, 2024

Summary

Three regular expressions are made to match the differing numeric timezone formats +00, +0000, +00:00. Combining these three regular expressions into one regular expression would reduce the total number of DTPD!, i.e. DateTimeParseInstr regular expressions.

Current behavior

Currently, three DateTimeParseInstr entries (each holding a compiled regular expression) are created for the three timezone numeric formats; +00, +0000, +00:00.
For example, datetime.rs:#L3438-L3471.
There are currently 153 regular expressions that might be compiled during a program run.

A run of s4 to process ./logs/other/tests/numbers3.log does not successfully parse any log messages. This means all 153 regular expressions were compiled in unsuccessful attempts to match a datetimestamp. According to the attached flamegraph, 37.80% of that runtime was spent in regex::regex::bytes::Regex::new.

flamegraph-numbers3

Suggested behavior

Create one DateTimeParseInstr (regular expression) to match the three types of numeric timezone formats.
This would greatly reduce the maximum number of regular expressions that may be compiled at runtime, and reduce the number "zero block" match attempts made for some log files (match attempts made before a DateTimeParseInstr is chosen for that file).

Tradeoff

There is an advantage to have multiple more precise DateTimeParseInstr (regular expressions); errant matches are less likely.

Additionally, the maximum cost for compiling regular expressions is a fixed amount. As the amount of data processed increases (larger files, more files) then that fixed cost shrinks relative to the total runtime.

... so I'm thinking the multiple timezone matches should be left in-place... 🤔🤔🤔

A good compromise would be allowing regex with "high confidence" matches to consolidate numeric timezones. Whereas matches that are generalized, "low confidence", would have distinguished timezones.

A "high confidence" might be:

[INFO] [20220609T004532+0000] Hello | goodbye

Matched [INFO] [20220609T004532+0000]

Notice the matched brackets and [INFO] and anchored from the line beginning give higher confidence this is the correct datetimestamp.

a "low confidence" might be:

Hello 20220609T004532+0000 goodbye

Matched 20220609T004532+0000

Notice the match is not "anchored" to the line beginning. Notice there are no brackets hinting the datetimestamp is or is not part of the surrounding substrings, e.g. 20220609T004532+0000 might be a substring in the logged message and not a datetimestamp.

@jtmoon79 jtmoon79 added enhancement New feature or request code improvement enhancement not seen by the user P1 important labels Apr 21, 2024
@jtmoon79 jtmoon79 changed the title consolidate matching numeric timezones consolidate regular expression matching for numeric timezones Apr 21, 2024
@jtmoon79 jtmoon79 added good first issue Good for newcomers and removed good first issue Good for newcomers labels Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code improvement enhancement not seen by the user enhancement New feature or request P1 important
Projects
None yet
Development

No branches or pull requests

1 participant