consolidate regular expression matching for numeric timezones #288

jtmoon79 · 2024-04-21T02:10:40Z

Summary

Three regular expressions are made to match the differing numeric timezone formats +00, +0000, +00:00. Combining these three regular expressions into one regular expression would reduce the total number of DTPD!, i.e. DateTimeParseInstr regular expressions.

Current behavior

Currently, three DateTimeParseInstr entries (each holding a compiled regular expression) are created for the three timezone numeric formats; +00, +0000, +00:00.
For example, datetime.rs:#L3438-L3471.
There are currently 153 regular expressions that might be compiled during a program run.

A run of s4 to process ./logs/other/tests/numbers3.log does not successfully parse any log messages. This means all 153 regular expressions were compiled in unsuccessful attempts to match a datetimestamp. According to the attached flamegraph, 37.80% of that runtime was spent in regex::regex::bytes::Regex::new.

Suggested behavior

Create one DateTimeParseInstr (regular expression) to match the three types of numeric timezone formats.
This would greatly reduce the maximum number of regular expressions that may be compiled at runtime, and reduce the number "zero block" match attempts made for some log files (match attempts made before a DateTimeParseInstr is chosen for that file).

Tradeoff

There is an advantage to have multiple more precise DateTimeParseInstr (regular expressions); errant matches are less likely.

Additionally, the maximum cost for compiling regular expressions is a fixed amount. As the amount of data processed increases (larger files, more files) then that fixed cost shrinks relative to the total runtime.

... so I'm thinking the multiple timezone matches should be left in-place... 🤔🤔🤔

A good compromise would be allowing regex with "high confidence" matches to consolidate numeric timezones. Whereas matches that are generalized, "low confidence", would have distinguished timezones.

A "high confidence" might be:

[INFO] [20220609T004532+0000] Hello | goodbye

Matched [INFO] [20220609T004532+0000]

Notice the matched brackets and [INFO] and anchored from the line beginning give higher confidence this is the correct datetimestamp.

a "low confidence" might be:

Hello 20220609T004532+0000 goodbye

Matched 20220609T004532+0000

Notice the match is not "anchored" to the line beginning. Notice there are no brackets hinting the datetimestamp is or is not part of the surrounding substrings, e.g. 20220609T004532+0000 might be a substring in the logged message and not a datetimestamp.

The text was updated successfully, but these errors were encountered:

jtmoon79 added enhancement New feature or request code improvement enhancement not seen by the user P1 important labels Apr 21, 2024

jtmoon79 changed the title ~~consolidate matching numeric timezones~~ consolidate regular expression matching for numeric timezones Apr 21, 2024

jtmoon79 mentioned this issue Apr 21, 2024

improve efficiency (catch all) #84

Open

jtmoon79 added good first issue Good for newcomers and removed good first issue Good for newcomers labels Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consolidate regular expression matching for numeric timezones #288

consolidate regular expression matching for numeric timezones #288

jtmoon79 commented Apr 21, 2024 •

edited

Loading

consolidate regular expression matching for numeric timezones #288

consolidate regular expression matching for numeric timezones #288

Comments

jtmoon79 commented Apr 21, 2024 • edited Loading

Summary

Current behavior

Suggested behavior

Tradeoff

jtmoon79 commented Apr 21, 2024 •

edited

Loading