You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three regular expressions are made to match the differing numeric timezone formats +00, +0000, +00:00. Combining these three regular expressions into one regular expression would reduce the total number of DTPD!, i.e. DateTimeParseInstr regular expressions.
Current behavior
Currently, three DateTimeParseInstr entries (each holding a compiled regular expression) are created for the three timezone numeric formats; +00, +0000, +00:00.
For example, datetime.rs:#L3438-L3471.
There are currently 153 regular expressions that might be compiled during a program run.
A run of s4 to process ./logs/other/tests/numbers3.log does not successfully parse any log messages. This means all 153 regular expressions were compiled in unsuccessful attempts to match a datetimestamp. According to the attached flamegraph, 37.80% of that runtime was spent in regex::regex::bytes::Regex::new.
Suggested behavior
Create one DateTimeParseInstr (regular expression) to match the three types of numeric timezone formats.
This would greatly reduce the maximum number of regular expressions that may be compiled at runtime, and reduce the number "zero block" match attempts made for some log files (match attempts made before a DateTimeParseInstr is chosen for that file).
Tradeoff
There is an advantage to have multiple more precise DateTimeParseInstr (regular expressions); errant matches are less likely.
Additionally, the maximum cost for compiling regular expressions is a fixed amount. As the amount of data processed increases (larger files, more files) then that fixed cost shrinks relative to the total runtime.
... so I'm thinking the multiple timezone matches should be left in-place... 🤔🤔🤔
A good compromise would be allowing regex with "high confidence" matches to consolidate numeric timezones. Whereas matches that are generalized, "low confidence", would have distinguished timezones.
A "high confidence" might be:
[INFO] [20220609T004532+0000] Hello | goodbye
Matched [INFO] [20220609T004532+0000]
Notice the matched brackets and [INFO] and anchored from the line beginning give higher confidence this is the correct datetimestamp.
a "low confidence" might be:
Hello 20220609T004532+0000 goodbye
Matched 20220609T004532+0000
Notice the match is not "anchored" to the line beginning. Notice there are no brackets hinting the datetimestamp is or is not part of the surrounding substrings, e.g. 20220609T004532+0000 might be a substring in the logged message and not a datetimestamp.
The text was updated successfully, but these errors were encountered:
Summary
Three regular expressions are made to match the differing numeric timezone formats
+00
,+0000
,+00:00
. Combining these three regular expressions into one regular expression would reduce the total number ofDTPD!
, i.e.DateTimeParseInstr
regular expressions.Current behavior
Currently, three
DateTimeParseInstr
entries (each holding a compiled regular expression) are created for the three timezone numeric formats;+00
,+0000
,+00:00
.For example,
datetime.rs:#L3438-L3471
.There are currently 153 regular expressions that might be compiled during a program run.
A run of
s4
to process./logs/other/tests/numbers3.log
does not successfully parse any log messages. This means all 153 regular expressions were compiled in unsuccessful attempts to match a datetimestamp. According to the attached flamegraph, 37.80% of that runtime was spent inregex::regex::bytes::Regex::new
.Suggested behavior
Create one
DateTimeParseInstr
(regular expression) to match the three types of numeric timezone formats.This would greatly reduce the maximum number of regular expressions that may be compiled at runtime, and reduce the number "zero block" match attempts made for some log files (match attempts made before a
DateTimeParseInstr
is chosen for that file).Tradeoff
There is an advantage to have multiple more precise
DateTimeParseInstr
(regular expressions); errant matches are less likely.Additionally, the maximum cost for compiling regular expressions is a fixed amount. As the amount of data processed increases (larger files, more files) then that fixed cost shrinks relative to the total runtime.
... so I'm thinking the multiple timezone matches should be left in-place... 🤔🤔🤔
A good compromise would be allowing regex with "high confidence" matches to consolidate numeric timezones. Whereas matches that are generalized, "low confidence", would have distinguished timezones.
A "high confidence" might be:
Matched
[INFO] [20220609T004532+0000]
Notice the matched brackets and
[INFO]
and anchored from the line beginning give higher confidence this is the correct datetimestamp.a "low confidence" might be:
Matched
20220609T004532+0000
Notice the match is not "anchored" to the line beginning. Notice there are no brackets hinting the datetimestamp is or is not part of the surrounding substrings, e.g. 20220609T004532+0000 might be a substring in the logged message and not a datetimestamp.
The text was updated successfully, but these errors were encountered: