-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle military clock time (0800) in time standardizer. #1056
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -185,6 +185,9 @@ class DateTimeStandardizer(Standardizer): | |
""" | ||
|
||
DEFAULT_FORMAT = "%B %d, %Y %H:%M:%S%Z" | ||
clock_re = re.compile(r"\d:[0-5]\d") | ||
year_re = re.compile(r"([12]\d\d\d-)|(/[12]\d\d\d)|(\d/[0-3]?\d/\d)") | ||
digitpair_re = re.compile(r"([0-2]\d)([0-5]\d)(\d\d)?") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would appreciate comments for the regexes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had hoped the names were descriptive. Do you want examples of what they match? Not sure how to make regexes clearer without just explaining the grammar of regexes, which is already better done on various web pages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Examples would be good, or even a short explanation of their functionality. It is very dense right now, and not easy to understand without trying to make sense of the regex itself. |
||
|
||
@staticmethod | ||
def fixer(raw_dateTime: str) -> datetime: | ||
|
@@ -205,10 +208,11 @@ def fixer(raw_dateTime: str) -> datetime: | |
""" | ||
assert raw_dateTime is not None, "raw_dateTime is None" | ||
try: | ||
raw_dateTime = raw_dateTime.strip() | ||
raw_dateTime = DateTimeStandardizer.preprocess(raw_dateTime) | ||
raw_dateTime = raw_dateTime.replace("Local", "") | ||
raw_dateTime = raw_dateTime.replace("local", "") | ||
raw_dateTime = raw_dateTime.replace(".", ":") | ||
logging.error(f"FIXME {raw_dateTime}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we have this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops. |
||
parsed = dateparser.parse(raw_dateTime) | ||
if not parsed: | ||
raise ValueError(f"Invalid date format: {raw_dateTime}") | ||
|
@@ -222,6 +226,35 @@ def fixer(raw_dateTime: str) -> datetime: | |
# Handle any other exceptions | ||
raise RuntimeError(f"Unexpected error occurred while processing: {raw_dateTime}") from e | ||
|
||
@staticmethod | ||
def preprocess(raw: str) -> str: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like it's only meant for military time, so I'd rather rename this to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fine. I was thinking other preprocessing might go here too. Either way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can always rename it at that point. |
||
# Fix up military clock time with just digits (0800) | ||
raw = raw.strip() | ||
tokens = raw.split() | ||
saw_clock = 0 | ||
saw_year = 0 | ||
saw_digits = 0 | ||
for token in tokens: | ||
if DateTimeStandardizer.clock_re.search(token): | ||
saw_clock += 1 | ||
elif DateTimeStandardizer.year_re.search(token): | ||
saw_year += 1 | ||
elif DateTimeStandardizer.digitpair_re.fullmatch(token): | ||
saw_digits += 1 | ||
# If unsure there's exactly one military clock time, bail out. | ||
# Note that numbers like 2024 could be times or years. | ||
if (saw_clock > 0) or (saw_year == 0) or (saw_digits != 1): | ||
return raw | ||
pieces: list[str] = [] | ||
for token in tokens: | ||
if match := DateTimeStandardizer.digitpair_re.fullmatch(token): | ||
clock = ":".join([x for x in match.groups() if x]) | ||
before = token[: match.start(0)] | ||
after = token[match.end(0) :] | ||
token = before + clock + after | ||
pieces.append(token) | ||
return " ".join(pieces) | ||
|
||
@staticmethod | ||
def standardize( | ||
doc: Document, | ||
|
@@ -305,7 +338,7 @@ def ignore_errors(doc: Document, standardizer: Standardizer, key_path: list[str] | |
try: | ||
doc = standardizer.standardize(doc, key_path=key_path) | ||
except KeyError: | ||
logger.warn(f"Key {key_path} not found in document: {doc}") | ||
logger.warning(f"Key {key_path} not found in document: {doc}") | ||
except Exception as e: | ||
logger.error(e) | ||
return doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to preserve the timezone? Why not convert to UTC time? That seems far easier to parse and a better basis especially when others don't have any timezone mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not my decision to make. I assume the reason is that sometimes the timezone isn't specified in either the document or the query and it's better to be in a position to do late binding. The way datetime and dateparser work is that timezone can be specified if known and will be used if specified. It's easy to convert a datetime to UTC if that's the goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we want to 'standardize' it, isn't standardizing to one timezone part of the process? It seems like that's likely to reduce errors as well in that case. Also, how does OpenSearch do with timezones?