Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle military clock time (0800) in time standardizer. #1056

Merged
merged 4 commits into from
Dec 5, 2024

Conversation

alexaryn
Copy link
Contributor

@alexaryn alexaryn commented Dec 5, 2024

No description provided.

Copy link
Contributor

@karanataryn karanataryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments and suggestions. This helps handle a new case, though I suspect we will need to use LLMs to handle general cases soon.

("2023-07-15 10.30.00", "2023-07-15 10:30:00"),
("15/07/2023 10.30.00", "2023-07-15 10:30:00"),
("2023-07-15 10.30.00 Local", "2023-07-15 10:30:00"),
("2023-07-15 10.30.00PDT", "2023-07-15 10:30:00-07:00"),
Copy link
Contributor

@karanataryn karanataryn Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to preserve the timezone? Why not convert to UTC time? That seems far easier to parse and a better basis especially when others don't have any timezone mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not my decision to make. I assume the reason is that sometimes the timezone isn't specified in either the document or the query and it's better to be in a position to do late binding. The way datetime and dateparser work is that timezone can be specified if known and will be used if specified. It's easy to convert a datetime to UTC if that's the goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we want to 'standardize' it, isn't standardizing to one timezone part of the process? It seems like that's likely to reduce errors as well in that case. Also, how does OpenSearch do with timezones?

@@ -185,6 +185,9 @@ class DateTimeStandardizer(Standardizer):
"""

DEFAULT_FORMAT = "%B %d, %Y %H:%M:%S%Z"
clock_re = re.compile(r"\d:[0-5]\d")
year_re = re.compile(r"([12]\d\d\d-)|(/[12]\d\d\d)|(\d/[0-3]?\d/\d)")
digitpair_re = re.compile(r"([0-2]\d)([0-5]\d)(\d\d)?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would appreciate comments for the regexes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped the names were descriptive. Do you want examples of what they match? Not sure how to make regexes clearer without just explaining the grammar of regexes, which is already better done on various web pages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples would be good, or even a short explanation of their functionality. It is very dense right now, and not easy to understand without trying to make sense of the regex itself.

raw_dateTime = raw_dateTime.replace("Local", "")
raw_dateTime = raw_dateTime.replace("local", "")
raw_dateTime = raw_dateTime.replace(".", ":")
logging.error(f"FIXME {raw_dateTime}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.

@@ -222,6 +226,35 @@ def fixer(raw_dateTime: str) -> datetime:
# Handle any other exceptions
raise RuntimeError(f"Unexpected error occurred while processing: {raw_dateTime}") from e

@staticmethod
def preprocess(raw: str) -> str:
Copy link
Contributor

@karanataryn karanataryn Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's only meant for military time, so I'd rather rename this to miltary_time_preprocess, given that we're doing some processing in the main function as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. I was thinking other preprocessing might go here too. Either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always rename it at that point.

@alexaryn alexaryn merged commit 223e6dc into main Dec 5, 2024
12 of 14 checks passed
@alexaryn alexaryn deleted the alex_military_clock branch December 5, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants