Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle military clock time (0800) in time standardizer. #1056

Merged
merged 4 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions lib/sycamore/sycamore/tests/unit/transforms/test_standardizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
DateTimeStandardizer,
ignore_errors,
)
import pytest
import unittest
from datetime import date, datetime

Expand Down Expand Up @@ -266,3 +267,41 @@ def test_ignore_errors_key_missing(self):
key_path = ["nonExistentKey"]
expected_output = {"event": {"coolKey": ""}}
self.assertEqual(ignore_errors(doc, standardizer, key_path), expected_output)


@pytest.mark.parametrize(
"raw, want",
[
("March 17, 2023, 14.25 Local", "2023-03-17 14:25:00"),
("March 17, 2023, 14.25", "2023-03-17 14:25:00"),
("March 17, 2023 14:25:00", "2023-03-17 14:25:00"),
("March 17, 2023 2:25PM", "2023-03-17 14:25:00"),
("March 17, 2023 2:25AM", "2023-03-17 02:25:00"),
("17 March 2023 14:25", "2023-03-17 14:25:00"),
("2023-07-15 10.30.00", "2023-07-15 10:30:00"),
("15/07/2023 10.30.00", "2023-07-15 10:30:00"),
("2023-07-15 10.30.00 Local", "2023-07-15 10:30:00"),
("2023-07-15 10.30.00PDT", "2023-07-15 10:30:00-07:00"),
Copy link
Contributor

@karanataryn karanataryn Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to preserve the timezone? Why not convert to UTC time? That seems far easier to parse and a better basis especially when others don't have any timezone mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not my decision to make. I assume the reason is that sometimes the timezone isn't specified in either the document or the query and it's better to be in a position to do late binding. The way datetime and dateparser work is that timezone can be specified if known and will be used if specified. It's easy to convert a datetime to UTC if that's the goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we want to 'standardize' it, isn't standardizing to one timezone part of the process? It seems like that's likely to reduce errors as well in that case. Also, how does OpenSearch do with timezones?

("2024/6/01 23:59:59 PDT", "2024-06-01 23:59:59-07:00"),
("2024/12/04 15:25:39 PST", "2024-12-04 15:25:39-08:00"),
("03/02/1995 0815 CST", "1995-03-02 08:15:00-06:00"),
("03/02/1995 081500 CST", "1995-03-02 08:15:00-06:00"),
("3/2/95 0815", "1995-03-02 08:15:00"),
("1995-03-02 0815 CST", "1995-03-02 08:15:00-06:00"),
("1995-03-02 081500 CST", "1995-03-02 08:15:00-06:00"),
("4/30/1970 10:15:00 JST", "1970-04-30 10:15:00+09:00"),
("1/2/2034 12:13:14 GMT", "2034-01-02 12:13:14+00:00"),
("2034-01-02T12:13:14+00:00", "2034-01-02 12:13:14+00:00"),
("", "fail"),
("wrongdate", "fail"),
("2023123-07-15 10.30.00 Local", "fail"),
("April 1, 1999 1259", "fail"),
],
)
def test_date_fixer(raw, want):
try:
dt = DateTimeStandardizer.fixer(raw)
s = dt.isoformat(sep=" ", timespec="seconds")
except ValueError:
s = "fail"
assert s == want
37 changes: 35 additions & 2 deletions lib/sycamore/sycamore/transforms/standardizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,9 @@ class DateTimeStandardizer(Standardizer):
"""

DEFAULT_FORMAT = "%B %d, %Y %H:%M:%S%Z"
clock_re = re.compile(r"\d:[0-5]\d")
year_re = re.compile(r"([12]\d\d\d-)|(/[12]\d\d\d)|(\d/[0-3]?\d/\d)")
digitpair_re = re.compile(r"([0-2]\d)([0-5]\d)(\d\d)?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would appreciate comments for the regexes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped the names were descriptive. Do you want examples of what they match? Not sure how to make regexes clearer without just explaining the grammar of regexes, which is already better done on various web pages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples would be good, or even a short explanation of their functionality. It is very dense right now, and not easy to understand without trying to make sense of the regex itself.


@staticmethod
def fixer(raw_dateTime: str) -> datetime:
Expand All @@ -205,10 +208,11 @@ def fixer(raw_dateTime: str) -> datetime:
"""
assert raw_dateTime is not None, "raw_dateTime is None"
try:
raw_dateTime = raw_dateTime.strip()
raw_dateTime = DateTimeStandardizer.preprocess(raw_dateTime)
raw_dateTime = raw_dateTime.replace("Local", "")
raw_dateTime = raw_dateTime.replace("local", "")
raw_dateTime = raw_dateTime.replace(".", ":")
logging.error(f"FIXME {raw_dateTime}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.

parsed = dateparser.parse(raw_dateTime)
if not parsed:
raise ValueError(f"Invalid date format: {raw_dateTime}")
Expand All @@ -222,6 +226,35 @@ def fixer(raw_dateTime: str) -> datetime:
# Handle any other exceptions
raise RuntimeError(f"Unexpected error occurred while processing: {raw_dateTime}") from e

@staticmethod
def preprocess(raw: str) -> str:
Copy link
Contributor

@karanataryn karanataryn Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's only meant for military time, so I'd rather rename this to miltary_time_preprocess, given that we're doing some processing in the main function as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. I was thinking other preprocessing might go here too. Either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always rename it at that point.

# Fix up military clock time with just digits (0800)
raw = raw.strip()
tokens = raw.split()
saw_clock = 0
saw_year = 0
saw_digits = 0
for token in tokens:
if DateTimeStandardizer.clock_re.search(token):
saw_clock += 1
elif DateTimeStandardizer.year_re.search(token):
saw_year += 1
elif DateTimeStandardizer.digitpair_re.fullmatch(token):
saw_digits += 1
# If unsure there's exactly one military clock time, bail out.
# Note that numbers like 2024 could be times or years.
if (saw_clock > 0) or (saw_year == 0) or (saw_digits != 1):
return raw
pieces: list[str] = []
for token in tokens:
if match := DateTimeStandardizer.digitpair_re.fullmatch(token):
clock = ":".join([x for x in match.groups() if x])
before = token[: match.start(0)]
after = token[match.end(0) :]
token = before + clock + after
pieces.append(token)
return " ".join(pieces)

@staticmethod
def standardize(
doc: Document,
Expand Down Expand Up @@ -305,7 +338,7 @@ def ignore_errors(doc: Document, standardizer: Standardizer, key_path: list[str]
try:
doc = standardizer.standardize(doc, key_path=key_path)
except KeyError:
logger.warn(f"Key {key_path} not found in document: {doc}")
logger.warning(f"Key {key_path} not found in document: {doc}")
except Exception as e:
logger.error(e)
return doc
Loading