Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncontrolled datetime formats #93

Closed
antalgu opened this issue Nov 1, 2024 · 1 comment
Closed

Uncontrolled datetime formats #93

antalgu opened this issue Nov 1, 2024 · 1 comment

Comments

@antalgu
Copy link
Contributor

antalgu commented Nov 1, 2024

We've been using your library for a while and there have been quite a few validation errors from uncontrolled datetime formats. About two hundred date-time formats come from these formats:

2010-07-04 04:18:23 +03:00
2024-02-19 01:30:15.927683+11
07 Aug 2024
2008-08-31 04:14:06 KST
2024/01/01 01:05:04 (JST)

The parser from dateutils.parse was able to control the first three formats, and by adding a method to remove the "(" and ")" and using tzinfos i was able to convert all of them properly:

def clean_timezone(date_str)
return re.sub(r'(([^)]+))', r'\1', date_str).strip()

... Existing code ...
tzinfos = {
'KST': tz.gettz('Asia/Seoul'), # Korea Standard Time UTC+9
'JST': tz.gettz('Asia/Tokyo'), # Japan Standard Time UTC+9
}
....
if type(whois_info['created']) == str:
created = parse(clean_timezone(whois_info['created']), tzinfos=tzinfos)
else:
created = whois_info['created']
...

I noticed you're using a list of date_time formats and trying to parse the string with these formats:

def _parse_date(date_string: str) -> Union[datetime, str]:
"""
Attempts to convert the given date string to a datetime.datetime object
otherwise returns the input `date_string`
:param date_string: a date string
:return: a datetime.datetime object
"""
for date_format in KNOWN_DATE_FORMATS:
try:
date = datetime.strptime(date_string, date_format)
return date
except ValueError:
continue
return date_string

I think you would be better off by using dateutil.parser with a few tzinfos and then having a reduced list that is used when that can't cut it.

Here I put a python script (in .txt) to test all your formats + these 5, and their timings with dateutils.parser and with your function. There are 6 commented formats which would be the only ones you need to control with an except and your method instead of with dateutil.parser. The tests might not be 100% accurate because your list could be ordered by popularity and so doing the same amount of tests with each format would not be the best method, however i think the difference in time is enough to indicate that the new method might be better.

tests
asynwhois_vs_custom_tests.txt

@antalgu antalgu changed the title Unlisted datetime formats Uncontrolled datetime formats Nov 1, 2024
@pogzyb
Copy link
Owner

pogzyb commented Nov 1, 2024

Hi,

Thanks for clearly outlining and describing the issue! I agree with your assessment, dateutil.parse seems like a much better option for handling date parsing. I attempted to add your suggestions in #94 . Specifically, the BaseParser and parse_date function now look like this:

class BaseParser:
    reg_expressions = {}

    date_keys = ()
    multiple_match_keys = ()

    # For handling special cases in TLD parser classes
    known_date_formats = []
    # Extra formats that dateutil might not figure out
    extra_date_formats = [
        "%Y-%m-%dT%H:%M:%SZ[%Z]",  # 2007-01-26T19:10:31Z[UTC]
        "%Y-%m-%dT%H:%M:%S.%fZ",  # 2018-12-01T16:17:30.568Z
        "%Y-%m-%dT%H:%M:%S%zZ",  # 1970-01-01T02:00:00+02:00Z
        "%Y-%m-%dt%H:%M:%S.%fz",  # 2007-01-26t19:10:31.00z
        "%Y-%m-%d %H:%M:%SZ",  # 2000-08-22 18:55:20Z
        "before %b-%Y",  # before aug-1996
    ]
    # Additional timezone info for dateutil
    timezone_info = {
        "KST": tz.gettz("Asia/Seoul"),  # Korea Standard Time UTC+9
        "JST": tz.gettz("Asia/Tokyo"),  # Japan Standard Time UTC+9
        "EEST": tz.gettz("Europe/Athens"),  # Eastern European Summertime UTC+3
    }

    ...

    def _parse_date(self, date_string: str) -> Union[datetime, str]:
        """
        Attempts to convert the given date string to a datetime.datetime object
        otherwise returns the input `date_string`
        :param date_string: a date string
        :return: a datetime.datetime object
        """

        def _datetime_or_none(dt_string: str, dt_format: str) -> Union[datetime, None]:
            try:
                return datetime.strptime(dt_string, dt_format)
            except ValueError:
                return None

        # first, try the known formats
        for date_format in self.known_date_formats:
            if date := _datetime_or_none(date_string, date_format):
                return date
        # next, try dateutil.parse
        try:
            clean_date_string = re.sub(r"\(([^)]+)\)", r"\1", date_string).strip()
            return parse(clean_date_string, tzinfos=self.timezone_info)
        except ParserError:
            pass
        # finally, try extra formats
        for date_format in self.extra_date_formats:
            if date := _datetime_or_none(date_string, date_format):
                return date
        # no luck parsing
        return date_string

I also added and modified your example script under tests/test_dateparsers.py, which looks like:

from datetime import datetime

from asyncwhois.parse import BaseParser


def test_dateparsers():  # noqa
    date_and_time_examples = [
        "2010-07-04 04:18:23 +03:00",
        "2024-02-19 01:30:15.927683+11",
        "2008-08-31 04:14:06 KST",
        "2024/01/01 01:05:04 (JST)",
        "07 Aug 2024",
    ]
    date_and_time_examples += [
        "02-jan-2000",
        "11-February-2000",
        "20-10-2000",
        "2000-01-02",
        "2.1.2000",
        "2000.01.02",
        "2000/01/02",
        "2011/06/01 01:05:01",
        "2011/06/01 01:05:01 (+0900)",
        "20170209",
        "20110908 14:44:51",
        "02/01/2013",
        "2000. 01. 02.",
        "2014.03.08 10:28:24",
        "24-Jul-2009 13:20:03 UTC",
        "Tue Jun 21 23:59:59 GMT 2011",
        "2007-01-26T19:10:31",
        "2007-01-26T19:10:31Z",
        "2007-01-26T19:10:31Z[UTC]",  # extra
        "2018-12-01T16:17:30.568Z",  # extra
        "2011-09-08T14:44:51.622265+03:00",
        "2013-12-06T08:17:22-0800",
        "1970-01-01T02:00:00+02:00Z",  # extra
        "2011-09-08t14:44:51.622265",
        "2007-01-26T19:10:31",
        "2007-01-26T19:10:31Z",
        "2007-01-26t19:10:31.00z",  # extra
        "2011-03-30T19:36:27+0200",
        "2011-09-08T14:44:51.622265+03:00",
        "2000-08-22 18:55:20Z",  # extra
        "2000-08-22 18:55:20",
        "08 Apr 2013 05:44:00",
        "23/04/2015 12:00:07",
        "23/04/2015 12:00:07 EEST",
        "23/04/2015 12:00:07.619546 EEST",
        "2015-04-23 12:00:07.619546",
        "August 14 2017",
        "08.03.2014 10:28:24",
        "Tue Dec 12 2000",
        "before aug-1996",  # extra
        "2017-09-26 11:38:29 (GMT+00:00)",
    ]

    bp = BaseParser()

    for dt in date_and_time_examples:
        result = bp._parse_date(dt)
        assert isinstance(result, datetime), f"Failed to parse date string: {dt}"

Let me know if this looks OK or if there is anything I may have overlooked in your suggestion.

@pogzyb pogzyb closed this as completed Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants