-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncontrolled datetime formats #93
Comments
Hi, Thanks for clearly outlining and describing the issue! I agree with your assessment, class BaseParser:
reg_expressions = {}
date_keys = ()
multiple_match_keys = ()
# For handling special cases in TLD parser classes
known_date_formats = []
# Extra formats that dateutil might not figure out
extra_date_formats = [
"%Y-%m-%dT%H:%M:%SZ[%Z]", # 2007-01-26T19:10:31Z[UTC]
"%Y-%m-%dT%H:%M:%S.%fZ", # 2018-12-01T16:17:30.568Z
"%Y-%m-%dT%H:%M:%S%zZ", # 1970-01-01T02:00:00+02:00Z
"%Y-%m-%dt%H:%M:%S.%fz", # 2007-01-26t19:10:31.00z
"%Y-%m-%d %H:%M:%SZ", # 2000-08-22 18:55:20Z
"before %b-%Y", # before aug-1996
]
# Additional timezone info for dateutil
timezone_info = {
"KST": tz.gettz("Asia/Seoul"), # Korea Standard Time UTC+9
"JST": tz.gettz("Asia/Tokyo"), # Japan Standard Time UTC+9
"EEST": tz.gettz("Europe/Athens"), # Eastern European Summertime UTC+3
}
...
def _parse_date(self, date_string: str) -> Union[datetime, str]:
"""
Attempts to convert the given date string to a datetime.datetime object
otherwise returns the input `date_string`
:param date_string: a date string
:return: a datetime.datetime object
"""
def _datetime_or_none(dt_string: str, dt_format: str) -> Union[datetime, None]:
try:
return datetime.strptime(dt_string, dt_format)
except ValueError:
return None
# first, try the known formats
for date_format in self.known_date_formats:
if date := _datetime_or_none(date_string, date_format):
return date
# next, try dateutil.parse
try:
clean_date_string = re.sub(r"\(([^)]+)\)", r"\1", date_string).strip()
return parse(clean_date_string, tzinfos=self.timezone_info)
except ParserError:
pass
# finally, try extra formats
for date_format in self.extra_date_formats:
if date := _datetime_or_none(date_string, date_format):
return date
# no luck parsing
return date_string I also added and modified your example script under from datetime import datetime
from asyncwhois.parse import BaseParser
def test_dateparsers(): # noqa
date_and_time_examples = [
"2010-07-04 04:18:23 +03:00",
"2024-02-19 01:30:15.927683+11",
"2008-08-31 04:14:06 KST",
"2024/01/01 01:05:04 (JST)",
"07 Aug 2024",
]
date_and_time_examples += [
"02-jan-2000",
"11-February-2000",
"20-10-2000",
"2000-01-02",
"2.1.2000",
"2000.01.02",
"2000/01/02",
"2011/06/01 01:05:01",
"2011/06/01 01:05:01 (+0900)",
"20170209",
"20110908 14:44:51",
"02/01/2013",
"2000. 01. 02.",
"2014.03.08 10:28:24",
"24-Jul-2009 13:20:03 UTC",
"Tue Jun 21 23:59:59 GMT 2011",
"2007-01-26T19:10:31",
"2007-01-26T19:10:31Z",
"2007-01-26T19:10:31Z[UTC]", # extra
"2018-12-01T16:17:30.568Z", # extra
"2011-09-08T14:44:51.622265+03:00",
"2013-12-06T08:17:22-0800",
"1970-01-01T02:00:00+02:00Z", # extra
"2011-09-08t14:44:51.622265",
"2007-01-26T19:10:31",
"2007-01-26T19:10:31Z",
"2007-01-26t19:10:31.00z", # extra
"2011-03-30T19:36:27+0200",
"2011-09-08T14:44:51.622265+03:00",
"2000-08-22 18:55:20Z", # extra
"2000-08-22 18:55:20",
"08 Apr 2013 05:44:00",
"23/04/2015 12:00:07",
"23/04/2015 12:00:07 EEST",
"23/04/2015 12:00:07.619546 EEST",
"2015-04-23 12:00:07.619546",
"August 14 2017",
"08.03.2014 10:28:24",
"Tue Dec 12 2000",
"before aug-1996", # extra
"2017-09-26 11:38:29 (GMT+00:00)",
]
bp = BaseParser()
for dt in date_and_time_examples:
result = bp._parse_date(dt)
assert isinstance(result, datetime), f"Failed to parse date string: {dt}" Let me know if this looks OK or if there is anything I may have overlooked in your suggestion. |
We've been using your library for a while and there have been quite a few validation errors from uncontrolled datetime formats. About two hundred date-time formats come from these formats:
2010-07-04 04:18:23 +03:00
2024-02-19 01:30:15.927683+11
07 Aug 2024
2008-08-31 04:14:06 KST
2024/01/01 01:05:04 (JST)
The parser from dateutils.parse was able to control the first three formats, and by adding a method to remove the "(" and ")" and using tzinfos i was able to convert all of them properly:
def clean_timezone(date_str)
return re.sub(r'(([^)]+))', r'\1', date_str).strip()
... Existing code ...
tzinfos = {
'KST': tz.gettz('Asia/Seoul'), # Korea Standard Time UTC+9
'JST': tz.gettz('Asia/Tokyo'), # Japan Standard Time UTC+9
}
....
if type(whois_info['created']) == str:
created = parse(clean_timezone(whois_info['created']), tzinfos=tzinfos)
else:
created = whois_info['created']
...
I noticed you're using a list of date_time formats and trying to parse the string with these formats:
asyncwhois/asyncwhois/parse.py
Lines 337 to 350 in b9cdc92
I think you would be better off by using dateutil.parser with a few tzinfos and then having a reduced list that is used when that can't cut it.
Here I put a python script (in .txt) to test all your formats + these 5, and their timings with dateutils.parser and with your function. There are 6 commented formats which would be the only ones you need to control with an except and your method instead of with dateutil.parser. The tests might not be 100% accurate because your list could be ordered by popularity and so doing the same amount of tests with each format would not be the best method, however i think the difference in time is enough to indicate that the new method might be better.
asynwhois_vs_custom_tests.txt
The text was updated successfully, but these errors were encountered: