Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update natural language parser #48

Merged
merged 8 commits into from
May 22, 2024
Merged

Conversation

ColeDCrawford
Copy link
Contributor

@ColeDCrawford ColeDCrawford commented May 13, 2024

This PR updates the natural language parser to work with the 2018 spec. As noted in the EDTF docs,

This specification differs from the earlier draft as follows:

  • the unspecified date character (formerly lower case ‘u’) is superseded by the character (upper case) 'X';
  • Masked precision is eliminated;
  • the uncertain and approximate qualifiers, '?' and '~', when applied together, are combined into a single qualifier character '%';
  • “qualification from the left” is introduced and replaces the grouping mechanism using parentheses;
  • the extended interval syntax keywords 'unknown' and 'open' have been replaced with null and the double-dot notation ['..'] respectively;
  • the year prefix 'y' and the exponential indicator 'e', both previously lowercase, are now 'Y' and 'E' (uppercase); and
  • the significant digit indicator 'p' is now 'S' (uppercase).

Checklist for the PR:

  • Replace the unspecified date character u with X
  • Remove masked precision
  • Combine ?~ into % for uncertain + approximate
  • Use qualification from the left rather than parentheses for grouping
  • Replace unknown with null syntax in intervals
  • Replace open with .. syntax syntax in intervals
  • Replace y, e, and p with Y, E, and S syntax for exponential years and significant digits

In the parser:
- Update regular expressions for SHORT_YEAR_RE and LONG_YEAR_RE to use X instead of x and u and Y instead of y
- Replaced`unknown` with null as per the 2018 spec. It does not look like python-edtf currently has open intervals (`open` before, `..` now)?
- Replaced `?~` with `%`

In the tests:
- eliminate masked precision - no u/x just X for unknown regardless of why the data is missing
- replace unknown with null
- replace ~? with %
@ColeDCrawford
Copy link
Contributor Author

ColeDCrawford commented May 13, 2024

@aweakley do you want to remove the old parentheses grouping examples from BAD_EXAMPLES here? Those are from the old spec.

Do we have any examples of the difference between null and .. for extended intervals? It doesn't look like the current version of python-edtf has any open syntax?

@aweakley
Copy link
Member

Do the parentheses ones make things more complex? I quite like that it's clear from the tests that inputs like that will raise an error, but if it'll make our lives easier to get rid of them then I'm happy with that.

# Group qualification: a qualification character to the immediate right of a component applies
# to that component as well as to all components to the left.
# year, month, and day are uncertain and approximate
('2004-06-11%', ('2004-06-11', '2004-06-09', '2004-06-13')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this one be a broader range, given the year and month are also uncertain and approximate?

Copy link
Contributor Author

@ColeDCrawford ColeDCrawford May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. This points at some other potential implementation issues. In the upper_fuzzy() / lower_fuzzy() docs there is this example:

>>> e = parse_edtf('1912-04~')
>>> e.lower_fuzzy()[:3]  # padding is 100% of a month
(1912, 3, 1)
>>> e.upper_fuzzy()[:3]
(1912, 5, 30)

1912-04~ would imply that the whole date should be padded ("year-month approximate"), right? Ending up with bounds of "1911-04-01" and "1913-05-31"? Whereas 1912-~04 (~ qualification to the left of 04) would be "month approximate" and we would pad by 100% of the month for bounds of "1912-3-1" and "1912-5-30".

1912-04~ and 2004-06-11% both are parsed to the L1 UncertainOrApproximate class, since the only qualification symbol is on the far right and applies to the entire EDTF object. 1912-~04 instead parses to the L2 PartialUncertainOrApproximate class. I think the problem with UncertainOrApproximate is that it currently uses the date.precision to determine the bounds. For 1912-04~, it determines there is month precision and applies 100% * 1 month of padding:

>>> from edtf.parser.grammar import parse_edtf as parse
>>> e = parse('1912-04~')
>>> e
UncertainOrApproximate: '1912-04~'
>>> e.ua
UA: '~'
>>> e.date.precision
'month'
>>> e.ua._get_multiplier()
1.0
>>> e.lower_fuzzy()[:3]
(1912, 3, 1)
>>> d = parse('2004-06-11%')
>>> d
UncertainOrApproximate: '2004-06-11%'
>>> d.date.precision
'day'
>>> d.lower_fuzzy()[:3]
(2004, 6, 9)

I think it's clear that _get_fuzzy_padding() method needs to be updated to account for the fact that the fuzziness is applied to the entire date, not just the rightmost part.

A remaining question: when the qualification symbol applies to multiple parts of the date, how should we express that in terms of fuzziness? 100% of the largest part (e.g. year) or also fuzz the other parts?

  • Ex: 2004-06-11% fuzzing just the year gives bounds of "2002-06-11" and "2006-06-11", vs fuzzing on all parts of the date gives bounds of "2002-04-09" and "2006-08-13" (200% of y m d given that it is both uncertain and approximate)

Copy link
Member

@aweakley aweakley May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can say 2004%-06-11 instead if you want the narrow range. So I think we should fuzz all parts to the left of the % symbol, and we'd get the broader "2002-04-09" - "2006-08-13" range.

@ColeDCrawford
Copy link
Contributor Author

Do the parentheses ones make things more complex? I quite like that it's clear from the tests that inputs like that will raise an error, but if it'll make our lives easier to get rid of them then I'm happy with that.

Nope, they fail just fine as they currently are so I'm also happy to leave them.

@ColeDCrawford
Copy link
Contributor Author

Not sure why the tests are hanging but they look like they are passing if you click in.

I've updated the tests, specifically EDTF level 1 "Qualification of a date (complete)" dates. These parse as UncertainOrApproximate classes.

@ColeDCrawford ColeDCrawford marked this pull request as ready for review May 21, 2024 13:49
@aweakley
Copy link
Member

This looks great, thank you.

@aweakley aweakley merged commit 077eac5 into ixc:v5 May 22, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants