Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve annotation of fromstring #64

Merged
merged 5 commits into from
Jul 26, 2022

Conversation

DMRobertson
Copy link
Contributor

Resolves #63. See discussion there for more context.

  1. parser seems like it can be an HTMLParser. There's code in the wild
    that does this without obviously being wrong. As far as I can tell from
    the lxml Cython source, parser is a _BaseParser, which encompasses
    XMLParser, HTMLParser and the iterparse class. I wasn't sure if the
    latter ought to be allowed here too. I erred on the side of caution.

  2. It's possible for this function to return None, if the parser has
    recover=True. I couldn't find a good way to express this without
    affecting every other call to fromstring. (Perhaps one could make the
    Parser generic over some kind of Recoverable indicator type... but
    that seemed like overkill).

`parser` seems like it can be an `HTMLParser. There's code in the wild
that does this without obviously being wrong. As far as I can tell from
the lxml Cython source, `parser` is a `_Baseparser`. I wasn't sure if
the `iterparse` class ought to be allowed here too. I erred on the side
of caution.

It's possible for this function to return None, if the parser has
`recover=True`. I couldn't find a good way to express this without
affecting every other call to `fromstring`. (Perhaps one could make the
Parser generic over some kind of `Recoverable` indicator type... but
that seemed like overkill)

Resolves lxml#63.
@@ -496,8 +496,11 @@ def parse(
source: _FileSource, parser: XMLParser = ..., base_url: _AnyStr = ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTMLParser is also allowed here.

Comment on lines 4 to 6
from lxml import etree
document = etree.fromstring("<doc></doc>")
reveal_type(document) # N: Revealed type is "lxml.etree._Element"
reveal_type(document) # N: Revealed type is "Union[lxml.etree._Element, None]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since None is really a thing that you'd only get if you pass a suitable parser (and that gets in the way otherwise), maybe we can split up the declarations of fromstring() and parse() to only allow None return values if the parser argument is provided?

Admittedly, you can also change the default parser globally, which then allows a None return value here. And, in fact, it's not even guaranteed that the return value is an Element otherwise, since parsers can really return whatever they choose. However, that's such an extremely rare use case that I'd lean towards not expressing it in the type system. (OTOH, the whole point of this PR is to cater for an extremely rare use case, so …)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In typeshed we generally avoid Unions in return types, for reasons discussed in python/mypy#1693. In this case, it sounds like returning None is a very rare edge case, but if we return Element | None, we force every caller to check for None after calling this function. As you say, one improvement could be to use overloads for cases where we know the return value is definitely not None. Another approach we've used in typeshed is to return something like Element | Any. This gives you a lot of type safety but also allows callers to do None checks without hitting "unreachable code" warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks both!

On returning Union types: I'd read some of the discussion in python/typing#566 which sounded promising at first, but as Jukka writes here:

For new code and new APIs the recommended way is to avoid signatures that would require the use of AnyOf anyway.

I'm about to push three commits which:

  • clarify that parse accepts an HTML parser too;
  • express that fromstring() without a parser returns an _Element (the comment about changing the default parser notwithstanding; and
  • change the return type of fromstring to _Element | Any.

To be explicit: I don't have strong opinions about this PR. I like the idea of annotations that describe everything that a function can possibly do; but there's a balance to between that and pragmatism, and I'm happy to drop the | None change if it's too niche. It'd be nice to get the bit about HTMLParser in though, if nothing else!

This isn't strictly true: one can apparently change the default parser.
But it's a nice refinement.
so as to not burden existing users with having to check the None-ness of
`fromstring()` return values
@DMRobertson DMRobertson requested a review from scoder May 3, 2022 22:53
lxml-stubs/etree.pyi Outdated Show resolved Hide resolved
Co-authored-by: scoder <stefan_ml@behnel.de>
@scoder scoder merged commit 0e97999 into lxml:master Jul 26, 2022
@scoder
Copy link
Member

scoder commented Jul 26, 2022

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

return type for lxml.etree.fromstring should be Optional (?)
3 participants