Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom extension attributes' introspection and usability #3707

Closed
bdewilde opened this issue May 8, 2019 · 6 comments · Fixed by #3729
Closed

custom extension attributes' introspection and usability #3707

bdewilde opened this issue May 8, 2019 · 6 comments · Fixed by #3729
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects

Comments

@bdewilde
Copy link

bdewilde commented May 8, 2019

Feature description

Hello! I'm exploring ways to better leverage spacy's increased customizability in textacy, and one option is to replace the textacy.Doc wrapper with a collection of custom extension attributes added to the regular spacy.Doc. Seems doable so far, but I've run into a couple significant usability issues:

  • tab-completion doesn't work on these extensions (at least not in a jupyter notebook), since doc._. + TAB only shows doc_extensions, get, has, mutable_types (which maybe we don't need to see?), set, span_extensions, and token_extensions. I don't know if there's actually any way to store the extensions in a tab-completable way, but it would be super helpful to users.
  • docstrings for the functions behind extensions don't carry through, seemingly because of the usage of functools.partial. This is a known problem, but possible solutions are controversial:
Signature:     
doc._.as_bag_of_words(
    normalize='lemma',
    weighting='count',
    as_strings=False,
)
Call signature: doc._.as_bag_of_words(*args, **kwargs)
Type:           partial
String form:    functools.partial(<function as_bag_of_words at 0x11fea32f0>, This is an example sentence.)
File:           ~/.pyenv/versions/3.7.0/lib/python3.7/functools.py
Docstring:     
partial(func, *args, **keywords) - new function with partial application
of the given arguments and keywords.

Together, these issues makes it harder for users to 1. know what extensions are available to them, and 2. understand what they do and how to use them. The existing functionality — Doc.has_extension() and Doc.get_extension(), or just doc._.doc_extensions — work well for code, but their outputs aren't human-friendly:

>>> doc._.doc_extensions
{'lang': (None, None, <function __main__.get_lang(doc)>, None),
 'tokens': (None, None, <function __main__.get_tokens(doc)>, None),
 'metadata': (None,
  None,
  <function __main__.get_metadata(doc)>,
  <function __main__.set_metadata(doc, value)>),
 'n_tokens': (None, None, <function __main__.get_n_tokens(doc)>, None),
 'n_sents': (None, None, <function __main__.get_n_sents(doc)>, None),
 'to_tokenized_text': (None,
  <function __main__.to_tokenized_text(doc)>,
  None,
  None),
 'to_tagged_text': (None, <function __main__.to_tagged_text(doc)>, None, None),
 'to_bag_of_words': (None,
  <function __main__.to_bag_of_words(doc, normalize='lemma', weighting='count', as_strings=False)>,
  None,
  None),
 'to_terms_list': (None,
  <function __main__.to_terms_list(doc, ngrams=(1, 2, 3), named_entities=True, normalize='lemma', as_strings=False, **kwargs)>,
  None,
  None),
 'to_bag_of_terms': (None,
  <function __main__.to_bag_of_terms(doc, ngrams=(1, 2, 3), named_entities=True, normalize='lemma', weighting='count', as_strings=False, **kwargs)>,
  None,
  None),
 'to_semantic_network': (None,
  <function __main__.to_semantic_network(doc, nodes='words', normalize='lemma', edge_weighting='default', window_width=10)>,
  None,
  None)}

I don't know if there are any good solutions to these issues, or if you'd want to take them on, but I think they'd make extension attributes a great deal easier to work with.

@bdewilde
Copy link
Author

bdewilde commented May 9, 2019

I did some digging. I think tab-completion could be made to work if you define a custom __dir__ method on the Underscore class to go with the custom __getattr__; the Python docs explain this a bit. As I wrote before, how best to present docstrings for partial functions is "controversial", but it seems like overwriting the partial function's __doc__ with the original function's __doc__ plus a leading caveat about the first Doc / Span / Token argument already being supplied might suffice?

@ines ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels May 10, 2019
@ines
Copy link
Member

ines commented May 10, 2019

Thanks for looking into this – I was hoping that something like this would be possible! Having tab completion for custom attributes would definitely be really nice. If you have time to experiment with this and want to submit a PR, that'd be cool. Otherwise I'm also happy to take a look at this for a future release 🙂

@ines ines added help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels May 10, 2019
@ines
Copy link
Member

ines commented May 11, 2019

@bdewilde See #3729 for my first draft on this – I've only tested this in our test suite and in my Ipython shell and it works as expected for me 😃 What do you think?

Making textacy expose custom attributes and methods sounds like a really cool idea btw, so I'm excited to see this come together. doc._.as_bag_of_words is a perfect example of what I had imagined when we came up with the feature!

@bdewilde
Copy link
Author

@ines , you are such a rockstar. Sorry I didn't get to this — have been hacking on textacy for the past couple days 😅 — but glad to see you did what I'd imagined in __dir__.

As for the docstring change, it might be worth prepending a little note to the original docstring re: the initial Doc arg being pre-filled by functools, so users don't get confused by a call signature that doesn't match what their code tells them. Or not? I'm honestly not sure what users would find more intuitive.

Thanks again for the prompt turnaround!

@ines
Copy link
Member

ines commented May 11, 2019

@bdewilde Yeah, I think you're right – more details definitely won't hurt. I just updated my branch accordingly. If it all works as expected for you, I'll merge this. (Release has to wait until v2.1.5, though, since we've already triggered the wheel builds for v2.1.4 😅)

@ines ines removed help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels May 11, 2019
ines added a commit that referenced this issue May 11, 2019
* Add custom __dir__ to Underscore (see #3707)

* Make sure custom extension methods keep their docstrings (see #3707)

* Improve tests

* Prepend note on partial to docstring (see #3707)

* Remove print statement

* Handle cases where docstring is None
@lock
Copy link

lock bot commented Jun 11, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants