Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Stemming #4

Open
seanmacavaney opened this issue Feb 24, 2022 · 3 comments
Open

Improve Stemming #4

seanmacavaney opened this issue Feb 24, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@seanmacavaney
Copy link
Collaborator

How easily can we support other stemmers? This will be necessary for e.g., the TREC NeuCLIR track.

Right now, performing stemming as a pipeline step and a Pisa stemmer of none works. But it may be more convenient for users if this could be a property of the PisaIndex itself.

As suggested by @amallia -- also need to check whether the krovetz stemmer works as intended -- was added to PISA but not tested.

@seanmacavaney seanmacavaney added the enhancement New feature or request label Feb 24, 2022
@cmacdonald
Copy link
Contributor

cmacdonald commented Feb 24, 2022

A stemmer in Pisa is any function that maps std::string to std::string (see https://github.com/pisa-engine/pisa/blob/master/include/pisa/query/term_processor.hpp#L14). While a Python function could be mapped to that. This could be done by detecting a Python callable being pass as a stemmer, and somehow wrapping it as a python function.

stemmer = lambda s : s.replace("ed", "")
index PisaIndex('./mystem', stemmer=stemmer)

On the other hand, as you acknowledge, a PyTerrier workaround would simply be just apply pre-processing transformer

stemmer = lambda s : s.replace("ed", "")
pipe = (
  pt.apply.text( lambda row: " ".join(row["text"].split(" ").map(stemmer)) ) 
  >> PisaIndex('./mystem', stemmer=None)
)

(I of course acknowledge that there are better tokenisations that str.split(" "))

@seanmacavaney
Copy link
Collaborator Author

I think I prefer the PyTerrier route over trying to shoehorn a Python function into Pisa for stemming. It's the same suggestion we give for Terrier indexing as well.

@cmacdonald
Copy link
Contributor

Yes, in hindsight, it would need more changes in https://github.com/pisa-engine/pisa/blob/master/src/query/term_processor.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants