-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-Up Preprocessing + NLP #162
base: master
Are you sure you want to change the base?
Speed-Up Preprocessing + NLP #162
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
WIP Co-authored-by: Henri Froese <hf2000510@gmail.com>
…nto decorator_for_parallelization
missing documentation Co-authored-by: Henri Froese <hf2000510@gmail.com>
@jbesomi , we have now tested more with Modin and thought about different implementations. The problem is that there is no
Options 3+4 seem like a bad idea. Option 1 is the easiest as it's already ready to merge and gives users a great speed boost for data that fits into RAM (probably the case for most users). Option 2 is probably the best for users with big datasets, as they will profit from the whole modin framework (partitioning etc.) and can use our library fully parallelized; of course, it's a little more for us to code, but not that much 🥴 🥈 . Interested in what you think |
Thank you for the detailed comments!
|
we have now created a notebook, which compares the two implementations (1 and 4). The conversion, as you can see at the bottom of the notebook is quite cheap, but the computation part takes about twice as long as we could archive it. 🥇 |
we now have merged #157 or the current master into this PR and are ready for review/merge 🐤 🙏 |
TODO:
|
We spent a lot of time looking at every possible options to speed up the library. Main findings:
str.replace
ands.apply(re.sub)
are equally fast (see Speed up preprocessing module #124) -> don't need to change thispd.apply
inside the librarypython multiprocessing decorator
The only way we see to parallelize things while still keeping a great User Experience and not slowing down users with small and medium sized datasets is the following:
Implement a function
helper.parallel
that handles parallelization for Pandas Series and use that to wrap our functions. Example:So we keep all functionality that is not parallelizable (like initializing patterns, downloading a spacy model, ...) in a function
f
with a nice docstring that the user can see. This function at the end callshelper.parallel(s, _f, other arguments)
where_f
houses all the parallelizable functionality.Here's the
helper.parallel
implementation, should be rather self-explanatory:We know that this does introduce some added complexity for developers in the
nlp
andpreprocessing
modules, but we believe that it's the best solution for users (automatic parallelization and same user experience as before) and with some more developer documentation, other contributors should not have any problems.MultiProcessing Implementation in Texthero.pdf
Of course, lots of tests are added to
test_helper.py
.Note: only so many lines changed because this builds upon #157