Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase versions, speed & update model downloading #10

Merged
merged 12 commits into from
Jul 27, 2022

Conversation

Muennighoff
Copy link
Contributor

@Muennighoff Muennighoff commented Jun 5, 2022

Solved Issues:

  • Prior to this model downloading is slow, recipe is slow, Flair model download is complex
  • Prior to this installation fails for Py37 due to transformers rust compiler problem (Need higher transformers version)
  • Prior to this Py37 Flair usage fails due to outdated Pandas version
  • Prior to this the webapp does not work in Py37 due to ImportError: cannot import name 'Markup' from 'jinja2'

Remaining issues:

  • For it to work on Py36 + Mac, you need to add tokenizers==0.10.3 to the requirements

Other notes:

  • The NumPy version pin is due to this error in Py37 (https://stackoverflow.com/questions/65998646/pandas-error-for-creating-an-emptydataframe)
  • Removed torch version pin as I did not find any problems leaving it as is
  • In an experiment on a 10K row dataset spaCy now needs 60% of the time it needed previously
  • spaCy:
    • Adding a new model in requirements: Edit requirements, update env
    • Adding a completely new model: Edit requirements, edit plugin.json, edit spacy_utils, update env
  • Flair:
    • Adding a new model: Edit resources_init, edit recipe.py, update env
      • Could probably have a model name field instead for Flair to remove the recipe.py step, but then we'll need to catch incorrect names

Do you find a nice way to shorten any of the above for adding a new model?
Should we add the instructions & the version pin for Py36 to the plugin website?

Manual Tests:

  • Mac Py37 / Py36 (with version pin) Recipe & Webapp 👍
  • Linux Py36 Recipe & Webapp 👍

👻👻👻

@Muennighoff Muennighoff changed the title Move FLAIR to resources_init & make non-en spaCy models optional Increase versions, speed & update model downloading Jun 6, 2022
Copy link

@amandamilberg amandamilberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on

  1. Python 37, could download flair model
  2. Python 37, webapp working accordingly
  3. Ran NER recipe on new / old plugin and Spacy model did run faster (1min37 vs 1min53)

@Muennighoff I did not have an issue re-building the code env for python 36 on a mac.. it worked with your requirements.txt file and used tokenizers==0.12.1

@nicolasdalsass
Copy link

I think this PR is a good first step, however I need to mention a couple things :

  • In no way should we accept that some features are locked behind editing the plugin in dev mode. If it's not available natively, we should consider we don't support it. I think there's a missed opportunity of nicely weaving the download model macro in the new version to support extra languages and model, instead of just removing it.

  • While there's indeed an appreciable speed gain running this new version, I suppose it's mainly due to the bumping of lib version and not much else, meaning there's still something to explore regarding parallelized exec. I tested the code, and running the recipe uses 100% of a single core, which quite frankly is not great. We should aim to use 100% of the available resources, and trust users to have a correct setup using cgroups to not make their servers explode.

So I'm ok with merging this PR since imo it's a net improvement over the existing situation (modulo the little comments I made), but I think we should be able to go sensibly further in the future.

@nicolasdalsass
Copy link

Cf #11 because I was curious :-D

* Spacy optimizations

- Disable unused pipeline algos - divides recipe time roughly by two
- Allow multi-cpu processing - on a 8 core machine, divives recipe time roughly by 3
Overall, 6x improvement for spacy recipe !
@nicolasdalsass nicolasdalsass self-requested a review July 27, 2022 07:39
@nicolasdalsass nicolasdalsass merged commit 99cdf91 into master Jul 27, 2022
@nicolasdalsass nicolasdalsass deleted the fix/model-downloading branch July 27, 2022 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants