Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is ipywidgets really required? #219

Closed
PeterJCLaw opened this issue Jul 20, 2023 · 12 comments
Closed

Is ipywidgets really required? #219

PeterJCLaw opened this issue Jul 20, 2023 · 12 comments
Assignees
Labels
dependencies Pull requests that update a dependency file

Comments

@PeterJCLaw
Copy link

PeterJCLaw commented Jul 20, 2023

Hi,

This package looks like it could offer our data scientists a substantial speedup, so in that regard it looks great. However it appears to pull in a number of packages which seem to be unused, together these packages significantly increase the requirements footprint of this otherwise small package. Notably this dependency ends up pulling in quite a lot of packages that we'd rather not deploy into a production system, meaning that we're less likely to be able to use it.

Is ipywidgets really needed? Could it be removed?

It looks like the dependency appeared as a result of #42, though I'm not really sure what the underlying issue was there. I'm not massively familiar with Jupyter notebooks so I don't know whether there's an implicit dependency that that creates?

Other dependencies which appear unused:

  • bleach
  • cloudpickle
  • parso
@jmcarpenter2
Copy link
Owner

Hey @PeterJCLaw , thanks for noticing this! I will review those 4 dependencies today and see if I can remove them. Less is more, as they say!

@jmcarpenter2 jmcarpenter2 self-assigned this Jul 20, 2023
@jmcarpenter2 jmcarpenter2 added the dependencies Pull requests that update a dependency file label Jul 20, 2023
@jmcarpenter2
Copy link
Owner

jmcarpenter2 commented Jul 20, 2023

Here's my notes:

  1. ipywidgets although not explicitly required, is part of core swifter functionality (progress bar). When i uninstall i get this warning on import. It could be an optional extras_require e.g. pip install swifter[progress_bar] but i worry that many users would find this abrasive as they currently expect the progress bar by default and i think that's a highly appreciated feature
/usr/local/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
  1. dask requires cloudpickle (dask is a core utility of swifter)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      2 import numpy as np
      3 from datetime import datetime, timedelta
----> 4 import swifter
      5 import platform
      6 import ray

File /mnt/swifter/__init__.py:5
      3 import warnings
      4 from logging import config
----> 5 from .swifter import SeriesAccessor, DataFrameAccessor, set_defaults
      6 from .parallel_accessor import (
      7     register_parallel_dataframe_accessor,
      8     register_parallel_series_accessor,
      9     register_modin,
     10 )
     12 warnings.filterwarnings("ignore", category=FutureWarning)

File /mnt/swifter/swifter.py:8
      5 import pandas as pd
      7 from abc import abstractmethod
----> 8 from dask import dataframe as dd
      9 from functools import partial
     10 from tqdm.auto import tqdm

File /usr/local/lib/python3.9/site-packages/dask/__init__.py:5
      3 from dask import config, datasets
      4 from dask._version import get_versions
----> 5 from dask.base import (
      6     annotate,
      7     compute,
      8     get_annotations,
      9     is_dask_collection,
     10     optimize,
     11     persist,
     12     visualize,
     13 )
     14 from dask.core import istask
     15 from dask.delayed import delayed

File /usr/local/lib/python3.9/site-packages/dask/multiprocessing.py:15
     12 from functools import partial
     13 from warnings import warn
---> 15 import cloudpickle
     17 from dask import config
     18 from dask.local import MultiprocessingPoolExecutor, get_async, reraise

ModuleNotFoundError: No module named 'cloudpickle'
  1. parso was added as an explicit requirement (it was already implicitly required prior to this) here to resolve a security vulnerability
    4.bleach is the same story, here another security vulnerability

@PeterJCLaw
Copy link
Author

Thanks for looking at this.

In general I would strongly discourage any mention of transient dependencies within a library's requirements. (Obviously it's fine to include transients which are also directly used by the library). Including them tends to create confusion and forces a particular setup upon all users, opening everyone up to potential security issues even if there's a short-term benefit for some subset of them.

Ideally a library's requirements should specify exactly and only the packages which the library requires and use version ranges to specify those it is compatible with. The version of transient dependencies should have no impact on the library itself and should thus not be specified. This ensures that end-users have the clearest idea of which packages it depends upon and have the most freedom to setup and/or secure their environment.
Where the versions of transient dependencies impact end-users' code/setup/etc. then it is their responsibility to choose (and maybe pin) versions of those dependencies -- they're the ones depending on them after all.

For 1) are you running that in a Jupyter notebook or similar? What happens if you run the equivalent code (can you share it?) as a plain command line script? This sounds like there's an optional dependency here which tqdm is warning you about due to the runtime environment that is unrelated to swifter.

I appreciate that security concerns have motivated the pinning, however those security issues will always be the responsibility of the end-user to address. The responsibility of a library with regards to security issues in dependencies exists only as far as a) direct dependencies only and then b) ensuring compatibility with patched versions & supporting those versions in the permitted ranges.

Libraries doing otherwise actually make dependency management considerably harder for end-users as it becomes unclear what the true dependencies are, which (as alluded to above) potentially creates other security issues. Consider what would happen if a newer version of dask dropped the need for bleach, then bleach was found to be vulnerable. With the current spelling, users of swifter are forced by their package managers to keep bleach installed and are potentially vulnerable even though dask and swifter no longer need the package. (Unfortunately there are plenty of ways for this to impact users even if the code appears unused).

@jmcarpenter2
Copy link
Owner

jmcarpenter2 commented Jul 20, 2023

Hey @PeterJCLaw , i totally agree with your approach here

Let me see if you agree with the following proposed changes:

  1. Right off the bat, we can remove parso and bleach from being explicit requirements and let upstream requirements handle installing them as well as users to resolve security vulnerabilities
  2. I will look into what compatibilities ipywidgets is needed for. I was in fact using a jupyter notebook to test.. which means that other runtime environments may not necessitate the ipywidgets.
    • If the answer is "it's only needed for jupyter notebooks"... i have a couple proposals
      1. Remove as a core dependency and instead add this as an extras_require so that it can be easily auto-installed for users who wish to have the same functionality
      2. Alternatively, just remove it altogether and allow users to resolve their runtime environments independently
    • If the answer is "it's needed for progress bars in general"
      1. Remove the core dependency and add as an extras_require so that users who want progress bars can install it
      2. Alternatively, just remove it altogether and allow users to resolve their runtime environments independently
    • The two paths above are the same. The only difference is what i would call the extras_require
  3. I will confirm that dask already installs cloudpickle by default (i was testing by pip uninstalling cloudpickle so it may be already a dask requirement and i shouldn't be explicitly mentioning it here either (just like parso and bleach).

Thoughts?

@PeterJCLaw
Copy link
Author

Yup sounds good.

A quick test of pip install dask in a fresh virtualenv shows that it does indeed install cloudpickle too.

I've used tqdm a lot in various contexts and never needed ipywidgets before. Combining that with the fact that you only see a warning (rather than an ImportError based crash) when ipywidgets isn't present, I'm pretty confident that it's not needed in the general case.

Up to you how you want the extras of your package to work.

@jmcarpenter2
Copy link
Owner

Sweet. Thanks for the extra input and validation on this, I'll put up a PR soon and add you as a reviewer

@xquyvu
Copy link

xquyvu commented Jul 21, 2023

@jmcarpenter2 Thanks for your quick response. From experience, I believe ipy* packages such as ipywidgets, ipykernel are only required when you are running a notebook, running a normal python script (and in prod) doesn't require this. I looked at swifter's documentation here and there is an option to disable the progress bar - Maybe we could throw an error when the user is running swifter in a notebook with the progress bar enabled but without ipywidgets installed?

Regarding cloudpickle, it is no longer a requirement for dask: dask/dask#2013

Looking back, if what's suggested here is correct, then maybe you wanted dask[dataframe] instead of dask in the requirements instead of explicitly listing cloudpickle.

@jmcarpenter2
Copy link
Owner

Thanks @xquyvu that's a good callout. I will definitely remove cloudpickle. Ironically i already am using dask[dataframe] https://github.com/jmcarpenter2/swifter/blob/master/setup.py#L16, just never removed cloudpickle.

I am seeing strange behavior with ipywidgets though.

At first, I was thinking... "this is fine, it gives a warning to pip install -U ipywidgets in the event it's not installed for jupyter notebook compatibility and I won't even need to offer an extras_require".. but when i manually run the pip install inside my notebook environment and restart the notebook it doesn't correctly give me the rich progress bar output. So I think i will offer a progress_bar extras_require or some similar name.. struggling to come up with something concise that means "jupyter notebook rich progress bar" as an extras_require name. Any suggestions?

@jmcarpenter2
Copy link
Owner

For review: #220

@PeterJCLaw
Copy link
Author

At first, I was thinking... "this is fine, it gives a warning to pip install -U ipywidgets in the event it's not installed for jupyter notebook compatibility and I won't even need to offer an extras_require".. but when i manually run the pip install inside my notebook environment and restart the notebook it doesn't correctly give me the rich progress bar output. So I think i will offer a progress_bar extras_require or some similar name.. struggling to come up with something concise that means "jupyter notebook rich progress bar" as an extras_require name. Any suggestions?

Maybe just jupyter or notebook for the extra's name, or even just ipywidgets? That obviously doesn't convey the feature, however it does convey the environment which seems perhaps more pertinent as that's where the warning might be shown.

I would discourage anything which only mentions the progress bar unless ipywidgets is needed for all progress bars. Given that tqdm works fine on its own outside notebooks, that seems not to be the case.

For review: #220

Thanks, that looks like it's moving in the right direction.

There's a lot going on in that PR though. I would tend to encourage keeping PRs to a single topic so that they're easier to review, easier to revert if needed and avoids one set of changes being held up by discussion in an unrelated set of changes. Just like with commits, the rule of thumb of "smallest useful increment" works well for PRs too.

@jmcarpenter2
Copy link
Owner

Thanks for the input, I think notebook is pretty clean extra_requires. Good suggestions.

I agree. I should've submitted separate PRs for each issue and then done a release PR.

@jmcarpenter2
Copy link
Owner

#220 is merged, thanks all for input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

No branches or pull requests

3 participants