Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuml + optuna HPO example #141

Merged
merged 11 commits into from
Mar 30, 2023

Conversation

betatim
Copy link
Member

@betatim betatim commented Feb 8, 2023

Closes rapidsai/cloud-ml-examples#227

🚧 This isn't ready to merge yet.

I've updated things to the latest version of optuna (3.1.0). This means that we no longer need the dask_optuna extension as the dask integration is now part of optuna.

I also removed the section on the dashboard. This is because you now need to install a separate package to get the dashboard. It seems like this would add a lot of explaining to an already long notebook for not so much benefit. If there was something particular to point out or do with the dashboard it would be worth it, currently the notebook just pointed out that the dashboard exists. There are visualisation examples in the notebook already.

Right now there are two (and a half) problems:

  1. sometimes the notebook just "hangs" when you run the HPO loop. As far as I can tell nothing is happening when this state is reached. top doesn't report any CPU usage and nvidia-smi also doesn't report anything happening. Waiting for half an hour or so, also doesn't solve the problem. The weird thing is that this morning it "just worked" but a few days ago I had the same problem, and now again. If someone has an idea where to start poking, let me know.
  2. you need to remove the SQLite DB used as storage each time you run the notebook. I assume this is a bug in the dask storage extension(?) that somehow leaves the DB in a state where it can't be used again?? Or maybe we'd have to use a new study name? We can solve this by using an in memory storage (don't provide a path to DaskStorage()). Does anyone see a downside to doing this?
  3. when the notebook does run (and doesn't get stuck), it doesn't seem like it is using more than one GPU. nvidia-smi output in the collapsible below. First question: is using nvidia-smi a good (approximate tool) to see if more than one GPU is being used? Second question: has it always been like this?
Details
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:3B:00.0 Off |                    0 |
| N/A   55C    P0    44W /  70W |   3459MiB / 15109MiB |     80%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
| N/A   53C    P0    29W /  70W |    100MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   46C    P0    27W /  70W |    100MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   44C    P0    28W /  70W |    100MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    925320      C   ...-deploy-optuna/bin/python     3359MiB |
|    0   N/A  N/A    925478      C   ...-deploy-optuna/bin/python       97MiB |
|    1   N/A  N/A    925486      C   ...-deploy-optuna/bin/python       97MiB |
|    2   N/A  N/A    925481      C   ...-deploy-optuna/bin/python       97MiB |
|    3   N/A  N/A    925488      C   ...-deploy-optuna/bin/python       97MiB |
+-----------------------------------------------------------------------------+

Before merging this we need to run the notebook at least once from start to finish. Other steps depend on the answer to the questions.

Copying task list from the issue:

  • Test if notebook works
  • Fix up anything that needs changing
  • Ensure notebook has good title, description and metadata tags in the first cell
  • Replace deployment instructions with links to docs pages
  • Copy notebook into a new folder into deployment docs examples
  • Copy any supporting files to the folder
  • Add notebook to examples toctree
  • Make PR to deployment docs repo

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jacobtomlinson jacobtomlinson marked this pull request as draft February 8, 2023 16:38
@hcho3 hcho3 self-assigned this Feb 8, 2023
@betatim
Copy link
Member Author

betatim commented Feb 8, 2023

Should this notebook even be here? Maybe, because it uses LocalCUDACluster, but maybe the RAPIDS examples are a better home?

Related to that question, which tags should we use?

@jacobtomlinson
Copy link
Member

@skirui-source was also looking at a notebook that only used LocalCUDACluster. I think it would be good to have an example or two that use it, but I don't want tons and tons.

Perhaps you two could sync up and decide which of the two should go here. I'd be happy with at least one and I'm fine with both. But we probably don't need more than that. But if there is a ton of overlap between then two then they could be merged.

@betatim
Copy link
Member Author

betatim commented Feb 9, 2023

Sounds like a plan. I left a comment on the other (obvious) optuna notebook issues regarding -1/0/+1 for migrating. I'll wait and see which one @skirui-source has tackled.

@betatim betatim marked this pull request as ready for review February 16, 2023 11:32
@betatim
Copy link
Member Author

betatim commented Feb 16, 2023

I figured out how to address all the problems I mentioned in the top comment 🥳 . I think a lot of it was related to switching from dask_optuna to the dask extension built in to optuna in the newer version.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 17, 2023

View / edit / reply to this conversation on ReviewNB

hcho3 commented on 2023-02-17T01:36:52Z
----------------------------------------------------------------

Line #1.    import pandas as pd

Not needed


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 17, 2023

View / edit / reply to this conversation on ReviewNB

hcho3 commented on 2023-02-17T01:36:53Z
----------------------------------------------------------------

The objective function will be the one we optimize in Optuna Study.

Should be revised to "We will optimize the objective function using Optuna Study"


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 17, 2023

View / edit / reply to this conversation on ReviewNB

hcho3 commented on 2023-02-17T01:36:54Z
----------------------------------------------------------------

Optuna uses study and trials

Revise to "Optuna uses studies and trials"


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 17, 2023

View / edit / reply to this conversation on ReviewNB

hcho3 commented on 2023-02-17T01:36:55Z
----------------------------------------------------------------

# Submit 4 optimization tasks, where each task runs about 40 optimization trials

"Submit n_workers optimization tasks..."


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 17, 2023

View / edit / reply to this conversation on ReviewNB

hcho3 commented on 2023-02-17T01:36:56Z
----------------------------------------------------------------

futures = [
    c.submit(
        study.optimize,
        lambda trial: objective(trial, X, y),
        n_trials=N_TRIALS // 4,
        pure=False,
    )
    for _ in range(4)
]

Use n_workers instead of hard-coding 4.


@hcho3
Copy link
Contributor

hcho3 commented Feb 17, 2023

Overall, the notebook looks good to me. I left a few minor comments.

I was able to run the notebook end-to-end using the latest stable Rapids Docker container (nvcr.io/nvidia/rapidsai/rapidsai-core:23.02-cuda11.8-runtime-ubuntu22.04-py3.10).

@betatim
Copy link
Member Author

betatim commented Feb 17, 2023

Thanks for the comments. I implemented them all as is, except for the import pandas one: there was an additional unused import I also removed :D

@betatim
Copy link
Member Author

betatim commented Feb 17, 2023

I'm a bit puzzled about the build error. The problem is that there is a \\ in the notebook that makes the version templating extension unhappy. The problem is around byte 3755442. If you print that part of the file there is indeed a \\ there, but there is also a \\ about 70 characters before. The offending bit is "var x = new MutationObserver(function (mutations, observer) {{\\n",\n which is part of the plotly(?) JS that gets injected. But it looks totally legit to have a \\n there. So yeah, I'm a bit stumped.

Does someone have an idea where to go poking around?

@jacobtomlinson
Copy link
Member

Hmm I wondered if throwing notebooks through jinja would throw up some problems like this.

I guess the options are:

  • Remove the offending plotly section
  • Rework the extension to handle this case

@betatim
Copy link
Member Author

betatim commented Feb 22, 2023

Could we skip notebooks in the version templating extension? At least temporarily and then work on improving the extension.

Switching away from plotly would be a bit tricky, one of the plots uses parallel coordinates, which is not something that you can (easily) do in matplotlib. And it is a useful plot type for the thing we are trying to show. But then again, maybe we should just remove the visualisation part and instead refer the reader to some optuna docs about how to do that? It would fit with "one idea per example", which I guess in this case is "look, optuna + dask + rapids = it just works!".

The JS injected by plotly causes parsing errors in the templating
extension we use to render the docs. For now the plots are removed,
could be re-added if the extension is reworked.
@betatim
Copy link
Member Author

betatim commented Feb 22, 2023

The latest commit implements the idea of removing the plotting. If we go ahead with this I'll create an issue to remind us to work on the extension (and add the plotting back).

@betatim
Copy link
Member Author

betatim commented Feb 22, 2023

With #160 merged, let's put back the visualisation stuff.

@betatim betatim force-pushed the notebook-cuml-optuna-hpo branch from 965b197 to a776a44 Compare February 28, 2023 09:54
@betatim betatim force-pushed the notebook-cuml-optuna-hpo branch from e0baa97 to b485938 Compare February 28, 2023 10:10
@betatim
Copy link
Member Author

betatim commented Feb 28, 2023

This is ready for review/merge. In the end I removed the plotting because it doesn't render/appear in the rendered docs. Maybe because we are not including all the right JS libs that the notebook UI includes/loads? Maybe something for a future PR dedicated to making plotlyJS work.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks great thanks!

Could you add a few more tags. The notebook uses libraries including cudf, cuml, and numpy which we need tags for. We probably also need a new dataset tag for BNP Paribas Cardif Claims Management although not sure what the best short slug is for that.

@betatim
Copy link
Member Author

betatim commented Mar 1, 2023

Added more tags. What do you think of dataset/bnp-claims? (I've added that for now)

@betatim
Copy link
Member Author

betatim commented Mar 9, 2023

Time to merge this?

@jacobtomlinson
Copy link
Member

Sorry for the delay in merging here. Let's get this in.

@jacobtomlinson jacobtomlinson merged commit eed6b0d into rapidsai:main Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate optuna/notebooks/optuna_rapids.ipynb to deployment docs
3 participants