Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#313 enabled fix differing versions of dependency scikit learn #316

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/changes/changes_2.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Version: 2.1.0
## Bug Fixes

* #303: Fixed AWS Codebuild
* #313: Fixed differing versions of dependency `scikit-learn`

## Documentation

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
{
Copy link
Collaborator

@ahsimb ahsimb Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific version of the library used by the default SLC depends ...

"AI-Lab adapts its own version of the library if required." I don't think it's a good explanation of what is happening here. AI-Lab itself is not doing anything. The user should run the script like the one in this notebook to make sure the versions are aligned.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the paragraph in the next push.
Please review again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahsimb thx, that was also what I meant needs to be formulated differently and @ckunki for changing

Copy link
Collaborator

@ahsimb ahsimb Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this code cell in two parts and add markdown cells for each part explaining the code?

In the first part we create a UDF that reads the version of the scikit_learn library. The UDF runs inside the language container, therefore the library version detected by the UDF is the version installed in this container.

In the second part we compare the version returned by the UDF with the version in the AI-Lab environment. If they differ we install the UDF's version in the AI-Lab environment.

BTW, do we need to run pip with the --upgrade option?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we need to run pip with the --upgrade option?

That I cannot answer. My tests have been successful without this option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

"cells": [
{
"cell_type": "markdown",
"id": "289d2a8c-953d-46e5-8c73-ad810c29b20f",
"metadata": {},
"source": [
"# Fix the Version of Python Library Scikit-learn\n",
"\n",
"This notebook ensures the AI-Lab is using the same version of the python library `scikit-learn` as the one used by the built-in [Script Language Container (SLC)](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/adding_new_packages_script_languages.htm#ScriptLanguageContainer) inside the Exasol database.\n",
"\n",
"## Rationale\n",
"\n",
"Using identical versions is required when transferring the Scikit-learn model from the AI-Lab to the database SLC.\n",
"\n",
"The AI-Lab serializes the Scikit-learn model with [pickle](https://docs.python.org/3/library/pickle.html) and uploads it into the BucketFS of the database. The UDF using the built-in SLC can only _deserialize_ the model if it is using the same version of Scikit-learn as was used for serializing it. The specific version of the library available in the built-in SLC depends on the release version of the database and cannot be controlled by the AI-Lab.\n",
"\n",
"Running the following script will update the version of the library used in the AI-Lab, if required.\n",
"\n",
"## Open Secure Configuration Storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d86ca808-044e-4fbd-be30-5ba8324f501e",
"metadata": {},
"outputs": [],
"source": [
"%run ../utils/access_store_ui.ipynb\n",
"display(get_access_store_ui('../'))"
]
},
{
"cell_type": "markdown",
"id": "055ed302-69aa-426c-b5ec-861c63b82d33",
"metadata": {},
"source": [
"## Detect the Version of Scikit-learn Used in the SLC\n",
"\n",
"The following cell creates a User Defined Function (UDF) called `detect_scikit_learn_version()` and then executes the UDF using the built-in SLC via an SQL statement.\n",
"\n",
"The UDF inquires and returns the version of Scikit-learn available in the built-in SLC which is then stored in variable `slc_scikit_learn_version`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa6c628f-853e-4850-8bab-46f7f645856e",
"metadata": {},
"outputs": [],
"source": [
"import textwrap\n",
"from exasol.nb_connector.connections import open_pyexasol_connection\n",
"\n",
"sql = textwrap.dedent(\"\"\"\n",
"CREATE OR REPLACE PYTHON3 SCALAR SCRIPT {schema!q}.detect_scikit_learn_version() RETURNS VARCHAR(100) AS\n",
"import sklearn\n",
"def run(ctx):\n",
" return sklearn.__version__ \n",
"/\n",
"\"\"\")\n",
"\n",
"with open_pyexasol_connection(ai_lab_config, compression=True) as conn:\n",
" query_params={'schema': ai_lab_config.db_schema}\n",
" conn.execute(sql, query_params)\n",
" result = conn.execute(\"select {schema!q}.detect_scikit_learn_version()\", query_params).fetchone()\n",
" slc_scikit_learn_version = result[0]"
]
},
{
"cell_type": "markdown",
"id": "e4b0dc24-6e02-4305-8fa1-15f68afac360",
"metadata": {},
"source": [
"## Compare the Scikit-learn Version and Update the AI-Lab if Required\n",
"\n",
"The next cell compares the Scikit-learn version returned by the UDF with the Scikit-learn version in the AI-Lab environment. If they differ, then the cell installs the UDF's Scikit-learn version in the AI-Lab environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50b88871-4c37-4cc1-ac85-841a22e98153",
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"from importlib import reload\n",
"\n",
"my_version = sklearn.__version__\n",
"\n",
"if slc_scikit_learn_version == my_version:\n",
" print(f\"AI-Lab scikit-learn version {my_version} is identical to that of the SLC.\\nNothing to do.\")\n",
"else:\n",
" print(f\"AI-Lab scikit-learn version {my_version} differs from SLC.\\nInstalling version {slc_scikit_learn_version} ...\")\n",
" %pip install \"scikit_learn=={slc_scikit_learn_version}\"\n",
" sklearn = reload(sklearn)\n",
" print(f\"Updated AI-Lab scikit-learn to version {sklearn.__version__}.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0ea6891-d171-4841-a2d8-edf8ac252d86",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "c2aa3433-eb40-495e-a80e-38ab0bee10cf",
"id": "c7b3a9b5-1b5f-4613-bd8d-6c14f00af33e",
"metadata": {},
"source": [
"# Introduction\n",
Expand All @@ -16,9 +16,10 @@
"## Prerequisites\n",
"\n",
"Before using this set of notebooks the following steps need to be completed:\n",
"1. [Configure the AI-Lab](../main_config.ipynb).\n",
"2. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb).\n",
"3. [Load the Abalone data](../data/data_abalone.ipynb).\n",
"1. [Configure the AI-Lab](../main_config.ipynb)\n",
"2. [Fix the Version of Python Library Scikit-learn](./sklearn_fix_version.ipynb)\n",
"3. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb)\n",
"4. [Load the Abalone data](../data/data_abalone.ipynb)\n",
"\n",
"## Content\n",
"\n",
Expand Down Expand Up @@ -58,7 +59,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@
"\n",
Copy link
Collaborator

@ahsimb ahsimb Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #12.    model.fit(X_train.values, y_train)

Why? The model is happy to take a DataFrame as an input.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We observed an error message as specified in comment-2293085752.
The notebook sklearn_train_abalone.ipynb did use .values already.

Copy link
Collaborator

@ahsimb ahsimb Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe earlier versions of scikit-learn didn't care about feature names. It seems that when we read input in udf like ctx.get_dataframe(num_rows=1000, start_col=1) we do not get the column names in the pandas DataFrame. That's strange.

"# Create and train the model.\n",
"model = tree.DecisionTreeClassifier()\n",
"model.fit(X_train, y_train)\n",
"model.fit(X_train.values, y_train)\n",
"\n",
"print(f\"Training took: {stopwatch}\")"
]
Expand Down
2 changes: 2 additions & 0 deletions test/notebooks/nbtest_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def test_regression(notebook_runner) -> None:
os.chdir('./data')
notebook_runner('data_abalone.ipynb')
os.chdir('../sklearn')
notebook_runner('sklearn_fix_version.ipynb')
notebook_runner('sklearn_predict_udf.ipynb')
notebook_runner('sklearn_train_abalone.ipynb')
notebook_runner('sklearn_predict_abalone.ipynb')
Expand All @@ -38,6 +39,7 @@ def test_classification(notebook_runner) -> None:
os.chdir('./data')
notebook_runner('data_telescope.ipynb')
os.chdir('../sklearn')
notebook_runner('sklearn_fix_version.ipynb')
notebook_runner('sklearn_predict_udf.ipynb')
notebook_runner('sklearn_train_telescope.ipynb')
notebook_runner('sklearn_predict_telescope.ipynb')
Expand Down