Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Simplification of the DFtoVW tutorial #4693

Merged
merged 7 commits into from
May 23, 2024
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
307 changes: 307 additions & 0 deletions python/docs/source/tutorials/python_simplified_dftovw_tuto.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "51f41eaf-f24f-44fc-8178-3270efa46ec4",
"metadata": {},
"source": [
"# Simple pandas to vowpalwabbit conversion tutorial"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b9a21a43-39ad-4213-9c7f-814bbafd8a54",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from vowpalwabbit.dftovw import DFtoVW\n",
"from vowpalwabbit import Workspace"
]
},
{
"cell_type": "markdown",
"id": "fc831353-b5aa-4bb0-a928-c47b340397a5",
"metadata": {},
"source": [
"### Building simple examples using `DftoVW.from_column_names`"
]
},
{
"cell_type": "markdown",
"id": "c60089f1-ce41-49ee-a3a9-74f0fb2cb34f",
"metadata": {},
"source": [
"Let's create the following pandas dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a31118c2-b315-4129-b28a-2ea37d2dae50",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(\n",
" [\n",
" {\n",
" \"income\": 0,\n",
" \"age\": 27,\n",
" \"marital-status\": \"Separated\",\n",
" \"education\": \"HS-grad\",\n",
" \"occupation\": \"Handlers-cleaners\",\n",
" \"hours-per-week\": 25,\n",
" },\n",
" {\n",
" \"income\": 1,\n",
" \"age\": 34,\n",
" \"marital-status\": \"Married-civ-spouse\",\n",
" \"education\": \"Bachelors\",\n",
" \"occupation\": \"Prof-specialty\",\n",
" \"hours-per-week\": 40,\n",
" },\n",
" {\n",
" \"income\": 0,\n",
" \"age\": 44,\n",
" \"marital-status\": \"Never-married\",\n",
" \"education\": \"Assoc-voc\",\n",
" \"occupation\": \"Priv-house-serv\",\n",
" \"hours-per-week\": 25,\n",
" },\n",
" {\n",
" \"income\": 1,\n",
" \"age\": 38,\n",
" \"marital-status\": \"Married-civ-spouse\",\n",
" \"education\": \"Bachelors\",\n",
" \"occupation\": \"Prof-specialty\",\n",
" \"hours-per-week\": 60,\n",
" },\n",
" {\n",
" \"income\": 0,\n",
" \"age\": 34,\n",
" \"marital-status\": \"Married-civ-spouse\",\n",
" \"education\": \"HS-grad\",\n",
" \"occupation\": \"Other-service\",\n",
" \"hours-per-week\": 36,\n",
" },\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "473e5c72-ab6c-4d72-a466-7352ec604393",
"metadata": {},
"source": [
"The user builds the examples using the class method `DftoVW.from_column_names`. The method is called using the dataframe object (`df`) and its various column names. The conversion to vowpal wabbit examples is then performed by calling the `convert_df` method:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2be83f6c-ecaa-45cb-bb3f-2f47827d6016",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['0 | age:27 marital-status=Separated education=HS-grad occupation=Handlers-cleaners hours-per-week:25',\n",
" '1 | age:34 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:40',\n",
" '0 | age:44 marital-status=Never-married education=Assoc-voc occupation=Priv-house-serv hours-per-week:25',\n",
" '1 | age:38 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:60',\n",
" '0 | age:34 marital-status=Married-civ-spouse education=HS-grad occupation=Other-service hours-per-week:36']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"converter = DFtoVW.from_column_names(\n",
" df=df,\n",
" y=\"income\",\n",
" x=[\"age\", \"marital-status\", \"education\", \"occupation\", \"hours-per-week\"],\n",
")\n",
"examples = converter.convert_df()\n",
"examples"
]
},
{
"cell_type": "markdown",
"id": "6109f95e-cd17-485b-947d-8c2c33a5843a",
"metadata": {},
"source": [
"Note that the vowpal wabbit format for categorical features is `feature_name=feature_value` whereas for numerical features the format is `feature_name:feature_value`. When using `DFtoVW` class, the appropriate format will be inferred from the dataframe columns types.\n",
"\n",
"We then train the model on these examples:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c0269980-78b3-4123-84eb-27e0fba929b4",
"metadata": {},
"outputs": [],
"source": [
"model = Workspace(P=1, enable_logging=True)\n",
"\n",
"for ex in examples:\n",
" model.learn(ex)\n",
"model.finish()"
]
},
{
"cell_type": "markdown",
"id": "50470ca2-f33d-495e-a3f9-46ae1a618e6d",
"metadata": {},
"source": [
"### Building more complex examples"
]
},
{
"cell_type": "markdown",
"id": "30a526a6-7f8f-48e4-8dca-f9058a0d87fb",
"metadata": {},
"source": [
"The class method `DFtoVW.from_column_names` represents a quick and simple way to build the examples, but if the user needs more control over the way the examples are created, she or he can either use the class `Feature` or the class `Namespace` for building features, and any of the label class available (see below) based on the nature of the task. \n",
"\n",
"- When using `Namespace` class (see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces for the meaning) the user specifies the name of the namespace with the `name` field, and will pass one or a list of `Feature` object to the `features` field.\n",
"\n",
"- The `Feature` class has a `value` field, which is the name of the column. The user can also rename the feature using the `rename_feature` field or choose to enforce a specific type (`\"numerical\"` or `\"categorical\"`) using `as_type` field.\n",
"\n",
"Regarding the labels, multiple classes are available:\n",
"- `SimpleLabel` for regression\n",
"- `MulticlassLabel` and `Multilabel` for classification\n",
"- `ContextualbanditLabel`.\n",
"\n",
"In the following examples we'll build 2 namespaces based on socio-demographic features and the job features."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "90a69d90-a0a6-42d4-8867-5d1b0e73f4ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['0 |ns_sociodemo age:27 marital-status=Separated education=HS-grad |ns_job occupation=Handlers-cleaners hours-per-week:25',\n",
" '1 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:40',\n",
" '0 |ns_sociodemo age:44 marital-status=Never-married education=Assoc-voc |ns_job occupation=Priv-house-serv hours-per-week:25',\n",
" '1 |ns_sociodemo age:38 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:60',\n",
" '0 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=HS-grad |ns_job occupation=Other-service hours-per-week:36']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from vowpalwabbit.dftovw import SimpleLabel, Namespace, Feature\n",
"\n",
"ns_sociodemo = Namespace(\n",
" features=[Feature(col) for col in [\"age\", \"marital-status\", \"education\"]],\n",
" name=\"ns_sociodemo\",\n",
")\n",
"ns_job = Namespace(\n",
" features=[Feature(col) for col in [\"occupation\", \"hours-per-week\"]], name=\"ns_job\"\n",
")\n",
"label = SimpleLabel(\"income\")\n",
"\n",
"converter_advanced = DFtoVW(df=df, namespaces=[ns_sociodemo, ns_job], label=label)\n",
"examples_advanced = converter_advanced.convert_df()\n",
"examples_advanced[:5]"
]
},
{
"cell_type": "markdown",
"id": "071326d7-f969-4db1-a73e-3cee225921f4",
"metadata": {},
"source": [
"We train the model by also including interactions between the variables of the 2 namespaces:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f0ed661f-d9a0-4ebb-93b8-f5747347c7b4",
"metadata": {},
"outputs": [],
"source": [
"model_advanced = Workspace(\n",
" # arg_str=\"--interactions ns_sociodemo:ns_job\", P=1, enable_logging=True\n",
" arg_str=\"--redefine a:=ns_job b:=ns_sociodemo -q ab \",\n",
" P=1,\n",
" enable_logging=True,\n",
")\n",
"\n",
"for ex in examples_advanced:\n",
" model_advanced.learn(ex)\n",
"\n",
"model_advanced.finish()"
]
},
{
"cell_type": "markdown",
"id": "5bb2208e-9d0e-44ef-8d91-faccedf41ac0",
"metadata": {},
"source": [
"Finally, we can get the estimated weights associated to each namespace and feature:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "06aabeab-2365-4f86-bf60-7043b0e59190",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('ns_job', 'occupation', 0.0),\n",
" ('ns_job', 'hours-per-week', 0.0019117757910862565),\n",
" ('ns_sociodemo', 'age', 0.001858704723417759),\n",
" ('ns_sociodemo', 'marital-status', 0.0),\n",
" ('ns_sociodemo', 'education', 0.0)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[\n",
" (ns.name, feature.name, model_advanced.get_weight_from_name(feature.name, ns.name))\n",
" for ns in [ns_job, ns_sociodemo]\n",
" for feature in ns.features\n",
"]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "python3_8_12",
"language": "python",
"name": "python3_8_12"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading