From bc6c826351b119c8f800a2c847ad78bc5d77ba3b Mon Sep 17 00:00:00 2001 From: Matt Bowers Date: Mon, 18 Sep 2023 14:36:29 +0200 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- about.html | 2 +- archive-python.xml | 4942 +++++++++----- archive.html | 47 +- archive.xml | 5880 +++++++++-------- gradient-boosting-series.html | 16 +- index.html | 71 +- listings.json | 2 + posts/8020-pandas-tutorial/index.html | 2 +- .../index.html | 2 +- posts/consider-the-decision-tree/index.html | 2 +- posts/decision-tree-from-scratch/index.html | 2 +- posts/drafts/conda-cheat-sheet/index.html | 2 +- .../get-down-with-gradient-descent/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- posts/hello-pyspark/index.html | 2 +- posts/hello-world/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- posts/xgboost-explained/index.html | 2 +- .../index.html | 1446 ++++ .../figure-html/cell-19-output-1.png | Bin 0 -> 17808 bytes .../figure-html/cell-20-output-1.png | Bin 0 -> 94075 bytes .../figure-html/cell-7-output-1.png | Bin 0 -> 17274 bytes .../figure-html/cell-8-output-1.png | Bin 0 -> 16837 bytes .../kigali-branches.jpg | Bin 0 -> 79493 bytes posts/xgboost-from-scratch/index.html | 2 +- search.json | 405 +- sitemap.xml | 50 +- 30 files changed, 8348 insertions(+), 4543 deletions(-) create mode 100644 posts/xgboost-for-regression-in-python/index.html create mode 100644 posts/xgboost-for-regression-in-python/index_files/figure-html/cell-19-output-1.png create mode 100644 posts/xgboost-for-regression-in-python/index_files/figure-html/cell-20-output-1.png create mode 100644 posts/xgboost-for-regression-in-python/index_files/figure-html/cell-7-output-1.png create mode 100644 posts/xgboost-for-regression-in-python/index_files/figure-html/cell-8-output-1.png create mode 100644 posts/xgboost-for-regression-in-python/kigali-branches.jpg diff --git a/.nojekyll b/.nojekyll index d06df9a..72312a6 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -7d0c39bd \ No newline at end of file +e110b739 \ No newline at end of file diff --git a/about.html b/about.html index 489a190..ccb50da 100644 --- a/about.html +++ b/about.html @@ -134,7 +134,7 @@
Subscribe
- + diff --git a/archive-python.xml b/archive-python.xml index 8c47f22..95005db 100644 --- a/archive-python.xml +++ b/archive-python.xml @@ -10,7 +10,2235 @@ A blog about data science, statistics, machine learning, and the scientific method quarto-1.3.433 -Tue, 05 Sep 2023 21:00:00 GMT +Tue, 24 Oct 2023 22:00:00 GMT + + XGBoost for Regression in Python + Matt Bowers + https://randomrealizations.com/posts/xgboost-for-regression-in-python/index.html + In this post I’m going to show you my process for solving regression problems with XGBoost in python, using either the native xgboost API or the scikit-learn interface. This is a powerful methodology that can produce world class results in a short time with minimal thought or effort. While we’ll be working on an old Kagle competition for predicting the sale prices of bulldozers and other heavy machinery, you can use this flow to solve whatever tabular data regression problem you’re working on.

+

This post serves as the explanation and documentation for the XGBoost regression jupyter notebook from my ds-templates repo on GitHub, so go ahead and download the notebook and follow along with your own data.

+

If you’re not already comfortable with the ideas behind gradient boosting and XGBoost, you’ll find it helpful to read some of my previous posts to get up to speed. I’d start with this introduction to gradient boosting, and then read this explanation of how XGBoost works.

+

Let’s get into it! 🚀

+
+

Install and import the xgboost library

+

If you don’t already have it, go ahead and use conda to install the xgboost library, e.g.

+
$ conda install -c conda-forge xgboost
+

Then import it along with the usual suspects.

+
+
import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import xgboost as xgb
+
+
+
+

Read dataset into python

+

In this example we’ll work on the Kagle Bluebook for Bulldozers competition, which asks us to build a regression model to predict the sale price of heavy equipment. Amazingly, you can solve your own regression problem by swapping this data out with your organization’s data before proceeding with the tutorial.

+

Go ahead and download the Train.zip file from Kagle and extract it into Train.csv. Then read the data into a pandas dataframe.

+
+
df = pd.read_csv('Train.csv', parse_dates=['saledate']);
+
+

Notice I cheated a little bit, checking the columns ahead of time and telling pandas to treat the saledate column as a date. In general it will make life easier to read in any date-like columns as dates.

+
+
df.info()
+
+
<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 401125 entries, 0 to 401124
+Data columns (total 53 columns):
+ #   Column                    Non-Null Count   Dtype         
+---  ------                    --------------   -----         
+ 0   SalesID                   401125 non-null  int64         
+ 1   SalePrice                 401125 non-null  int64         
+ 2   MachineID                 401125 non-null  int64         
+ 3   ModelID                   401125 non-null  int64         
+ 4   datasource                401125 non-null  int64         
+ 5   auctioneerID              380989 non-null  float64       
+ 6   YearMade                  401125 non-null  int64         
+ 7   MachineHoursCurrentMeter  142765 non-null  float64       
+ 8   UsageBand                 69639 non-null   object        
+ 9   saledate                  401125 non-null  datetime64[ns]
+ 10  fiModelDesc               401125 non-null  object        
+ 11  fiBaseModel               401125 non-null  object        
+ 12  fiSecondaryDesc           263934 non-null  object        
+ 13  fiModelSeries             56908 non-null   object        
+ 14  fiModelDescriptor         71919 non-null   object        
+ 15  ProductSize               190350 non-null  object        
+ 16  fiProductClassDesc        401125 non-null  object        
+ 17  state                     401125 non-null  object        
+ 18  ProductGroup              401125 non-null  object        
+ 19  ProductGroupDesc          401125 non-null  object        
+ 20  Drive_System              104361 non-null  object        
+ 21  Enclosure                 400800 non-null  object        
+ 22  Forks                     192077 non-null  object        
+ 23  Pad_Type                  79134 non-null   object        
+ 24  Ride_Control              148606 non-null  object        
+ 25  Stick                     79134 non-null   object        
+ 26  Transmission              183230 non-null  object        
+ 27  Turbocharged              79134 non-null   object        
+ 28  Blade_Extension           25219 non-null   object        
+ 29  Blade_Width               25219 non-null   object        
+ 30  Enclosure_Type            25219 non-null   object        
+ 31  Engine_Horsepower         25219 non-null   object        
+ 32  Hydraulics                320570 non-null  object        
+ 33  Pushblock                 25219 non-null   object        
+ 34  Ripper                    104137 non-null  object        
+ 35  Scarifier                 25230 non-null   object        
+ 36  Tip_Control               25219 non-null   object        
+ 37  Tire_Size                 94718 non-null   object        
+ 38  Coupler                   213952 non-null  object        
+ 39  Coupler_System            43458 non-null   object        
+ 40  Grouser_Tracks            43362 non-null   object        
+ 41  Hydraulics_Flow           43362 non-null   object        
+ 42  Track_Type                99153 non-null   object        
+ 43  Undercarriage_Pad_Width   99872 non-null   object        
+ 44  Stick_Length              99218 non-null   object        
+ 45  Thumb                     99288 non-null   object        
+ 46  Pattern_Changer           99218 non-null   object        
+ 47  Grouser_Type              99153 non-null   object        
+ 48  Backhoe_Mounting          78672 non-null   object        
+ 49  Blade_Type                79833 non-null   object        
+ 50  Travel_Controls           79834 non-null   object        
+ 51  Differential_Type         69411 non-null   object        
+ 52  Steering_Controls         69369 non-null   object        
+dtypes: datetime64[ns](1), float64(2), int64(6), object(44)
+memory usage: 162.2+ MB
+
+
+
+
+

Prepare raw data for XGBoost

+

When faced with a new tabular dataset for modeling, we have two format considerations: data types and missingness. From the call to df.info() above, we can see we have both mixed types and missing values.

+

When it comes to missing values, some models like the gradient booster or random forest in scikit-learn require purely non-missing inputs. One of the great strengths of XGBoost is that it relaxes this requirement, allowing us to pass in missing feature values, so we don’t have to worry about them.

+

Regarding data types, all ML models for tabular data require inputs to be numeric, either integers or floats, so we’re going to have to deal with those object columns.

+
+

Encode string features

+

The simplest way to encode string variables is to map each unique string value to an integer; this is called integer encoding.

+

We have a couple of options for how to implement this transformation: pandas categoricals or the scikit-learn label encoder. We can use the categorical type in pandas to generate mappings from string values to integers for each string feature. The category type is a bit like the factor type in R. Pandas stores the underlying data as integers, and it also keeps a mapping from the integers to the string values. XGBoost will be able to access the integers for model fitting. This is nice because we can still access the actual categories which can be helpful when we start taking a closer look at the data. If you prefer, you can also use the scikit-learn label encoder to replace the string columns with their integer-mapped counterparts.

+
+
def encode_string_features(df, use_cats=True):
+    out_df = df.copy()
+    for feature, feature_type in df.dtypes.items():
+        if feature_type == 'object':
+            if use_cats:
+                out_df[feature] = out_df[feature].astype('category')
+            else:
+                from sklearn.preprocessing import LabelEncoder
+                out_df[feature] = LabelEncoder() \
+                    .fit_transform(out_df[feature].astype('str'))
+    return out_df
+
+df = encode_string_features(df, use_cats=False)
+
+
+
+

Encode date and timestamp features

+

While dates feel sort of numeric, they are not numbers, so we need to transform them into numeric columns. Unfortunately, encoding timestamps isn’t as straightforward as encoding strings, so we actually might need to engage in a little bit of feature engineering. A single date has many different attributes, e.g. days since epoch, year, quarter, month, day, day of year, day of week, is holiday, etc. As a starting point, we can just add a few of these attributes as features. Once a feature is represented as a date or timestamp data type, you can access various attributes via the dt attribute.

+
+
def encode_datetime_features(df, datetime_features, datetime_attributes):
+    out_df = df.copy()
+    for datetime_feature in datetime_features:
+        for datetime_attribute in datetime_attributes:
+            if datetime_attribute == 'days_since_epoch':
+                out_df[f'{datetime_feature}_{datetime_attribute}'] = \
+                    (out_df[datetime_feature] 
+                     - pd.Timestamp(year=1970, month=1, day=1)).dt.days
+            else:
+                out_df[f'{datetime_feature}_{datetime_attribute}'] = \
+                    getattr(out_df[datetime_feature].dt, datetime_attribute)
+    return out_df
+
+datetime_features = [
+    'saledate',
+]
+datetime_attributes = [
+    'year',
+    'month',
+    'day',
+    'quarter',
+    'day_of_year',
+    'day_of_week',
+    'days_since_epoch',
+]
+
+df = encode_datetime_features(df, datetime_features, datetime_attributes)
+
+
+
+

Transform the target if necessary

+

In the interest of speed and efficiency, we didn’t bother doing any EDA with the feature data. Part of my justification for this is that trees are incredibly robust to outliers, colinearity, missingness, and other assorted nonsense in the feature data. However, they are not necessarily robust to nonsense in the target variable, so it’s worth having a look at it before proceeding any further.

+
+
df.SalePrice.hist(); plt.xlabel('SalePrice');
+
+

histogram of sale price showing right-skewed data

+
+
+

Often when predicting prices it makes sense to use log price, especially when they span multiple orders of magnitude or have a strong right skew. These data look pretty friendly, lacking outliers and exhibiting only a mild positive skew; we could probably get away without doing any transformation. But checking the evaluation metric used to score the Kagle competition, we see they’re using root mean squared log error. That’s equivalent to using RMSE on log-transformed target data, so let’s go ahead and work with log prices.

+
+
df['logSalePrice'] = np.log1p(df['SalePrice'])
+df.logSalePrice.hist(); plt.xlabel('logSalePrice');
+
+

histogram of log sale price showing a more symetric distribution

+
+
+
+
+
+

Train and Evaluate the XGBoost regression model

+

Having prepared our dataset, we are now ready to train an XGBoost model. We’ll walk through the flow step-by-step first, then later we’ll collect the code in a single cell, so it’s easier to quickly iterate through variations of the model.

+
+

Specify target and feature columns

+

First we’ll put together a list of our features and define the target column. I like to have an actual list defined in the code so it’s easier to see everything we’re puting into the model and easier to add or remove features as we iterate. Just run something like list(df.columns) in a cel to get a copy-pasteable list of columns, then edit it down to the full list of features, i.e. remove the target, date columns, and other non-feature columns..

+
+
# list(df.columns)
+
+
+
features = [
+    'SalesID',
+    'MachineID',
+    'ModelID',
+    'datasource',
+    'auctioneerID',
+    'YearMade',
+    'MachineHoursCurrentMeter',
+    'UsageBand',
+    'fiModelDesc',
+    'fiBaseModel',
+    'fiSecondaryDesc',
+    'fiModelSeries',
+    'fiModelDescriptor',
+    'ProductSize',
+    'fiProductClassDesc',
+    'state',
+    'ProductGroup',
+    'ProductGroupDesc',
+    'Drive_System',
+    'Enclosure',
+    'Forks',
+    'Pad_Type',
+    'Ride_Control',
+    'Stick',
+    'Transmission',
+    'Turbocharged',
+    'Blade_Extension',
+    'Blade_Width',
+    'Enclosure_Type',
+    'Engine_Horsepower',
+    'Hydraulics',
+    'Pushblock',
+    'Ripper',
+    'Scarifier',
+    'Tip_Control',
+    'Tire_Size',
+    'Coupler',
+    'Coupler_System',
+    'Grouser_Tracks',
+    'Hydraulics_Flow',
+    'Track_Type',
+    'Undercarriage_Pad_Width',
+    'Stick_Length',
+    'Thumb',
+    'Pattern_Changer',
+    'Grouser_Type',
+    'Backhoe_Mounting',
+    'Blade_Type',
+    'Travel_Controls',
+    'Differential_Type',
+    'Steering_Controls',
+    'saledate_year',
+    'saledate_month',
+    'saledate_day',
+    'saledate_quarter',
+    'saledate_day_of_year',
+    'saledate_day_of_week',
+    'saledate_days_since_epoch'
+]
+
+target = 'logSalePrice'
+
+
+
+

Split the data into training and validation sets

+

Next we split the dataset into a training set and a validation set. Of course since we’re going to evaluate against the validation set a number of times as we iterate, it’s best practice to keep a separate test set reserved to check our final model to ensure it generalizes well. Assuming that final test set is hidden away, we can use the rest of the data for training and validation.

+

There are two main ways we might want to select the validation set. If there isn’t a temporal ordering of the observations, we might be able to randomly sample. In practice, it’s much more common that observations have a temporal ordering, and that models are trained on observations up to a certain time and used to predict on observations occuring after that time. Since this data is temporal, we don’t want to split randomly; instead we’ll split on observation date, reserving the latest observations for the validation set.

+
+
# Temporal Validation Set
+def train_test_split_temporal(df, datetime_column, n_test):
+    idx_sort = np.argsort(df[datetime_column])
+    idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]
+    return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+
+# Random Validation Set
+def train_test_split_random(df, n_test):
+    np.random.seed(42)
+    idx_sort = np.random.permutation(len(df))
+    idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]
+    return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+my_train_test_split = lambda d, n_valid: train_test_split_temporal(d, 'saledate', n_valid)
+# my_train_test_split = lambda d, n_valid: train_test_split_random(d, n_valid)
+
+
+
n_valid = 12000
+train_df, valid_df = my_train_test_split(df, n_valid)
+
+train_df.shape, valid_df.shape
+
+
((389125, 61), (12000, 61))
+
+
+
+
+

Create DMatrix data objects

+

XGBoost uses a data type called dense matrix for efficient training and prediction, so next we need to create DMatrix objects for our training and validation datasets.

+
+

If you prefer to use the scikit-learn interface to XGBoost, you don’t need to create these dense matrix objects. More on that below.

+
+
+
dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+
+
+

Set the XGBoost parameters

+

XGBoost has numerous hyperparameters. Fortunately, just a handful of them tend to be the most influential; furthermore, the default values are not bad in most situations. I like to start out with a dictionary containing the default parameter values for just the ones I think are most important. For training there is one required boosting parameter called num_boost_round which I set to 50 as a starting point; you can make this smaller initially if training takes too long.

+
+
# default values for important parameters
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+
+
+

Train the XGBoost model

+

Check out the documentation on the learning API to see all the training options. During training, I like to have XGBoost print out the evaluation metric on the train and validation set after every few boosting rounds and again at the end of training; that can be done by setting evals and verbose_eval. You can also save the evaluation results in a dictionary passed into evals_result to inspect and plot the objective curve over the training iterations.

+
+
evals_result = {}
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')],
+              verbose_eval=10,
+              evals_result=evals_result)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+
+

Train the XGBoost model using the sklearn interface

+

You can optionally use the sklearn estimator interface to XGBoost. This will bypass the need to use the DMatrix data objects for training and prediction, and it will allow you to leverage many of the other scikit-learn ecosystem tools like pipelines, parameter search, partial dependence plots, etc. The XGBRegressor is available in the xgboost library that we’ve already imported.

+
+
# scikit-learn interface
+reg = xgb.XGBRegressor(n_estimators=num_boost_round, **params)
+reg.fit(train_df[features], train_df[target], 
+        eval_set=[(train_df[features], train_df[target]), (valid_df[features], valid_df[target])], 
+        verbose=10);
+
+
[0] validation_0-rmse:6.74422   validation_1-rmse:6.79733
+[10]    validation_0-rmse:0.34798   validation_1-rmse:0.37158
+[20]    validation_0-rmse:0.26289   validation_1-rmse:0.28239
+[30]    validation_0-rmse:0.25148   validation_1-rmse:0.27028
+[40]    validation_0-rmse:0.24375   validation_1-rmse:0.26420
+[49]    validation_0-rmse:0.23738   validation_1-rmse:0.25855
+
+
+

Since not all features of XGBoost are available through the scikit-learn estimator interface, you might want to get the native booster object back out of the sklearn wrapper.

+
+
m = reg.get_booster()
+
+
+
+

Evaluate the model and check for overfitting

+

We get the model evaluation metrics on the training and validation sets printed to stdout when we use the evals argument to the training API. Typically I just look at those printed metrics, but let’s double check by hand.

+
+
def root_mean_squared_error(y_true, y_pred):
+    return np.sqrt(np.mean((y_true - y_pred)**2))
+
+root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+
+
0.25855368
+
+
+

So, how good is that RMSLE of 0.259? Well, checking the Kagle leaderboard for this competition, we would have come in 53rd out of 474, which is in the top 12% of submissions. That’s not bad for 10 minutes of work doing the bare minimum necessary to transform the raw data into a format consumable by XGBoost and then training a model using default hyperparameter values.

+
+

Note that we’re using a different validation set from that used for the final leaderboard (which is long closed), but our score is likely still a decent approximation for how we would have done in the competition.

+
+

It can be helpful to take a look at objective curves for training and validation data to get a sense for the extent of overfitting. A huge difference between training and validation performance indicates overfitting. In the below curve, there is very little overfitting, indicating we can be aggressive with hyperparameters that increase model flexibility. More on that soon.

+
+
pd.DataFrame({
+    'train': evals_result['train']['rmse'],
+    'valid': evals_result['valid']['rmse']
+}).plot(); plt.xlabel('boosting round'); plt.ylabel('objective');
+
+

line plot showing objective function versus training iteration for training and validation sets

+
+
+
+
+

Check feature importance

+

It’s helpful to get an idea of how much the model is using each feature. In following iterations we might want to try dropping low-signal features or examining the important ones more closely for feature engineering ideas. The gigantic caveat to keep in mind here is that there are different measures of feature importance, and each one will give different importances. XGBoost provides three importance measures; I tend to prefer looking at the weight measure because its rankings usually seem most intuitive.

+
+
fig, ax = plt.subplots(figsize=(5,10))
+feature_importances = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)
+feature_importances.plot.barh(ax=ax)
+plt.title('Feature Importance');
+
+

feature importance plot showing a few high importance features and many low importance ones

+
+
+
+
+
+

Improve performance using a model iteration loop

+

At this point we have a half-decent prototype model. Now we enter the model iteration loop in which we adjust features and model parameters to find configurations that have better and better performance.

+

Let’s start by putting the feature and target specification, the training/validation split, the model training, and the evaluation all together in one code block that we can copy paste for easy model iteration.

+
+

Note that for this process to be effective, model training needs to take less than 10 seconds. Otherwise you’ll be sitting around waiting way too long. If training takes too long, try training on a sample of the training data, or try reducing the number of boosting rounds.

+
+
+
features = [
+    'SalesID',
+    'MachineID',
+    'ModelID',
+    'datasource',
+    'auctioneerID',
+    'YearMade',
+    'MachineHoursCurrentMeter',
+    'UsageBand',
+    'fiModelDesc',
+    'fiBaseModel',
+    'fiSecondaryDesc',
+    'fiModelSeries',
+    'fiModelDescriptor',
+    'ProductSize',
+    'fiProductClassDesc',
+    'state',
+    'ProductGroup',
+    'ProductGroupDesc',
+    'Drive_System',
+    'Enclosure',
+    'Forks',
+    'Pad_Type',
+    'Ride_Control',
+    'Stick',
+    'Transmission',
+    'Turbocharged',
+    'Blade_Extension',
+    'Blade_Width',
+    'Enclosure_Type',
+    'Engine_Horsepower',
+    'Hydraulics',
+    'Pushblock',
+    'Ripper',
+    'Scarifier',
+    'Tip_Control',
+    'Tire_Size',
+    'Coupler',
+    'Coupler_System',
+    'Grouser_Tracks',
+    'Hydraulics_Flow',
+    'Track_Type',
+    'Undercarriage_Pad_Width',
+    'Stick_Length',
+    'Thumb',
+    'Pattern_Changer',
+    'Grouser_Type',
+    'Backhoe_Mounting',
+    'Blade_Type',
+    'Travel_Controls',
+    'Differential_Type',
+    'Steering_Controls',
+    'saledate_year',
+    'saledate_month',
+    'saledate_day',
+    'saledate_quarter',
+    'saledate_day_of_year',
+    'saledate_day_of_week',
+    'saledate_days_since_epoch',
+]
+
+target = 'logSalePrice'
+
+train_df, valid_df = train_test_split_temporal(df, 'saledate', 12000)
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')],verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+

Feature selection

+
+

Drop low-importance features

+

Let’s try training a model on only the top k most important features. You can try different values of k for the rankings created from each of the three importance measures. You can play with how many to keep, looking for the optimal number manually.

+
+
feature_importances_weight = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)
+feature_importances_cover = pd.Series(m.get_score(importance_type='cover')).sort_values(ascending=False)
+feature_importances_gain = pd.Series(m.get_score(importance_type='gain')).sort_values(ascending=False)
+
+
+
# features = list(feature_importances_weight[:30].index)
+# features = list(feature_importances_cover[:35].index)
+features = list(feature_importances_gain[:30].index)
+
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37150
+[20]    train-rmse:0.26182  valid-rmse:0.27986
+[30]    train-rmse:0.24974  valid-rmse:0.26896
+[40]    train-rmse:0.24282  valid-rmse:0.26043
+[49]    train-rmse:0.23768  valid-rmse:0.25664
+
+
+

Looks like keeping the top 30 from the gain importance type gives a slight performance improvement.

+
+
+

Drop one feature at a time

+

Next try dropping each feature out of the model one-at-a-time to see if there are any more features that you can drop. For each feature, drop it from the feature set, then train a new model, then record the evaluation score. At the end, sort the scores to see which features are the best candidates for removal.

+
+
features = [
+    'Coupler_System',
+     'Tire_Size',
+     'Scarifier',
+     'ProductSize',
+     'Ride_Control',
+     'fiBaseModel',
+     'Enclosure',
+     'Pad_Type',
+     'YearMade',
+     'fiSecondaryDesc',
+     'ProductGroup',
+     'Drive_System',
+     'Ripper',
+     'saledate_days_since_epoch',
+     'fiModelDescriptor',
+     'fiProductClassDesc',
+     'MachineID',
+     'Hydraulics',
+     'SalesID',
+     'Track_Type',
+     'ModelID',
+     'fiModelDesc',
+     'Travel_Controls',
+     'Transmission',
+     'Blade_Extension',
+     'fiModelSeries',
+     'Grouser_Tracks',
+     'Undercarriage_Pad_Width',
+     'Stick',
+     'Thumb'
+]
+
+# drop each feature one-at-a-time
+scores = []
+for i, feature in enumerate(features):
+    drop_one_features = features[:i] + features[i+1:]
+
+    dtrain = xgb.DMatrix(data=train_df[drop_one_features], label=train_df[target], enable_categorical=True)
+    dvalid = xgb.DMatrix(data=valid_df[drop_one_features], label=valid_df[target], enable_categorical=True)
+
+    params = {
+        'learning_rate': 0.3,
+        'max_depth': 6,
+        'min_child_weight': 1,
+        'subsample': 1,
+        'colsample_bynode': 1,
+        'objective': 'reg:squarederror',
+    }
+    num_boost_round = 50
+
+    m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+                evals=[(dtrain, 'train'), (dvalid, 'valid')],
+                verbose_eval=False)
+    score = root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+    scores.append(score)
+
+results_df = pd.DataFrame({
+    'feature': features,
+    'score': scores
+})
+results_df.sort_values(by='score')
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
featurescore
18SalesID0.252617
5fiBaseModel0.253710
27Undercarriage_Pad_Width0.254032
17Hydraulics0.254114
20ModelID0.254169
4Ride_Control0.254278
16MachineID0.254413
19Track_Type0.254825
6Enclosure0.254958
28Stick0.255164
1Tire_Size0.255365
10ProductGroup0.255404
22Travel_Controls0.255895
29Thumb0.256300
23Transmission0.256380
26Grouser_Tracks0.256395
11Drive_System0.256652
24Blade_Extension0.256698
7Pad_Type0.256952
25fiModelSeries0.257073
2Scarifier0.257590
12Ripper0.257848
0Coupler_System0.258074
21fiModelDesc0.258712
13saledate_days_since_epoch0.259856
14fiModelDescriptor0.260439
9fiSecondaryDesc0.260782
15fiProductClassDesc0.263790
3ProductSize0.268068
8YearMade0.313105
+ +
+
+
+

Next try removing the feature with the best removal score. Then with that feature still removed, also try removing the feature with the next best removal score and so on. Repeat this process until the model evaluation metric is no longer improving. I think this could be considered a faster version of backward stepwise feature selection.

+
+
features = [
+    'Coupler_System',
+     'Tire_Size',
+     'Scarifier',
+     'ProductSize',
+     'Ride_Control',
+#      'fiBaseModel',
+     'Enclosure',
+     'Pad_Type',
+     'YearMade',
+     'fiSecondaryDesc',
+     'ProductGroup',
+     'Drive_System',
+     'Ripper',
+     'saledate_days_since_epoch',
+     'fiModelDescriptor',
+     'fiProductClassDesc',
+     'MachineID',
+#      'Hydraulics',
+#      'SalesID',
+     'Track_Type',
+     'ModelID',
+     'fiModelDesc',
+     'Travel_Controls',
+     'Transmission',
+     'Blade_Extension',
+     'fiModelSeries',
+     'Grouser_Tracks',
+#      'Undercarriage_Pad_Width',
+     'Stick',
+     'Thumb'
+]
+
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79145
+[10]    train-rmse:0.34882  valid-rmse:0.37201
+[20]    train-rmse:0.26050  valid-rmse:0.27386
+[30]    train-rmse:0.24844  valid-rmse:0.26205
+[40]    train-rmse:0.24042  valid-rmse:0.25426
+[49]    train-rmse:0.23549  valid-rmse:0.25004
+
+
+

So here I was able to remove four more features before the score started getting worse. With our reduced feature set, we’re now ranking 39th on that Kagle leaderboard. Let’s see how far we can get with some hyperparameter tuning.

+
+
+
+

Tune the XGBoost hyperparameters

+

This is a topic which deserves its own full-length post, but just for fun, here I’ll do a quick and dirty hand tuning without a ton of explanation.

+

Broadly speaking, my process is to increase model expressiveness by increasing the maximum tree depth untill it looks like I’m overfitting. At that point, I start pushing tree pruning parameters like min child weight and regularization parameters like lambda to counteract the overfitting. That process lead me to the following parameters.

+
+
params = {
+    'learning_rate': 0.3,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 5,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74473  valid-rmse:6.80196
+[10]    train-rmse:0.31833  valid-rmse:0.34151
+[20]    train-rmse:0.22651  valid-rmse:0.24885
+[30]    train-rmse:0.21501  valid-rmse:0.23904
+[40]    train-rmse:0.20897  valid-rmse:0.23645
+[49]    train-rmse:0.20418  valid-rmse:0.23412
+
+
+

That gets us up to 12th place. Next I start reducing the learning rate and increasing the boosting rounds in proportion to one another.

+
+
params = {
+    'learning_rate': 0.3/5,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 5,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50*5
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:9.04930  valid-rmse:9.12743
+[10]    train-rmse:4.88505  valid-rmse:4.93769
+[20]    train-rmse:2.64630  valid-rmse:2.68501
+[30]    train-rmse:1.44703  valid-rmse:1.47923
+[40]    train-rmse:0.81123  valid-rmse:0.84079
+[50]    train-rmse:0.48441  valid-rmse:0.51272
+[60]    train-rmse:0.32887  valid-rmse:0.35434
+[70]    train-rmse:0.26276  valid-rmse:0.28630
+[80]    train-rmse:0.23720  valid-rmse:0.26026
+[90]    train-rmse:0.22658  valid-rmse:0.24932
+[100]   train-rmse:0.22119  valid-rmse:0.24441
+[110]   train-rmse:0.21747  valid-rmse:0.24114
+[120]   train-rmse:0.21479  valid-rmse:0.23923
+[130]   train-rmse:0.21250  valid-rmse:0.23768
+[140]   train-rmse:0.21099  valid-rmse:0.23618
+[150]   train-rmse:0.20928  valid-rmse:0.23524
+[160]   train-rmse:0.20767  valid-rmse:0.23445
+[170]   train-rmse:0.20658  valid-rmse:0.23375
+[180]   train-rmse:0.20558  valid-rmse:0.23307
+[190]   train-rmse:0.20431  valid-rmse:0.23252
+[200]   train-rmse:0.20316  valid-rmse:0.23181
+[210]   train-rmse:0.20226  valid-rmse:0.23145
+[220]   train-rmse:0.20133  valid-rmse:0.23087
+[230]   train-rmse:0.20045  valid-rmse:0.23048
+[240]   train-rmse:0.19976  valid-rmse:0.23023
+[249]   train-rmse:0.19902  valid-rmse:0.23009
+
+
+

Decreasing the learning rate and increasing the boosting rounds got us up to a 2nd place score. Notice that the score is still decreasing on the validation set. We can actually continue boosting on this model by passing it to the xgb_model argument in the train function. We want to go very very slowly here to avoid overshooting the minimum of the objective function. To do that I ramp up the lambda regularization parameter and boost a few more rounds from where we left off.

+
+
# second stage
+params = {
+    'learning_rate': 0.3/10,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 60,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50*3
+
+m1 = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10,
+              xgb_model=m)
+
+
[0] train-rmse:0.19900  valid-rmse:0.23007
+[10]    train-rmse:0.19862  valid-rmse:0.22990
+[20]    train-rmse:0.19831  valid-rmse:0.22975
+[30]    train-rmse:0.19796  valid-rmse:0.22964
+[40]    train-rmse:0.19768  valid-rmse:0.22955
+[50]    train-rmse:0.19739  valid-rmse:0.22940
+[60]    train-rmse:0.19714  valid-rmse:0.22935
+[70]    train-rmse:0.19689  valid-rmse:0.22927
+[80]    train-rmse:0.19664  valid-rmse:0.22915
+[90]    train-rmse:0.19646  valid-rmse:0.22915
+[100]   train-rmse:0.19620  valid-rmse:0.22910
+[110]   train-rmse:0.19604  valid-rmse:0.22907
+[120]   train-rmse:0.19583  valid-rmse:0.22901
+[130]   train-rmse:0.19562  valid-rmse:0.22899
+[140]   train-rmse:0.19546  valid-rmse:0.22898
+[149]   train-rmse:0.19520  valid-rmse:0.22886
+
+
+
+
root_mean_squared_error(dvalid.get_label(), m1.predict(dvalid))
+
+
0.22885828
+
+
+

And that gets us to 1st place on the leaderboard.

+
+
+
+

Wrapping Up

+

There you have it, how to use XGBoost to solve a regression problem in python with world class performance. Remember you can use the XGBoost regression notebook from my ds-templates repo to make it easy to follow this flow on your own problems. If you found this helpful, or if you have additional ideas about solving regression problems with XGBoost, let me know down in the comments.

+
+ + ]]>
+ python + tutorial + gradient boosting + xgboost + https://randomrealizations.com/posts/xgboost-for-regression-in-python/index.html + Tue, 24 Oct 2023 22:00:00 GMT + +
Blogging with Quarto and Jupyter: The Complete Guide Matt Bowers @@ -479,7 +2707,7 @@ image-alt: "A London Underground train emerging from a tunnel" tutorial blogging https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/index.html - Tue, 05 Sep 2023 21:00:00 GMT + Tue, 05 Sep 2023 22:00:00 GMT @@ -2431,7 +4659,7 @@ xgboost score: 0.24123239765807963 gradient boosting from scratch https://randomrealizations.com/posts/xgboost-from-scratch/index.html - Fri, 06 May 2022 21:00:00 GMT + Fri, 06 May 2022 22:00:00 GMT @@ -3525,2303 +5753,1611 @@ font-style: inherit;">= np.argsort(x) sort_y, sort_x = y[sort_idx], x[sort_idx] - sum_y, n = y.sum(), len(y) - sum_y_right, n_right = sum_y, n - sum_y_left, n_left = 0., 0 - - for i in range(0, self.n - self.min_samples_leaf): - y_i, x_i, x_i_next = sort_y[i], sort_x[i], sort_x[i + 1] - sum_y_left += y_i; sum_y_right -= y_i - n_left += 1; n_right -= 1 - if n_left < self.min_samples_leaf or x_i == x_i_next: - continue - score = - sum_y_left**2 / n_left - sum_y_right**2 / n_right + sum_y**2 / n - if score < self.best_score_so_far: - self.best_score_so_far = score - self.split_feature_idx = feature_idx - self.threshold = (x_i + x_i_next) / 2 - - def __repr__(self): - s = f'n: {self.n}' - s += f'; value:{self.value:0.2f}' - if not self.is_leaf: - split_feature_name = self.X.columns[self.split_feature_idx] - s += f'; split: {split_feature_name} <= {self.threshold:0.3f}' - return s - - def predict(self, X): - return np.array([self._predict_row(row) for i, row in X.iterrows()]) - - def _predict_row(self, row): - if self.is_leaf: - return self.value - child = self.left if row[self.split_feature_idx] <= self.threshold \ - else sum_y, n self.right - = y.return child._predict_row(row)
- - -
-

From Scratch versus Scikit-Learn

-

As usual, we’ll test our homegrown handiwork by comparing it to the existing implementation in scikit-learn. First let’s train both models on the California Housing dataset which gives us 20k instances and 8 features to predict median house price by district.

-
-
sum(), from sklearn.datasets len(y)
+        sum_y_right, n_right import fetch_california_housing
-= sum_y, n
+        sum_y_left, n_left from sklearn.model_selection = import train_test_split
-
-X, y 0., = fetch_california_housing(as_frame0
+    
+        =for i True, return_X_yin =range(True)
-X_train, X_test, y_train, y_test 0, = train_test_split(X, y, test_sizeself.n =- 0.3, random_stateself.min_samples_leaf):
+            y_i, x_i, x_i_next == sort_y[i], sort_x[i], sort_x[i 43)
-
-
-
+ from sklearn.tree 1]
+            sum_y_left import DecisionTreeRegressor
-+= y_ifrom sklearn.metrics ; sum_y_right import mean_squared_error
-
-max_depth -= y_i
+            n_left = += 8
-min_samples_leaf 1= ; n_right 16
-
-tree -= = DecisionTree(X_train, y_train, max_depth1
+            =max_depth, min_samples_leafif  n_left =min_samples_leaf)
-pred < = tree.predict(X_test)
-
-sk_tree self.min_samples_leaf = DecisionTreeRegressor(max_depthor x_i =max_depth, min_samples_leaf== x_i_next:
+                =min_samples_leaf)
-sk_tree.fit(X_train, y_train)
-sk_pred continue
+            score = sk_tree.predict(X_test)
-
-= print(- sum_y_leftf'from scratch MSE: **{mean_squared_error(y_test, pred)2 :0.4f}/ n_left ')
-- sum_y_rightprint(**f'scikit-learn MSE: 2 {mean_squared_error(y_test, sk_pred)/ n_right :0.4f}+ sum_y')
-
-
from scratch MSE: 0.3988
-scikit-learn MSE: 0.3988
-
-
-

We get similar accuracy on a held-out test dataset.

-

Let’s benchmark the two implementations on training time.

-
-
**%%time
-sk_tree 2 = DecisionTreeRegressor(max_depth/ n
+            =max_depth, min_samples_leafif score =min_samples_leaf)
-sk_tree.fit(X_train, y_train)< ;
-
-
CPU times: user 45.3 ms, sys: 555 µs, total: 45.8 ms
-Wall time: 45.3 ms
-
-
-
DecisionTreeRegressor(max_depth=8, min_samples_leaf=16)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
-
-
-
-
self.best_score_so_far:
+                %%time
-tree self.best_score_so_far = DecisionTree(X_train, y_train, max_depth= score
+                =max_depth, min_samples_leafself.split_feature_idx =min_samples_leaf)
-
-
CPU times: user 624 ms, sys: 1.65 ms, total: 625 ms
-Wall time: 625 ms
-
-
-

Wow, the scikit-learn implementation absolutely smoked us, training an order of magnitude faster. This is to be expected, since they implement split finding in cython, which generates compiled C code that can run much faster than our native python code. Maybe we can take a look at how to optimize python code with cython here on the blog one of these days.

-
-
-

Wrapping Up

-

Holy cow, we just implemented a decision tree using nothing but numpy. I hope you enjoyed the scratch build as much as I did, and I hope you got a little bit better at coding (I certainly did). That was actually way harder than I expected, but looking back at the finished product, it doesn’t seem so bad right? I almost thought we were going to get away with not implementing our own decision tree, but it turns out that this will be super helpful for us when it comes time to implement XGBoost from scratch.

-
-
-

References

-

This implementation is inspired and partially adapted from Jeremy Howard’s live coding of a Random Forest as part of the fastai ML course.

-
- - ]]> - python - gradient boosting - from scratch - https://randomrealizations.com/posts/decision-tree-from-scratch/index.html - Sun, 12 Dec 2021 21:00:00 GMT - - - - How to Implement a Gradient Boosting Machine that Works with Any Loss Function - Matt Bowers - https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/index.html - -
-

-
Cold water cascades over the rocks in Erwin, Tennessee.
-
- -

Friends, this is going to be an epic post! Today, we bring together all the ideas we’ve built up over the past few posts to nail down our understanding of the key ideas in Jerome Friedman’s seminal 2001 paper: “Greedy Function Approximation: A Gradient Boosting Machine.” In particular, we’ll summarize the highlights from the paper, and we’ll build an in-house python implementation of his generic gradient boosting algorithm which can train with any differentiable loss function. What’s more, we’ll go ahead and take our generic gradient boosting machine for a spin by training it with several of the most popular loss functions used in practice.

-

Are you freaking stoked or what?

-

Sweet. Let’s do this.

-
-

Friedman 2001: TL;DR

-

I’ve mentioned this paper a couple of times before, but as far as I can tell, this is the origin of gradient boosting; it is therefore, a seminal work worth reading. You know what, I think you might like to pick up the paper and read it yourself. Like many papers, there is a lot of scary looking math in the first few pages, but if you’ve been following along on this blog, you’ll find that it’s actually totally approachable. This is the kind of thing that cures imposter syndrome, so give it a shot. That said, here’s the TL;DR as I see it.

-

The first part of the paper introduces the idea of fitting models by doing gradient descent in function space, an ingenious idea we spent an entire post demystifying earlier. Friedman goes on to introduce the generic gradient boost algorithm, which works with any differentiable loss function, as well as specific variants for minimizing absolute error, Huber loss, and binary deviance. In terms of hyperparameters, he points out that the learning rate can be used to reduce overfitting, while increased tree depth can help capture more complex interactions among features. He even discusses feature importance and partial dependence methods for interpreting fitted gradient boosting models.

-

Friedman concludes by musing about the advantages of gradient boosting with trees. He notes some key advantages afforded by the use of decision trees including no need to rescale input data, robustness against irrelevant input features, and elegant handling of missing feature values. He points out that gradient boosting manages to capitalize on the benefits of decision trees while minimizing their key weakness (crappy accuracy). I think this offers a great insight into why gradient boosting models have become so widespread and successful in practical ML applications.

-
-
-

Friedman’s Generic Gradient Boosting Algorithm

-

Let’s take a closer look at Friedman’s original gradient boost algorithm, Alg. 1 in Section 3 of the paper (translated into the notation we’ve been using so far).

-

Like last time, we have training data where is a length- vector of target values, and is an matrix with observations of features. We also have a differentiable loss function , a “learning rate” hyperparameter , and a fixed number of model iterations .

-

Algorithm: gradient_boost returns: model

-
    -
  1. Let base model , where

  2. -
  3. for = to :

  4. -
  5.      Let “pseudo-residual” vector

  6. -
  7.      Train decision tree regressor to predict (minimizing squared error)

  8. -
  9.      foreach terminal leaf node :

  10. -
  11.           Let

  12. -
  13.           Set terminal leaf node to predict value

  14. -
  15.     

  16. -
  17. Return composite model

  18. -
-

By now, most of this is already familiar to us. We begin by setting the base model equal to the constant prediction value that minimizes the loss over all examples in the training dataset (line 1). Then we begin the boosting iterations (line 2), each time computing the negative gradients of the loss with respect to the current model predictions (known as the pseudo residuals) (line 3). We then fit our next decision tree regressor to predict the pseudo residuals (line 4).

-

Then we encounter something new on lines 5-7. When we fit a vanilla decision tree regressor to predict pseudo residuals, we’re using mean squared error as the loss function to train the tree. As you might imagine, this works well when the global loss function is also squared error. But if we want to use a global loss other than squared error, there is an additional trick we can use to further increase the composite model’s accuracy. The idea is to continue using squared error to train each decision tree, keeping its structure and split conditions but altering the predicted value in each leaf to help minimize the global loss function. Instead of using the mean target value as the prediction for each node (as we would do when minimizing squared error), we use a numerical optimization method like line search to choose the constant value for that leaf that leads to the best overall loss. This is the same thing we did in line 1 of the algorithm to set the base prediction, but here we choose the optimal prediction for each terminal node of the newly trained decision tree.

-
-
-

Implementation

-

I did some (half-assed) searching on the interweb for an implementation of GBM that allows the user to provide a custom loss function, and you know what? I couldn’t find anything. If you find another implementation, post in the comments so we can learn from it too.

-

Since we need to modify the values predicted by our decision trees’ terminal nodes, we’ll want to brush up on the scikit-learn decision tree structure before we get going. You can see explanations of all the necessary decision tree hacks in this notebook.

-
-
= feature_idx
+                import numpy self.threshold as np
-= (x_i from sklearn.tree + x_i_next) import DecisionTreeRegressor 
-/ from scipy.optimize 2
+                
+    import minimize
-
-def class GradientBoostingMachine():
-    __repr__('''Gradient Boosting Machine supporting any user-supplied loss function.
-self):
+        s     
-=     Parameters
-f'n:     ----------
-{    n_trees : int
-self        number of boosting rounds
-.n        
-}    learning_rate : float
-'
+        s         learning rate hyperparameter
-+=         
-f'; value:    max_depth : int
-{        maximum tree depth
-self    '''
-    
-    .valuedef :0.2f}__init__('
+        self, n_trees, learning_rateif =not 0.1, max_depthself.is_leaf:
+            split_feature_name == 1):
-        self.X.columns[self.n_treesself.split_feature_idx]
+            s =n_trees+= ; 
-        f'; split: self.learning_rate{split_feature_name=learning_rate
-        }self.max_depth <= =max_depth{;
-    
-    selfdef fit(.thresholdself, X, y, objective):
-        :0.3f}'''Fit the GBM using the specified loss function.
-'
+                
-return s
+    
+            Parameters
-def predict(        ----------
-self, X):
+                X : ndarray of size (number observations, number features)
-return np.array([            design matrix
-self._predict_row(row)             
-for i, row         y : ndarray of size (number observations,)
-in X.iterrows()])
+    
+                target values
-def _predict_row(            
-self, row):
+                objective : loss function class instance
-if             Class specifying the loss function for training.
-self.is_leaf: 
+                        Should implement two methods:
-return                 loss(labels: ndarray, predictions: ndarray) -> float
-self.value
+        child                 negative_gradient(labels: ndarray, predictions: ndarray) -> ndarray
-=         '''
-        
-        self.left self.trees if row[= []
-        self.split_feature_idx] self.base_prediction <= = self.threshold self._get_optimal_base_value(y, objective.loss)
-        current_predictions \
+                = else self.base_prediction self.right
+        * np.ones(shapereturn child._predict_row(row)
+
+
+
+

From Scratch versus Scikit-Learn

+

As usual, we’ll test our homegrown handiwork by comparing it to the existing implementation in scikit-learn. First let’s train both models on the California Housing dataset which gives us 20k instances and 8 features to predict median house price by district.

+
+
=y.shape)
-        from sklearn.datasets for _ import fetch_california_housing
+in from sklearn.model_selection range(import train_test_split
+
+X, y self.n_trees):
-            pseudo_residuals = fetch_california_housing(as_frame= objective.negative_gradient(y, current_predictions)
-            tree == DecisionTreeRegressor(max_depthTrue, return_X_y=self.max_depth)
-            tree.fit(X, pseudo_residuals)
-            True)
+X_train, X_test, y_train, y_test self._update_terminal_nodes(tree, X, y, current_predictions, objective.loss)
-            current_predictions = train_test_split(X, y, test_size+= =self.learning_rate 0.3, random_state* tree.predict(X)
-            =self.trees.append(tree)
-     
-    43)
+
+
+
def _get_optimal_base_value(from sklearn.tree self, y, loss):
-        import DecisionTreeRegressor
+'''Find the optimal initial prediction for the base model.'''
-        fun from sklearn.metrics = import mean_squared_error
+
+max_depth lambda c: loss(y, c)
-        c0 = = y.mean()
-        8
+min_samples_leaf return minimize(fun= =fun, x016
+
+tree =c0).x[= DecisionTree(X_train, y_train, max_depth0]
-        
-    =max_depth, min_samples_leafdef _update_terminal_nodes(=min_samples_leaf)
+pred = tree.predict(X_test)
+
+sk_tree = DecisionTreeRegressor(max_depthself, tree, X, y, current_predictions, loss):
-        =max_depth, min_samples_leaf'''Update the tree's predictions according to the loss function.'''
-        =min_samples_leaf)
+sk_tree.fit(X_train, y_train)
+sk_pred # terminal node id's
-        leaf_nodes = sk_tree.predict(X_test)
+
+= np.nonzero(tree.tree_.children_left print(== f'from scratch MSE: -{mean_squared_error(y_test, pred)1)[:0.4f}0]
-        ')
+# compute leaf for each sample in ``X``.
-        leaf_node_for_each_sample print(= tree.f'scikit-learn MSE: apply(X)
-        {mean_squared_error(y_test, sk_pred)for leaf :0.4f}in leaf_nodes:
-            samples_in_this_leaf ')
+
+
from scratch MSE: 0.3988
+scikit-learn MSE: 0.3988
+
+
+

We get similar accuracy on a held-out test dataset.

+

Let’s benchmark the two implementations on training time.

+
+
= np.where(leaf_node_for_each_sample %%time
+sk_tree == leaf)[= DecisionTreeRegressor(max_depth0]
-            y_in_leaf =max_depth, min_samples_leaf= y.take(samples_in_this_leaf, axis=min_samples_leaf)
+sk_tree.fit(X_train, y_train)=;
+
+
CPU times: user 45.3 ms, sys: 555 µs, total: 45.8 ms
+Wall time: 45.3 ms
+
+
+
DecisionTreeRegressor(max_depth=8, min_samples_leaf=16)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
+
+
+
+
0)
-            preds_in_leaf %%time
+tree = current_predictions.take(samples_in_this_leaf, axis= DecisionTree(X_train, y_train, max_depth==max_depth, min_samples_leaf0)
-            val =min_samples_leaf)
+
+
CPU times: user 624 ms, sys: 1.65 ms, total: 625 ms
+Wall time: 625 ms
+
+
+

Wow, the scikit-learn implementation absolutely smoked us, training an order of magnitude faster. This is to be expected, since they implement split finding in cython, which generates compiled C code that can run much faster than our native python code. Maybe we can take a look at how to optimize python code with cython here on the blog one of these days.

+
+
+

Wrapping Up

+

Holy cow, we just implemented a decision tree using nothing but numpy. I hope you enjoyed the scratch build as much as I did, and I hope you got a little bit better at coding (I certainly did). That was actually way harder than I expected, but looking back at the finished product, it doesn’t seem so bad right? I almost thought we were going to get away with not implementing our own decision tree, but it turns out that this will be super helpful for us when it comes time to implement XGBoost from scratch.

+
+
+

References

+

This implementation is inspired and partially adapted from Jeremy Howard’s live coding of a Random Forest as part of the fastai ML course.

+
+ + ]]>
+ python + gradient boosting + from scratch + https://randomrealizations.com/posts/decision-tree-from-scratch/index.html + Sun, 12 Dec 2021 22:00:00 GMT + +
+ + How to Implement a Gradient Boosting Machine that Works with Any Loss Function + Matt Bowers + https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/index.html + +
+

+
Cold water cascades over the rocks in Erwin, Tennessee.
+
+ +

Friends, this is going to be an epic post! Today, we bring together all the ideas we’ve built up over the past few posts to nail down our understanding of the key ideas in Jerome Friedman’s seminal 2001 paper: “Greedy Function Approximation: A Gradient Boosting Machine.” In particular, we’ll summarize the highlights from the paper, and we’ll build an in-house python implementation of his generic gradient boosting algorithm which can train with any differentiable loss function. What’s more, we’ll go ahead and take our generic gradient boosting machine for a spin by training it with several of the most popular loss functions used in practice.

+

Are you freaking stoked or what?

+

Sweet. Let’s do this.

+
+

Friedman 2001: TL;DR

+

I’ve mentioned this paper a couple of times before, but as far as I can tell, this is the origin of gradient boosting; it is therefore, a seminal work worth reading. You know what, I think you might like to pick up the paper and read it yourself. Like many papers, there is a lot of scary looking math in the first few pages, but if you’ve been following along on this blog, you’ll find that it’s actually totally approachable. This is the kind of thing that cures imposter syndrome, so give it a shot. That said, here’s the TL;DR as I see it.

+

The first part of the paper introduces the idea of fitting models by doing gradient descent in function space, an ingenious idea we spent an entire post demystifying earlier. Friedman goes on to introduce the generic gradient boost algorithm, which works with any differentiable loss function, as well as specific variants for minimizing absolute error, Huber loss, and binary deviance. In terms of hyperparameters, he points out that the learning rate can be used to reduce overfitting, while increased tree depth can help capture more complex interactions among features. He even discusses feature importance and partial dependence methods for interpreting fitted gradient boosting models.

+

Friedman concludes by musing about the advantages of gradient boosting with trees. He notes some key advantages afforded by the use of decision trees including no need to rescale input data, robustness against irrelevant input features, and elegant handling of missing feature values. He points out that gradient boosting manages to capitalize on the benefits of decision trees while minimizing their key weakness (crappy accuracy). I think this offers a great insight into why gradient boosting models have become so widespread and successful in practical ML applications.

+
+
+

Friedman’s Generic Gradient Boosting Algorithm

+

Let’s take a closer look at Friedman’s original gradient boost algorithm, Alg. 1 in Section 3 of the paper (translated into the notation we’ve been using so far).

+

Like last time, we have training data where is a length- vector of target values, and is an matrix with observations of features. We also have a differentiable loss function , a “learning rate” hyperparameter , and a fixed number of model iterations .

+

Algorithm: gradient_boost returns: model

+
    +
  1. Let base model , where

  2. +
  3. for = to :

  4. +
  5.      Let “pseudo-residual” vector

  6. +
  7.      Train decision tree regressor to predict (minimizing squared error)

  8. +
  9.      foreach terminal leaf node :

  10. +
  11.           Let

  12. +
  13.           Set terminal leaf node to predict value

  14. +
  15.     

  16. +
  17. Return composite model

  18. +
+

By now, most of this is already familiar to us. We begin by setting the base model equal to the constant prediction value that minimizes the loss over all examples in the training dataset (line 1). Then we begin the boosting iterations (line 2), each time computing the negative gradients of the loss with respect to the current model predictions (known as the pseudo residuals) (line 3). We then fit our next decision tree regressor to predict the pseudo residuals (line 4).

+

Then we encounter something new on lines 5-7. When we fit a vanilla decision tree regressor to predict pseudo residuals, we’re using mean squared error as the loss function to train the tree. As you might imagine, this works well when the global loss function is also squared error. But if we want to use a global loss other than squared error, there is an additional trick we can use to further increase the composite model’s accuracy. The idea is to continue using squared error to train each decision tree, keeping its structure and split conditions but altering the predicted value in each leaf to help minimize the global loss function. Instead of using the mean target value as the prediction for each node (as we would do when minimizing squared error), we use a numerical optimization method like line search to choose the constant value for that leaf that leads to the best overall loss. This is the same thing we did in line 1 of the algorithm to set the base prediction, but here we choose the optimal prediction for each terminal node of the newly trained decision tree.

+
+
+

Implementation

+

I did some (half-assed) searching on the interweb for an implementation of GBM that allows the user to provide a custom loss function, and you know what? I couldn’t find anything. If you find another implementation, post in the comments so we can learn from it too.

+

Since we need to modify the values predicted by our decision trees’ terminal nodes, we’ll want to brush up on the scikit-learn decision tree structure before we get going. You can see explanations of all the necessary decision tree hacks in this notebook.

+
+
= import numpy self._get_optimal_leaf_value(y_in_leaf, 
-                                               preds_in_leaf,
-                                               loss)
-            tree.tree_.value[leaf, as np
+0, from sklearn.tree 0] import DecisionTreeRegressor 
+= val
-            
-    from scipy.optimize def _get_optimal_leaf_value(import minimize
+
+self, y, current_predictions, loss):
-        class GradientBoostingMachine():
+    '''Find the optimal prediction value for a given leaf.'''
-        fun '''Gradient Boosting Machine supporting any user-supplied loss function.
+=     
+lambda c: loss(y, current_predictions     Parameters
++ c)
-        c0     ----------
+= y.mean()
-            n_trees : int
+return minimize(fun        number of boosting rounds
+=fun, x0        
+=c0).x[    learning_rate : float
+0]
-          
-            learning rate hyperparameter
+def predict(        
+self, X):
-            max_depth : int
+'''Generate predictions for the given input data.'''
-                maximum tree depth
+return (    '''
+    
+    self.base_prediction 
-                def + __init__(self.learning_rate 
-                self, n_trees, learning_rate* np.=sum([tree.predict(X) 0.1, max_depthfor tree =in 1):
+        self.trees], axisself.n_trees==n_trees0))
-
-

In terms of design, we implement a class for the GBM with scikit-like fit and predict methods. Notice in the below implementation that the fit method is only 10 lines long, and corresponds very closely to Friedman’s gradient boost algorithm from above. Most of the complexity comes from the helper methods for updating the leaf values according to the specified loss function.

-

When the user wants to call the fit method, they’ll need to supply the loss function they want to use for boosting. We’ll make the user implement their loss (a.k.a. objective) function as a class with two methods: (1) a loss method taking the labels and the predictions and returning the loss score and (2) a negative_gradient method taking the labels and the predictions and returning an array of negative gradients.

-
-
-

Testing our Model

-

Let’s test drive our custom-loss-ready GBM with a few different loss functions! We’ll compare it to the scikit-learn GBM to sanity check our implementation.

-
-
; 
+        from sklearn.ensemble self.learning_rateimport GradientBoostingRegressor, GradientBoostingClassifier
-
-rng =learning_rate
+        = np.random.default_rng()
-
-self.max_depth# test data
-=max_depthdef make_test_data(n, noise_scale):
-    x ;
+    
+    = np.linspace(def fit(0, self, X, y, objective):
+        10, '''Fit the GBM using the specified loss function.
+500).reshape(        
+-        Parameters
+1,        ----------
+1)
-    y         X : ndarray of size (number observations, number features)
+= (np.where(x             design matrix
+<             
+5, x,         y : ndarray of size (number observations,)
+5)             target values
++ rng.normal(            
+0, noise_scale, size        objective : loss function class instance
+=x.shape)).ravel()
-                Class specifying the loss function for training.
+return x, y
-    
-            Should implement two methods:
+# print model loss scores
-                loss(labels: ndarray, predictions: ndarray) -> float
+def print_model_loss_scores(obj, y, preds, sk_preds):
-                    negative_gradient(labels: ndarray, predictions: ndarray) -> ndarray
+print(        '''
+        
+        f'From Scratch Loss = self.trees {obj= []
+        .loss(y, pred)self.base_prediction :0.4}= ')
-    self._get_optimal_base_value(y, objective.loss)
+        current_predictions print(= f'Scikit-Learn Loss = self.base_prediction {obj* np.ones(shape.loss(y, sk_pred)=y.shape)
+        :0.4}for _ ')
-
-
-

Mean Squared Error

-

Mean Squared Error (a.k.a. Least Squares) loss produces estimates of the mean target value conditioned on the feature values. Here’s the implementation.

-
-
x, y in = make_test_data(range(500, self.n_trees):
+            pseudo_residuals 0.4)
-
-
-
= objective.negative_gradient(y, current_predictions)
+            tree # from scratch GBM
-= DecisionTreeRegressor(max_depthclass SquaredErrorLoss():
-    ='''User-Defined Squared Error Loss'''
-    
-    self.max_depth)
+            tree.fit(X, pseudo_residuals)
+            def loss(self._update_terminal_nodes(tree, X, y, current_predictions, objective.loss)
+            current_predictions self, y, preds):
-        += return np.mean((y self.learning_rate - preds)* tree.predict(X)
+            **self.trees.append(tree)
+     
+    2)
-    
-    def _get_optimal_base_value(def negative_gradient(self, y, loss):
+        self, y, preds):
-        '''Find the optimal initial prediction for the base model.'''
+        fun return y = - preds
-    
-
-gbm lambda c: loss(y, c)
+        c0 = GradientBoostingMachine(n_trees= y.mean()
+        =return minimize(fun10,
-                              learning_rate=fun, x0==c0).x[0.5,
-                              max_depth0]
+        
+    =def _update_terminal_nodes(1)
-gbm.fit(x, y, SquaredErrorLoss())
-pred self, tree, X, y, current_predictions, loss):
+        = gbm.predict(x)
-
-
-
'''Update the tree's predictions according to the loss function.'''
+        # scikit-learn GBM
-sk_gbm # terminal node id's
+        leaf_nodes = GradientBoostingRegressor(n_estimators= np.nonzero(tree.tree_.children_left === 10,
-                                   learning_rate-=1)[0.5,
-                                   max_depth0]
+        =# compute leaf for each sample in ``X``.
+        leaf_node_for_each_sample 1,
-                                   loss= tree.=apply(X)
+        'squared_error')
-sk_gbm.fit(x, y)
-sk_pred for leaf = sk_gbm.predict(x)
-
-
-
print_model_loss_scores(SquaredErrorLoss(), y, pred, sk_pred)
-
-
From Scratch Loss = 0.168
-Scikit-Learn Loss = 0.168
-
-
-
-
-

Scatterplot showing data and model prediction of y given x

-
-
-
-
-

Mean Absolute Error

-

Mean Absolute Error (a.k.a.Least Absolute Deviations) loss produces estimates of the median target value conditioned on the feature values. Here’s the implementation.

-
-
x, y in leaf_nodes:
+            samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0]
+            y_in_leaf = make_test_data(= y.take(samples_in_this_leaf, axis500, =0.4)
-
-
-

-0)
+            preds_in_leaf # from scratch GBM
-= current_predictions.take(samples_in_this_leaf, axisclass AbsoluteErrorLoss():
-    ='''User-Defined Absolute Error Loss'''
-    
-    0)
+            val def loss(= self, y, preds):
-        self._get_optimal_leaf_value(y_in_leaf, 
+                                               preds_in_leaf,
+                                               loss)
+            tree.tree_.value[leaf, return np.mean(np.0, abs(y 0] - preds))
-    
-    = val
+            
+    def negative_gradient(def _get_optimal_leaf_value(self, y, preds):
-        self, y, current_predictions, loss):
+        return np.sign(y '''Find the optimal prediction value for a given leaf.'''
+        fun - preds)
-
-
-gbm = = GradientBoostingMachine(n_treeslambda c: loss(y, current_predictions =+ c)
+        c0 10,
-                              learning_rate= y.mean()
+        =return minimize(fun0.5,
-                              max_depth=fun, x0==c0).x[1)
-gbm.fit(x, y, AbsoluteErrorLoss())
-pred 0]
+          
+    = gbm.predict(x)
-
-
-
def predict(# scikit-learn GBM
-sk_gbm self, X):
+        = GradientBoostingRegressor(n_estimators'''Generate predictions for the given input data.'''
+        =return (10,
-                                   learning_rateself.base_prediction 
+                =+ 0.5,
-                                   max_depthself.learning_rate 
+                =* np.1,
-                                   losssum([tree.predict(X) =for tree 'absolute_error')
-sk_gbm.fit(x, y)
-sk_pred in = sk_gbm.predict(x)
-
-
-
print_model_loss_scores(AbsoluteErrorLoss(), y, pred, sk_pred)
-
-
From Scratch Loss = 0.3225
-Scikit-Learn Loss = 0.3208
-
-
-
-
-

Figure showing scatterplot of data and model prediction of median of y given x

-
+font-style: inherit;">self.trees], axis=0))
+

In terms of design, we implement a class for the GBM with scikit-like fit and predict methods. Notice in the below implementation that the fit method is only 10 lines long, and corresponds very closely to Friedman’s gradient boost algorithm from above. Most of the complexity comes from the helper methods for updating the leaf values according to the specified loss function.

+

When the user wants to call the fit method, they’ll need to supply the loss function they want to use for boosting. We’ll make the user implement their loss (a.k.a. objective) function as a class with two methods: (1) a loss method taking the labels and the predictions and returning the loss score and (2) a negative_gradient method taking the labels and the predictions and returning an array of negative gradients.

-
-

Quantile Loss

-

Quantile loss yields estimates of a given quantile of the target variable conditioned on the features. Here’s my implementation.

-
-
x, y 
+

Testing our Model

+

Let’s test drive our custom-loss-ready GBM with a few different loss functions! We’ll compare it to the scikit-learn GBM to sanity check our implementation.

+
+
= make_test_data(from sklearn.ensemble 500, import GradientBoostingRegressor, GradientBoostingClassifier
+
+rng 1)
-
-
-

-= np.random.default_rng()
+
+# from scratch GBM
-# test data
+class QuantileLoss():
-    def make_test_data(n, noise_scale):
+    x '''Quantile Loss
-= np.linspace(    
-0,     Parameters
-10,     ----------
-500).reshape(    alpha : float
--        quantile to be estimated, 0 < alpha < 1
-1,    '''
-    
-    1)
+    y def = (np.where(x __init__(< self, alpha):
-        5, x, if alpha 5) < + rng.normal(0 0, noise_scale, sizeor alpha =x.shape)).ravel()
+    >return x, y
+    
+1:
-            # print model loss scores
+raise def print_model_loss_scores(obj, y, preds, sk_preds):
+    ValueError(print('alpha must be between 0 and 1')
-        f'From Scratch Loss = self.alpha {obj= alpha
-        
-    .loss(y, pred)def loss(:0.4}self, y, preds):
-        e ')
+    = y print(- preds
-        f'Scikit-Learn Loss = return np.mean(np.where(e {obj> .loss(y, sk_pred)0, :0.4}self.alpha ')
+
+
+

Mean Squared Error

+

Mean Squared Error (a.k.a. Least Squares) loss produces estimates of the mean target value conditioned on the feature values. Here’s the implementation.

+
+
x, y * e, (= make_test_data(self.alpha 500, 0.4)
+
+
+
- # from scratch GBM
+1) class SquaredErrorLoss():
+    * e))
-    
-    '''User-Defined Squared Error Loss'''
+    
+    def negative_gradient(def loss(self, y, preds):
-        e = y         - preds 
-        return np.mean((y return np.where(e - preds)> **0, 2)
+    
+    self.alpha, def negative_gradient(self.alpha self, y, preds):
+        - return y 1)
-
-gbm - preds
+    
+
+gbm = GradientBoostingMachine(n_trees=10,
-                              learning_rate                              learning_rate=0.5,
-                             max_depth                              max_depth=1)
-gbm.fit(x, y, QuantileLoss(alpha=0.9))
-pred gbm.fit(x, y, SquaredErrorLoss())
+pred = gbm.predict(x)    
+font-style: inherit;">= gbm.predict(x)
-
-

+
# scikit-learn GBM
-sk_gbm sk_gbm = GradientBoostingRegressor(n_estimators=10,
-                                 learning_rate                                   learning_rate=0.5,
-                                 max_depth                                   max_depth=1,
-                                 loss                                   loss='quantile', alpha=0.9)
-sk_gbm.fit(x, y)
-sk_pred 'squared_error')
+sk_gbm.fit(x, y)
+sk_pred = sk_gbm.predict(x)
-
-
print_model_loss_scores(QuantileLoss(alpha=0.9), y, pred, sk_pred)
+
+
print_model_loss_scores(SquaredErrorLoss(), y, pred, sk_pred)
-
From Scratch Loss = 0.1853
-Scikit-Learn Loss = 0.1856
+
From Scratch Loss = 0.168
+Scikit-Learn Loss = 0.168
-
+
-

Figure showing scatterplot of data and model prediction of 0.9 quantile of y given x

+

Scatterplot showing data and model prediction of y given x

-
-

Binary Cross Entropy Loss

-

The previous losses are useful for regression problems, where the target is numeric. But we can also solve classification problems, simply by swapping in an appropriate loss function. Here we’ll implement binary cross entropy, a.k.a. binary deviance, a.k.a. negative binomial log likelihood (sometimes abusively called log loss). One thing to remember is that, as with logistic regression, our model is actually predicting the log odds ratio, not the probability of the positive class. Thus we use expit transformations (the inverse of logit) whenever probabilities are needed, e.g., when predicting the probability that an observation belongs to the positive class.

-
-
# make categorical test data
-
-def expit(t):
-    return np.exp(t) / (1 + np.exp(t))
-
-x = np.linspace(-3, 3, 500)
-p = expit(x)
-y = rng.binomial(1, p, size=p.shape)
-x = x.reshape(
+

Mean Absolute Error

+

Mean Absolute Error (a.k.a.Least Absolute Deviations) loss produces estimates of the median target value conditioned on the feature values. Here’s the implementation.

+
+
x, y -= make_test_data(1,500, 1)
+font-style: inherit;">0.4)
-
-

+

+# from scratch GBM
-class BinaryCrossEntropyLoss():
-    '''Binary Cross Entropy Loss
-    
-    Note that the predictions should be log odds ratios.
-    '''
-    
-    def __init__(self):
-        self.expit = lambda t: np.exp(t) / (1 class AbsoluteErrorLoss():
+    + np.exp(t))
-    
-    '''User-Defined Absolute Error Loss'''
+    
+    def loss(self, y, preds):
-        p = self.expit(preds)
-        return -np.mean(y * np.log(p) + (1 - y)         * np.log(return np.mean(np.1 abs(y - p))
-    
-    - preds))
+    
+    def negative_gradient(self, y, preds):
-        p = self.expit(preds)
-        return y / p - (1 - y) / (1 - p)
-
-    
-gbm         return np.sign(y - preds)
+
+
+gbm = GradientBoostingMachine(n_trees=10,
-                              learning_rate                              learning_rate=0.5,
-                              max_depth                              max_depth=1)
-gbm.fit(x, y, BinaryCrossEntropyLoss())
-pred gbm.fit(x, y, AbsoluteErrorLoss())
+pred = expit(gbm.predict(x))
+font-style: inherit;">=
gbm.predict(x)
-
-

+
# scikit-learn GBM
-sk_gbm sk_gbm = GradientBoostingClassifier(n_estimators= GradientBoostingRegressor(n_estimators=10,
-                                    learning_rate                                   learning_rate=0.5,
-                                    max_depth                                   max_depth=1,
-                                    loss                                   loss='log_loss')
-sk_gbm.fit(x, y)
-sk_pred = sk_gbm.predict_proba(x)[:, 'absolute_error')
+sk_gbm.fit(x, y)
+sk_pred 1]
+font-style: inherit;">=
sk_gbm.predict(x)
-
-
print_model_loss_scores(BinaryCrossEntropyLoss(), y, pred, sk_pred)
+
+
print_model_loss_scores(AbsoluteErrorLoss(), y, pred, sk_pred)
-
From Scratch Loss = 0.6379
-Scikit-Learn Loss = 0.6403
+
From Scratch Loss = 0.3225
+Scikit-Learn Loss = 0.3208
-
+
-

Figure showing data and model prediction of probability that y equals one given x

+

Figure showing scatterplot of data and model prediction of median of y given x

-
-
-

Wrapping Up

-

Woohoo! We did it! We finally made it through Friedman’s paper in its entirety, and we implemented the generic gradient boosting algorithm which works with any differentiable loss function. If you made it this far, great job, gold star! By now you hopefully have a pretty solid grasp on gradient boosting, which is good, because soon we’re going to dive into the modern Newton descent gradient boosting frameworks like XGBoost. Onward!

-
-
-

References

-

Friedman’s 2001 paper: Greedy Function Approximation: A Gradient Boosting Machine

-
- - ]]>
- python - gradient boosting - from scratch - https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/index.html - Fri, 22 Oct 2021 21:00:00 GMT - -
- - Hello PySpark! - Matt Bowers - https://randomrealizations.com/posts/hello-pyspark/index.html - -
-

-
A big day at Playa Guiones
-
- -

Well, you guessed it: it’s time for us to learn PySpark!

-

I know, I know, I can hear you screaming into your pillow. Indeed we just spent all that time converting from R and learning python and why the hell do we need yet another API for working with dataframes?

-

That’s a totally fair question.

-

So what happens when we’re working on something in the real world, where datasets get large in a hurry, and we suddenly have a dataframe that no longer fits into memory? We need a way for our computations and datasets to scale across multiple nodes in a distributed system without having to get too fussy about all the distributed compute details.

-

Enter PySpark.

-

I think it’s fair to think of PySpark as a python package for working with arbitrarily large dataframes, i.e., it’s like pandas but scalable. It’s built on top of Apache Spark, a unified analytics engine for large-scale data processing. PySpark is essentially a way to access the functionality of spark via python code. While there are other high-level interfaces to Spark (such as Java, Scala, and R), for data scientists who are already working extensively with python, PySpark will be the natural interface of choice. PySpark also has great integration with SQL, and it has a companion machine learning library called MLlib that’s more or less a scalable scikit-learn (maybe we can cover it in a future post).

-

So, here’s the plan. First we’re going to get set up to run PySpark locally in a jupyter notebook on our laptop. This is my preferred environment for interactively playing with PySpark and learning the ropes. Then we’re going to get up and running in PySpark as quickly as possible by reviewing the most essential functionality for working with dataframes and comparing it to how we would do things in pandas. Once we’re comfortable running PySpark on the laptop, it’s going to be much easier to jump onto a distributed cluster and run PySpark at scale.

-

Let’s do this.

-
-

How to Run PySpark in a Jupyter Notebook on Your Laptop

-

Ok, I’m going to walk us through how to get things installed on a Mac or Linux machine where we’re using homebrew and conda to manage virtual environments. If you have a different setup, your favorite search engine will help you get PySpark set up locally.

-
-
-
- -
-
-Note -
-
-
-

It’s possible for Homebrew and Anaconda to interfere with one another. The simple rule of thumb is that whenever you want to use the brew command, first deactivate your conda environment by running conda deactivate. See this Stack Overflow question for more details.

-
-
-
-

Install Spark

-

Install Spark with homebrew.

-

+

Quantile Loss

+

Quantile loss yields estimates of a given quantile of the target variable conditioned on the features. Here’s my implementation.

+
+
x, y brew install apache-spark
-

Next we need to set up a SPARK_HOME environment variable in the shell. Check where Spark is installed.

-
= make_test_data(brew info apache-spark
-

You should see something like

-
==> apache-spark: stable 3.3.2 (bottled), HEAD
-Engine for large-scale data processing
-https://spark.apache.org/
-/opt/homebrew/Cellar/apache-spark/3.3.2 (1,453 files, 320.9MB) *
-...
-

Set the SPARK_HOME environment variable to your spark installation path with /libexec appended to the end. To do this I added the following line to my .zshrc file.

-
500, export SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.3.2/libexec
-

Restart your shell, and test the installation by starting the Spark shell.

-
spark-shell
-
...
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
-      /_/
-         
-Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 19.0.2)
-Type in expressions to have them evaluated.
-Type :help for more information.
-
-scala> 
-

If you get the scala> prompt, then you’ve successfully installed Spark on your laptop!

-
-
-

Install PySpark

-

Use conda to install the PySpark python package. As usual, it’s advisable to do this in a new virtual environment.

-
$ conda install pyspark
-

You should be able to launch an interactive PySpark REPL by saying pyspark.

-
$ pyspark
-...
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
-      /_/
-
-Using Python version 3.8.3 (default, Jul  2 2020 11:26:31)
-Spark context Web UI available at http://192.168.100.47:4041
-Spark context available as 'sc' (master = local[*], app id = local-1624127229929).
-SparkSession available as 'spark'.
->>> 
-

This time we get a familiar python >>> prompt. This is an interactive shell where we can easily experiment with PySpark. Feel free to run the example code in this post here in the PySpark shell, or, if you prefer a notebook, read on and we’ll get set up to run PySpark in a jupyter notebook.

-
-
-
- -
-
-Note -
-
-
-

When I tried following this setup on a new Mac, I hit an error about being unable to find the Java Runtime. This stack overflow question lead me to the fix.

-
+font-style: inherit;">1)
-
-
-

The Spark Session Object

-

You may have noticed that when we launched that PySpark interactive shell, it told us that something called SparkSession was available as 'spark'. So basically, what’s happening here is that when we launch the pyspark shell, it instantiates an object called spark which is an instance of class pyspark.sql.session.SparkSession. The spark session object is going to be our entry point for all kinds of PySpark functionality, i.e., we’re going to be saying things like spark.this() and spark.that() to make stuff happen.

-

The PySpark interactive shell is kind enough to instantiate one of these spark session objects for us automatically. However, when we’re using another interface to PySpark (like say a jupyter notebook running a python kernal), we’ll have to make a spark session object for ourselves.

-
-
-

Create a PySpark Session in a Jupyter Notebook

-

There are a few ways to run PySpark in jupyter which you can read about here.

-

For derping around with PySpark on your laptop, I think the best way is to instantiate a spark session from a jupyter notebook running on a regular python kernel. The method we’ll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session. So, first install the findspark package.

-

+

+conda install # from scratch GBM
+-c conda-forge findspark
-

Launch jupyter as usual.

-
class QuantileLoss():
+    jupyter notebook
-

Go ahead and fire up a new notebook using a regular python 3 kernal. Once you land inside the notebook, there are a couple things we need to do to get a spark session instantiated. You can think of this as boilerplate code that we need to run in the first cell of a notebook where we’re going to use PySpark.

-
-
'''Quantile Loss
+import pyspark
-    
+import findspark
-    Parameters
+from pyspark.sql     ----------
+import SparkSession
-
-findspark.init()
-spark     alpha : float
+= SparkSession.builder.appName(        quantile to be estimated, 0 < alpha < 1
+'My Spark App').getOrCreate()
-
-

First we’re running findspark’s init() method to find our Spark installation. If you run into errors here, make sure you got the SPARK_HOME environment variable correctly set in the install instructions above. Then we instantiate a spark session as spark. Once you run this, you’re ready to rock and roll with PySpark in your jupyter notebook.

-
-
-
- -
-
-Note -
-
-
-

Spark provides a handy web UI that you can use for monitoring and debugging. Once you instantiate the spark session You can open the UI in your web browser at http://localhost:4040/jobs/.

-
-
-
-
-
-

PySpark Concepts

-

PySpark provides two main abstractions for data: the RDD and the dataframe. RDD’s are just a distributed list of objects; we won’t go into details about them in this post. For us, the key object in PySpark is the dataframe.

-

While PySpark dataframes expose much of the functionality you would expect from a library for tabular data manipulation, they behave a little differently from pandas dataframes, both syntactically and under-the-hood. There are a couple of key concepts that will help explain these idiosyncracies.

-

Immutability - Pyspark RDD’s and dataframes are immutable. This means that if you change an object, e.g. by adding a column to a dataframe, PySpark returns a reference to a new dataframe; it does not modify the existing dataframe. This is kind of nice, because we don’t have to worry about that whole view versus copy nonsense that happens in pandas.

-

Lazy Evaluation - Lazy evaluation means that when we start manipulating a dataframe, PySpark won’t actually perform any of the computations until we explicitly ask for the result. This is nice because it potentially allows PySpark to do fancy optimizations before executing a sequence of operations. It’s also confusing at first, because PySpark will seem to blaze through complex operations and then take forever to print a few rows of the dataframe.

-
-
-

PySpark Dataframe Essentials

-
-

Creating a PySpark dataframe with createDataFrame()

-

The first thing we’ll need is a way to make dataframes. createDataFrame() allows us to create PySpark dataframes from python objects like nested lists or pandas dataframes. Notice that createDataFrame() is a method of the spark session class, so we’ll call it from our spark session sparkby saying spark.createDataFrame().

-
-
    '''
+    
+    # create pyspark dataframe from nested  lists
-my_df def = spark.createDataFrame(
-    data__init__(=[
-        [self, alpha):
+        2022, if alpha "tiger"],
-        [< 2023, 0 "rabbit"],
-        [or alpha 2024, >"dragon"]
-    ],
-    schema1:
+            =[raise 'year', ValueError('animal']
-)
-
-

Let’s read the seaborn tips dataset into a pandas dataframe and then use it to create a PySpark dataframe.

-
-
'alpha must be between 0 and 1')
+        import pandas self.alpha as pd
-
-= alpha
+        
+    # load tips dataset into a pandas dataframe
-pandas_df def loss(= pd.read_csv(self, y, preds):
+        e = y - preds
+        'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
-
-return np.mean(np.where(e # create pyspark dataframe from a pandas dataframe
-pyspark_df > = spark.createDataFrame(pandas_df)
-
-
-
-
- -
-
-Note -
-
-
-

In real life when we’re running PySpark on a large-scale distributed system, we would not generally want to use python lists or pandas dataframes to load data into PySpark. Ideally we would want to read data directly from where it is stored on HDFS, e.g. by reading parquet files, or by querying directly from a hive database using spark sql.

-
-
-
-
-

Peeking at a dataframe’s contents

-

The default print method for the PySpark dataframe will just give you the schema.

-
-
pyspark_df
-
-
DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]
-
-
-

If we want to peek at some of the data, we’ll need to use the show() method, which is analogous to the pandas head(). Remember that show() will cause PySpark to execute any operations that it’s been lazily waiting to evaluate, so sometimes it can take a while to run.

-
-
0, self.alpha * e, (self.alpha - 1) * e))
+    
+    def negative_gradient(self, y, preds):
+        e = y - preds 
+        return np.where(e > 0, self.alpha, self.alpha - 1)
+
+gbm # show the first few rows of the dataframe
-pyspark_df.show(= GradientBoostingMachine(n_trees5)
-
-
+----------+----+------+------+---+------+----+
-|total_bill| tip|   sex|smoker|day|  time|size|
-+----------+----+------+------+---+------+----+
-|     16.99|1.01|Female|    No|Sun|Dinner|   2|
-|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
-|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
-|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
-|     24.59|3.61|Female|    No|Sun|Dinner|   4|
-+----------+----+------+------+---+------+----+
-only showing top 5 rows
-
-
-
-

-[Stage 0:>                                                          (0 + 1) / 1]
-
-                                                                                
-
-
-

We thus encounter our first rude awakening. PySpark’s default representation of dataframes in the notebook isn’t as pretty as that of pandas. But no one ever said it would be pretty, they just said it would be scalable.

-

You can also use the printSchema() method for a nice vertical representation of the schema.

-
-
=# show the dataframe schema
-pyspark_df.printSchema()
-
-
root
- |-- total_bill: double (nullable = true)
- |-- tip: double (nullable = true)
- |-- sex: string (nullable = true)
- |-- smoker: string (nullable = true)
- |-- day: string (nullable = true)
- |-- time: string (nullable = true)
- |-- size: long (nullable = true)
-
-
-
-
-
-

Select columns by name

-

You can select specific columns from a dataframe using the select() method. You can pass either a list of names, or pass names as arguments.

-
-
10,
+                              learning_rate# select some of the columns
-pyspark_df.select(='total_bill', 0.5,
+                             max_depth'tip')
-
-=# select columns in a list
-pyspark_df.select([1)
+gbm.fit(x, y, QuantileLoss(alpha'day', ='time', 0.9))
+pred 'total_bill'])
+font-style: inherit;">= gbm.predict(x)
-
-
-

Filter rows based on column values

-

Analogous to the WHERE clause in SQL, and the query() method in pandas, PySpark provides a filter() method which returns only the rows that meet the specified conditions. Its argument is a string specifying the condition to be met for rows to be included in the result. You specify the condition as an expression involving the column names and comparison operators like <, >, <=, >=, == (equal), and ~= (not equal). You can specify compound expressions using and and or, and you can even do a SQL-like in to check if the column value matches any items in a list.

-
-

+
## compare a column to a value
-pyspark_df.# scikit-learn GBM
+sk_gbm filter(= GradientBoostingRegressor(n_estimators'total_bill > 20')
-
-=# compare two columns with arithmetic
-pyspark_df.10,
+                                 learning_ratefilter(='tip > 0.15 * total_bill')
-
-0.5,
+                                 max_depth# check equality with a string value
-pyspark_df.=filter(1,
+                                 loss'sex == "Male"')
-
-=# check equality with any of several possible values
-pyspark_df.'quantile', alphafilter(='day in ("Sat", "Sun")')
-
-0.9)
+sk_gbm.fit(x, y)
+sk_pred # use "and" 
-pyspark_df.= sk_gbm.predict(x)
+
+
+
print_model_loss_scores(QuantileLoss(alphafilter(='day == "Fri" and time == "Lunch"')
+font-style: inherit;">0.9), y, pred, sk_pred)
+
+
From Scratch Loss = 0.1853
+Scikit-Learn Loss = 0.1856
-

If you’re into boolean indexing with the brackets, PySpark does support that too, but I encourage you to use filter() instead. Check out my rant about why you shouldn’t use boolean indexing for the details. The TLDR is that filter() requires less typing, makes your code more readable and portable, and it allows you to chain method calls together using dot chains.

-

Here’s the boolean indexing equivalent of the last example from above.

-
-

+
+

Figure showing scatterplot of data and model prediction of 0.9 quantile of y given x

+
+
+
+
+

Binary Cross Entropy Loss

+

The previous losses are useful for regression problems, where the target is numeric. But we can also solve classification problems, simply by swapping in an appropriate loss function. Here we’ll implement binary cross entropy, a.k.a. binary deviance, a.k.a. negative binomial log likelihood (sometimes abusively called log loss). One thing to remember is that, as with logistic regression, our model is actually predicting the log odds ratio, not the probability of the positive class. Thus we use expit transformations (the inverse of logit) whenever probabilities are needed, e.g., when predicting the probability that an observation belongs to the positive class.

+
+
# using boolean indexing
-pyspark_df[(pyspark_df.day # make categorical test data
+
+== def expit(t):
+    'Fri') return np.exp(t) & (pyspark_df.time / (== 1 'Lunch')]
-
-

I know, it looks horrendous, but not as horrendous as the error message you’ll get if you forget the parentheses.

-
-
-

Add new columns to a dataframe

-

You can add new columns which are functions of the existing columns with the withColumn() method.

-
-
+ np.exp(t))
+
+x import pyspark.sql.functions = np.linspace(as f
-
--# add a new column using col() to reference other columns
-pyspark_df.withColumn(3, 'tip_percent', f.col(3, 'tip') 500)
+p / f.col(= expit(x)
+y 'total_bill'))
-
-

Notice that we’ve imported the pyspark.sql.functions module. This module contains lots of useful functions that we’ll be using all over the place, so it’s probably a good idea to go ahead and import it whenever you’re using PySpark. BTW, it seems like folks usually import this module as f or F. In this example we’re using the col() function, which allows us to refer to columns in our dataframe using string representations of the column names.

-

You could also achieve the same result using the dot to reference the other columns, but this requires us to type the dataframe name over and over again, which makes it harder to reuse this code on different dataframes or in dot chains.

-
-
= rng.binomial(# add a new column using the dot to reference other columns (less recommended)
-pyspark_df.withColumn(1, p, size'tip_percent', pyspark_df.tip =p.shape)
+x / pyspark_df.total_bill)
-
-

If you want to apply numerical transformations like exponents or logs, use the built-in functions in the pyspark.sql.functions module.

-
-
= x.reshape(# log 
-pyspark_df.withColumn(-'log_bill', f.log(f.col(1,'total_bill')))
-
-1)
+
+
+
# exponent
-pyspark_df.withColumn(# from scratch GBM
+'bill_squared', f.class BinaryCrossEntropyLoss():
+    pow(f.col('''Binary Cross Entropy Loss
+'total_bill'),     
+2))
-
-

You can implement conditional assignment like SQL’s CASE WHEN construct using the when() function and the otherwise() method.

-
-
    Note that the predictions should be log odds ratios.
+# conditional assignment (like CASE WHEN)
-pyspark_df.withColumn(    '''
+    
+    'is_male', f.when(f.col(def 'sex') __init__(== self):
+        'Male', self.expit True).otherwise(= False))
-
-lambda t: np.exp(t) # using multiple when conditions and values
-pyspark_df.withColumn(/ ('bill_size', 
-    f.when(f.col(1 'total_bill') + np.exp(t))
+    
+    < def loss(10, self, y, preds):
+        p 'small')
-    .when(f.col(= 'total_bill') self.expit(preds)
+        < return 20, -np.mean(y 'medium')
-    .otherwise(* np.log(p) 'large')
-)
-
-

Remember that since PySpark dataframes are immutable, calling withColumns() on a dataframe returns a new dataframe. If you want to persist the result, you’ll need to make an assignment.

-
pyspark_df = pyspark_df.withColumns(...)
-
-
-

Group by and aggregate

-

PySpark provides a groupBy() method similar to the pandas groupby(). Just like in pandas, we can call methods like count() and mean() on our grouped dataframe, and we also have a more flexible agg() method that allows us to specify column-aggregation mappings.

-
-

-+ (# group by and count
-pyspark_df.groupBy(1 'time').count().show()
-
-
+------+-----+
-|  time|count|
-+------+-----+
-|Dinner|  176|
-| Lunch|   68|
-+------+-----+
-
-
-
-
-

-- y) # group by and specify column-aggregation mappings with agg()
-pyspark_df.groupBy(* np.log('time').agg({1 'total_bill': - p))
+    
+    'mean', def negative_gradient('tip': self, y, preds):
+        p 'max'}).show()
-
-
+------+--------+------------------+
-|  time|max(tip)|   avg(total_bill)|
-+------+--------+------------------+
-|Dinner|    10.0| 20.79715909090909|
-| Lunch|     6.7|17.168676470588235|
-+------+--------+------------------+
-
-
-
-

If you want to get fancier with your aggregations, it might just be easier to express them using hive syntax. Read on to find out how.

-
-
-

Run Hive SQL on dataframes

-

One of the mind-blowing features of PySpark is that it allows you to write hive SQL queries on your dataframes. To take a PySpark dataframe into the SQL world, use the createOrReplaceTempView() method. This method takes one string argument which will be the dataframes name in the SQL world. Then you can use spark.sql() to run a query. The result is returned as a PySpark dataframe.

-
-

-= # put pyspark dataframe in SQL world and query it
-pyspark_df.createOrReplaceTempView(self.expit(preds)
+        'tips')
-spark.sql(return y 'select * from tips').show(/ p 5)
-
-
+----------+----+------+------+---+------+----+
-|total_bill| tip|   sex|smoker|day|  time|size|
-+----------+----+------+------+---+------+----+
-|     16.99|1.01|Female|    No|Sun|Dinner|   2|
-|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
-|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
-|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
-|     24.59|3.61|Female|    No|Sun|Dinner|   4|
-+----------+----+------+------+---+------+----+
-only showing top 5 rows
-
-
-
-

This is awesome for a couple of reasons. First, it allows us to easily express any transformations in hive syntax. If you’re like me and you’ve already been using hive, this will dramatically reduce the PySpark learning curve, because when in doubt, you can always bump a dataframe into the SQL world and simply use hive to do what you need. Second, if you have a hive deployment, PySpark’s SQL world also has access to all of your hive tables. This means you can write queries involving both hive tables and your PySpark dataframes. It also means you can run hive commands, like inserting into a table, directly from PySpark.

-

Let’s do some aggregations that might be a little trickier to do using the PySpark built-in functions.

-
-

-- (# run hive query and save result to dataframe
-tip_stats_by_time 1 = spark.sql(- y) """
-/ (    select
-1         time
-- p)
+
+    
+gbm         , count(*) as n 
-= GradientBoostingMachine(n_trees        , avg(tip) as avg_tip
-=        , percentile_approx(tip, 0.5) as med_tip
-10,
+                              learning_rate        , avg(case when tip > 3 then 1 else 0 end) as pct_tip_gt_3
-=    from 
-0.5,
+                              max_depth        tips
-=    group by 1
-1)
+gbm.fit(x, y, BinaryCrossEntropyLoss())
+pred """)
-
-tip_stats_by_time.show()
-
-
+------+---+------------------+-------+-------------------+
-|  time|  n|           avg_tip|med_tip|       pct_tip_gt_3|
-+------+---+------------------+-------+-------------------+
-|Dinner|176| 3.102670454545455|    3.0|0.44886363636363635|
-| Lunch| 68|2.7280882352941176|    2.2|0.27941176470588236|
-+------+---+------------------+-------+-------------------+
-
-
+font-style: inherit;">= expit(gbm.predict(x))
-
-
-
-

Visualization with PySpark

-

There aren’t any tools for visualization included in PySpark. But that’s no problem, because we can just use the toPandas() method on a PySpark dataframe to pull data back into pandas. Once we have a pandas dataframe, we can happily build visualizations as usual. Of course, if your PySpark dataframe is huge, you wouldn’t want to use toPandas() directly, because PySpark will attempt to read the entire contents of its huge dataframe into memory. Instead, it’s best to use PySpark to generate aggregations of your data for plotting or to pull only a sample of your full data into pandas.

-
-

+
# read aggregated pyspark dataframe into pandas for plotting
-plot_pdf # scikit-learn GBM
+sk_gbm = tip_stats_by_time.toPandas()
-plot_pdf.plot.bar(x= GradientBoostingClassifier(n_estimators=='time', y10,
+                                    learning_rate=[='avg_tip', 0.5,
+                                    max_depth'med_tip'])=;
+font-style: inherit;">1
,
+ loss='log_loss') +sk_gbm.fit(x, y) +sk_pred = sk_gbm.predict_proba(x)[:, 1]
+
+
+
print_model_loss_scores(BinaryCrossEntropyLoss(), y, pred, sk_pred)
+
+
From Scratch Loss = 0.6379
+Scikit-Learn Loss = 0.6403
+
+
+
-

Figure showing a bar plot of average and median tips by time

+

Figure showing data and model prediction of probability that y equals one given x

+

Wrapping Up

-

So that’s a wrap on our crash course in working with PySpark. You now have a good idea of what pyspark is and how to get started manipulating dataframes with it. Stay tuned for a future post on PySpark’s companion ML library MLlib. In the meantime, may no dataframe be too large for you ever again.

+

Woohoo! We did it! We finally made it through Friedman’s paper in its entirety, and we implemented the generic gradient boosting algorithm which works with any differentiable loss function. If you made it this far, great job, gold star! By now you hopefully have a pretty solid grasp on gradient boosting, which is good, because soon we’re going to dive into the modern Newton descent gradient boosting frameworks like XGBoost. Onward!

+
+
+

References

+

Friedman’s 2001 paper: Greedy Function Approximation: A Gradient Boosting Machine

]]>
python - PySpark - tutorial - https://randomrealizations.com/posts/hello-pyspark/index.html - Mon, 21 Jun 2021 21:00:00 GMT - + gradient boosting + from scratch + https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/index.html + Fri, 22 Oct 2021 22:00:00 GMT +
diff --git a/archive.html b/archive.html index afa80e1..f7afd72 100644 --- a/archive.html +++ b/archive.html @@ -168,7 +168,7 @@
Subscribe
- + @@ -202,7 +202,26 @@

Archive

-
+
+ + +
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+

diff --git a/archive.xml b/archive.xml index f1eeab2..5119f35 100644 --- a/archive.xml +++ b/archive.xml @@ -10,3505 +10,4035 @@ A blog about data science, statistics, machine learning, and the scientific method quarto-1.3.433 -Tue, 05 Sep 2023 21:00:00 GMT +Tue, 24 Oct 2023 22:00:00 GMT - Blogging with Quarto and Jupyter: The Complete Guide + XGBoost for Regression in Python Matt Bowers - https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/index.html + https://randomrealizations.com/posts/xgboost-for-regression-in-python/index.html -

Ahh, blogging. I think we can all agree it’s probably one of the greatest forms of written communication to have ever existed.

-

Whats that you say? You’d like to set up your own blog? And you say you want to use a dead simple, data science friendly tech stack? And you wouldn’t be caught dead handing over your painstakingly crafted content to Medium? No worries, friend, I know exactly what you need.

-

Enter Quarto.

-

In this post we’ll set up a blog using a lightweight tech stack consisting of a terminal running quarto, git, and jupyter, and we’ll use Github Pages to host our website for free. Optionally, for a few dollars a year, we can even host our website at our own custom domain.

-

A quick note on how to use this post. Quarto’s documentation on blogging provides a nice high-level overview of the blogging workflow, and I refer to it and many other bits of Quarto documentation here. At the time of writing, the handful of other blog posts about setting up quarto blogs are aimed at the RStudio user. This post exists to provide a jupyter and python-centric path for you to follow through the entire setup of your new quarto blog, and to impart my opinionated recommendations about best practices.

-

Let’s get into it!

-
-

What is Quarto?

-

Quarto is a way to render plain text source files containing markdown and code in python, R, and other languages into published formats like websites, books, slides, journal articles, etc. There is clearly a lot that we can do with it, but Today, we’ll use it to make a nice looking blog out of some jupyter notebook files.

-

Quarto follows the familiar convention of using a project directory to house all material for a given project. The directory will include source files like jupyter notebooks or Rmarkdown files, as well as configuration files that control how output files are rendered. We can then use the quarto command line utility to perform actions like previewing and rendering within the project directory.

-
-
-

Instantiate your blog

-
-

Create a new Quarto project

-

After installing quarto fire up a new terminal and check that the install was successful by running

-
ds-templates repo on GitHub, so go ahead and download the notebook and follow along with your own data.

+

If you’re not already comfortable with the ideas behind gradient boosting and XGBoost, you’ll find it helpful to read some of my previous posts to get up to speed. I’d start with this introduction to gradient boosting, and then read this explanation of how XGBoost works.

+

Let’s get into it! 🚀

+
+

Install and import the xgboost library

+

If you don’t already have it, go ahead and use conda to install the xgboost library, e.g.

+
$ conda install -c conda-forge xgboost
+

Then import it along with the usual suspects.

+
+
quarto import numpy --version
-

Now think of a name for your blog’s project directory; this will also be the name of its git repository. The name will have no effect on your website’s name or URL, so don’t think too hard. The quarto documentation calls it myblog, so we’ll one-up them and call ours pirate-ninja-blog. Run the following command to create it in the current directory.

-
as np
+quarto create-project pirate-ninja-blog import pandas --type website:blog
-

That command creates a directory called pirate-ninja-blog containing everything you need to render your new blog. You can preview your website by running

-
as pd
+quarto preview pirate-ninja-blog
-

Your local website will open in a new browser window. As you edit various aspects of your blog, the preview will update with your changes. This preview feature is so simple and so great.

-
-
-

-
Previewing your blog with quarto preview command
-
+font-style: inherit;">import matplotlib.pyplot as plt +import xgboost as xgb

-
-

Set up a git repo

-

Change into your project directory and we’ll start setting up your git repo.

-

+

Read dataset into python

+

In this example we’ll work on the Kagle Bluebook for Bulldozers competition, which asks us to build a regression model to predict the sale price of heavy equipment. Amazingly, you can solve your own regression problem by swapping this data out with your organization’s data before proceeding with the tutorial.

+

Go ahead and download the Train.zip file from Kagle and extract it into Train.csv. Then read the data into a pandas dataframe.

+
+
df cd pirate-ninja-blog
-

initialize a new git repo.

-
= pd.read_csv(git init 'Train.csv', parse_dates-b main
-

The _site/ directory is where quarto puts the rendered output files, so you’ll want to ignore it in git. I also like to just ignore any hidden files too, so add the following to your .gitignore file.

-
-
-
.gitignore
+font-style: inherit;">=['saledate']);
-

+
df.info()
+
+
<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 401125 entries, 0 to 401124
+Data columns (total 53 columns):
+ #   Column                    Non-Null Count   Dtype         
+---  ------                    --------------   -----         
+ 0   SalesID                   401125 non-null  int64         
+ 1   SalePrice                 401125 non-null  int64         
+ 2   MachineID                 401125 non-null  int64         
+ 3   ModelID                   401125 non-null  int64         
+ 4   datasource                401125 non-null  int64         
+ 5   auctioneerID              380989 non-null  float64       
+ 6   YearMade                  401125 non-null  int64         
+ 7   MachineHoursCurrentMeter  142765 non-null  float64       
+ 8   UsageBand                 69639 non-null   object        
+ 9   saledate                  401125 non-null  datetime64[ns]
+ 10  fiModelDesc               401125 non-null  object        
+ 11  fiBaseModel               401125 non-null  object        
+ 12  fiSecondaryDesc           263934 non-null  object        
+ 13  fiModelSeries             56908 non-null   object        
+ 14  fiModelDescriptor         71919 non-null   object        
+ 15  ProductSize               190350 non-null  object        
+ 16  fiProductClassDesc        401125 non-null  object        
+ 17  state                     401125 non-null  object        
+ 18  ProductGroup              401125 non-null  object        
+ 19  ProductGroupDesc          401125 non-null  object        
+ 20  Drive_System              104361 non-null  object        
+ 21  Enclosure                 400800 non-null  object        
+ 22  Forks                     192077 non-null  object        
+ 23  Pad_Type                  79134 non-null   object        
+ 24  Ride_Control              148606 non-null  object        
+ 25  Stick                     79134 non-null   object        
+ 26  Transmission              183230 non-null  object        
+ 27  Turbocharged              79134 non-null   object        
+ 28  Blade_Extension           25219 non-null   object        
+ 29  Blade_Width               25219 non-null   object        
+ 30  Enclosure_Type            25219 non-null   object        
+ 31  Engine_Horsepower         25219 non-null   object        
+ 32  Hydraulics                320570 non-null  object        
+ 33  Pushblock                 25219 non-null   object        
+ 34  Ripper                    104137 non-null  object        
+ 35  Scarifier                 25230 non-null   object        
+ 36  Tip_Control               25219 non-null   object        
+ 37  Tire_Size                 94718 non-null   object        
+ 38  Coupler                   213952 non-null  object        
+ 39  Coupler_System            43458 non-null   object        
+ 40  Grouser_Tracks            43362 non-null   object        
+ 41  Hydraulics_Flow           43362 non-null   object        
+ 42  Track_Type                99153 non-null   object        
+ 43  Undercarriage_Pad_Width   99872 non-null   object        
+ 44  Stick_Length              99218 non-null   object        
+ 45  Thumb                     99288 non-null   object        
+ 46  Pattern_Changer           99218 non-null   object        
+ 47  Grouser_Type              99153 non-null   object        
+ 48  Backhoe_Mounting          78672 non-null   object        
+ 49  Blade_Type                79833 non-null   object        
+ 50  Travel_Controls           79834 non-null   object        
+ 51  Differential_Type         69411 non-null   object        
+ 52  Steering_Controls         69369 non-null   object        
+dtypes: datetime64[ns](1), float64(2), int64(6), object(44)
+memory usage: 162.2+ MB
+
+
+ +
+

Prepare raw data for XGBoost

+

When faced with a new tabular dataset for modeling, we have two format considerations: data types and missingness. From the call to df.info() above, we can see we have both mixed types and missing values.

+

When it comes to missing values, some models like the gradient booster or random forest in scikit-learn require purely non-missing inputs. One of the great strengths of XGBoost is that it relaxes this requirement, allowing us to pass in missing feature values, so we don’t have to worry about them.

+

Regarding data types, all ML models for tabular data require inputs to be numeric, either integers or floats, so we’re going to have to deal with those object columns.

+
+

Encode string features

+

The simplest way to encode string variables is to map each unique string value to an integer; this is called integer encoding.

+

We have a couple of options for how to implement this transformation: pandas categoricals or the scikit-learn label encoder. We can use the categorical type in pandas to generate mappings from string values to integers for each string feature. The category type is a bit like the factor type in R. Pandas stores the underlying data as integers, and it also keeps a mapping from the integers to the string values. XGBoost will be able to access the integers for model fitting. This is nice because we can still access the actual categories which can be helpful when we start taking a closer look at the data. If you prefer, you can also use the scikit-learn label encoder to replace the string columns with their integer-mapped counterparts.

+
+
/.quarto/
-def encode_string_features(df, use_cats/_site/
-=.True):
+    out_df *
-
-

For now we’ll just stage the .gitignore file for the initial commit. Eventually you’ll want to commit the other files in your project too, either now or later as you edit them.

-
= df.copy()
+    git add .gitignore 
-for feature, feature_type git commit in df.dtypes.items():
+        -m if feature_type "Initial commit."
-

Then follow GitHub’s instructions to add the local repo to GitHub using git. Basically just create a new blank repo on GitHub’s website, copy the remote repository url, then add the remote repo url to your local git repo.

-
== git remote add origin 'object':
+            <REMOTE_URLif use_cats:
+                out_df[feature] >
-

Then you’ll be able to push any commits you make to your remote repository on GitHub by saying git push.

-
-
-
-

Understand the components of a Quarto blog

-
-

Contents of the quarto project directory

-

Let’s have a quick look at what quarto put inside of the project directory.

-
_quarto.yml
-about.qmd
-index.qmd
-profile.jpg
-posts
-styles.css
-_site
-
    -
  • Quarto uses yaml files to specify configurations. The _quarto.yml file specifies project-wide configurations.
  • -
  • Quarto’s markdown file type uses extension qmd``. Each qmd file will correspond to a page in our website.index.qmdis the homepage andabout.qmd` is the About page.
  • -
  • profile.jpg is an image that is included on the about page.
  • -
  • styles.css defines css styles for the website.
  • -
  • posts is a directory where we can put qmd and other documents which will be rendered into blog posts.
  • -
  • posts/_metadata.yml contains configurations that apply to all documents in the posts directory.
  • -
  • _site is a directory that contains the rendered website. Whereas all the other files and directories constitute the source code for our blog, _site is the rendered output, i.e. the website itself.
  • -
-

Let’s take a closer look at these components and start to make the blog yours.

-
-
-

Project-wide Configurations

-

The _quarto.yml file controls project-wide configurations, website options, and HTML document options. Options in this file are specified in yaml in a key/value structure with three top level keys: project, website, and format. The quarto website options documentation has the full list of options that you can set here. It will be very helpful to take a look at some example _quarto.yml files in the wild, such as the one from quarto.org or even the one from this blog.

-

Under the website key, go ahead and set the title and description for your blog.

-
website:
-  title: "Pirate Ninja Blog"
-  description: "A blog about pirates, ninjas, and other things"
-

You can also customize your navbar which is visible at the top of all pages on your site. Also go ahead and set your github and twitter urls for the icons in the navbar.

-

Under the format key, you can also try changing the HTML theme to one of the other 25 built-in themes.

-
-
-

The About Page

-

The about.qmd file defines an About page for the blog. Go ahead and fill in your details in the about.qmd file; you can also replace the profile.jpg file with your own image. Have a look at the quarto documentation on About pages to explore more functionality. Notably, you can change the template option to change the page layout.

-
-
-

The Homepage

-

The index.qmd file defines the landing page for your website. It is a listing page which shows links to all the pages in the posts directory. For now we don’t need to change anything here.

-
-
-

The posts/ directory

-

The posts directory contains all your blog posts. There aren’t really requirements for subdirectory structure inside the posts directory, but it’s a best practice to create a new subdirectory for each new blog post. This just helps keep auxillary files like images or conda environment files organized. Out of the box, the posts directory looks like this.

-
posts
-├── _metadata.yml
-├── post-with-code
-│   ├── image.jpg
-│   └── index.qmd
-└── welcome
-    ├── index.qmd
-    └── thumbnail.jpg
-

There are two reasons we want to be deliberate about how we organize and name things in the posts directory. First, the vast majority of our blog’s content will live here, so we don’t want it to be a big confusing mess. Second, the directory sstructure and file naming will be reflected in the URLs to our blog posts; if you prefer tidy-looking URLs, and I know you do, then you want to use tidy directory and file names in the posts directory.

-

You can check how the URLs look by navigating to one of the pre-populated posts in the site preview in your browser. For instance, the welcome post’s URL would be

-
https://example.com/posts/welcome/
-

When quarto renders the qmd file at posts/welcome/index.qmd it creates an output document in the website at posts/welcome/index.html. In fact the full URL to the post is,

-
https://example.com/posts/welcome/index.html
-

but the browser knows if you give it a URL with a path ending in a /, then it should look for the index.html file inside that directory.

-

So I think the best practice here is to name your new post subdirectory with the title of the post in all lower case with dashes for spaces, e.g. post-with-code. Then to force all output pages to be called index.html, you can set the output-file key in the posts/_metadata.yml file like this.

-
-
-
posts/_metadata.yml
-
-
= out_df[feature].astype(output-file: index.html
-
-

Note that alternative naming conventions are possible; notably you might want to prefix each post name with the date in yyyy-mm-dd format, so the post subdirectories sort temporally and look nice in a list. That’s the convention used in Quarto’s own blog at quarto.org, As long as you keep everything for a given post inside its subdirectory, you should be good to go with nice-looking URLs.

-
-
-
-

Authoring posts with jupyter

-
-

Creating a new post

-

It turns out that quarto will render not only .qmd files, but also .ipynb files in the posts directory. So let’s create a new blog post from a notebook.

-

I think it’s a best practice to write draft posts in their own git branches, that way if you need to deploy some kind of hotfix to main while you’re drafting a post, you won’t have to deploy a half-written post livin on the main branch. To start a new post, create a new development branch, change into the posts directory, create a new subdirectory with your preferred naming convention, change into that new directory, and fire up jupyter.

-
'category')
+            git checkout else:
+                -b new-post
-from sklearn.preprocessing cd posts
-import LabelEncoder
+                out_df[feature] mkdir new-post
-= LabelEncoder() cd new-post
-\
+                    .fit_transform(out_df[feature].astype(jupyter notebook
-

Now create a new notebook from the jupyter UI. In order for quarto to recognize the document, the first cell of the notebook must be a raw text cell (press r in command mode to change a cell to raw text), and it must contain the document’s yaml front matter. You can use the following as a frontmatter template.

-
---
-title: New Post
-date: 2023-07-12
-description: A nice new post
-categories: [nonsense, code]
----
-

Now to preview your post, open a new terminal, change into your blog’s project directory and run the quarto preview command. You’ll see a link to the new post in the listing on the homepage. I usually like to have the preview open in a browser while I’m editing the jupyter notebook, just to make sure things look the way I want in the rendered output. From here you can keep editing the notebook, and the preview will update in the browser dynamically.

-
-
-

Markdown and code cells

-

From here you can put text in markdown cells and you can write code in code cells. Let’s add a markdown cell with some markdown formatting.

-
## A nice heading
-
-Here is some lovely text and an equation.
-
-$$ a^2 + b^2 = c^2 $$
-
-Here's a list.
-
-- a link to an [external website](https://quarto.org).
-- a link to [another post in this blog](/posts/welcome/index.qmd).
-

This markdown will be rendered into the HTML page for the post. The last line in the above cell demonstrates the best practice for using relative urls to link to other resources within your website. Instead of providing the full url in the parentheses, just give the path to the qmd or ipynb file that you want to link to. Note that paths need to start with the / at the root of the quarto project, since without it, quarto will try to resolve paths relative to the location of the current document instead of the root of the project.

-

Then create a code cell with some code. Try something like this.

-
'str'))
+    print(return out_df
+
+df 'Hello, Quarto!')
-

By default, both code and cell output will be rendered into the HTML output. So far our jupyter notebook looks like this.

-
-
-

-
View of a new post being written in jupyter notebook
-
-
-

Back in the browser window running your blog preview, you can see the rendered page of the new post.

-
-
-

-
View of the preview of the rendered post
-
+font-style: inherit;">= encode_string_features(df, use_cats=False)
-
-

Figures

-

Let’s add a figure to our post. Add a new code cell with the following code.

-

+

Encode date and timestamp features

+

While dates feel sort of numeric, they are not numbers, so we need to transform them into numeric columns. Unfortunately, encoding timestamps isn’t as straightforward as encoding strings, so we actually might need to engage in a little bit of feature engineering. A single date has many different attributes, e.g. days since epoch, year, quarter, month, day, day of year, day of week, is holiday, etc. As a starting point, we can just add a few of these attributes as features. Once a feature is represented as a date or timestamp data type, you can access various attributes via the dt attribute.

+
+
# | fig-cap: This is my lovely line plot
-def encode_datetime_features(df, datetime_features, datetime_attributes):
+    out_df # | fig-alt: A line plot extending up and to the right
-
-= df.copy()
+    import numpy for datetime_feature as np
-in datetime_features:
+        import matplotlib.pyplot for datetime_attribute as plt
-
-x in datetime_attributes:
+            = np.arange(if datetime_attribute 10)
-y == = 'days_since_epoch':
+                out_df[2 f'* x {datetime_feature+ }1
-plt.plot(x, y)_;
-

Notice a couple of important details. First I placed a semicolon at the end of the last line. That supresses the [<matplotlib.lines.Line2D at 0x1111d00a0>] text output, which would otherwise show up in your blog post too.

-

Second, I added a couple of special comments at the top of the cell. Quarto allows you to specify numerous code execution options, designated by the # | prefix, to control the behavior and appearance of the code and output at a cell level. I set two keys here, fig-cap and fig-alt which respectively set the figure caption text and the image alt tag text. The fig-alt key is particularly important to set on all your figures because it provides the non-visual description for screenreader users reading your post. The alt tag should be a simple description of what the plot is and possibly what it shows or means. Be a friend of the blind and visually impaired community and set fig-alt on all of your figures.

-
-
-

Version control

-

As you edit your new post, go ahead and commit your changes on your development branch. Once you’ve finished your new post, you can merge it into main like this.

-
{datetime_attributegit checkout main
-}git merge new-post
-

Then you can push to GitHub by running git push. You should also be sure to run a final quarto preview to check that everything looks good before publishing to the web.

-
- -
-

Publishing your blog to the web

-
-

Hosting with GitHub Pages

-

It’s likely that the easiest (read best) option for you is to host your blog on GitHub Pages. This is because GitHub pages is free, and since you already have your blog’s source code checked into a remote repository at GitHub, it’s very easy to set up. Quarto’s documentation on publishing to GitHub Pages outlines three ways to publish your website, but I recommend their option 2, using the quarto publish command. Once you set up your gh-pages branch as described in the documentation, you simply run quarto publish at the command line and your updates are deployed to your website.

-
-
-

Setting up your domain name

-

By default, if you choose to host with GitHub Pages, your website will be published to a url in the form https://username.github.io/reponame/. You can certainly do this; for example Jake VanderPlas’s awesome blog Pythonic Perambulations lives at http://jakevdp.github.io.

-

But, like me, you might want to get your own custom domain by buying, or really renting, one from a registrar. I use Namecheap. If you decide to go for a custom domain, refer to GitHub’s documentation on custom domains. You’ll also need to point your domain registrar to the IP address where GitHub Pages is hosting your website. For an example of how to do this at Namecheap, see Namecheap’s documentation about GitHub Pages

-

Whether you decide to use the standard github.io domain or your own custom domain, be sure to set the site-url key in your _quarto.yml file to ensure other quarto functionality works correctly. For example

-
-
-
_quarto.yml
-
-
'] website:
-  = site-url: https://example.com/
-
-

Edit: I found that after upgrading to quarto 1.3, using quarto publish to publish from the gh-pages branch obliterates the CNAME file that is created when you set a custom domain in your repository settings > Pages > Custom Domain. That breaks the mapping from your custom domain to your published website. See this disscussion thread for details. The fix is to manually create a CNAME file in the root of your project, and include it in the rendered website using the resources option under the project key in _quarto.yml. The CNAME file should just contain your custom domain, excluding any https://.

-
-
-
CNAME
-
-
\
+                    (out_df[datetime_feature] 
+                     example.com
-
-

With the CNAME file in the root of your quarto project, you can then include it in the rendered output.

-
-
-
_quarto.yml
-
-
- pd.Timestamp(yearproject:
-  =resources:
-    1970, month- CNAME
-
-
-
-
-

Keep in touch with your readers

-
-

RSS Feed

-

The RSS feed is handy for syndicating your posts to feed readers, other websites, and to your email subscribers. As described in quarto’s documentation on RSS feeds, you can automatically generate an RSS feed for your blog by first setting the value of site-url under the website key in _quarto.yml, and then setting feed: true under the listing key in the frontmatter of index.qmd. This will generate an RSS feed in the root of your website called index.xml. Once you have an RSS feed, go ahead and submit it to Python-Bloggers to have your work syndicated to a wider audience and to strengthen our little community of independent data science blogs.

-
-
-

Email Subscriptions

-

The idea here is to have a form field on your website where readers can input their email address to be added to your mailing list. Quarto’s documentation on subscriptions describes how to set up a subscribe box on your blog using MailChimp, so we won’t repeat it here. Once you have some subscribers, you can send them updates whenever you write a new post. You could do this manually or, in my case, set up an automation through MailChimp which uses your RSS feed to send out email updates to the list about new posts.

-
-
-

Comments

-

Quarto has build-in support for three different comment systems: hypothesis, utterances, and giscus. The good news is that these are all free to use, easy to set up, and AFAIK do not engage in any sketchy tracking activities. The bad news is that none of them are ideal because they all require the user to create an account and login to leave a comment. We want to encourage readers to comment, so we don’t want them to have to create accounts or deal with passwords or pick all the squares with bicycles or any such nonsense, just to leave a little comment. To that end, I’ve actually been working on self-hosted login-free comments for this blog using isso, but it’s a bit more involved than these built-in solutions, so we’ll have to discuss it at length in a future post.

-

If you prefer an easy, out-of-the-box solution, I can recommend utterances, which uses GitHub issues to store comments for each post. I used utterances for comments on the first jekyll-based incarnation of this blog; you can still see the utterances comments on posts before this one. Go check out the Quarto documentation on comments to see how to set up utterances in your project.

-
-
-

Analytics

-

As a data enthusiast, you’ll likely enjoy collecting some data about page views and visitors to your site. You might be tempted to use Google Analytics to do this; indeed quarto makes it very easy to just add a line to your _quarto.yml file to set it up. Unfortunately, in this case, going with the easy and free solution means supporting Google’s dubious corporate surveillance activities. Be a conscientious internet citizen and avoid using Google Analytics on your blog. Fortunately, there are numerous privacy-friendly alternatives to Google Analytics. For this blog I’m self-hosting umami analytics, which might warrant its own post in the future.

-
-
-
-

More humbly suggested best practices

-
-

Using conda environments for reproducibility

-

As you know, it’s a good practice to use an environment manager to keep track of packages, their versions, and other dependencies for software in a data science project. The same applies to blog posts; especially if you’re using unusual or bleeding-edge packages in a post. This will help us out a lot when we have to go back and re-run a notebook a couple years later to regenerate the output. Here we’ll use conda as our environment manager.

-

To be clear, I don’t bother doing this if I’m just using fairly stable functionality in standard packages like pandas, numpy, and matplotlib, but we’ll do it here for illustration. From a terminal sitting inside our post subdirectory at posts/new-post, create a new conda environment with the packages you’re using in the post.

-
=conda create 1, day-p ./venv jupyter numpy matplotlib
-

Note the -p flag which tells conda to save the environment to ./venv in the current working directory. This will save all the installed packages here in the post directory instead of in your system-wide location for conda environments. Note also that you’ll want to avoid checking anything in the venv directory into source control, so add venv to the .gitignore file at the root of the quarto project to ignore all venv directories throughout your quarto project.

-

Now whenever you work on this post, you’ll navigate to the post subdirectory with a terminal and activate the conda environment.

-
=conda activate ./venv
-

Then you can fire up your jupyter notebook from the command line, and it will use the active conda environment.

-

Since we don’t want to check the venv directory with all its installed libraries into source control, we need to create an environment.yml file from which the environment can later be reproduced. With the local conda environment active, run the following.

-
1)).dt.days
+            conda env export else:
+                out_df[--from-history f'> environment.yml
-

The --from-history flag tells conda to skip adding a bunch of system specific stuff that will gunk up your environment yaml file and make it harder to use for cross-platform reproducibility. This environment.yml file is the only environment management artifact that you need to check into git.

-

Later if you need to recreate the environment from the environment.yml file, you can use the following command.

-
{datetime_featureconda env create }-f environment.yml _-p ./venv{datetime_attribute`
-
-
-

Image file best practices

-

Let’s talk about image file sizes. The key idea is that we want images to have just enough resolution to look good; any more than that and we’re just draging around larger-than-necessary files and wasting bandwidth and slowing down page load times.

-

You can read all about choosing optimal image sizes, but the TLDR is that images should be just large enough (in pixels) to fill the containers they occupy on the page. In our quarto blog, the two most common kinds of images are inline images we put in the body of posts and image thumbnails that show up as the associated image for a post, e.g. in the listing on our homepage. The inline image container seems to be about 800 pixels wide in my browser and the thumbnails are smaller, so adding some margin of error, I decided to go for 1000x750 for inline images and 500x375 for the thumbnails.

-

I use a command line tool called Image Magick to resize image files. Go ahead and install image magick with homebrew, and let’s add some images to our new post.

-

For this example I’ll use a nice shot of the London Underground from Wikipedia. Save your image as image.jpg. Then use image magick to create two new resized images for inline and thumbnail use.

-
}convert image.jpg '] -resize 1000x1000 main.jpg 
-= convert image.jpg \
+                    -resize 500x500 thumbnail.jpg 
-

These commands do not change the aspect ratio of the image; they just reduce the size so that the image fits within the size specified.

-

Now move both of your new images into the post subdirectory at posts/new-post/. To specify the thumbnail image, set the image key in the post’s front matter. Be sure to also add an alt tag description of the image using the image-alt key to keep it accessible for screen reader users. Our post’s frontmatter now looks like this.

-
---
-title: New Post
-date: 2023-07-12
-description: A nice new post
-categories: [nonsense, code]
-image: thumbnail.jpg
-image-alt: "A London Underground train emerging from a tunnel"
----
-

To include an image within the body of a post, use markdown in the post to include the image. I added a markdown cell just under the front matter containing the following.

-
![A London Underground train emerging from a tunnel](main.jpg "")
-

In your preview browser window, you can see we have the thumbnail for our new post on the homepage listing.

-
-
-

-
A screenshot of the homepage showing the new post’s thumbnail image
-
-
-

And we also have the inline image appearing in the body of the post.

-
-
-

-
A screenshot of the new post showing the image included in the body of the post
-
-
-

You can take a look at the source code for this blog to see some examples of including images in posts.

-
-
-
-

SEO

-

SEO is a huge topic, but here we’ll just focus on a few fundamental technical aspects that we want to be sure to get right. This boils down to registering with the top search engines by market share and ensuring that we’re providing them with the information they need to properly index our pages.

-

I checked the top search engines by global market share and as of 2023 it looks like Google has about 85%, Bing has about 8%, and the others have 2% or less each. So let’s focus on setting our site up to work well with Google search and Bing to get over 90% coverage.

-
-

Google Search Console and Bing Webmaster Tools

-

Google Search Console is a tool for web admins to help analyze search traffic and identify any technical issues that might prevent pages from appearing or ranking well in search. Go ahead and set up an account and register your blog in search console. You can refer to Google’s documentation on search console to guide you through setup and configuration.

-

Once you get set up on GSC, you can also create an account for Bing Webmaster Tools. Do this after setting up GSC because there is an option to import your information from your GSC account.

-

Once you’re set up with GSC and BWT, you’ll get email alerts anytime they crawl your site and detect any indexing problems. When that happens, track down the issues and fix them so your pages can appear in organic searches.

-
-
-

Sitemap

-

A sitemap is an xml document that lists all the pages on your website. It’s a map for the search engine bots that crawl the web looking for new pages to index. Quarto will automatically generate a sitemap called sitemap.xml in the root of your website, as long as you’ve filled out the site-url key in _quarto.yml. You can submit your website for indexing by providing your sitemap in Google Search Console and Bing Webmaster Tools.

-
-
-
-

Wrapping Up

-

Boy howdy, that was a lot, but at this point you should have a fully functioning blog, built with a minimalist, data-science-friendly tech stack consisting of quarto, jupyter, and GitHub. If you do create a blog using quarto, drop a link to it in the comments, and we can all check it out and celebrate your creation!

-
- - ]]> - python - tutorial - blogging - https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/index.html - Tue, 05 Sep 2023 21:00:00 GMT - - - - Random Realizations Resurrected - Matt Bowers - https://randomrealizations.com/posts/random-realizations-resurrected/index.html - -
-

-
Christ the Redeemer towers into a vast blue Brazilian sky.!
-
-
-

Well it’s been over a year since I posted anything here. You see, a lot has been going on here at the Random Realizations Remote Global Headquarters that has distracted from producing the high-quality data science content that you’re used to. Mostly I went on hiatus from work and started traveling, which turns out to be it’s own full time job. I had aspirations of writing more after leaving work, but of course, after leaving, I couldn’t be bothered to sit down at my laptop and type stuff about data science to yall. After all, life is bigger than that.

-

When I finally felt like opening up my laptop, I was confronted with an email from the maintainers of fastpages, the open source content management system (CMS) I originally used to create this blog, notifying me that the project was being deprecated and that I would need to migrate my content to some other platform.

-

Boo.

-

That didn’t sound like much fun, so I spent another few months ignoring the blog. But eventually, dear reader, I decided it was time to roll up my sleeves and get this blog thriving once again.

-

Ok so fastpages was going to be deprecated, and I needed to find a new CMS. My requirements were pretty simple: I wanted to write the blog posts with jupyter notebook, and I wanted to host the site on my own domain. Helpfully, the former maintainers of fastpages recommended an alternative CMS called Quarto which I had never heard of. Apparently I had been living under a rock because Quarto appears to be all the rage. Quarto’s website says it’s an open-source scientific and technical publishing system. I think it’s fair to think of it as a way to render plain text or source code from languages like python, R, and julia into a variety of different published formats like websites, books, or journal articles. It was developed by the good folks over at RStudio, and the project has a pretty active following over on github, so I think it’s less likely to suddenly disappear like fastpages.

-

So anyway, I’ve been migrating my content over into this new quarto universe.

-

You mayofficially consider this blog resurrected from the dead, because this is the first new post published after the migration. The site has a bit of a new look and feel, so I hope you like it. Do let me know in the comments if you find anything amiss with the new website. Otherwise we’ll just assume it’s fabulous.

-

I’m working on a post about how to create a blog with quarto using jupyter and python, so you can too!

-

See you in more posts real soon! Love, Matt.

- ]]> - blogging - https://randomrealizations.com/posts/random-realizations-resurrected/index.html - Tue, 01 Aug 2023 21:00:00 GMT - - - - XGBoost from Scratch - Matt Bowers - https://randomrealizations.com/posts/xgboost-from-scratch/index.html - -
-

-
A weathered tree reaches toward the sea at Playa Mal País
-
-
-

Well, dear reader, it’s that time again, time for us to do a seemingly unnecessary scratch build of a popular algorithm that most people would simply import from the library without a second thought. But readers of this blog are not most people. Of course you know that when we do scratch builds, it’s not for the hell of it, it’s for the purpose of demystification. To that end, today we are going to implement XGBoost from scratch in python, using only numpy and pandas.

-

Specifically we’re going to implement the core statistical learning algorithm of XGBoost, including most of the key hyperparameters and their functionality. Our implementation will also support user-defined custom objective functions, meaning that it can perform regression, classification, and whatever exotic learning tasks you can dream up, as long as you can write down a twice-differentiable objective function. We’ll refrain from implementing some simple features like column subsampling which will be left to you, gentle reader, as exercises. In terms of tree methods, we’re going to implement the exact tree-splitting algorithm, leaving the sparsity-aware method (used to handle missing feature values) and the approximate method (used for scalability) as exercises or maybe topics for future posts.

-

As always, if something is unclear, try backtracking through the previous posts on gradient boosting and decision trees to clarify your intuition. We’ve already built up all the statistical and computational background needed to make sense of this scratch build. Here are the most important prerequisite posts:

-
    -
  1. Gradient Boosting Machine from Scratch
  2. -
  3. Decision Tree From Scratch
  4. -
  5. How to Understand XGBoost
  6. -
-

Great, let’s do this.

-
-

The XGBoost Model Class

-

We begin with the user-facing API for our model, a class called XGBoostModel which will implement gradient boosting and prediction. To be more consistent with the XGBoost library, we’ll pass hyperparameters to our model in a parameter dictionary, so our init method is going to pull relevant parameters out of the dictionary and set them as object attributes. Note the use of python’s defaultdict so we don’t have to worry about handling key errors if we try to access a parameter that the user didn’t set in the dictionary.

-
-
getattr(out_df[datetime_feature].dt, datetime_attribute)
+    import math
-return out_df
+
+datetime_features import numpy = [
+    as np 
-'saledate',
+]
+datetime_attributes import pandas = [
+    as pd
-'year',
+    from collections 'month',
+    import defaultdict
-
-
-
'day',
+    class XGBoostModel():
-    'quarter',
+    '''XGBoost from Scratch
-'day_of_year',
+        '''
-    
-    'day_of_week',
+    def 'days_since_epoch',
+]
+
+df __init__(= encode_datetime_features(df, datetime_features, datetime_attributes)
+
+
+
+

Transform the target if necessary

+

In the interest of speed and efficiency, we didn’t bother doing any EDA with the feature data. Part of my justification for this is that trees are incredibly robust to outliers, colinearity, missingness, and other assorted nonsense in the feature data. However, they are not necessarily robust to nonsense in the target variable, so it’s worth having a look at it before proceeding any further.

+
+
df.SalePrice.hist()self, params, random_seed; plt.xlabel(='SalePrice')None):
-        ;
+
+

histogram of sale price showing right-skewed data

+
+
+

Often when predicting prices it makes sense to use log price, especially when they span multiple orders of magnitude or have a strong right skew. These data look pretty friendly, lacking outliers and exhibiting only a mild positive skew; we could probably get away without doing any transformation. But checking the evaluation metric used to score the Kagle competition, we see they’re using root mean squared log error. That’s equivalent to using RMSE on log-transformed target data, so let’s go ahead and work with log prices.

+
+
df[self.params 'logSalePrice'] = defaultdict(= np.log1p(df[lambda: 'SalePrice'])
+df.logSalePrice.hist()None, params)
-        ; plt.xlabel(self.subsample 'logSalePrice')= ;
+
+

histogram of log sale price showing a more symetric distribution

+
+
+
+ +
+

Train and Evaluate the XGBoost regression model

+

Having prepared our dataset, we are now ready to train an XGBoost model. We’ll walk through the flow step-by-step first, then later we’ll collect the code in a single cell, so it’s easier to quickly iterate through variations of the model.

+
+

Specify target and feature columns

+

First we’ll put together a list of our features and define the target column. I like to have an actual list defined in the code so it’s easier to see everything we’re puting into the model and easier to add or remove features as we iterate. Just run something like list(df.columns) in a cel to get a copy-pasteable list of columns, then edit it down to the full list of features, i.e. remove the target, date columns, and other non-feature columns..

+
+
self.params[# list(df.columns)
+
+
+
features 'subsample'] = [
+    \
-            'SalesID',
+    if 'MachineID',
+    self.params['ModelID',
+    'subsample'] 'datasource',
+    else 'auctioneerID',
+    1.0
-        'YearMade',
+    self.learning_rate 'MachineHoursCurrentMeter',
+    = 'UsageBand',
+    self.params['fiModelDesc',
+    'learning_rate'] 'fiBaseModel',
+    \
-            'fiSecondaryDesc',
+    if 'fiModelSeries',
+    self.params['fiModelDescriptor',
+    'learning_rate'] 'ProductSize',
+    else 'fiProductClassDesc',
+    0.3
-        'state',
+    self.base_prediction 'ProductGroup',
+    = 'ProductGroupDesc',
+    self.params['Drive_System',
+    'base_score'] 'Enclosure',
+    \
-            'Forks',
+    if 'Pad_Type',
+    self.params['Ride_Control',
+    'base_score'] 'Stick',
+    else 'Transmission',
+    0.5
-        'Turbocharged',
+    self.max_depth 'Blade_Extension',
+    = 'Blade_Width',
+    self.params['Enclosure_Type',
+    'max_depth'] 'Engine_Horsepower',
+    \
-            'Hydraulics',
+    if 'Pushblock',
+    self.params['Ripper',
+    'max_depth'] 'Scarifier',
+    else 'Tip_Control',
+    5
-        'Tire_Size',
+    self.rng 'Coupler',
+    = np.random.default_rng(seed'Coupler_System',
+    =random_seed)
-
-

The fit method, based on our classic GBM, takes a feature dataframe, a target vector, the objective function, and the number of boosting rounds as arguments. The user-supplied objective function should be an object with loss, gradient, and hessian methods, each of which takes a target vector and a prediction vector as input; the loss method should return a scalar loss score, the gradient method should return a vector of gradients, and the hessian method should return a vector of hessians.

-

In contrast to boosting in the classic GBM, instead of computing residuals between the current predictions and the target, we compute gradients and hessians of the loss function with respect to the current predictions, and instead of predicting residuals with a decision tree, we fit a special XGBoost tree booster (which we’ll implement in a moment) using the gradients and hessians. I’ve also added row subsampling by drawing a random subset of instance indices and passing them to the tree booster during each boosting round. The rest of the fit method is the same as the classic GBM, and the predict method is identical too.

-
-
'Grouser_Tracks',
+    def fit('Hydraulics_Flow',
+    self, X, y, objective, num_boost_round, verbose'Track_Type',
+    ='Undercarriage_Pad_Width',
+    False):
-    current_predictions 'Stick_Length',
+    = 'Thumb',
+    self.base_prediction 'Pattern_Changer',
+    * np.ones(shape'Grouser_Type',
+    =y.shape)
-    'Backhoe_Mounting',
+    self.boosters 'Blade_Type',
+    = []
-    'Travel_Controls',
+    for i 'Differential_Type',
+    in 'Steering_Controls',
+    range(num_boost_round):
-        gradients 'saledate_year',
+    = objective.gradient(y, current_predictions)
-        hessians 'saledate_month',
+    = objective.hessian(y, current_predictions)
-        sample_idxs 'saledate_day',
+    = 'saledate_quarter',
+    None 'saledate_day_of_year',
+    if 'saledate_day_of_week',
+    self.subsample 'saledate_days_since_epoch'
+]
+
+target == = 1.0 'logSalePrice'
+
+
+
+

Split the data into training and validation sets

+

Next we split the dataset into a training set and a validation set. Of course since we’re going to evaluate against the validation set a number of times as we iterate, it’s best practice to keep a separate test set reserved to check our final model to ensure it generalizes well. Assuming that final test set is hidden away, we can use the rest of the data for training and validation.

+

There are two main ways we might want to select the validation set. If there isn’t a temporal ordering of the observations, we might be able to randomly sample. In practice, it’s much more common that observations have a temporal ordering, and that models are trained on observations up to a certain time and used to predict on observations occuring after that time. Since this data is temporal, we don’t want to split randomly; instead we’ll split on observation date, reserving the latest observations for the validation set.

+
+
\
-            # Temporal Validation Set
+else def train_test_split_temporal(df, datetime_column, n_test):
+    idx_sort self.rng.choice(= np.argsort(df[datetime_column])
+    idx_train, idx_test len(y), 
-                                 size= idx_sort[:=math.floor(-n_valid], idx_sort[self.subsample-n_valid:]
+    *return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+
+len(y)), 
-                                 replace# Random Validation Set
+=def train_test_split_random(df, n_test):
+    np.random.seed(False)
-        booster 42)
+    idx_sort = TreeBooster(X, gradients, hessians, 
-                              = np.random.permutation(self.params, len(df))
+    idx_train, idx_test self.max_depth, sample_idxs)
-        current_predictions = idx_sort[:+= -n_valid], idx_sort[self.learning_rate -n_valid:]
+    * booster.predict(X)
-        return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+my_train_test_split self.boosters.append(booster)
-        = if verbose: 
-            lambda d, n_valid: train_test_split_temporal(d, print('saledate', n_valid)
+f'[# my_train_test_split = lambda d, n_valid: train_test_split_random(d, n_valid)
+
+
+
n_valid {i= }12000
+train_df, valid_df ] train loss = = my_train_test_split(df, n_valid)
+
+train_df.shape, valid_df.shape
+
+
((389125, 61), (12000, 61))
+
+
+
+
+

Create DMatrix data objects

+

XGBoost uses a data type called dense matrix for efficient training and prediction, so next we need to create DMatrix objects for our training and validation datasets.

+
+

If you prefer to use the scikit-learn interface to XGBoost, you don’t need to create these dense matrix objects. More on that below.

+
+
+
dtrain {objective= xgb.DMatrix(data.loss(y, current_predictions)=train_df[features], label}=train_df[target], enable_categorical')
-            
-=def predict(True)
+dvalid self, X):
-    = xgb.DMatrix(datareturn (=valid_df[features], labelself.base_prediction =valid_df[target], enable_categorical+ =self.learning_rate 
-            True)
+
+
+
+

Set the XGBoost parameters

+

XGBoost has numerous hyperparameters. Fortunately, just a handful of them tend to be the most influential; furthermore, the default values are not bad in most situations. I like to start out with a dictionary containing the default parameter values for just the ones I think are most important. For training there is one required boosting parameter called num_boost_round which I set to 50 as a starting point; you can make this smaller initially if training takes too long.

+
+
* np.# default values for important parameters
+params sum([booster.predict(X) = {
+    for booster 'learning_rate': in 0.3,
+    self.boosters], axis'max_depth': =6,
+    0))
-
-XGBoostModel.fit 'min_child_weight': = fit
-XGBoostModel.predict 1,
+    = predict            
-
-

All we have to do now is implement the tree booster.

-
-
-

The XGBoost Tree Booster

-

The XGBoost tree booster is a modified version of the decision tree that we built in the decision tree from scratch post. Like the decision tree, we recursively build a binary tree structure by finding the best split rule for each node in the tree. The main difference is the criterion for evaluating splits and the way that we define a leaf’s predicted value. Instead of being functions of the target values of the instances in each node, the criterion and predicted values are functions of the instance gradients and hessians. Thus we need only make a couple of modifications to our previous decision tree implementation to create the XGBoost tree booster.

-
-

Initialization and Inserting Child Nodes

-

Most of the init method is just parsing the parameter dictionary to assign parameters as object attributes. The one notable difference from our decision tree is in the way we define the node’s predicted value. We define self.value according to equation 5 of the XGBoost paper, a simple function of the gradient and hessian values of the instances in the current node. Of course the init also goes on to build the tree via the maybe insert child nodes method. This method is nearly identical to the one we implemented for our decision tree. So far so good.

-
-
'subsample': class TreeBooster():
- 
-    1,
+    def 'colsample_bynode': __init__(1,
+    self, X, g, h, params, max_depth, idxs'objective': ='reg:squarederror',
+}
+num_boost_round None):
-        = self.params 50
+
+
+
+

Train the XGBoost model

+

Check out the documentation on the learning API to see all the training options. During training, I like to have XGBoost print out the evaluation metric on the train and validation set after every few boosting rounds and again at the end of training; that can be done by setting evals and verbose_eval. You can also save the evaluation results in a dictionary passed into evals_result to inspect and plot the objective curve over the training iterations.

+
+
evals_result = params
-        = {}
+m self.max_depth = xgb.train(params= max_depth
-        =params, dtrainassert =dtrain, num_boost_roundself.max_depth =num_boost_round,
+              evals>= =[(dtrain, 0, 'train'), (dvalid, 'max_depth must be nonnegative'
-        'valid')],
+              verbose_evalself.min_child_weight == params[10,
+              evals_result'min_child_weight'] =evals_result)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+
+

Train the XGBoost model using the sklearn interface

+

You can optionally use the sklearn estimator interface to XGBoost. This will bypass the need to use the DMatrix data objects for training and prediction, and it will allow you to leverage many of the other scikit-learn ecosystem tools like pipelines, parameter search, partial dependence plots, etc. The XGBRegressor is available in the xgboost library that we’ve already imported.

+
+
\
-            # scikit-learn interface
+reg if params[= xgb.XGBRegressor(n_estimators'min_child_weight'] =num_boost_round, else **params)
+reg.fit(train_df[features], train_df[target], 
+        eval_set1.0
-        =[(train_df[features], train_df[target]), (valid_df[features], valid_df[target])], 
+        verboseself.reg_lambda == params[10)'reg_lambda'] ;
+
+
[0] validation_0-rmse:6.74422   validation_1-rmse:6.79733
+[10]    validation_0-rmse:0.34798   validation_1-rmse:0.37158
+[20]    validation_0-rmse:0.26289   validation_1-rmse:0.28239
+[30]    validation_0-rmse:0.25148   validation_1-rmse:0.27028
+[40]    validation_0-rmse:0.24375   validation_1-rmse:0.26420
+[49]    validation_0-rmse:0.23738   validation_1-rmse:0.25855
+
+
+

Since not all features of XGBoost are available through the scikit-learn estimator interface, you might want to get the native booster object back out of the sklearn wrapper.

+
+
m if params[= reg.get_booster()
+
+
+
+

Evaluate the model and check for overfitting

+

We get the model evaluation metrics on the training and validation sets printed to stdout when we use the evals argument to the training API. Typically I just look at those printed metrics, but let’s double check by hand.

+
+
'reg_lambda'] def root_mean_squared_error(y_true, y_pred):
+    else return np.sqrt(np.mean((y_true 1.0
-        - y_pred)self.gamma **= params[2))
+
+root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+
+
0.25855368
+
+
+

So, how good is that RMSLE of 0.259? Well, checking the Kagle leaderboard for this competition, we would have come in 53rd out of 474, which is in the top 12% of submissions. That’s not bad for 10 minutes of work doing the bare minimum necessary to transform the raw data into a format consumable by XGBoost and then training a model using default hyperparameter values.

+
+

Note that we’re using a different validation set from that used for the final leaderboard (which is long closed), but our score is likely still a decent approximation for how we would have done in the competition.

+
+

It can be helpful to take a look at objective curves for training and validation data to get a sense for the extent of overfitting. A huge difference between training and validation performance indicates overfitting. In the below curve, there is very little overfitting, indicating we can be aggressive with hyperparameters that increase model flexibility. More on that soon.

+
+
pd.DataFrame({
+    'gamma'] 'train': evals_result[if params['train']['gamma'] 'rmse'],
+    else 'valid': evals_result[0.0
-        'valid'][self.colsample_bynode 'rmse']
+}).plot()= params[; plt.xlabel('colsample_bynode'] 'boosting round')\
-            ; plt.ylabel(if params['objective')'colsample_bynode'] ;
+
+

line plot showing objective function versus training iteration for training and validation sets

+
+
+
+
+

Check feature importance

+

It’s helpful to get an idea of how much the model is using each feature. In following iterations we might want to try dropping low-signal features or examining the important ones more closely for feature engineering ideas. The gigantic caveat to keep in mind here is that there are different measures of feature importance, and each one will give different importances. XGBoost provides three importance measures; I tend to prefer looking at the weight measure because its rankings usually seem most intuitive.

+
+
fig, ax else = plt.subplots(figsize1.0
-        =(if 5,isinstance(g, pd.Series): g 10))
+feature_importances = g.values
-        = pd.Series(m.get_score(importance_typeif =isinstance(h, pd.Series): h 'weight')).sort_values(ascending= h.values
-        =if idxs False)
+feature_importances.plot.barh(axis =ax)
+plt.title(None: idxs 'Feature Importance')= np.arange(;
+
+

feature importance plot showing a few high importance features and many low importance ones

+
+
+
+
+
+

Improve performance using a model iteration loop

+

At this point we have a half-decent prototype model. Now we enter the model iteration loop in which we adjust features and model parameters to find configurations that have better and better performance.

+

Let’s start by putting the feature and target specification, the training/validation split, the model training, and the evaluation all together in one code block that we can copy paste for easy model iteration.

+
+

Note that for this process to be effective, model training needs to take less than 10 seconds. Otherwise you’ll be sitting around waiting way too long. If training takes too long, try training on a sample of the training data, or try reducing the number of boosting rounds.

+
+
+
features len(g))
-        = [
+    self.X, 'SalesID',
+    self.g, 'MachineID',
+    self.h, 'ModelID',
+    self.idxs 'datasource',
+    = X, g, h, idxs
-        'auctioneerID',
+    self.n, 'YearMade',
+    self.c 'MachineHoursCurrentMeter',
+    = 'UsageBand',
+    len(idxs), X.shape['fiModelDesc',
+    1]
-        'fiBaseModel',
+    self.value 'fiSecondaryDesc',
+    = 'fiModelSeries',
+    -g[idxs].'fiModelDescriptor',
+    sum() 'ProductSize',
+    / (h[idxs].'fiProductClassDesc',
+    sum() 'state',
+    + 'ProductGroup',
+    self.reg_lambda) 'ProductGroupDesc',
+    # Eq (5)
-        'Drive_System',
+    self.best_score_so_far 'Enclosure',
+    = 'Forks',
+    0.
-        'Pad_Type',
+    if 'Ride_Control',
+    self.max_depth 'Stick',
+    > 'Transmission',
+    0:
-            'Turbocharged',
+    self._maybe_insert_child_nodes()
-
-    'Blade_Extension',
+    def _maybe_insert_child_nodes('Blade_Width',
+    self):
-        'Enclosure_Type',
+    for i 'Engine_Horsepower',
+    in 'Hydraulics',
+    range('Pushblock',
+    self.c): 'Ripper',
+    self._find_better_split(i)
-        'Scarifier',
+    if 'Tip_Control',
+    self.is_leaf: 'Tire_Size',
+    return
-        x 'Coupler',
+    = 'Coupler_System',
+    self.X.values['Grouser_Tracks',
+    self.idxs,'Hydraulics_Flow',
+    self.split_feature_idx]
-        left_idx 'Track_Type',
+    = np.nonzero(x 'Undercarriage_Pad_Width',
+    <= 'Stick_Length',
+    self.threshold)['Thumb',
+    0]
-        right_idx 'Pattern_Changer',
+    = np.nonzero(x 'Grouser_Type',
+    > 'Backhoe_Mounting',
+    self.threshold)['Blade_Type',
+    0]
-        'Travel_Controls',
+    self.left 'Differential_Type',
+    = TreeBooster('Steering_Controls',
+    self.X, 'saledate_year',
+    self.g, 'saledate_month',
+    self.h, 'saledate_day',
+    self.params, 
-                                'saledate_quarter',
+    self.max_depth 'saledate_day_of_year',
+    - 'saledate_day_of_week',
+    1, 'saledate_days_since_epoch',
+]
+
+target self.idxs[left_idx])
-        = self.right 'logSalePrice'
+
+train_df, valid_df = TreeBooster(= train_test_split_temporal(df, self.X, 'saledate', self.g, 12000)
+dtrain self.h, = xgb.DMatrix(dataself.params, 
-                                 =train_df[features], labelself.max_depth =train_df[target], enable_categorical- =1, True)
+dvalid self.idxs[right_idx])
-
-    = xgb.DMatrix(data@property
-    =valid_df[features], labeldef is_leaf(=valid_df[target], enable_categoricalself): =return True)
+
+params self.best_score_so_far = {
+    == 'learning_rate': 0.
-
-    0.3,
+    def _find_better_split('max_depth': self, feature_idx):
-        6,
+    pass
-
-
-
-

Split Finding

-

Split finding follows the exact same pattern that we used in the decision tree, except we keep track of gradient and hessian stats instead of target value stats, and of course we use the XGBoost gain criterion (equation 7 from the paper) for evaluating splits.

-
-
'min_child_weight': def _find_better_split(1,
+    self, feature_idx):
-    x 'subsample': = 1,
+    self.X.values['colsample_bynode': self.idxs, feature_idx]
-    g, h 1,
+    = 'objective': self.g['reg:squarederror',
+}
+num_boost_round self.idxs], = self.h[50
+
+m self.idxs]
-    sort_idx = xgb.train(params= np.argsort(x)
-    sort_g, sort_h, sort_x =params, dtrain= g[sort_idx], h[sort_idx], x[sort_idx]
-    sum_g, sum_h =dtrain, num_boost_round= g.=num_boost_round,
+              evalssum(), h.=[(dtrain, sum()
-    sum_g_right, sum_h_right 'train'), (dvalid, = sum_g, sum_h
-    sum_g_left, sum_h_left 'valid')],verbose_eval= =0., 10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+

Feature selection

+
+

Drop low-importance features

+

Let’s try training a model on only the top k most important features. You can try different values of k for the rankings created from each of the three importance measures. You can play with how many to keep, looking for the optimal number manually.

+
+
feature_importances_weight 0.
-
-    = pd.Series(m.get_score(importance_typefor i =in 'weight')).sort_values(ascendingrange(=0, False)
+feature_importances_cover self.n = pd.Series(m.get_score(importance_type- =1):
-        g_i, h_i, x_i, x_i_next 'cover')).sort_values(ascending= sort_g[i], sort_h[i], sort_x[i], sort_x[i =+ False)
+feature_importances_gain 1]
-        sum_g_left = pd.Series(m.get_score(importance_type+= g_i=; sum_g_right 'gain')).sort_values(ascending-= g_i
-        sum_h_left =+= h_iFalse)
+
+
+
; sum_h_right # features = list(feature_importances_weight[:30].index)
+-= h_i
-        # features = list(feature_importances_cover[:35].index)
+features if sum_h_left = < list(feature_importances_gain[:self.min_child_weight 30].index)
+
+dtrain or x_i = xgb.DMatrix(data== x_i_next:=train_df[features], labelcontinue
-        =train_df[target], enable_categoricalif sum_h_right =< True)
+dvalid self.min_child_weight: = xgb.DMatrix(databreak
-
-        gain =valid_df[features], label= =valid_df[target], enable_categorical0.5 =* ((sum_g_leftTrue)
+
+params **= {
+    2 'learning_rate': / (sum_h_left 0.3,
+    + 'max_depth': self.reg_lambda))
-                        6,
+    + (sum_g_right'min_child_weight': **1,
+    2 'subsample': / (sum_h_right 1,
+    + 'colsample_bynode': self.reg_lambda))
-                        1,
+    - (sum_g'objective': **'reg:squarederror',
+}
+num_boost_round 2 = / (sum_h 50
+
+m + = xgb.train(paramsself.reg_lambda))
-                        ) =params, dtrain- =dtrain, num_boost_roundself.gamma=num_boost_round,
+              evals/=[(dtrain, 2 'train'), (dvalid, # Eq(7) in the xgboost paper
-        'valid')], verbose_evalif gain => 10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37150
+[20]    train-rmse:0.26182  valid-rmse:0.27986
+[30]    train-rmse:0.24974  valid-rmse:0.26896
+[40]    train-rmse:0.24282  valid-rmse:0.26043
+[49]    train-rmse:0.23768  valid-rmse:0.25664
+
+
+

Looks like keeping the top 30 from the gain importance type gives a slight performance improvement.

+
+
+

Drop one feature at a time

+

Next try dropping each feature out of the model one-at-a-time to see if there are any more features that you can drop. For each feature, drop it from the feature set, then train a new model, then record the evaluation score. At the end, sort the scores to see which features are the best candidates for removal.

+
+
features self.best_score_so_far: 
-            = [
+    self.split_feature_idx 'Coupler_System',
+     = feature_idx
-            'Tire_Size',
+     self.best_score_so_far 'Scarifier',
+     = gain
-            'ProductSize',
+     self.threshold 'Ride_Control',
+     = (x_i 'fiBaseModel',
+     + x_i_next) 'Enclosure',
+     / 'Pad_Type',
+     2
-            
-TreeBooster._find_better_split 'YearMade',
+     = _find_better_split
-
-
-
-

Prediction

-

Prediction works exactly the same as in our decision tree, and the methods are nearly identical.

-
-
'fiSecondaryDesc',
+     def predict('ProductGroup',
+     self, X):
-    'Drive_System',
+     return np.array(['Ripper',
+     self._predict_row(row) 'saledate_days_since_epoch',
+     for i, row 'fiModelDescriptor',
+     in X.iterrows()])
-
-'fiProductClassDesc',
+     def _predict_row('MachineID',
+     self, row):
-    'Hydraulics',
+     if 'SalesID',
+     self.is_leaf: 
-        'Track_Type',
+     return 'ModelID',
+     self.value
-    child 'fiModelDesc',
+     = 'Travel_Controls',
+     self.left 'Transmission',
+     if row['Blade_Extension',
+     self.split_feature_idx] 'fiModelSeries',
+     <= 'Grouser_Tracks',
+     self.threshold 'Undercarriage_Pad_Width',
+     \
-        'Stick',
+     else 'Thumb'
+]
+
+self.right
-    # drop each feature one-at-a-time
+scores return child._predict_row(row)
-
-TreeBooster.predict = []
+= predict 
-TreeBooster._predict_row for i, feature = _predict_row 
-
-
-
-
-

The Complete XGBoost From Scratch Implementation

-

Here’s the entire implementation which produces a usable XGBoostModel class with fit and predict methods.

-
-
in class XGBoostModel():
-    enumerate(features):
+    drop_one_features '''XGBoost from Scratch
-= features[:i]     '''
-    
-    + features[idef +__init__(1:]
+
+    dtrain self, params, random_seed= xgb.DMatrix(data==train_df[drop_one_features], labelNone):
-        =train_df[target], enable_categoricalself.params == defaultdict(True)
+    dvalid lambda: = xgb.DMatrix(dataNone, params)
-        =valid_df[drop_one_features], labelself.subsample =valid_df[target], enable_categorical= =self.params[True)
+
+    params 'subsample'] = {
+        \
-            'learning_rate': if 0.3,
+        self.params['max_depth': 'subsample'] 6,
+        else 'min_child_weight': 1.0
-        1,
+        self.learning_rate 'subsample': = 1,
+        self.params['colsample_bynode': 'learning_rate'] 1,
+        \
-            'objective': if 'reg:squarederror',
+    }
+    num_boost_round self.params[= 'learning_rate'] 50
+
+    m else = xgb.train(params0.3
-        =params, dtrainself.base_prediction =dtrain, num_boost_round= =num_boost_round,
+                evalsself.params[=[(dtrain, 'base_score'] 'train'), (dvalid, \
-            'valid')],
+                verbose_evalif =self.params[False)
+    score 'base_score'] = root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+    scores.append(score)
+
+results_df else = pd.DataFrame({
+    0.5
-        'feature': features,
+    self.max_depth 'score': scores
+})
+results_df.sort_values(by= =self.params['score')
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
featurescore
18SalesID0.252617
5fiBaseModel0.253710
27Undercarriage_Pad_Width0.254032
17Hydraulics0.254114
20ModelID0.254169
4Ride_Control0.254278
16MachineID0.254413
19Track_Type0.254825
6Enclosure0.254958
28Stick0.255164
1Tire_Size0.255365
10ProductGroup0.255404
22Travel_Controls0.255895
29Thumb0.256300
23Transmission0.256380
26Grouser_Tracks0.256395
11Drive_System0.256652
24Blade_Extension0.256698
7Pad_Type0.256952
25fiModelSeries0.257073
2Scarifier0.257590
12Ripper0.257848
0Coupler_System0.258074
21fiModelDesc0.258712
13saledate_days_since_epoch0.259856
14fiModelDescriptor0.260439
9fiSecondaryDesc0.260782
15fiProductClassDesc0.263790
3ProductSize0.268068
8YearMade0.313105
+ +
+
+
+

Next try removing the feature with the best removal score. Then with that feature still removed, also try removing the feature with the next best removal score and so on. Repeat this process until the model evaluation metric is no longer improving. I think this could be considered a faster version of backward stepwise feature selection.

+
+
features 'max_depth'] = [
+    \
-            'Coupler_System',
+     if 'Tire_Size',
+     self.params['Scarifier',
+     'max_depth'] 'ProductSize',
+     else 'Ride_Control',
+5
-        #      'fiBaseModel',
+     self.rng 'Enclosure',
+     = np.random.default_rng(seed'Pad_Type',
+     =random_seed)
-                
-    'YearMade',
+     def fit('fiSecondaryDesc',
+     self, X, y, objective, num_boost_round, verbose'ProductGroup',
+     ='Drive_System',
+     False):
-        current_predictions 'Ripper',
+     = 'saledate_days_since_epoch',
+     self.base_prediction 'fiModelDescriptor',
+     * np.ones(shape'fiProductClassDesc',
+     =y.shape)
-        'MachineID',
+self.boosters #      'Hydraulics',
+= []
-        #      'SalesID',
+     for i 'Track_Type',
+     in 'ModelID',
+     range(num_boost_round):
-            gradients 'fiModelDesc',
+     = objective.gradient(y, current_predictions)
-            hessians 'Travel_Controls',
+     = objective.hessian(y, current_predictions)
-            sample_idxs 'Transmission',
+     = 'Blade_Extension',
+     None 'fiModelSeries',
+     if 'Grouser_Tracks',
+self.subsample #      'Undercarriage_Pad_Width',
+     == 'Stick',
+     1.0 'Thumb'
+]
+
+dtrain \
-                = xgb.DMatrix(dataelse =train_df[features], labelself.rng.choice(=train_df[target], enable_categoricallen(y), 
-                                     size==math.floor(True)
+dvalid self.subsample= xgb.DMatrix(data*=valid_df[features], labellen(y)), 
-                                     replace=valid_df[target], enable_categorical=False)
-            booster True)
+
+params = TreeBooster(X, gradients, hessians, 
-                                  = {
+    self.params, 'learning_rate': self.max_depth, sample_idxs)
-            current_predictions 0.3,
+    += 'max_depth': self.learning_rate 6,
+    * booster.predict(X)
-            'min_child_weight': self.boosters.append(booster)
-            1,
+    if verbose: 
-                'subsample': print(1,
+    f'['colsample_bynode': {i1,
+    }'objective': ] train loss = 'reg:squarederror',
+}
+num_boost_round {objective= .loss(y, current_predictions)50
+
+m }= xgb.train(params')
-            
-    =params, dtraindef predict(=dtrain, num_boost_roundself, X):
-        =num_boost_round,
+              evalsreturn (=[(dtrain, self.base_prediction 'train'), (dvalid, + 'valid')], verbose_evalself.learning_rate 
-                =* np.10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79145
+[10]    train-rmse:0.34882  valid-rmse:0.37201
+[20]    train-rmse:0.26050  valid-rmse:0.27386
+[30]    train-rmse:0.24844  valid-rmse:0.26205
+[40]    train-rmse:0.24042  valid-rmse:0.25426
+[49]    train-rmse:0.23549  valid-rmse:0.25004
+
+
+

So here I was able to remove four more features before the score started getting worse. With our reduced feature set, we’re now ranking 39th on that Kagle leaderboard. Let’s see how far we can get with some hyperparameter tuning.

+
+
+
+

Tune the XGBoost hyperparameters

+

This is a topic which deserves its own full-length post, but just for fun, here I’ll do a quick and dirty hand tuning without a ton of explanation.

+

Broadly speaking, my process is to increase model expressiveness by increasing the maximum tree depth untill it looks like I’m overfitting. At that point, I start pushing tree pruning parameters like min child weight and regularization parameters like lambda to counteract the overfitting. That process lead me to the following parameters.

+
+
params sum([booster.predict(X) = {
+    for booster 'learning_rate': in 0.3,
+    self.boosters], axis'max_depth': =10,
+    0))
-    
-'min_child_weight': class TreeBooster():
- 
-    14,
+    def 'lambda': __init__(5,
+    self, X, g, h, params, max_depth, idxs'subsample': =1,
+    None):
-        'colsample_bynode': self.params 1,
+    = params
-        'objective': self.max_depth 'reg:squarederror',}
+num_boost_round = max_depth
-        = assert 50
+
+m self.max_depth = xgb.train(params>= =params, dtrain0, =dtrain, num_boost_round'max_depth must be nonnegative'
-        =num_boost_round,
+              evalsself.min_child_weight =[(dtrain, = params['train'), (dvalid, 'min_child_weight'] 'valid')], verbose_eval\
-            =if params[10)
+
+
[0] train-rmse:6.74473  valid-rmse:6.80196
+[10]    train-rmse:0.31833  valid-rmse:0.34151
+[20]    train-rmse:0.22651  valid-rmse:0.24885
+[30]    train-rmse:0.21501  valid-rmse:0.23904
+[40]    train-rmse:0.20897  valid-rmse:0.23645
+[49]    train-rmse:0.20418  valid-rmse:0.23412
+
+
+

That gets us up to 12th place. Next I start reducing the learning rate and increasing the boosting rounds in proportion to one another.

+
+
params 'min_child_weight'] = {
+    else 'learning_rate': 1.0
-        0.3self.reg_lambda /= params[5,
+    'reg_lambda'] 'max_depth': if params[10,
+    'reg_lambda'] 'min_child_weight': else 14,
+    1.0
-        'lambda': self.gamma 5,
+    = params['subsample': 'gamma'] 1,
+    if params['colsample_bynode': 'gamma'] 1,
+    else 'objective': 0.0
-        'reg:squarederror',}
+num_boost_round self.colsample_bynode = = params[50'colsample_bynode'] *\
-            5
+
+m if params[= xgb.train(params'colsample_bynode'] =params, dtrainelse =dtrain, num_boost_round1.0
-        =num_boost_round,
+              evalsif =[(dtrain, isinstance(g, pd.Series): g 'train'), (dvalid, = g.values
-        'valid')], verbose_evalif =isinstance(h, pd.Series): h 10)
+
+
[0] train-rmse:9.04930  valid-rmse:9.12743
+[10]    train-rmse:4.88505  valid-rmse:4.93769
+[20]    train-rmse:2.64630  valid-rmse:2.68501
+[30]    train-rmse:1.44703  valid-rmse:1.47923
+[40]    train-rmse:0.81123  valid-rmse:0.84079
+[50]    train-rmse:0.48441  valid-rmse:0.51272
+[60]    train-rmse:0.32887  valid-rmse:0.35434
+[70]    train-rmse:0.26276  valid-rmse:0.28630
+[80]    train-rmse:0.23720  valid-rmse:0.26026
+[90]    train-rmse:0.22658  valid-rmse:0.24932
+[100]   train-rmse:0.22119  valid-rmse:0.24441
+[110]   train-rmse:0.21747  valid-rmse:0.24114
+[120]   train-rmse:0.21479  valid-rmse:0.23923
+[130]   train-rmse:0.21250  valid-rmse:0.23768
+[140]   train-rmse:0.21099  valid-rmse:0.23618
+[150]   train-rmse:0.20928  valid-rmse:0.23524
+[160]   train-rmse:0.20767  valid-rmse:0.23445
+[170]   train-rmse:0.20658  valid-rmse:0.23375
+[180]   train-rmse:0.20558  valid-rmse:0.23307
+[190]   train-rmse:0.20431  valid-rmse:0.23252
+[200]   train-rmse:0.20316  valid-rmse:0.23181
+[210]   train-rmse:0.20226  valid-rmse:0.23145
+[220]   train-rmse:0.20133  valid-rmse:0.23087
+[230]   train-rmse:0.20045  valid-rmse:0.23048
+[240]   train-rmse:0.19976  valid-rmse:0.23023
+[249]   train-rmse:0.19902  valid-rmse:0.23009
+
+
+

Decreasing the learning rate and increasing the boosting rounds got us up to a 2nd place score. Notice that the score is still decreasing on the validation set. We can actually continue boosting on this model by passing it to the xgb_model argument in the train function. We want to go very very slowly here to avoid overshooting the minimum of the objective function. To do that I ramp up the lambda regularization parameter and boost a few more rounds from where we left off.

+
+
= h.values
-        # second stage
+params if idxs = {
+    is 'learning_rate': None: idxs 0.3= np.arange(/len(g))
-        10,
+    self.X, 'max_depth': self.g, 10,
+    self.h, 'min_child_weight': self.idxs 14,
+    = X, g, h, idxs
-        'lambda': self.n, 60,
+    self.c 'subsample': = 1,
+    len(idxs), X.shape['colsample_bynode': 1]
-        1,
+    self.value 'objective': = 'reg:squarederror',}
+num_boost_round -g[idxs].= sum() 50/ (h[idxs].*sum() 3
+
+m1 + = xgb.train(paramsself.reg_lambda) =params, dtrain# Eq (5)
-        =dtrain, num_boost_roundself.best_score_so_far =num_boost_round,
+              evals= =[(dtrain, 0.
-        'train'), (dvalid, if 'valid')], verbose_evalself.max_depth => 10,
+              xgb_model0:
-            =m)
+
+
[0] train-rmse:0.19900  valid-rmse:0.23007
+[10]    train-rmse:0.19862  valid-rmse:0.22990
+[20]    train-rmse:0.19831  valid-rmse:0.22975
+[30]    train-rmse:0.19796  valid-rmse:0.22964
+[40]    train-rmse:0.19768  valid-rmse:0.22955
+[50]    train-rmse:0.19739  valid-rmse:0.22940
+[60]    train-rmse:0.19714  valid-rmse:0.22935
+[70]    train-rmse:0.19689  valid-rmse:0.22927
+[80]    train-rmse:0.19664  valid-rmse:0.22915
+[90]    train-rmse:0.19646  valid-rmse:0.22915
+[100]   train-rmse:0.19620  valid-rmse:0.22910
+[110]   train-rmse:0.19604  valid-rmse:0.22907
+[120]   train-rmse:0.19583  valid-rmse:0.22901
+[130]   train-rmse:0.19562  valid-rmse:0.22899
+[140]   train-rmse:0.19546  valid-rmse:0.22898
+[149]   train-rmse:0.19520  valid-rmse:0.22886
+
+
+
+
root_mean_squared_error(dvalid.get_label(), m1.predict(dvalid))
+
+
0.22885828
+
+
+

And that gets us to 1st place on the leaderboard.

+
+
+
+

Wrapping Up

+

There you have it, how to use XGBoost to solve a regression problem in python with world class performance. Remember you can use the XGBoost regression notebook from my ds-templates repo to make it easy to follow this flow on your own problems. If you found this helpful, or if you have additional ideas about solving regression problems with XGBoost, let me know down in the comments.

+
+ + ]]> + python + tutorial + gradient boosting + xgboost + https://randomrealizations.com/posts/xgboost-for-regression-in-python/index.html + Tue, 24 Oct 2023 22:00:00 GMT + + + + Blogging with Quarto and Jupyter: The Complete Guide + Matt Bowers + https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/index.html + +

Ahh, blogging. I think we can all agree it’s probably one of the greatest forms of written communication to have ever existed.

+

Whats that you say? You’d like to set up your own blog? And you say you want to use a dead simple, data science friendly tech stack? And you wouldn’t be caught dead handing over your painstakingly crafted content to Medium? No worries, friend, I know exactly what you need.

+

Enter Quarto.

+

In this post we’ll set up a blog using a lightweight tech stack consisting of a terminal running quarto, git, and jupyter, and we’ll use Github Pages to host our website for free. Optionally, for a few dollars a year, we can even host our website at our own custom domain.

+

A quick note on how to use this post. Quarto’s documentation on blogging provides a nice high-level overview of the blogging workflow, and I refer to it and many other bits of Quarto documentation here. At the time of writing, the handful of other blog posts about setting up quarto blogs are aimed at the RStudio user. This post exists to provide a jupyter and python-centric path for you to follow through the entire setup of your new quarto blog, and to impart my opinionated recommendations about best practices.

+

Let’s get into it!

+
+

What is Quarto?

+

Quarto is a way to render plain text source files containing markdown and code in python, R, and other languages into published formats like websites, books, slides, journal articles, etc. There is clearly a lot that we can do with it, but Today, we’ll use it to make a nice looking blog out of some jupyter notebook files.

+

Quarto follows the familiar convention of using a project directory to house all material for a given project. The directory will include source files like jupyter notebooks or Rmarkdown files, as well as configuration files that control how output files are rendered. We can then use the quarto command line utility to perform actions like previewing and rendering within the project directory.

+
+
+

Instantiate your blog

+
+

Create a new Quarto project

+

After installing quarto fire up a new terminal and check that the install was successful by running

+
self._maybe_insert_child_nodes()
-
-    quarto def _maybe_insert_child_nodes(--version
+

Now think of a name for your blog’s project directory; this will also be the name of its git repository. The name will have no effect on your website’s name or URL, so don’t think too hard. The quarto documentation calls it myblog, so we’ll one-up them and call ours pirate-ninja-blog. Run the following command to create it in the current directory.

+
self):
-        quarto create-project pirate-ninja-blog for i --type website:blog
+

That command creates a directory called pirate-ninja-blog containing everything you need to render your new blog. You can preview your website by running

+
in quarto preview pirate-ninja-blog
+

Your local website will open in a new browser window. As you edit various aspects of your blog, the preview will update with your changes. This preview feature is so simple and so great.

+
+
+

+
Previewing your blog with quarto preview command
+
+
+
+
+

Set up a git repo

+

Change into your project directory and we’ll start setting up your git repo.

+
range(cd pirate-ninja-blog
+

initialize a new git repo.

+
self.c): git init self._find_better_split(i)
-        -b main
+

The _site/ directory is where quarto puts the rendered output files, so you’ll want to ignore it in git. I also like to just ignore any hidden files too, so add the following to your .gitignore file.

+
+
+
.gitignore
+
+
if /.quarto/
+self.is_leaf: /_site/
+return
-        x = self.X.values[self.idxs,self.split_feature_idx]
-        left_idx .= np.nonzero(x *
+
+

For now we’ll just stage the .gitignore file for the initial commit. Eventually you’ll want to commit the other files in your project too, either now or later as you edit them.

+
<= git add .gitignore 
+self.threshold)[git commit 0]
-        right_idx -m = np.nonzero(x "Initial commit."
+

Then follow GitHub’s instructions to add the local repo to GitHub using git. Basically just create a new blank repo on GitHub’s website, copy the remote repository url, then add the remote repo url to your local git repo.

+
> git remote add origin self.threshold)[<REMOTE_URL0]
-        >
+

Then you’ll be able to push any commits you make to your remote repository on GitHub by saying git push.

+
+
+
+

Understand the components of a Quarto blog

+
+

Contents of the quarto project directory

+

Let’s have a quick look at what quarto put inside of the project directory.

+
_quarto.yml
+about.qmd
+index.qmd
+profile.jpg
+posts
+styles.css
+_site
+
    +
  • Quarto uses yaml files to specify configurations. The _quarto.yml file specifies project-wide configurations.
  • +
  • Quarto’s markdown file type uses extension qmd``. Each qmd file will correspond to a page in our website.index.qmdis the homepage andabout.qmd` is the About page.
  • +
  • profile.jpg is an image that is included on the about page.
  • +
  • styles.css defines css styles for the website.
  • +
  • posts is a directory where we can put qmd and other documents which will be rendered into blog posts.
  • +
  • posts/_metadata.yml contains configurations that apply to all documents in the posts directory.
  • +
  • _site is a directory that contains the rendered website. Whereas all the other files and directories constitute the source code for our blog, _site is the rendered output, i.e. the website itself.
  • +
+

Let’s take a closer look at these components and start to make the blog yours.

+
+
+

Project-wide Configurations

+

The _quarto.yml file controls project-wide configurations, website options, and HTML document options. Options in this file are specified in yaml in a key/value structure with three top level keys: project, website, and format. The quarto website options documentation has the full list of options that you can set here. It will be very helpful to take a look at some example _quarto.yml files in the wild, such as the one from quarto.org or even the one from this blog.

+

Under the website key, go ahead and set the title and description for your blog.

+
website:
+  title: "Pirate Ninja Blog"
+  description: "A blog about pirates, ninjas, and other things"
+

You can also customize your navbar which is visible at the top of all pages on your site. Also go ahead and set your github and twitter urls for the icons in the navbar.

+

Under the format key, you can also try changing the HTML theme to one of the other 25 built-in themes.

+
+
+

The About Page

+

The about.qmd file defines an About page for the blog. Go ahead and fill in your details in the about.qmd file; you can also replace the profile.jpg file with your own image. Have a look at the quarto documentation on About pages to explore more functionality. Notably, you can change the template option to change the page layout.

+
+
+

The Homepage

+

The index.qmd file defines the landing page for your website. It is a listing page which shows links to all the pages in the posts directory. For now we don’t need to change anything here.

+
+
+

The posts/ directory

+

The posts directory contains all your blog posts. There aren’t really requirements for subdirectory structure inside the posts directory, but it’s a best practice to create a new subdirectory for each new blog post. This just helps keep auxillary files like images or conda environment files organized. Out of the box, the posts directory looks like this.

+
posts
+├── _metadata.yml
+├── post-with-code
+│   ├── image.jpg
+│   └── index.qmd
+└── welcome
+    ├── index.qmd
+    └── thumbnail.jpg
+

There are two reasons we want to be deliberate about how we organize and name things in the posts directory. First, the vast majority of our blog’s content will live here, so we don’t want it to be a big confusing mess. Second, the directory sstructure and file naming will be reflected in the URLs to our blog posts; if you prefer tidy-looking URLs, and I know you do, then you want to use tidy directory and file names in the posts directory.

+

You can check how the URLs look by navigating to one of the pre-populated posts in the site preview in your browser. For instance, the welcome post’s URL would be

+
https://example.com/posts/welcome/
+

When quarto renders the qmd file at posts/welcome/index.qmd it creates an output document in the website at posts/welcome/index.html. In fact the full URL to the post is,

+
https://example.com/posts/welcome/index.html
+

but the browser knows if you give it a URL with a path ending in a /, then it should look for the index.html file inside that directory.

+

So I think the best practice here is to name your new post subdirectory with the title of the post in all lower case with dashes for spaces, e.g. post-with-code. Then to force all output pages to be called index.html, you can set the output-file key in the posts/_metadata.yml file like this.

+
+
+
posts/_metadata.yml
+
+
self.left output-file: index.html
+
+

Note that alternative naming conventions are possible; notably you might want to prefix each post name with the date in yyyy-mm-dd format, so the post subdirectories sort temporally and look nice in a list. That’s the convention used in Quarto’s own blog at quarto.org, As long as you keep everything for a given post inside its subdirectory, you should be good to go with nice-looking URLs.

+
+
+
+

Authoring posts with jupyter

+
+

Creating a new post

+

It turns out that quarto will render not only .qmd files, but also .ipynb files in the posts directory. So let’s create a new blog post from a notebook.

+

I think it’s a best practice to write draft posts in their own git branches, that way if you need to deploy some kind of hotfix to main while you’re drafting a post, you won’t have to deploy a half-written post livin on the main branch. To start a new post, create a new development branch, change into the posts directory, create a new subdirectory with your preferred naming convention, change into that new directory, and fire up jupyter.

+
= TreeBooster(git checkout self.X, -b new-post
+self.g, cd posts
+self.h, mkdir new-post
+self.params, 
-                                cd new-post
+self.max_depth jupyter notebook
+

Now create a new notebook from the jupyter UI. In order for quarto to recognize the document, the first cell of the notebook must be a raw text cell (press r in command mode to change a cell to raw text), and it must contain the document’s yaml front matter. You can use the following as a frontmatter template.

+
---
+title: New Post
+date: 2023-07-12
+description: A nice new post
+categories: [nonsense, code]
+---
+

Now to preview your post, open a new terminal, change into your blog’s project directory and run the quarto preview command. You’ll see a link to the new post in the listing on the homepage. I usually like to have the preview open in a browser while I’m editing the jupyter notebook, just to make sure things look the way I want in the rendered output. From here you can keep editing the notebook, and the preview will update in the browser dynamically.

+
+
+

Markdown and code cells

+

From here you can put text in markdown cells and you can write code in code cells. Let’s add a markdown cell with some markdown formatting.

+
## A nice heading
+
+Here is some lovely text and an equation.
+
+$$ a^2 + b^2 = c^2 $$
+
+Here's a list.
+
+- a link to an [external website](https://quarto.org).
+- a link to [another post in this blog](/posts/welcome/index.qmd).
+

This markdown will be rendered into the HTML page for the post. The last line in the above cell demonstrates the best practice for using relative urls to link to other resources within your website. Instead of providing the full url in the parentheses, just give the path to the qmd or ipynb file that you want to link to. Note that paths need to start with the / at the root of the quarto project, since without it, quarto will try to resolve paths relative to the location of the current document instead of the root of the project.

+

Then create a code cell with some code. Try something like this.

+
- print(1, 'Hello, Quarto!')
+

By default, both code and cell output will be rendered into the HTML output. So far our jupyter notebook looks like this.

+
+
+

+
View of a new post being written in jupyter notebook
+
+
+

Back in the browser window running your blog preview, you can see the rendered page of the new post.

+
+
+

+
View of the preview of the rendered post
+
+
+
+
+

Figures

+

Let’s add a figure to our post. Add a new code cell with the following code.

+
self.idxs[left_idx])
-        # | fig-cap: This is my lovely line plot
+self.right # | fig-alt: A line plot extending up and to the right
+
+= TreeBooster(import numpy self.X, as np
+self.g, import matplotlib.pyplot self.h, as plt
+
+x self.params, 
-                                 = np.arange(self.max_depth 10)
+y - = 1, 2 self.idxs[right_idx])
-
-    * x @property
-    + def is_leaf(1
+plt.plot(x, y)self): ;
+

Notice a couple of important details. First I placed a semicolon at the end of the last line. That supresses the [<matplotlib.lines.Line2D at 0x1111d00a0>] text output, which would otherwise show up in your blog post too.

+

Second, I added a couple of special comments at the top of the cell. Quarto allows you to specify numerous code execution options, designated by the # | prefix, to control the behavior and appearance of the code and output at a cell level. I set two keys here, fig-cap and fig-alt which respectively set the figure caption text and the image alt tag text. The fig-alt key is particularly important to set on all your figures because it provides the non-visual description for screenreader users reading your post. The alt tag should be a simple description of what the plot is and possibly what it shows or means. Be a friend of the blind and visually impaired community and set fig-alt on all of your figures.

+
+
+

Version control

+

As you edit your new post, go ahead and commit your changes on your development branch. Once you’ve finished your new post, you can merge it into main like this.

+
return git checkout main
+self.best_score_so_far git merge new-post
+

Then you can push to GitHub by running git push. You should also be sure to run a final quarto preview to check that everything looks good before publishing to the web.

+
+
+
+

Publishing your blog to the web

+
+

Hosting with GitHub Pages

+

It’s likely that the easiest (read best) option for you is to host your blog on GitHub Pages. This is because GitHub pages is free, and since you already have your blog’s source code checked into a remote repository at GitHub, it’s very easy to set up. Quarto’s documentation on publishing to GitHub Pages outlines three ways to publish your website, but I recommend their option 2, using the quarto publish command. Once you set up your gh-pages branch as described in the documentation, you simply run quarto publish at the command line and your updates are deployed to your website.

+
+
+

Setting up your domain name

+

By default, if you choose to host with GitHub Pages, your website will be published to a url in the form https://username.github.io/reponame/. You can certainly do this; for example Jake VanderPlas’s awesome blog Pythonic Perambulations lives at http://jakevdp.github.io.

+

But, like me, you might want to get your own custom domain by buying, or really renting, one from a registrar. I use Namecheap. If you decide to go for a custom domain, refer to GitHub’s documentation on custom domains. You’ll also need to point your domain registrar to the IP address where GitHub Pages is hosting your website. For an example of how to do this at Namecheap, see Namecheap’s documentation about GitHub Pages

+

Whether you decide to use the standard github.io domain or your own custom domain, be sure to set the site-url key in your _quarto.yml file to ensure other quarto functionality works correctly. For example

+
+
+
_quarto.yml
+
+
== website:
+  0.
-    
-    site-url: https://example.com/
+
+

Edit: I found that after upgrading to quarto 1.3, using quarto publish to publish from the gh-pages branch obliterates the CNAME file that is created when you set a custom domain in your repository settings > Pages > Custom Domain. That breaks the mapping from your custom domain to your published website. See this disscussion thread for details. The fix is to manually create a CNAME file in the root of your project, and include it in the rendered website using the resources option under the project key in _quarto.yml. The CNAME file should just contain your custom domain, excluding any https://.

+
+
+
CNAME
+
+
def _find_better_split(example.com
+
+

With the CNAME file in the root of your quarto project, you can then include it in the rendered output.

+
+
+
_quarto.yml
+
+
self, feature_idx):
-        x project:
+  = resources:
+    self.X.values[- CNAME
+
+
+
+
+

Keep in touch with your readers

+
+

RSS Feed

+

The RSS feed is handy for syndicating your posts to feed readers, other websites, and to your email subscribers. As described in quarto’s documentation on RSS feeds, you can automatically generate an RSS feed for your blog by first setting the value of site-url under the website key in _quarto.yml, and then setting feed: true under the listing key in the frontmatter of index.qmd. This will generate an RSS feed in the root of your website called index.xml. Once you have an RSS feed, go ahead and submit it to Python-Bloggers to have your work syndicated to a wider audience and to strengthen our little community of independent data science blogs.

+
+
+

Email Subscriptions

+

The idea here is to have a form field on your website where readers can input their email address to be added to your mailing list. Quarto’s documentation on subscriptions describes how to set up a subscribe box on your blog using MailChimp, so we won’t repeat it here. Once you have some subscribers, you can send them updates whenever you write a new post. You could do this manually or, in my case, set up an automation through MailChimp which uses your RSS feed to send out email updates to the list about new posts.

+
+
+

Comments

+

Quarto has build-in support for three different comment systems: hypothesis, utterances, and giscus. The good news is that these are all free to use, easy to set up, and AFAIK do not engage in any sketchy tracking activities. The bad news is that none of them are ideal because they all require the user to create an account and login to leave a comment. We want to encourage readers to comment, so we don’t want them to have to create accounts or deal with passwords or pick all the squares with bicycles or any such nonsense, just to leave a little comment. To that end, I’ve actually been working on self-hosted login-free comments for this blog using isso, but it’s a bit more involved than these built-in solutions, so we’ll have to discuss it at length in a future post.

+

If you prefer an easy, out-of-the-box solution, I can recommend utterances, which uses GitHub issues to store comments for each post. I used utterances for comments on the first jekyll-based incarnation of this blog; you can still see the utterances comments on posts before this one. Go check out the Quarto documentation on comments to see how to set up utterances in your project.

+
+
+

Analytics

+

As a data enthusiast, you’ll likely enjoy collecting some data about page views and visitors to your site. You might be tempted to use Google Analytics to do this; indeed quarto makes it very easy to just add a line to your _quarto.yml file to set it up. Unfortunately, in this case, going with the easy and free solution means supporting Google’s dubious corporate surveillance activities. Be a conscientious internet citizen and avoid using Google Analytics on your blog. Fortunately, there are numerous privacy-friendly alternatives to Google Analytics. For this blog I’m self-hosting umami analytics, which might warrant its own post in the future.

+
+
+
+

More humbly suggested best practices

+
+

Using conda environments for reproducibility

+

As you know, it’s a good practice to use an environment manager to keep track of packages, their versions, and other dependencies for software in a data science project. The same applies to blog posts; especially if you’re using unusual or bleeding-edge packages in a post. This will help us out a lot when we have to go back and re-run a notebook a couple years later to regenerate the output. Here we’ll use conda as our environment manager.

+

To be clear, I don’t bother doing this if I’m just using fairly stable functionality in standard packages like pandas, numpy, and matplotlib, but we’ll do it here for illustration. From a terminal sitting inside our post subdirectory at posts/new-post, create a new conda environment with the packages you’re using in the post.

+
self.idxs, feature_idx]
-        g, h conda create = -p ./venv jupyter numpy matplotlib
+

Note the -p flag which tells conda to save the environment to ./venv in the current working directory. This will save all the installed packages here in the post directory instead of in your system-wide location for conda environments. Note also that you’ll want to avoid checking anything in the venv directory into source control, so add venv to the .gitignore file at the root of the quarto project to ignore all venv directories throughout your quarto project.

+

Now whenever you work on this post, you’ll navigate to the post subdirectory with a terminal and activate the conda environment.

+
self.g[conda activate ./venv
+

Then you can fire up your jupyter notebook from the command line, and it will use the active conda environment.

+

Since we don’t want to check the venv directory with all its installed libraries into source control, we need to create an environment.yml file from which the environment can later be reproduced. With the local conda environment active, run the following.

+
self.idxs], conda env export self.h[--from-history self.idxs]
-        sort_idx > environment.yml
+

The --from-history flag tells conda to skip adding a bunch of system specific stuff that will gunk up your environment yaml file and make it harder to use for cross-platform reproducibility. This environment.yml file is the only environment management artifact that you need to check into git.

+

Later if you need to recreate the environment from the environment.yml file, you can use the following command.

+
= np.argsort(x)
-        sort_g, sort_h, sort_x conda env create = g[sort_idx], h[sort_idx], x[sort_idx]
-        sum_g, sum_h -f environment.yml = g.-p ./venvsum(), h.`
+
+
+

Image file best practices

+

Let’s talk about image file sizes. The key idea is that we want images to have just enough resolution to look good; any more than that and we’re just draging around larger-than-necessary files and wasting bandwidth and slowing down page load times.

+

You can read all about choosing optimal image sizes, but the TLDR is that images should be just large enough (in pixels) to fill the containers they occupy on the page. In our quarto blog, the two most common kinds of images are inline images we put in the body of posts and image thumbnails that show up as the associated image for a post, e.g. in the listing on our homepage. The inline image container seems to be about 800 pixels wide in my browser and the thumbnails are smaller, so adding some margin of error, I decided to go for 1000x750 for inline images and 500x375 for the thumbnails.

+

I use a command line tool called Image Magick to resize image files. Go ahead and install image magick with homebrew, and let’s add some images to our new post.

+

For this example I’ll use a nice shot of the London Underground from Wikipedia. Save your image as image.jpg. Then use image magick to create two new resized images for inline and thumbnail use.

+
sum()
-        sum_g_right, sum_h_right convert image.jpg = sum_g, sum_h
-        sum_g_left, sum_h_left -resize 1000x1000 main.jpg 
+= convert image.jpg 0., -resize 500x500 thumbnail.jpg 
+

These commands do not change the aspect ratio of the image; they just reduce the size so that the image fits within the size specified.

+

Now move both of your new images into the post subdirectory at posts/new-post/. To specify the thumbnail image, set the image key in the post’s front matter. Be sure to also add an alt tag description of the image using the image-alt key to keep it accessible for screen reader users. Our post’s frontmatter now looks like this.

+
---
+title: New Post
+date: 2023-07-12
+description: A nice new post
+categories: [nonsense, code]
+image: thumbnail.jpg
+image-alt: "A London Underground train emerging from a tunnel"
+---
+

To include an image within the body of a post, use markdown in the post to include the image. I added a markdown cell just under the front matter containing the following.

+
![A London Underground train emerging from a tunnel](main.jpg "")
+

In your preview browser window, you can see we have the thumbnail for our new post on the homepage listing.

+
+
+

+
A screenshot of the homepage showing the new post’s thumbnail image
+
+
+

And we also have the inline image appearing in the body of the post.

+
+
+

+
A screenshot of the new post showing the image included in the body of the post
+
+
+

You can take a look at the source code for this blog to see some examples of including images in posts.

+
+
+
+

SEO

+

SEO is a huge topic, but here we’ll just focus on a few fundamental technical aspects that we want to be sure to get right. This boils down to registering with the top search engines by market share and ensuring that we’re providing them with the information they need to properly index our pages.

+

I checked the top search engines by global market share and as of 2023 it looks like Google has about 85%, Bing has about 8%, and the others have 2% or less each. So let’s focus on setting our site up to work well with Google search and Bing to get over 90% coverage.

+
+

Google Search Console and Bing Webmaster Tools

+

Google Search Console is a tool for web admins to help analyze search traffic and identify any technical issues that might prevent pages from appearing or ranking well in search. Go ahead and set up an account and register your blog in search console. You can refer to Google’s documentation on search console to guide you through setup and configuration.

+

Once you get set up on GSC, you can also create an account for Bing Webmaster Tools. Do this after setting up GSC because there is an option to import your information from your GSC account.

+

Once you’re set up with GSC and BWT, you’ll get email alerts anytime they crawl your site and detect any indexing problems. When that happens, track down the issues and fix them so your pages can appear in organic searches.

+
+
+

Sitemap

+

A sitemap is an xml document that lists all the pages on your website. It’s a map for the search engine bots that crawl the web looking for new pages to index. Quarto will automatically generate a sitemap called sitemap.xml in the root of your website, as long as you’ve filled out the site-url key in _quarto.yml. You can submit your website for indexing by providing your sitemap in Google Search Console and Bing Webmaster Tools.

+
+
+
+

Wrapping Up

+

Boy howdy, that was a lot, but at this point you should have a fully functioning blog, built with a minimalist, data-science-friendly tech stack consisting of quarto, jupyter, and GitHub. If you do create a blog using quarto, drop a link to it in the comments, and we can all check it out and celebrate your creation!

+
+ + ]]>
+ python + tutorial + blogging + https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/index.html + Tue, 05 Sep 2023 22:00:00 GMT + +
+ + Random Realizations Resurrected + Matt Bowers + https://randomrealizations.com/posts/random-realizations-resurrected/index.html + +
+

+
Christ the Redeemer towers into a vast blue Brazilian sky.!
+
+
+

Well it’s been over a year since I posted anything here. You see, a lot has been going on here at the Random Realizations Remote Global Headquarters that has distracted from producing the high-quality data science content that you’re used to. Mostly I went on hiatus from work and started traveling, which turns out to be it’s own full time job. I had aspirations of writing more after leaving work, but of course, after leaving, I couldn’t be bothered to sit down at my laptop and type stuff about data science to yall. After all, life is bigger than that.

+

When I finally felt like opening up my laptop, I was confronted with an email from the maintainers of fastpages, the open source content management system (CMS) I originally used to create this blog, notifying me that the project was being deprecated and that I would need to migrate my content to some other platform.

+

Boo.

+

That didn’t sound like much fun, so I spent another few months ignoring the blog. But eventually, dear reader, I decided it was time to roll up my sleeves and get this blog thriving once again.

+

Ok so fastpages was going to be deprecated, and I needed to find a new CMS. My requirements were pretty simple: I wanted to write the blog posts with jupyter notebook, and I wanted to host the site on my own domain. Helpfully, the former maintainers of fastpages recommended an alternative CMS called Quarto which I had never heard of. Apparently I had been living under a rock because Quarto appears to be all the rage. Quarto’s website says it’s an open-source scientific and technical publishing system. I think it’s fair to think of it as a way to render plain text or source code from languages like python, R, and julia into a variety of different published formats like websites, books, or journal articles. It was developed by the good folks over at RStudio, and the project has a pretty active following over on github, so I think it’s less likely to suddenly disappear like fastpages.

+

So anyway, I’ve been migrating my content over into this new quarto universe.

+

You mayofficially consider this blog resurrected from the dead, because this is the first new post published after the migration. The site has a bit of a new look and feel, so I hope you like it. Do let me know in the comments if you find anything amiss with the new website. Otherwise we’ll just assume it’s fabulous.

+

I’m working on a post about how to create a blog with quarto using jupyter and python, so you can too!

+

See you in more posts real soon! Love, Matt.

+ ]]> + blogging + https://randomrealizations.com/posts/random-realizations-resurrected/index.html + Tue, 01 Aug 2023 22:00:00 GMT + + + + XGBoost from Scratch + Matt Bowers + https://randomrealizations.com/posts/xgboost-from-scratch/index.html + +
+

+
A weathered tree reaches toward the sea at Playa Mal País
+
+
+

Well, dear reader, it’s that time again, time for us to do a seemingly unnecessary scratch build of a popular algorithm that most people would simply import from the library without a second thought. But readers of this blog are not most people. Of course you know that when we do scratch builds, it’s not for the hell of it, it’s for the purpose of demystification. To that end, today we are going to implement XGBoost from scratch in python, using only numpy and pandas.

+

Specifically we’re going to implement the core statistical learning algorithm of XGBoost, including most of the key hyperparameters and their functionality. Our implementation will also support user-defined custom objective functions, meaning that it can perform regression, classification, and whatever exotic learning tasks you can dream up, as long as you can write down a twice-differentiable objective function. We’ll refrain from implementing some simple features like column subsampling which will be left to you, gentle reader, as exercises. In terms of tree methods, we’re going to implement the exact tree-splitting algorithm, leaving the sparsity-aware method (used to handle missing feature values) and the approximate method (used for scalability) as exercises or maybe topics for future posts.

+

As always, if something is unclear, try backtracking through the previous posts on gradient boosting and decision trees to clarify your intuition. We’ve already built up all the statistical and computational background needed to make sense of this scratch build. Here are the most important prerequisite posts:

+
    +
  1. Gradient Boosting Machine from Scratch
  2. +
  3. Decision Tree From Scratch
  4. +
  5. How to Understand XGBoost
  6. +
+

Great, let’s do this.

+
+

The XGBoost Model Class

+

We begin with the user-facing API for our model, a class called XGBoostModel which will implement gradient boosting and prediction. To be more consistent with the XGBoost library, we’ll pass hyperparameters to our model in a parameter dictionary, so our init method is going to pull relevant parameters out of the dictionary and set them as object attributes. Note the use of python’s defaultdict so we don’t have to worry about handling key errors if we try to access a parameter that the user didn’t set in the dictionary.

+
+
0.
-
-        import math
+for i import numpy in as np 
+range(import pandas 0, as pd
+self.n from collections - import defaultdict
+
+
+
1):
-            g_i, h_i, x_i, x_i_next class XGBoostModel():
+    = sort_g[i], sort_h[i], sort_x[i], sort_x[i '''XGBoost from Scratch
++     '''
+    
+    1]
-            sum_g_left def += g_i__init__(; sum_g_right self, params, random_seed-= g_i
-            sum_h_left =+= h_iNone):
+        ; sum_h_right self.params -= h_i
-            = defaultdict(if sum_h_left lambda: < None, params)
+        self.min_child_weight self.subsample or x_i = == x_i_next:self.params[continue
-            'subsample'] if sum_h_right \
+            < if self.min_child_weight: self.params[break
-
-            gain 'subsample'] = else 0.5 1.0
+        * ((sum_g_leftself.learning_rate **= 2 self.params[/ (sum_h_left 'learning_rate'] + \
+            self.reg_lambda))
-                            if + (sum_g_rightself.params[**'learning_rate'] 2 else / (sum_h_right 0.3
+        + self.base_prediction self.reg_lambda))
-                            = - (sum_gself.params[**'base_score'] 2 \
+            / (sum_h if + self.params[self.reg_lambda))
-                            ) 'base_score'] - else self.gamma0.5
+        /self.max_depth 2 = # Eq(7) in the xgboost paper
-            self.params[if gain 'max_depth'] > \
+            self.best_score_so_far: 
-                if self.split_feature_idx self.params[= feature_idx
-                'max_depth'] self.best_score_so_far else = gain
-                5
+        self.threshold self.rng = (x_i = np.random.default_rng(seed+ x_i_next) =random_seed)
+
+

The fit method, based on our classic GBM, takes a feature dataframe, a target vector, the objective function, and the number of boosting rounds as arguments. The user-supplied objective function should be an object with loss, gradient, and hessian methods, each of which takes a target vector and a prediction vector as input; the loss method should return a scalar loss score, the gradient method should return a vector of gradients, and the hessian method should return a vector of hessians.

+

In contrast to boosting in the classic GBM, instead of computing residuals between the current predictions and the target, we compute gradients and hessians of the loss function with respect to the current predictions, and instead of predicting residuals with a decision tree, we fit a special XGBoost tree booster (which we’ll implement in a moment) using the gradients and hessians. I’ve also added row subsampling by drawing a random subset of instance indices and passing them to the tree booster during each boosting round. The rest of the fit method is the same as the classic GBM, and the predict method is identical too.

+
+
/ def fit(2
-                
-    self, X, y, objective, num_boost_round, verbosedef predict(=self, X):
-        False):
+    current_predictions return np.array([= self._predict_row(row) self.base_prediction for i, row * np.ones(shapein X.iterrows()])
-
-    =y.shape)
+    def _predict_row(self.boosters self, row):
-        = []
+    if for i self.is_leaf: 
-            in return range(num_boost_round):
+        gradients self.value
-        child = objective.gradient(y, current_predictions)
+        hessians = objective.hessian(y, current_predictions)
+        sample_idxs = self.left None if row[if self.split_feature_idx] self.subsample <= == self.threshold 1.0 \
-                        else self.right
-        return child._predict_row(row)
-
-
-
-

Testing

-

Let’s take this baby for a spin and benchmark its performance against the actual XGBoost library. We use the scikit learn California housing dataset for benchmarking.

-
-
self.rng.choice(from sklearn.datasets len(y), 
+                                 sizeimport fetch_california_housing
-=math.floor(from sklearn.model_selection self.subsampleimport train_test_split
-    
-X, y *= fetch_california_housing(as_framelen(y)), 
+                                 replace=True, return_X_yFalse)
+        booster == TreeBooster(X, gradients, hessians, 
+                              True)
-X_train, X_test, y_train, y_test self.params, = train_test_split(X, y, test_sizeself.max_depth, sample_idxs)
+        current_predictions =+= 0.3, 
-                                                    random_stateself.learning_rate =* booster.predict(X)
+        43)
-
-

Let’s start with a nice friendly squared error objective function for training. We should probably have a future post all about how to define custom objective functions in XGBoost, but for now, here’s how I define squared error.

-
-
self.boosters.append(booster)
+        class SquaredErrorObjective():
-    if verbose: 
+            def loss(print(self, y, pred): f'[return np.mean((y {i- pred)}**] train loss = 2)
-    {objectivedef gradient(.loss(y, current_predictions)self, y, pred): }return pred ')
+            
+- y
-    def predict(def hessian(self, X):
+    self, y, pred): return (return np.ones(self.base_prediction len(y))
+font-style: inherit;">+ self.learning_rate + * np.sum([booster.predict(X) for booster in self.boosters], axis=0)) + +XGBoostModel.fit = fit +XGBoostModel.predict = predict
-

Here I use a more or less arbitrary set of hyperparameters for training. Feel free to play around with tuning and trying other parameter combinations yourself.

-
-

+

The XGBoost Tree Booster

+

The XGBoost tree booster is a modified version of the decision tree that we built in the decision tree from scratch post. Like the decision tree, we recursively build a binary tree structure by finding the best split rule for each node in the tree. The main difference is the criterion for evaluating splits and the way that we define a leaf’s predicted value. Instead of being functions of the target values of the instances in each node, the criterion and predicted values are functions of the instance gradients and hessians. Thus we need only make a couple of modifications to our previous decision tree implementation to create the XGBoost tree booster.

+
+

Initialization and Inserting Child Nodes

+

Most of the init method is just parsing the parameter dictionary to assign parameters as object attributes. The one notable difference from our decision tree is in the way we define the node’s predicted value. We define self.value according to equation 5 of the XGBoost paper, a simple function of the gradient and hessian values of the instances in the current node. Of course the init also goes on to build the tree via the maybe insert child nodes method. This method is nearly identical to the one we implemented for our decision tree. So far so good.

+
+
import xgboost class TreeBooster():
+ 
+    as xgb
-
-params def = {
-    __init__('learning_rate': self, X, g, h, params, max_depth, idxs0.1,
-    ='max_depth': None):
+        5,
-    self.params 'subsample': = params
+        0.8,
-    self.max_depth 'reg_lambda': = max_depth
+        1.5,
-    assert 'gamma': self.max_depth 0.0,
-    >= 'min_child_weight': 0, 25,
-    'max_depth must be nonnegative'
+        'base_score': self.min_child_weight 0.0,
-    = params['tree_method': 'min_child_weight'] 'exact',
-}
-num_boost_round \
+            = if params[50
-
-'min_child_weight'] # train the from-scratch XGBoost model
-model_scratch else = XGBoostModel(params, random_seed1.0
+        =self.reg_lambda 42)
-model_scratch.fit(X_train, y_train, SquaredErrorObjective(), num_boost_round)
-
-= params[# train the library XGBoost model
-dtrain 'reg_lambda'] = xgb.DMatrix(X_train, labelif params[=y_train)
-dtest 'reg_lambda'] = xgb.DMatrix(X_test, labelelse =y_test)
-model_xgb 1.0
+        = xgb.train(params, dtrain, num_boost_round)
-
-

Let’s check the models’ performance on the held out test data to benchmark our implementation.

-
-
pred_scratch self.gamma = model_scratch.predict(X_test)
-pred_xgb = params[= model_xgb.predict(dtest)
-'gamma'] print(if params[f'scratch score: 'gamma'] {SquaredErrorObjective()else .loss(y_test, pred_scratch)0.0
+        }self.colsample_bynode ')
-= params[print('colsample_bynode'] f'xgboost score: \
+            {SquaredErrorObjective()if params[.loss(y_test, pred_xgb)'colsample_bynode'] }else ')
-
-
scratch score: 0.2434125759558149
-xgboost score: 0.24123239765807963
-
-
-

Well, look at that! Our scratch-built SGBoost is looking pretty consistent with the library. Go us!

-
-
-

Wrapping Up

-

I’d say this is a pretty good milestone for us here at Random Realizations. We’ve been hammering away at the various concepts around gradient boosting, leaving a trail of equations and scratch-built algos in our wake. Today we put all of that together to create a legit scratch build of XGBoost, something that would have been out of reach for me before we embarked on this journey together over a year ago. To anyone with the patience to read through this stuff, cheers to you! I hope you’re learning and enjoying this as much as I am.

-
-
-

Reader Exercises

-

If you want to take this a step further and deepen your understanding and coding abilities, let me recommend some exercises for you.

-
    -
  1. Implement column subsampling. XGBoost itself provides column subsampling by tree, by level, and by node. Try implementing by tree first, then try adding by level or by node as well. These should be pretty straightforward to do.
  2. -
  3. Implement sparsity aware split finding for missing feature values (Algorithm 2 in the XGBoost paper). This will be a little more involved, since you’ll need to refactor and modify several parts of the tree booster class.
  4. -
-
- - ]]> - python - gradient boosting - from scratch - https://randomrealizations.com/posts/xgboost-from-scratch/index.html - Fri, 06 May 2022 21:00:00 GMT - - - - XGBoost Explained - Matt Bowers - https://randomrealizations.com/posts/xgboost-explained/index.html - -
-

-
Tree branches on a chilly day in Johnson City
-
-
-

Ahh, XGBoost, what an absolutely stellar implementation of gradient boosting. Once Tianqi Chen and Carlos Guestrin of the University of Washington published the XGBoost paper and shared the open source code in the mid 2010’s, the algorithm quickly gained adoption in the ML community, appearing in over half of winning Kagle submissions in 2015. Nowadays it’s certainly among the most popular gradient boosting libraries, along with LightGBM and CatBoost, although the highly scientific indicator of GitHub stars per year indicates that it is in fact the most beloved gradient boosting package of all. Since it was the first of the modern popular boosting frameworks, and since benchmarking indicates that no other boosting algorithm outperforms it, we can comfortably focus our attention on understanding XGBoost.

-

The XGBoost authors identify two key aspects of a machine learning system: (1) a flexible statistical model and (2) a scalable learning system to fit that model using data. XGBoost improves on both of these aspects, providing a more flexible and feature-rich statistical model and building a truly scalable system to fit it. In this post we’re going to focus on the statistical modeling innovations, outlining the key differences from the classic gradient boosting machine and divinginto the mathematical derivation of the XGBoost learning algorithm. If you’re not already familiar with gradient boosting, go back and read the earlier posts in the series before jumping in here.

-

Buckle up, dear reader. Today we understand how XGBoost works, no hand waving required.

-
-

XGBoost is a Gradient Boosting Machine

-

At a high level, XGBoost is an iteratively constructed composite model, just like the classic gradient boosting machine we discussed back in the GBM post . The final model takes the form

-

-

where is the base prediction, is the learning rate hyperparameter that helps control overfitting by reducing the contributions of each booster, and each of the boosters is a decision tree. To help us connect the dots between theory and code, whenever we encounter new hyperparameters, I’ll point out their names from the XGBoost Parameter Documentation. So, can be set by base_score, and is set by either eta or learning_rate.

-

XGBoost introduces two key statistical learning improvements over the classic gradient boosting model. First, it reimagines the gradient descent algorithm used for training, and second it uses a custom-built decision tree with extra functionality as its booster. We’ll dive into each of these key innovations in the following sections.

-
-
-

Descent Algorithm Innovations

-
-

Regularized Objective Function

-

In the post on GBM with any loss function, we looked at loss functions of the form which compute some distance between targets and predictions and sum them up over the training dataset. XGBoost introduces regularization into the objective function so that the objective takes the form

-

-

where is some twice-differentiable loss function. is a regularization that penalizes the complexity of each tree booster, taking the form

-

-

where is the number of leaf nodes and is the squared sum of the leaf prediction values. This introduces two new hyperparameters: which penalizes the number of leaf nodes and which is the L-2 regularization parameter for leaf predicted values. These are set by gamma and reg_lambbda in the XGBoost parametrization. Together, these provide powerful new controls to reduce overfitting due to overly complex tree boosters. Note that reduces the objective back to an unregularized loss function as used in the classic GBM.

-
-
-

An Aside on Newton’s Method

-

As we’ll see soon, XGBoost uses Newton’s Method to minimize its objective function, so let’s start with a quick refresher.

-

Newton’s method is an iterative procedure for minimizing a function . At each step we have some input , and our goal is to find a nudge value such that

-

-

To find a good nudge value , we generate a local quadratic approximation of the function in the neighborhood of the input , and then we find the input value that would bring us to the minimum of the quadratic approximation.

-
-
-

-
Schematic of Newton’s method
-
-
-

The figure shows a single Newton step where we start at , find the local quadratic approximation, and then jump a distance along the -axis to land at the minimum of the quadratic. If we iterate in this way, we are likely to land close to the minimum of .

-

So how do we compute the quadratic approximation? We use the second order Taylor series expansion of near the point .

-

-

To find the nudge value that minimizes the quadratic approximation, we can take the derivative with respect to , set it to zero, and solve for .

-

-

-

And as long as (i.e., the parabola is pointing up), .

-
-
-

Tree Boosting with Newton’s Method

-

This lands us at the heart of XGBoost, which uses Newton’s method, rather than gradient descent, to guide each round of boosting. This explanation will correspond very closely to section 2.2 of the XGBoost paper, but here I’ll explicitly spell out some of the intermediate steps which are omitted from their derivation, and you’ll get some additional commentary from me along the way.

-
-

Newton Descent in Tree Space

-

Suppose we’ve done boosting rounds, and we want to add the -th booster to our composite model. Our current model’s prediction for instance is . If we add a new tree booster to our model, the objective function would give

-

-

We need to choose so that it decreases the loss, i.e. we want

-

-

Does that sound familiar? In the previous section we used Newton’s method to find a value of that would make . Let’s try the same thing with our loss function. To be explicit, the parallels are: , , and .

-

Let’s start by finding the second order Taylor series approximation for the loss around the point .

-

-

where

-

-

and

-

-

are the first and second order partial derivatives of the loss with respect to the current predictions. The XGBoost paper calls these the gradients and hessians, respectively. Remember that when we specify an actual loss function to use, we would also specify the functional form of the gradients and hessians, so that they are directly computable.

-

Now we can go back and substitute our quadratic approximation in for the loss function to get an approximation of the objective function in the neighborhood of ..

-

-

Since is constant regardless of our choice of , we can drop it and instead work with the modified objective, which gives us Equation (3) from the paper.

-

-

Now the authors are about to do something great. They’re about to show how to directly compute the optimal prediction values for the leaf nodes of . We’ll circle back in a moment about how we find a good structure for , i.e. good node splits, but we’re going to find the optimal predicted values for any tree structure having terminal nodes. Let denote the set of instances that are in the -th leaf node of . Then we can rewrite the objective.

-

-

We notice that for all instances in , the tree yields the same predicted value . Substituting in for the predicted values and expanding we get

-

-

Rearranging terms we obtain Equation (4).

-

-

For each leaf node , our modified objective function is quadratic in . To find the optimal predicted values we take the derivative, set to zero, and solve for .

-

-

This yields Equation (5).

-

-
-
-

Split Finding

-

Now that we know how to find the optimal predicted value for any leaf node, we need to identify a criterion for finding a good tree structure, which boils down to finding the best split for a given node. Back in the [decision tree from scratch](/decision-tree-from-scratch post, we derived a split evaluation metric based on the reduction in the objective function associated with a particular split.
-To do that, first we need a way to compute the objective function given a particular tree structure. Substituting the optimal predicted values into the objective function, we get Equation (6).

-

-

We can then evaluate potential splits by comparing the objective before making a split to the objective after making a split, where the split with the maximum reduction in objective (a.k.a. gain) is best.

-

More formally, let be the set of data instances in the current node, and let and be the instances that fall into the left and right child nodes of a proposed split. Let be the total loss for all instances in the node, while and are the losses for the left and right child nodes. The total loss contributed by instances in node prior to any split is

-

-

And the loss after splitting into and is

-

-

The gain from this split is then

-

-

which is Equation (7) from the paper. In practice it makes sense to accept a split only if the gain is positive, thus the parameter sets the minimum gain required to make a further split. This is why can be set with the parameter gamma or the more descriptivemin_loss_split.

-
-
-
-
-

Tree Booster Innovations

-
-

Missing Values and Sparsity-Aware Split Finding

-

The XGBoost paper also introduces a modified algorithm for tree split finding which explicitly handles missing feature values. Recall that in order to find the best threshold value for a given feature, we can simply try all possible threshold values, recording the score for each. If some feature values are missing, the XGBoost split finding algorithm simply scores each threshold twice: once with missing value instances in the left node and once with them in the right node. The best split will then specify both the threshold value and to which node instances with missing values should be assigned. The paper calls this the sparsity aware split finding routine, which is defined as Algorithm 2.

-
-
-

Preventing Further Splitting

-

In addition to min_loss_split discussed above, XGBoost offers another parameter for limiting further tree splitting called min_child_weight. This name is a little confusing to me because the word “weight” has various meanings. In the context of this parameter, “weight” refers to the sum of the hessians over instances in the node. For squared error loss , so this is equivalent to the number of samples. Thus this parameter generalizes the notion of the minimum number of samples allowed in a terminal node.

-
-
-

Sampling

-

XGBoost takes a cue from Random Forest and introduces both column and row subsampling. These sampling methods can prevent overfitting and reduce training time by limiting the amount of data to be processed during boosting.

-

Like random forest, XGBoost implements column subsampling, which limits tree split finding to randomly selected subsets of features. XGBoost provides column sampling for each tree, for each depth level within a tree, and for each split point within a tree, controlled by colsample_bytree, colsample_bbylevel, and colsample_bbynode respectively.

-

One interesting distinction is that XGBoost implements row sampling without replacement using subbsample, whereas random forest uses bootstrapping. The choice to bootstrap rows in RF probably spurred from a desire to use as much data as possible while training on the smaller datasets of the 1990’s when RF was developed. With larger datasets and the ability to generate a large number of trees, XGBoost simply takes a subsample of rows for each tree.

-
-
-
-

Scalability

-

Even though we’re focused on statistical learning, I figured I’d comment on why XGBoost is highly scalable. Basically it boils down to efficient, parallelizable, and distributable methods for growing trees. You’ll notice there is a tree_method parameter which allows you to choose between the greedy exact algorithm (like the one we discussed in the decision tree from scratch post) and the approximate algorithm, which offers various scalability-related functionality, notably including the ability to consider only a small number of candidate split points instead of trying all possible splits. The algorithm also uses clever tricks like pre-sorting data for split finding and caching frequently needed values.

-
-

Why XGBoost is so Successful

-

As I mentioned in the intro, XGBoost is simply a very good implementation of the gradient boosting tree model. Therefore it inherits all the benefits of decision trees and tree ensembles, while making even further improvements over the classic gradient boosting machine. These improvements boil down to

-
    -
  1. more ways to control overfitting
  2. -
  3. elegant handling of custom objectives
  4. -
  5. scalability
  6. -
-

First, XGBoost introduces two new tree regularization hyperparameters and which are baked directly into its objective function. Combining these with the additional column and row sampling functionality provides a variety of ways to reduce overfitting.

-

Second, the XGBoost formulation provides a much more elegant way to train models on custom objective functions. Recall that for custom objectives, the classic GBM finds tree structure by fitting a squared error decision tree to the gradients of the loss function and then sets each leaf’s predicted value by running a numerical optimization routine to find the optimal predicted value.

-

The XGBoost formulation improves on this two-stage approach by unifying the generation of tree structure and predicted values. Both the split scoring metric and the predicted values are directly computable from the instance gradient and hessian values, which are connected directly back to the overall training objective. This also removes the need for additional numerical optimizations, which contributes to speed, stability, and scalability.

-

Finally, speaking of scalability, XGBoost emerged at a time when industrial dataset size was exploding. Many use cases require scalable ML systems, and all use cases benefit from faster training and higher model development velocity.

-
-
-
-

Wrapping Up

-

Well, there you go, those are the salient ideas behind XGBoost, the gold standard in gradient boosting model implementations. Hopefully now we all understand the mathematical basis for the algorithm and appreciate the key improvements it makes over the classic GBM. If you want to go even deeper, you can join us for the next post where we’ll roll up our sleeves and implement XGBoost entirely from scratch.

-
-
-

References

-

The XGBoost paper

-
-
-

Exercise

-

Proove that the XGBoost Newton Descent generalizes the classic GBM gradient descent. Hint: show that XGBoost with a squared error objective and no regularization reduces to the classic GBM.

-
- - ]]> - gradient boosting - https://randomrealizations.com/posts/xgboost-explained/index.html - Sat, 12 Mar 2022 21:00:00 GMT - - - - Decision Tree from Scratch - Matt Bowers - https://randomrealizations.com/posts/decision-tree-from-scratch/index.html -

-

Yesterday we had a lovely discussion about the key strengths and weaknesses of decision trees and why tree ensembles are so great. But today, gentle reader, we abandon our philosophizing and get down to the business of implementing one of these decision trees from scratch.

-

A note before we get started. This is going to be the most involved scratch-build that we’ve done at Random Realizations so far. It is not the kind of algorithm that I could just sit down and write all at once. We need to start with a basic frame and then add functionality step by step, testing all along the way to make sure things are working properly. Since I’m writing this in a jupyter notebook, I’ll try to give you a sense for how I actually put the algorithm together interactively in pieces, eventually landing on a fully-functional final product.

-

Shall we?

-
-

Binary Tree Data Structure

-

A decision tree takes a dataset with features and a target, partitions the feature space into chunks, and assigns a prediction value to each chunk. Since each partitioning step divides one chunk in two, and since the partitioning is done recursively, it’s natural to use a binary tree data structure to represent a decision tree.

-

The basic idea of the binary tree is that we define a class to represent nodes in the tree. If we want to add children to a given node, we simply assign them as attributes of the parent node. The child nodes we add are themselves instances of the same class, so we can add children to them in the same way.

-

Let’s start out with a simple class for our decision tree. It takes a single value called max_depth as input, which will dictate how many layers of child nodes should be inserted below the root. This controls the depth of the tree. As long as max_depth is positive, the parent will instantiate two new instances of the binary tree node class, passing along max_depth decremented by one and attaching the two children to itself as attributes called left and right.

-
-
1.0
+        if isinstance(g, pd.Series): g = g.values
+        if isinstance(h, pd.Series): h = h.values
+        if idxs is import math
-None: idxs import numpy = np.arange(as np 
-len(g))
+        import pandas self.X, as pd
-self.g, import matplotlib.pyplot self.h, as plt
-
-
-
self.idxs class DecisionTree():
-
-        = X, g, h, idxs
+        def self.n, __init__(self.c self, max_depth):
-            = assert max_depth len(idxs), X.shape[>= 1]
+        0, self.value 'max_depth must be nonnegative'
-            = self.max_depth -g[idxs].= max_depth
-            sum() / (h[idxs].sum() + self.reg_lambda) # Eq (5)
+        self.best_score_so_far = 0.
+        if if max_depth self.max_depth > 0:
-                            self.left self._maybe_insert_child_nodes()
+
+    = DecisionTree(max_depthdef _maybe_insert_child_nodes(=max_depthself):
+        -for i 1)
-                in self.right range(= DecisionTree(max_depthself.c): =max_depthself._find_better_split(i)
+        -if 1)
-
-

Let’s make a new instance of our decision tree class, a tree with depth 2.

-
-
t self.is_leaf: = DecisionTree(max_depthreturn
+        x == 2)
-
-
-
-

-
Binary tree structure diagram
-
-
-

We can access individual nodes and check their value of max_depth.

-
-
t.max_depth, t.left.max_depth, t.left.right.max_depth
-
-
(2, 1, 0)
-
-
-

Our full decision tree can expand on this idea where each node receives some input, modifies it, creates two child nodes, and passes the modified input along to them. Specifically, each node in our decision tree will receive a dataset, determine how best to split the dataset into two parts, create two child nodes, and pass one part of the data to the left child and the other part to the right child.

-

All we have to do now is add some additional functionality to our decision tree. First we’ll start by capturing all the inputs we need to grow a tree, which include the feature dataframe X, the target array y, max_depth to explicitly limit tree depth, min_samples_leaf to specify the minimum number of observations that are allowed in a leaf node, and an optional idxs which specifies the indices of data that the node should use. The indices argument is useful for users of our decision tree because it will allow them to implement row subsampling in ensemble methods like random forest. It will also be handy for internal use inside the decision tree when passing data along to child nodes; instead of passing copies of the two data subsets, we’ll just pass a reference to the full dataset and pass along a set of indices to identify that node’s instance subset.

-

Once we get our input, we’ll do a little bit of input validation and store things that we want to keep as object attributes. In case this is a leaf node, we’ll go ahead and compute its predicted value; since this is a regression tree, the prediction is just the mean of the target y. We’ll also go ahead and initialize a score metric which we’ll use to help us find the best split later; since lower scores are going to be better, we’ll initialize it to positive infinity. Finally, we’ll push the logic to add child nodes into a method called _maybe_insert_child_nodes that we’ll define next.

-
-
-
- -
-
-Note -
-
-
-

a leading underscore in a method name indicates the method is for internal use and not part of the user-facing API of the class.

-
-
-
-
self.X.values[class DecisionTree():
-
-    self.idxs,def self.split_feature_idx]
+        left_idx __init__(= np.nonzero(x self, X, y, min_samples_leaf<= =self.threshold)[5, max_depth0]
+        right_idx == np.nonzero(x 6, idxs> =self.threshold)[None):
-        0]
+        assert max_depth self.left >= = TreeBooster(0, self.X, 'max_depth must be nonnegative'
-        self.g, assert min_samples_leaf self.h, > self.params, 
+                                0, self.max_depth - 1, self.idxs[left_idx])
+        self.right = TreeBooster(self.X, self.g, 'min_samples_leaf must be positive'
-        self.h, self.min_samples_leaf, self.params, 
+                                 self.max_depth = min_samples_leaf, max_depth
-        - if 1, isinstance(y, pd.Series): y self.idxs[right_idx])
+
+    = y.values
-        @property
+    if idxs def is_leaf(is self): None: idxs return = np.arange(self.best_score_so_far len(y))
-        == self.X, 0.
+
+    self.y, def _find_better_split(self.idxs self, feature_idx):
+        = X, y, idxs
-        pass
+
+
+
+

Split Finding

+

Split finding follows the exact same pattern that we used in the decision tree, except we keep track of gradient and hessian stats instead of target value stats, and of course we use the XGBoost gain criterion (equation 7 from the paper) for evaluating splits.

+
+
self.n, def _find_better_split(self.c self, feature_idx):
+    x = = len(idxs), X.shape[self.X.values[1]
-        self.idxs, feature_idx]
+    g, h self.value = = np.mean(y[idxs]) self.g[# node's prediction value
-        self.idxs], self.best_score_so_far self.h[= self.idxs]
+    sort_idx float(= np.argsort(x)
+    sort_g, sort_h, sort_x 'inf') = g[sort_idx], h[sort_idx], x[sort_idx]
+    sum_g, sum_h # initial loss before split finding
-        = g.if sum(), h.self.max_depth sum()
+    sum_g_right, sum_h_right > = sum_g, sum_h
+    sum_g_left, sum_h_left 0:
-            = self._maybe_insert_child_nodes()
-            
-    0., def _maybe_insert_child_nodes(0.
+
+    self):
-        for i in range(0, self.n - 1):
+        g_i, h_i, x_i, x_i_next = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]
+        sum_g_left += g_i; sum_g_right -= g_i
+        sum_h_left += h_ipass
-
-

Now in order to test our class, we’ll need some actual data. We can use the same scikit-learn diabetes data from the last post.

-
-
; sum_h_right from sklearn.datasets -= h_i
+        import load_diabetes
-
-X, y if sum_h_left = load_diabetes(as_frame< =self.min_child_weight True, return_X_yor x_i === x_i_next:True)
-
-
-
t continue
+        = DecisionTree(X, y, min_samples_leafif sum_h_right =< 5, max_depthself.min_child_weight: =break
+
+        gain 5)
-
-

So far, so good.

-
-
-

Inserting Child Nodes

-

Our node inserting function _maybe_insert_child_nodes needs to first find the best split; then if a valid split exists, it needs to insert the child nodes. To find the best valid split, we need to loop through the columns and search each one for the best valid split. Again we’ll push the logic of finding the best split into a function that we’ll define later. Next if no split was found, we need to bail by returning before trying to insert the child nodes. To check if this node is a leaf (i.e. it shouldn’t have child nodes), we define a property called is_leaf which will just check if the best score so far is still infinity, in which case no split was found and the node is a leaf.

-

If a valid split was found, then we need to insert the child nodes. We’ll assume that our split finding function assigned attributes called split_feature_idx and threshold to tell us the split feature’s index and the split threshold value. We then use these to compute the indices of the data to be passed to the child nodes; the left child gets instances where the split feature value is less than or equal to the threshold, and the right child node gets instances where the split feature value is greater than the threshold. Then we create two new decision trees, passing the corresponding data indices to each and assigning them to the left and right attributes of the current node.

-
-
    = def _maybe_insert_child_nodes(0.5 self):
-        * ((sum_g_leftfor j **in 2 range(/ (sum_h_left self.c): 
-            + self._find_better_split(j)
-        self.reg_lambda))
+                        if + (sum_g_rightself.is_leaf: **# do not insert children
-            2 return 
-        x / (sum_h_right = + self.X.values[self.reg_lambda))
+                        self.idxs,- (sum_gself.split_feature_idx]
-        left_idx **= np.nonzero(x 2 <= / (sum_h self.threshold)[+ 0]
-        right_idx self.reg_lambda))
+                        ) = np.nonzero(x - > self.gammaself.threshold)[/0]
-        2 self.left # Eq(7) in the xgboost paper
+        = DecisionTree(if gain self.X, > self.y, self.best_score_so_far: 
+            self.min_samples_leaf, 
-                                  self.split_feature_idx self.max_depth = feature_idx
+            - self.best_score_so_far 1, = gain
+            self.idxs[left_idx])
-        self.threshold self.right = (x_i = DecisionTree(+ x_i_next) self.X, / self.y, 2
+            
+TreeBooster._find_better_split self.min_samples_leaf, 
-                                  = _find_better_split
+
+
+
+

Prediction

+

Prediction works exactly the same as in our decision tree, and the methods are nearly identical.

+
+
self.max_depth def predict(- self, X):
+    1, return np.array([self.idxs[right_idx])
-
-    self._predict_row(row) def _find_better_split(for i, row self, feature_idx):
-        in X.iterrows()])
+
+pass
-    
-    def _predict_row(@property
-    self, row):
+    def is_leaf(if self): self.is_leaf: 
+        return self.best_score_so_far self.value
+    child = == self.left float(if row['inf')
-
-

To test these new methods , we can assign them to our DecisionTree class and create a new class instance to make sure things are still working.

-
-
DecisionTree._maybe_insert_child_nodes self.split_feature_idx] = _maybe_insert_child_nodes
-DecisionTree._find_better_split <= = _find_better_split
-DecisionTree.is_leaf self.threshold = is_leaf
-t \
+        = DecisionTree(X, y, min_samples_leafelse =self.right
+    5, max_depthreturn child._predict_row(row)
+
+TreeBooster.predict == predict 
+TreeBooster._predict_row 6)
+font-style: inherit;">= _predict_row
-

Yep, we’re still looking good.

-
-

Split Finding

-

Now we need to fill in the functionality of the split finding method. The overall strategy is to consider every possible way to split on the current feature, measuring the quality of each potential split with some scoring mechanism, and keeping track of the best split we’ve seen so far. We’ll come back to the issue of how to try all the possible splits in a moment, but let’s start by figuring out how to score a particular potential split.

-

Like other machine learning models, trees are trained by attempting to minimize some loss function that measures how well the model predicts the target data. We’ll be training our regression tree to minimize squared error.

-

-

For a given node, we can replace with because each node uses the sample mean of its target instances as its prediction. We can then rewrite the loss for a given node as

-

-

We can then evaluate potential splits by comparing the loss after splitting to the loss before splitting, where the split with the greatest loss reduction is best. Let’s work out a simple expression for the loss reduction from a given split.

-

Let be the set of data instances in the current node, and let and be the instances that fall into the left and right child nodes of a proposed split. Let be the total loss for all instances in the node, while and are the losses for the left and right child nodes. The total loss contributed by instances in prior to any split is

-

-

And the loss after splitting into and is

-

-

The reduction in loss from this split is

-

-

Since the terms cancel and we can simplify.

-

-

This is a really nice formulation of the split scoring metric from a computational complexity perspective. We can sort the data by the feature values then, starting with the smallest min_samples_leaf instances in the left node and the rest in the right node, we check the score. Then to check the next split, we simply move a single target value from the right node into the left node, updating the score by subtracting it from the right node’s partial sum and adding it to the left node’s partial sum. The third term is constant for all splits, so we only need to compute it once. If any split’s score is lower than the best score so far, then we update the best score so far, the split feature, and the threshold value. When we’re done we can be sure we found the best possible split. The time bottleneck is the sort, which puts us at an average time complexity of .

-
-
    def _find_better_split(
+

The Complete XGBoost From Scratch Implementation

+

Here’s the entire implementation which produces a usable XGBoostModel class with fit and predict methods.

+
+
self, feature_idx):
-        x class XGBoostModel():
+    = '''XGBoost from Scratch
+self.X.values[    '''
+    
+    self.idxs,feature_idx]
-        y def = __init__(self.y[self, params, random_seedself.idxs]
-        sort_idx == np.argsort(x)
-        sort_y, sort_x None):
+        = y[sort_idx], x[sort_idx]
-        sum_y, n self.params = y.= defaultdict(sum(), lambda: len(y)
-        sum_y_right, n_right None, params)
+        = sum_y, n
-        sum_y_left, n_left self.subsample = = 0., self.params[0
-    
-        'subsample'] for i \
+            in if range(self.params[0, 'subsample'] self.n else - 1.0
+        self.min_samples_leaf):
-            y_i, x_i, x_i_next self.learning_rate = sort_y[i], sort_x[i], sort_x[i = + self.params[1]
-            sum_y_left 'learning_rate'] += y_i\
+            ; sum_y_right if -= y_i
-            n_left self.params[+= 'learning_rate'] 1else ; n_right 0.3
+        -= self.base_prediction 1
-            = if  n_left self.params[< 'base_score'] self.min_samples_leaf \
+            or x_i if == x_i_next:
-                self.params[continue
-            score 'base_score'] = else - sum_y_left0.5
+        **self.max_depth 2 = / n_left self.params[- sum_y_right'max_depth'] **\
+            2 if / n_right self.params[+ sum_y'max_depth'] **else 2 5
+        / n
-            self.rng if score = np.random.default_rng(seed< =random_seed)
+                
+    self.best_score_so_far:
-                def fit(self.best_score_so_far self, X, y, objective, num_boost_round, verbose= score
-                =self.split_feature_idx False):
+        current_predictions = feature_idx
-                = self.threshold self.base_prediction = (x_i * np.ones(shape+ x_i_next) =y.shape)
+        / self.boosters 2
-
-

Again, we assign the split finding method to our class and instantiate a new tree to make sure things are still working.

-
-
DecisionTree._find_better_split = []
+        = _find_better_split
-t for i = DecisionTree(X, y, min_samples_leafin =range(num_boost_round):
+            gradients 5, max_depth= objective.gradient(y, current_predictions)
+            hessians == objective.hessian(y, current_predictions)
+            sample_idxs 6)
-X.columns[t.split_feature_idx], t.threshold
-
-
('s5', -0.0037611760063045703)
-
-
-

Nice! Looks like the tree started with a split on the s5 feature.

-
-
-

Inspecting the Tree

-

While we’re developing something complex like a decision tree class, we need a good way to inspect the object to help with testing and debugging. Let’s write a quick string representation method to make it easier to check what’s going on with a particular node.

-
-
    = def None __repr__(if self):
-        s self.subsample = == f'n: 1.0 {\
+                selfelse .nself.rng.choice(}len(y), 
+                                     size'
-        s =math.floor(+= self.subsamplef'; value:*{len(y)), 
+                                     replaceself=.valueFalse)
+            booster :0.2f}= TreeBooster(X, gradients, hessians, 
+                                  '
-        self.params, if self.max_depth, sample_idxs)
+            current_predictions not += self.is_leaf:
-            split_feature_name self.learning_rate = * booster.predict(X)
+            self.X.columns[self.boosters.append(booster)
+            self.split_feature_idx]
-            s if verbose: 
+                += print(f'; split: f'[{split_feature_name{i} <= ] train loss = {{objectiveself.loss(y, current_predictions).threshold}:0.3f}')
+            
+    '
-        def predict(return s
-
-

We can assign the string representation method to the class and print a few nodes.

-
-
DecisionTree.self, X):
+        __repr__ return (= self.base_prediction __repr__
-t + = DecisionTree(X, y, min_samples_leafself.learning_rate 
+                =* np.5, max_depthsum([booster.predict(X) =for booster 2)
-in print(t)
-self.boosters], axisprint(t.left)
-=print(t.left.left)
-
-
n: 442; value:152.13; split: s5 <= -0.004
-n: 218; value:109.99; split: bmi <= 0.006
-n: 171; value:96.31
-
-
-
-
-

Prediction

-

We need a public predict method that takes a feature dataframe and returns an array of predictions. We’ll need to look up the predicted value for one instance at a time and stitch them together in an array. We can do that by iterating over the feature dataframe rows with a list comprehension that calls a _predict_row method to grab the prediction for each row. The row predict method needs to return the current node’s predicted value if it’s a leaf, or if not, it needs to identify the appropriate child node based on its split and ask it for a prediction.

-
-
    0))
+    
+def predict(class TreeBooster():
+ 
+    self, X):
-        def return np.array([__init__(self._predict_row(row) self, X, g, h, params, max_depth, idxsfor i, row =in X.iterrows()])
-    
-    None):
+        def _predict_row(self.params self, row):
-        = params
+        if self.max_depth self.is_leaf: 
-            = max_depth
+        return assert self.value
-        child self.max_depth = >= self.left 0, if row['max_depth must be nonnegative'
+        self.split_feature_idx] self.min_child_weight <= = params[self.threshold 'min_child_weight'] \
-                            else if params[self.right
-        'min_child_weight'] return child._predict_row(row)
-
-

Let’s assign the predict methods and make predictions on a few rows.

-
-
DecisionTree.predict else = predict
-DecisionTree._predict_row 1.0
+        = _predict_row
-t.predict(X.iloc[:self.reg_lambda 3, :])
-
-
array([225.87962963,  96.30994152, 225.87962963])
-
-
-
-
-

The Complete Decision Tree Implementation

-

Here’s the implementation, all in one place.

-
-
= params[class DecisionTree():
-
-    'reg_lambda'] def if params[__init__('reg_lambda'] self, X, y, min_samples_leafelse =1.0
+        5, max_depthself.gamma == params[6, idxs'gamma'] =if params[None):
-        'gamma'] assert max_depth else >= 0.0
+        0, self.colsample_bynode 'max_depth must be nonnegative'
-        = params[assert min_samples_leaf 'colsample_bynode'] > \
+            0, if params['colsample_bynode'] 'min_samples_leaf must be positive'
-        else self.min_samples_leaf, 1.0
+        self.max_depth if isinstance(g, pd.Series): g = min_samples_leaf, max_depth
-        = g.values
+        if isinstance(y, pd.Series): y isinstance(h, pd.Series): h = y.values
-        = h.values
+        if idxs None: idxs = np.arange(len(y))
-        len(g))
+        self.X, self.y, self.g, self.h, self.idxs = X, y, idxs
-        = X, g, h, idxs
+        self.n, len(idxs), X.shape[1]
-                self.value = np.mean(y[idxs]) = # node's prediction value
-        -g[idxs].self.best_score_so_far sum() = / (h[idxs].sum() + self.reg_lambda) float(# Eq (5)
+        self.best_score_so_far 'inf') = # initial loss before split finding
-        0.
+        if > 0:
-                        self._maybe_insert_child_nodes()
-            
-    
+    def _maybe_insert_child_nodes(self):
-                for j for i in range(self.c): 
-            self.c): self._find_better_split(j)
-        self._find_better_split(i)
+        if self.is_leaf: # do not insert children
-            self.is_leaf: return 
-        x return
+        x = self.idxs,self.split_feature_idx]
-        left_idx         left_idx = np.nonzero(x self.threshold)[0]
-        right_idx         right_idx = np.nonzero(x self.threshold)[0]
-                self.left = DecisionTree(= TreeBooster(self.X, self.y, self.g, self.h, self.min_samples_leaf, 
-                                  self.params, 
+                                self.max_depth 1, self.idxs[left_idx])
-                self.right = DecisionTree(= TreeBooster(self.X, self.y, self.g, self.h, self.min_samples_leaf, 
-                                  self.params, 
+                                 self.max_depth 1, self.idxs[right_idx])
-    
-    
+    @property
-        def is_leaf(return self.best_score_so_far == float(== 'inf')
-    
-    0.
+    
+    def _find_better_split(self, feature_idx):
-        x         x = self.X.values[self.idxs,feature_idx]
-        y self.idxs, feature_idx]
+        g, h = self.y[self.g[self.idxs], self.h[self.idxs]
-        sort_idx         sort_idx = np.argsort(x)
-        sort_y, sort_x         sort_g, sort_h, sort_x = y[sort_idx], x[sort_idx]
-        sum_y, n = g[sort_idx], h[sort_idx], x[sort_idx]
+        sum_g, sum_h = y.= g.sum(), sum(), h.len(y)
-        sum_y_right, n_right sum()
+        sum_g_right, sum_h_right = sum_y, n
-        sum_y_left, n_left = sum_g, sum_h
+        sum_g_left, sum_h_left = 0., 0., 0
-    
-        0.
+
+        for i 0, self.n - - self.min_samples_leaf):
-            y_i, x_i, x_i_next 1):
+            g_i, h_i, x_i, x_i_next = sort_y[i], sort_x[i], sort_x[i = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]
-            sum_y_left += y_i; sum_y_right             sum_g_left -= y_i
-            n_left += g_i+= ; sum_g_right 1-= g_i
+            sum_h_left ; n_right += h_i-= ; sum_h_right 1
-            -= h_i
+            if  n_left if sum_h_left < self.min_samples_leaf self.min_child_weight or x_i == x_i_next:
-                == x_i_next:continue
-            score = - sum_y_left**            2 if sum_h_right / n_left < - sum_y_rightself.min_child_weight: **break
+
+            gain 2 = / n_right 0.5 + sum_y* ((sum_g_left**2 / n
-            if score < self.best_score_so_far:
-                self.best_score_so_far = score
-                self.split_feature_idx = feature_idx
-                self.threshold = (x_i + x_i_next) / 2
-                
-    def __repr__(self):
-        s / (sum_h_left = + f'n: self.reg_lambda))
+                            {+ (sum_g_rightself**.n2 }/ (sum_h_right '
-        s + += self.reg_lambda))
+                            f'; value:- (sum_g{**self2 .value/ (sum_h :0.2f}+ '
-        self.reg_lambda))
+                            ) if - not self.gammaself.is_leaf:
-            split_feature_name /= 2 self.X.columns[# Eq(7) in the xgboost paper
+            self.split_feature_idx]
-            s if gain += > f'; split: self.best_score_so_far: 
+                {split_feature_nameself.split_feature_idx }= feature_idx
+                 <= self.best_score_so_far {= gain
+                selfself.threshold .threshold= (x_i :0.3f}+ x_i_next) '
-        / return s
-    
-    2
+                
+    def predict(self, X):
-                return np.array([for i, row in X.iterrows()])
-    
-    
+    def _predict_row(self, row):
-                if self.is_leaf: 
-                        return self.value
-        child         child = self.threshold \
-                            else self.right
-                return child._predict_row(row)
-
-

From Scratch versus Scikit-Learn

-

As usual, we’ll test our homegrown handiwork by comparing it to the existing implementation in scikit-learn. First let’s train both models on the California Housing dataset which gives us 20k instances and 8 features to predict median house price by district.

-
-

+

Testing

+

Let’s take this baby for a spin and benchmark its performance against the actual XGBoost library. We use the scikit learn California housing dataset for benchmarking.

+
+
from sklearn.datasets import fetch_california_housing
-from sklearn.model_selection import train_test_split
-
-X, y     
+X, y = fetch_california_housing(as_frame=True, return_X_y==True)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
+                                                    random_state=43)
+
+

Let’s start with a nice friendly squared error objective function for training. We should probably have a future post all about how to define custom objective functions in XGBoost, but for now, here’s how I define squared error.

+
+
class SquaredErrorObjective():
+    def loss(self, y, pred): return np.mean((y - pred)**2)
+    def gradient(self, y, pred): return pred - y
+    def hessian(self, y, pred): return np.ones(len(y))
+
+

Here I use a more or less arbitrary set of hyperparameters for training. Feel free to play around with tuning and trying other parameter combinations yourself.

+
+
import xgboost as xgb
+
+params = {
+    'learning_rate': 0.1,
+    'max_depth': 5,
+    True)
-X_train, X_test, y_train, y_test 'subsample': = train_test_split(X, y, test_size0.8,
+    ='reg_lambda': 0.3, random_state1.5,
+    ='gamma': 43)
-
-
-
0.0,
+    from sklearn.tree 'min_child_weight': import DecisionTreeRegressor
-25,
+    from sklearn.metrics 'base_score': import mean_squared_error
-
-max_depth 0.0,
+    = 'tree_method': 8
-min_samples_leaf 'exact',
+}
+num_boost_round = 16
-
-tree = DecisionTree(X_train, y_train, max_depth50
+
+=max_depth, min_samples_leaf# train the from-scratch XGBoost model
+model_scratch =min_samples_leaf)
-pred = XGBoostModel(params, random_seed= tree.predict(X_test)
-
-sk_tree == DecisionTreeRegressor(max_depth42)
+model_scratch.fit(X_train, y_train, SquaredErrorObjective(), num_boost_round)
+
+=max_depth, min_samples_leaf# train the library XGBoost model
+dtrain =min_samples_leaf)
-sk_tree.fit(X_train, y_train)
-sk_pred = xgb.DMatrix(X_train, label= sk_tree.predict(X_test)
-
-=y_train)
+dtest print(= xgb.DMatrix(X_test, labelf'from scratch MSE: =y_test)
+model_xgb {mean_squared_error(y_test, pred)= xgb.train(params, dtrain, num_boost_round)
+
+

Let’s check the models’ performance on the held out test data to benchmark our implementation.

+
+
pred_scratch :0.4f}= model_scratch.predict(X_test)
+pred_xgb ')
-= model_xgb.predict(dtest)
+print(f'scikit-learn MSE: {mean_squared_error(y_test, sk_pred):0.4f}f'scratch score: ')
-
-
from scratch MSE: 0.3988
-scikit-learn MSE: 0.3988
-
-
-

We get similar accuracy on a held-out test dataset.

-

Let’s benchmark the two implementations on training time.

-
-
{SquaredErrorObjective()%%time
-sk_tree .loss(y_test, pred_scratch)= DecisionTreeRegressor(max_depth}=max_depth, min_samples_leaf')
+=min_samples_leaf)
-sk_tree.fit(X_train, y_train)print(;
-
-
CPU times: user 45.3 ms, sys: 555 µs, total: 45.8 ms
-Wall time: 45.3 ms
-
-
-
DecisionTreeRegressor(max_depth=8, min_samples_leaf=16)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
-
-
-
-
f'xgboost score: %%time
-tree {SquaredErrorObjective()= DecisionTree(X_train, y_train, max_depth.loss(y_test, pred_xgb)=max_depth, min_samples_leaf}=min_samples_leaf)
+font-style: inherit;">')
-
CPU times: user 624 ms, sys: 1.65 ms, total: 625 ms
-Wall time: 625 ms
+
scratch score: 0.2434125759558149
+xgboost score: 0.24123239765807963
-

Wow, the scikit-learn implementation absolutely smoked us, training an order of magnitude faster. This is to be expected, since they implement split finding in cython, which generates compiled C code that can run much faster than our native python code. Maybe we can take a look at how to optimize python code with cython here on the blog one of these days.

+

Well, look at that! Our scratch-built SGBoost is looking pretty consistent with the library. Go us!

Wrapping Up

-

Holy cow, we just implemented a decision tree using nothing but numpy. I hope you enjoyed the scratch build as much as I did, and I hope you got a little bit better at coding (I certainly did). That was actually way harder than I expected, but looking back at the finished product, it doesn’t seem so bad right? I almost thought we were going to get away with not implementing our own decision tree, but it turns out that this will be super helpful for us when it comes time to implement XGBoost from scratch.

+

I’d say this is a pretty good milestone for us here at Random Realizations. We’ve been hammering away at the various concepts around gradient boosting, leaving a trail of equations and scratch-built algos in our wake. Today we put all of that together to create a legit scratch build of XGBoost, something that would have been out of reach for me before we embarked on this journey together over a year ago. To anyone with the patience to read through this stuff, cheers to you! I hope you’re learning and enjoying this as much as I am.

-
-

References

-

This implementation is inspired and partially adapted from Jeremy Howard’s live coding of a Random Forest as part of the fastai ML course.

+
+

Reader Exercises

+

If you want to take this a step further and deepen your understanding and coding abilities, let me recommend some exercises for you.

+
    +
  1. Implement column subsampling. XGBoost itself provides column subsampling by tree, by level, and by node. Try implementing by tree first, then try adding by level or by node as well. These should be pretty straightforward to do.
  2. +
  3. Implement sparsity aware split finding for missing feature values (Algorithm 2 in the XGBoost paper). This will be a little more involved, since you’ll need to refactor and modify several parts of the tree booster class.
  4. +
]]> python gradient boosting from scratch - https://randomrealizations.com/posts/decision-tree-from-scratch/index.html - Sun, 12 Dec 2021 21:00:00 GMT - + https://randomrealizations.com/posts/xgboost-from-scratch/index.html + Fri, 06 May 2022 22:00:00 GMT + + + + XGBoost Explained + Matt Bowers + https://randomrealizations.com/posts/xgboost-explained/index.html + +
+

+
Tree branches on a chilly day in Johnson City
+
+
+

Ahh, XGBoost, what an absolutely stellar implementation of gradient boosting. Once Tianqi Chen and Carlos Guestrin of the University of Washington published the XGBoost paper and shared the open source code in the mid 2010’s, the algorithm quickly gained adoption in the ML community, appearing in over half of winning Kagle submissions in 2015. Nowadays it’s certainly among the most popular gradient boosting libraries, along with LightGBM and CatBoost, although the highly scientific indicator of GitHub stars per year indicates that it is in fact the most beloved gradient boosting package of all. Since it was the first of the modern popular boosting frameworks, and since benchmarking indicates that no other boosting algorithm outperforms it, we can comfortably focus our attention on understanding XGBoost.

+

The XGBoost authors identify two key aspects of a machine learning system: (1) a flexible statistical model and (2) a scalable learning system to fit that model using data. XGBoost improves on both of these aspects, providing a more flexible and feature-rich statistical model and building a truly scalable system to fit it. In this post we’re going to focus on the statistical modeling innovations, outlining the key differences from the classic gradient boosting machine and divinginto the mathematical derivation of the XGBoost learning algorithm. If you’re not already familiar with gradient boosting, go back and read the earlier posts in the series before jumping in here.

+

Buckle up, dear reader. Today we understand how XGBoost works, no hand waving required.

+
+

XGBoost is a Gradient Boosting Machine

+

At a high level, XGBoost is an iteratively constructed composite model, just like the classic gradient boosting machine we discussed back in the GBM post . The final model takes the form

+

+

where is the base prediction, is the learning rate hyperparameter that helps control overfitting by reducing the contributions of each booster, and each of the boosters is a decision tree. To help us connect the dots between theory and code, whenever we encounter new hyperparameters, I’ll point out their names from the XGBoost Parameter Documentation. So, can be set by base_score, and is set by either eta or learning_rate.

+

XGBoost introduces two key statistical learning improvements over the classic gradient boosting model. First, it reimagines the gradient descent algorithm used for training, and second it uses a custom-built decision tree with extra functionality as its booster. We’ll dive into each of these key innovations in the following sections.

+
+
+

Descent Algorithm Innovations

+
+

Regularized Objective Function

+

In the post on GBM with any loss function, we looked at loss functions of the form which compute some distance between targets and predictions and sum them up over the training dataset. XGBoost introduces regularization into the objective function so that the objective takes the form

+

+

where is some twice-differentiable loss function. is a regularization that penalizes the complexity of each tree booster, taking the form

+

+

where is the number of leaf nodes and is the squared sum of the leaf prediction values. This introduces two new hyperparameters: which penalizes the number of leaf nodes and which is the L-2 regularization parameter for leaf predicted values. These are set by gamma and reg_lambbda in the XGBoost parametrization. Together, these provide powerful new controls to reduce overfitting due to overly complex tree boosters. Note that reduces the objective back to an unregularized loss function as used in the classic GBM.

+
+
+

An Aside on Newton’s Method

+

As we’ll see soon, XGBoost uses Newton’s Method to minimize its objective function, so let’s start with a quick refresher.

+

Newton’s method is an iterative procedure for minimizing a function . At each step we have some input , and our goal is to find a nudge value such that

+

+

To find a good nudge value , we generate a local quadratic approximation of the function in the neighborhood of the input , and then we find the input value that would bring us to the minimum of the quadratic approximation.

+
+
+

+
Schematic of Newton’s method
+
+
+

The figure shows a single Newton step where we start at , find the local quadratic approximation, and then jump a distance along the -axis to land at the minimum of the quadratic. If we iterate in this way, we are likely to land close to the minimum of .

+

So how do we compute the quadratic approximation? We use the second order Taylor series expansion of near the point .

+

+

To find the nudge value that minimizes the quadratic approximation, we can take the derivative with respect to , set it to zero, and solve for .

+

+

+

And as long as (i.e., the parabola is pointing up), .

+
+
+

Tree Boosting with Newton’s Method

+

This lands us at the heart of XGBoost, which uses Newton’s method, rather than gradient descent, to guide each round of boosting. This explanation will correspond very closely to section 2.2 of the XGBoost paper, but here I’ll explicitly spell out some of the intermediate steps which are omitted from their derivation, and you’ll get some additional commentary from me along the way.

+
+

Newton Descent in Tree Space

+

Suppose we’ve done boosting rounds, and we want to add the -th booster to our composite model. Our current model’s prediction for instance is . If we add a new tree booster to our model, the objective function would give

+

+

We need to choose so that it decreases the loss, i.e. we want

+

+

Does that sound familiar? In the previous section we used Newton’s method to find a value of that would make . Let’s try the same thing with our loss function. To be explicit, the parallels are: , , and .

+

Let’s start by finding the second order Taylor series approximation for the loss around the point .

+

+

where

+

+

and

+

+

are the first and second order partial derivatives of the loss with respect to the current predictions. The XGBoost paper calls these the gradients and hessians, respectively. Remember that when we specify an actual loss function to use, we would also specify the functional form of the gradients and hessians, so that they are directly computable.

+

Now we can go back and substitute our quadratic approximation in for the loss function to get an approximation of the objective function in the neighborhood of ..

+

+

Since is constant regardless of our choice of , we can drop it and instead work with the modified objective, which gives us Equation (3) from the paper.

+

+

Now the authors are about to do something great. They’re about to show how to directly compute the optimal prediction values for the leaf nodes of . We’ll circle back in a moment about how we find a good structure for , i.e. good node splits, but we’re going to find the optimal predicted values for any tree structure having terminal nodes. Let denote the set of instances that are in the -th leaf node of . Then we can rewrite the objective.

+

+

We notice that for all instances in , the tree yields the same predicted value . Substituting in for the predicted values and expanding we get

+

+

Rearranging terms we obtain Equation (4).

+

+

For each leaf node , our modified objective function is quadratic in . To find the optimal predicted values we take the derivative, set to zero, and solve for .

+

+

This yields Equation (5).

+

+
+
+

Split Finding

+

Now that we know how to find the optimal predicted value for any leaf node, we need to identify a criterion for finding a good tree structure, which boils down to finding the best split for a given node. Back in the [decision tree from scratch](/decision-tree-from-scratch post, we derived a split evaluation metric based on the reduction in the objective function associated with a particular split.
+To do that, first we need a way to compute the objective function given a particular tree structure. Substituting the optimal predicted values into the objective function, we get Equation (6).

+

+

We can then evaluate potential splits by comparing the objective before making a split to the objective after making a split, where the split with the maximum reduction in objective (a.k.a. gain) is best.

+

More formally, let be the set of data instances in the current node, and let and be the instances that fall into the left and right child nodes of a proposed split. Let be the total loss for all instances in the node, while and are the losses for the left and right child nodes. The total loss contributed by instances in node prior to any split is

+

+

And the loss after splitting into and is

+

+

The gain from this split is then

+

+

which is Equation (7) from the paper. In practice it makes sense to accept a split only if the gain is positive, thus the parameter sets the minimum gain required to make a further split. This is why can be set with the parameter gamma or the more descriptivemin_loss_split.

+
+
+
+
+

Tree Booster Innovations

+
+

Missing Values and Sparsity-Aware Split Finding

+

The XGBoost paper also introduces a modified algorithm for tree split finding which explicitly handles missing feature values. Recall that in order to find the best threshold value for a given feature, we can simply try all possible threshold values, recording the score for each. If some feature values are missing, the XGBoost split finding algorithm simply scores each threshold twice: once with missing value instances in the left node and once with them in the right node. The best split will then specify both the threshold value and to which node instances with missing values should be assigned. The paper calls this the sparsity aware split finding routine, which is defined as Algorithm 2.

+
+
+

Preventing Further Splitting

+

In addition to min_loss_split discussed above, XGBoost offers another parameter for limiting further tree splitting called min_child_weight. This name is a little confusing to me because the word “weight” has various meanings. In the context of this parameter, “weight” refers to the sum of the hessians over instances in the node. For squared error loss , so this is equivalent to the number of samples. Thus this parameter generalizes the notion of the minimum number of samples allowed in a terminal node.

+
+
+

Sampling

+

XGBoost takes a cue from Random Forest and introduces both column and row subsampling. These sampling methods can prevent overfitting and reduce training time by limiting the amount of data to be processed during boosting.

+

Like random forest, XGBoost implements column subsampling, which limits tree split finding to randomly selected subsets of features. XGBoost provides column sampling for each tree, for each depth level within a tree, and for each split point within a tree, controlled by colsample_bytree, colsample_bbylevel, and colsample_bbynode respectively.

+

One interesting distinction is that XGBoost implements row sampling without replacement using subbsample, whereas random forest uses bootstrapping. The choice to bootstrap rows in RF probably spurred from a desire to use as much data as possible while training on the smaller datasets of the 1990’s when RF was developed. With larger datasets and the ability to generate a large number of trees, XGBoost simply takes a subsample of rows for each tree.

+
+
+
+

Scalability

+

Even though we’re focused on statistical learning, I figured I’d comment on why XGBoost is highly scalable. Basically it boils down to efficient, parallelizable, and distributable methods for growing trees. You’ll notice there is a tree_method parameter which allows you to choose between the greedy exact algorithm (like the one we discussed in the decision tree from scratch post) and the approximate algorithm, which offers various scalability-related functionality, notably including the ability to consider only a small number of candidate split points instead of trying all possible splits. The algorithm also uses clever tricks like pre-sorting data for split finding and caching frequently needed values.

+
+

Why XGBoost is so Successful

+

As I mentioned in the intro, XGBoost is simply a very good implementation of the gradient boosting tree model. Therefore it inherits all the benefits of decision trees and tree ensembles, while making even further improvements over the classic gradient boosting machine. These improvements boil down to

+
    +
  1. more ways to control overfitting
  2. +
  3. elegant handling of custom objectives
  4. +
  5. scalability
  6. +
+

First, XGBoost introduces two new tree regularization hyperparameters and which are baked directly into its objective function. Combining these with the additional column and row sampling functionality provides a variety of ways to reduce overfitting.

+

Second, the XGBoost formulation provides a much more elegant way to train models on custom objective functions. Recall that for custom objectives, the classic GBM finds tree structure by fitting a squared error decision tree to the gradients of the loss function and then sets each leaf’s predicted value by running a numerical optimization routine to find the optimal predicted value.

+

The XGBoost formulation improves on this two-stage approach by unifying the generation of tree structure and predicted values. Both the split scoring metric and the predicted values are directly computable from the instance gradient and hessian values, which are connected directly back to the overall training objective. This also removes the need for additional numerical optimizations, which contributes to speed, stability, and scalability.

+

Finally, speaking of scalability, XGBoost emerged at a time when industrial dataset size was exploding. Many use cases require scalable ML systems, and all use cases benefit from faster training and higher model development velocity.

+
+
+
+

Wrapping Up

+

Well, there you go, those are the salient ideas behind XGBoost, the gold standard in gradient boosting model implementations. Hopefully now we all understand the mathematical basis for the algorithm and appreciate the key improvements it makes over the classic GBM. If you want to go even deeper, you can join us for the next post where we’ll roll up our sleeves and implement XGBoost entirely from scratch.

+
+
+

References

+

The XGBoost paper

+
+
+

Exercise

+

Proove that the XGBoost Newton Descent generalizes the classic GBM gradient descent. Hint: show that XGBoost with a squared error objective and no regularization reduces to the classic GBM.

+
+ + ]]> + gradient boosting + https://randomrealizations.com/posts/xgboost-explained/index.html + Sat, 12 Mar 2022 22:00:00 GMT + diff --git a/gradient-boosting-series.html b/gradient-boosting-series.html index 869b94d..09c5118 100644 --- a/gradient-boosting-series.html +++ b/gradient-boosting-series.html @@ -168,7 +168,7 @@
Subscribe
- + @@ -204,7 +204,7 @@

Gradient Boosting

I recommend reading through the series in order, since concepts tend to build on earlier ideas.

-
+
@@ -242,7 +242,7 @@

-
+
@@ -274,7 +274,7 @@

-
+
@@ -306,7 +306,7 @@

-
+
@@ -344,7 +344,7 @@

-
+
@@ -376,7 +376,7 @@

-
+
@@ -414,7 +414,7 @@

-
+
diff --git a/index.html b/index.html index 8152445..48e4f4d 100644 --- a/index.html +++ b/index.html @@ -168,7 +168,7 @@
Subscribe
- + @@ -179,7 +179,7 @@
Subscribe
-
Categories
All (13)
PySpark (1)
blogging (3)
from scratch (4)
gradient boosting (8)
pandas (1)
python (7)
tutorial (3)
+
Categories
All (14)
PySpark (1)
blogging (3)
from scratch (4)
gradient boosting (9)
pandas (1)
python (8)
tutorial (4)
xgboost (1)
@@ -202,7 +202,48 @@

Home

-
+
+
+

branches reach into the Kigali sky

+
+ + +
+
@@ -240,7 +281,7 @@

-
+
@@ -272,7 +313,7 @@

-
+
@@ -310,7 +351,7 @@

-
+
@@ -342,7 +383,7 @@

-
+
@@ -380,7 +421,7 @@

-
+
@@ -412,7 +453,7 @@

-
+
@@ -450,7 +491,7 @@

-
+
@@ -488,7 +529,7 @@

-
+
@@ -520,7 +561,7 @@

-
+
@@ -552,7 +593,7 @@

-
+
@@ -590,7 +631,7 @@

-
+
@@ -628,7 +669,7 @@

-
+
diff --git a/listings.json b/listings.json index 491f066..579de06 100644 --- a/listings.json +++ b/listings.json @@ -14,6 +14,7 @@ { "listing": "/index.html", "items": [ + "/posts/xgboost-for-regression-in-python/index.html", "/posts/blogging-with-quarto-and-jupyter/index.html", "/posts/random-realizations-resurrected/index.html", "/posts/xgboost-from-scratch/index.html", @@ -32,6 +33,7 @@ { "listing": "/archive.html", "items": [ + "/posts/xgboost-for-regression-in-python/index.html", "/posts/blogging-with-quarto-and-jupyter/index.html", "/posts/random-realizations-resurrected/index.html", "/posts/xgboost-from-scratch/index.html", diff --git a/posts/8020-pandas-tutorial/index.html b/posts/8020-pandas-tutorial/index.html index 5acb914..345c330 100644 --- a/posts/8020-pandas-tutorial/index.html +++ b/posts/8020-pandas-tutorial/index.html @@ -175,7 +175,7 @@
Subscribe
- + diff --git a/posts/blogging-with-quarto-and-jupyter/index.html b/posts/blogging-with-quarto-and-jupyter/index.html index e906a58..6539bd4 100644 --- a/posts/blogging-with-quarto-and-jupyter/index.html +++ b/posts/blogging-with-quarto-and-jupyter/index.html @@ -172,7 +172,7 @@
Subscribe
- + diff --git a/posts/consider-the-decision-tree/index.html b/posts/consider-the-decision-tree/index.html index 37a6ba6..7e618d2 100644 --- a/posts/consider-the-decision-tree/index.html +++ b/posts/consider-the-decision-tree/index.html @@ -172,7 +172,7 @@
Subscribe
- + diff --git a/posts/decision-tree-from-scratch/index.html b/posts/decision-tree-from-scratch/index.html index 0caa852..33f54f6 100644 --- a/posts/decision-tree-from-scratch/index.html +++ b/posts/decision-tree-from-scratch/index.html @@ -177,7 +177,7 @@
Subscribe
- + diff --git a/posts/drafts/conda-cheat-sheet/index.html b/posts/drafts/conda-cheat-sheet/index.html index 06de5d4..0b8f493 100644 --- a/posts/drafts/conda-cheat-sheet/index.html +++ b/posts/drafts/conda-cheat-sheet/index.html @@ -170,7 +170,7 @@
Subscribe
- + diff --git a/posts/get-down-with-gradient-descent/index.html b/posts/get-down-with-gradient-descent/index.html index 7aee0d4..e358bc4 100644 --- a/posts/get-down-with-gradient-descent/index.html +++ b/posts/get-down-with-gradient-descent/index.html @@ -174,7 +174,7 @@
Subscribe
- + diff --git a/posts/gradient-boosting-machine-from-scratch/index.html b/posts/gradient-boosting-machine-from-scratch/index.html index 1d3d164..5cd7431 100644 --- a/posts/gradient-boosting-machine-from-scratch/index.html +++ b/posts/gradient-boosting-machine-from-scratch/index.html @@ -174,7 +174,7 @@
Subscribe
- + diff --git a/posts/gradient-boosting-machine-with-any-loss-function/index.html b/posts/gradient-boosting-machine-with-any-loss-function/index.html index 1cec980..b53bdd4 100644 --- a/posts/gradient-boosting-machine-with-any-loss-function/index.html +++ b/posts/gradient-boosting-machine-with-any-loss-function/index.html @@ -174,7 +174,7 @@
Subscribe
- + diff --git a/posts/hello-pyspark/index.html b/posts/hello-pyspark/index.html index d8a6ed5..e0f0b48 100644 --- a/posts/hello-pyspark/index.html +++ b/posts/hello-pyspark/index.html @@ -172,7 +172,7 @@
Subscribe
- + diff --git a/posts/hello-world/index.html b/posts/hello-world/index.html index adea3b8..a4decda 100644 --- a/posts/hello-world/index.html +++ b/posts/hello-world/index.html @@ -138,7 +138,7 @@
Subscribe
- + diff --git a/posts/how-gradient-boosting-does-gradient-descent/index.html b/posts/how-gradient-boosting-does-gradient-descent/index.html index 2b464d2..d73805a 100644 --- a/posts/how-gradient-boosting-does-gradient-descent/index.html +++ b/posts/how-gradient-boosting-does-gradient-descent/index.html @@ -140,7 +140,7 @@
Subscribe
- + diff --git a/posts/random-realizations-resurrected/index.html b/posts/random-realizations-resurrected/index.html index 522dbe6..95e22de 100644 --- a/posts/random-realizations-resurrected/index.html +++ b/posts/random-realizations-resurrected/index.html @@ -138,7 +138,7 @@
Subscribe
- + diff --git a/posts/xgboost-explained/index.html b/posts/xgboost-explained/index.html index c7160ae..3053236 100644 --- a/posts/xgboost-explained/index.html +++ b/posts/xgboost-explained/index.html @@ -140,7 +140,7 @@
Subscribe
- + diff --git a/posts/xgboost-for-regression-in-python/index.html b/posts/xgboost-for-regression-in-python/index.html new file mode 100644 index 0000000..32df79c --- /dev/null +++ b/posts/xgboost-for-regression-in-python/index.html @@ -0,0 +1,1446 @@ + + + + + + + + + + + + +Random Realizations – XGBoost for Regression in Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

XGBoost for Regression in Python

+
+
python
+
tutorial
+
gradient boosting
+
xgboost
+
+
+ +
+
+ A step-bystep tutorial on regression with XGBoost in python using sklearn and the xgboost library +
+
+ + +
+ +
+
Author
+
+

Matt Bowers

+
+
+ +
+
Published
+
+

October 25, 2023

+
+
+ + +
+ + +
+ +

In this post I’m going to show you my process for solving regression problems with XGBoost in python, using either the native xgboost API or the scikit-learn interface. This is a powerful methodology that can produce world class results in a short time with minimal thought or effort. While we’ll be working on an old Kagle competition for predicting the sale prices of bulldozers and other heavy machinery, you can use this flow to solve whatever tabular data regression problem you’re working on.

+

This post serves as the explanation and documentation for the XGBoost regression jupyter notebook from my ds-templates repo on GitHub, so go ahead and download the notebook and follow along with your own data.

+

If you’re not already comfortable with the ideas behind gradient boosting and XGBoost, you’ll find it helpful to read some of my previous posts to get up to speed. I’d start with this introduction to gradient boosting, and then read this explanation of how XGBoost works.

+

Let’s get into it! 🚀

+
+

Install and import the xgboost library

+

If you don’t already have it, go ahead and use conda to install the xgboost library, e.g.

+
$ conda install -c conda-forge xgboost
+

Then import it along with the usual suspects.

+
+
import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import xgboost as xgb
+
+
+
+

Read dataset into python

+

In this example we’ll work on the Kagle Bluebook for Bulldozers competition, which asks us to build a regression model to predict the sale price of heavy equipment. Amazingly, you can solve your own regression problem by swapping this data out with your organization’s data before proceeding with the tutorial.

+

Go ahead and download the Train.zip file from Kagle and extract it into Train.csv. Then read the data into a pandas dataframe.

+
+
df = pd.read_csv('Train.csv', parse_dates=['saledate']);
+
+

Notice I cheated a little bit, checking the columns ahead of time and telling pandas to treat the saledate column as a date. In general it will make life easier to read in any date-like columns as dates.

+
+
df.info()
+
+
<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 401125 entries, 0 to 401124
+Data columns (total 53 columns):
+ #   Column                    Non-Null Count   Dtype         
+---  ------                    --------------   -----         
+ 0   SalesID                   401125 non-null  int64         
+ 1   SalePrice                 401125 non-null  int64         
+ 2   MachineID                 401125 non-null  int64         
+ 3   ModelID                   401125 non-null  int64         
+ 4   datasource                401125 non-null  int64         
+ 5   auctioneerID              380989 non-null  float64       
+ 6   YearMade                  401125 non-null  int64         
+ 7   MachineHoursCurrentMeter  142765 non-null  float64       
+ 8   UsageBand                 69639 non-null   object        
+ 9   saledate                  401125 non-null  datetime64[ns]
+ 10  fiModelDesc               401125 non-null  object        
+ 11  fiBaseModel               401125 non-null  object        
+ 12  fiSecondaryDesc           263934 non-null  object        
+ 13  fiModelSeries             56908 non-null   object        
+ 14  fiModelDescriptor         71919 non-null   object        
+ 15  ProductSize               190350 non-null  object        
+ 16  fiProductClassDesc        401125 non-null  object        
+ 17  state                     401125 non-null  object        
+ 18  ProductGroup              401125 non-null  object        
+ 19  ProductGroupDesc          401125 non-null  object        
+ 20  Drive_System              104361 non-null  object        
+ 21  Enclosure                 400800 non-null  object        
+ 22  Forks                     192077 non-null  object        
+ 23  Pad_Type                  79134 non-null   object        
+ 24  Ride_Control              148606 non-null  object        
+ 25  Stick                     79134 non-null   object        
+ 26  Transmission              183230 non-null  object        
+ 27  Turbocharged              79134 non-null   object        
+ 28  Blade_Extension           25219 non-null   object        
+ 29  Blade_Width               25219 non-null   object        
+ 30  Enclosure_Type            25219 non-null   object        
+ 31  Engine_Horsepower         25219 non-null   object        
+ 32  Hydraulics                320570 non-null  object        
+ 33  Pushblock                 25219 non-null   object        
+ 34  Ripper                    104137 non-null  object        
+ 35  Scarifier                 25230 non-null   object        
+ 36  Tip_Control               25219 non-null   object        
+ 37  Tire_Size                 94718 non-null   object        
+ 38  Coupler                   213952 non-null  object        
+ 39  Coupler_System            43458 non-null   object        
+ 40  Grouser_Tracks            43362 non-null   object        
+ 41  Hydraulics_Flow           43362 non-null   object        
+ 42  Track_Type                99153 non-null   object        
+ 43  Undercarriage_Pad_Width   99872 non-null   object        
+ 44  Stick_Length              99218 non-null   object        
+ 45  Thumb                     99288 non-null   object        
+ 46  Pattern_Changer           99218 non-null   object        
+ 47  Grouser_Type              99153 non-null   object        
+ 48  Backhoe_Mounting          78672 non-null   object        
+ 49  Blade_Type                79833 non-null   object        
+ 50  Travel_Controls           79834 non-null   object        
+ 51  Differential_Type         69411 non-null   object        
+ 52  Steering_Controls         69369 non-null   object        
+dtypes: datetime64[ns](1), float64(2), int64(6), object(44)
+memory usage: 162.2+ MB
+
+
+
+
+

Prepare raw data for XGBoost

+

When faced with a new tabular dataset for modeling, we have two format considerations: data types and missingness. From the call to df.info() above, we can see we have both mixed types and missing values.

+

When it comes to missing values, some models like the gradient booster or random forest in scikit-learn require purely non-missing inputs. One of the great strengths of XGBoost is that it relaxes this requirement, allowing us to pass in missing feature values, so we don’t have to worry about them.

+

Regarding data types, all ML models for tabular data require inputs to be numeric, either integers or floats, so we’re going to have to deal with those object columns.

+
+

Encode string features

+

The simplest way to encode string variables is to map each unique string value to an integer; this is called integer encoding.

+

We have a couple of options for how to implement this transformation: pandas categoricals or the scikit-learn label encoder. We can use the categorical type in pandas to generate mappings from string values to integers for each string feature. The category type is a bit like the factor type in R. Pandas stores the underlying data as integers, and it also keeps a mapping from the integers to the string values. XGBoost will be able to access the integers for model fitting. This is nice because we can still access the actual categories which can be helpful when we start taking a closer look at the data. If you prefer, you can also use the scikit-learn label encoder to replace the string columns with their integer-mapped counterparts.

+
+
def encode_string_features(df, use_cats=True):
+    out_df = df.copy()
+    for feature, feature_type in df.dtypes.items():
+        if feature_type == 'object':
+            if use_cats:
+                out_df[feature] = out_df[feature].astype('category')
+            else:
+                from sklearn.preprocessing import LabelEncoder
+                out_df[feature] = LabelEncoder() \
+                    .fit_transform(out_df[feature].astype('str'))
+    return out_df
+
+df = encode_string_features(df, use_cats=False)
+
+
+
+

Encode date and timestamp features

+

While dates feel sort of numeric, they are not numbers, so we need to transform them into numeric columns. Unfortunately, encoding timestamps isn’t as straightforward as encoding strings, so we actually might need to engage in a little bit of feature engineering. A single date has many different attributes, e.g. days since epoch, year, quarter, month, day, day of year, day of week, is holiday, etc. As a starting point, we can just add a few of these attributes as features. Once a feature is represented as a date or timestamp data type, you can access various attributes via the dt attribute.

+
+
def encode_datetime_features(df, datetime_features, datetime_attributes):
+    out_df = df.copy()
+    for datetime_feature in datetime_features:
+        for datetime_attribute in datetime_attributes:
+            if datetime_attribute == 'days_since_epoch':
+                out_df[f'{datetime_feature}_{datetime_attribute}'] = \
+                    (out_df[datetime_feature] 
+                     - pd.Timestamp(year=1970, month=1, day=1)).dt.days
+            else:
+                out_df[f'{datetime_feature}_{datetime_attribute}'] = \
+                    getattr(out_df[datetime_feature].dt, datetime_attribute)
+    return out_df
+
+datetime_features = [
+    'saledate',
+]
+datetime_attributes = [
+    'year',
+    'month',
+    'day',
+    'quarter',
+    'day_of_year',
+    'day_of_week',
+    'days_since_epoch',
+]
+
+df = encode_datetime_features(df, datetime_features, datetime_attributes)
+
+
+
+

Transform the target if necessary

+

In the interest of speed and efficiency, we didn’t bother doing any EDA with the feature data. Part of my justification for this is that trees are incredibly robust to outliers, colinearity, missingness, and other assorted nonsense in the feature data. However, they are not necessarily robust to nonsense in the target variable, so it’s worth having a look at it before proceeding any further.

+
+
df.SalePrice.hist(); plt.xlabel('SalePrice');
+
+

histogram of sale price showing right-skewed data

+
+
+

Often when predicting prices it makes sense to use log price, especially when they span multiple orders of magnitude or have a strong right skew. These data look pretty friendly, lacking outliers and exhibiting only a mild positive skew; we could probably get away without doing any transformation. But checking the evaluation metric used to score the Kagle competition, we see they’re using root mean squared log error. That’s equivalent to using RMSE on log-transformed target data, so let’s go ahead and work with log prices.

+
+
df['logSalePrice'] = np.log1p(df['SalePrice'])
+df.logSalePrice.hist(); plt.xlabel('logSalePrice');
+
+

histogram of log sale price showing a more symetric distribution

+
+
+
+
+
+

Train and Evaluate the XGBoost regression model

+

Having prepared our dataset, we are now ready to train an XGBoost model. We’ll walk through the flow step-by-step first, then later we’ll collect the code in a single cell, so it’s easier to quickly iterate through variations of the model.

+
+

Specify target and feature columns

+

First we’ll put together a list of our features and define the target column. I like to have an actual list defined in the code so it’s easier to see everything we’re puting into the model and easier to add or remove features as we iterate. Just run something like list(df.columns) in a cel to get a copy-pasteable list of columns, then edit it down to the full list of features, i.e. remove the target, date columns, and other non-feature columns..

+
+
# list(df.columns)
+
+
+
features = [
+    'SalesID',
+    'MachineID',
+    'ModelID',
+    'datasource',
+    'auctioneerID',
+    'YearMade',
+    'MachineHoursCurrentMeter',
+    'UsageBand',
+    'fiModelDesc',
+    'fiBaseModel',
+    'fiSecondaryDesc',
+    'fiModelSeries',
+    'fiModelDescriptor',
+    'ProductSize',
+    'fiProductClassDesc',
+    'state',
+    'ProductGroup',
+    'ProductGroupDesc',
+    'Drive_System',
+    'Enclosure',
+    'Forks',
+    'Pad_Type',
+    'Ride_Control',
+    'Stick',
+    'Transmission',
+    'Turbocharged',
+    'Blade_Extension',
+    'Blade_Width',
+    'Enclosure_Type',
+    'Engine_Horsepower',
+    'Hydraulics',
+    'Pushblock',
+    'Ripper',
+    'Scarifier',
+    'Tip_Control',
+    'Tire_Size',
+    'Coupler',
+    'Coupler_System',
+    'Grouser_Tracks',
+    'Hydraulics_Flow',
+    'Track_Type',
+    'Undercarriage_Pad_Width',
+    'Stick_Length',
+    'Thumb',
+    'Pattern_Changer',
+    'Grouser_Type',
+    'Backhoe_Mounting',
+    'Blade_Type',
+    'Travel_Controls',
+    'Differential_Type',
+    'Steering_Controls',
+    'saledate_year',
+    'saledate_month',
+    'saledate_day',
+    'saledate_quarter',
+    'saledate_day_of_year',
+    'saledate_day_of_week',
+    'saledate_days_since_epoch'
+]
+
+target = 'logSalePrice'
+
+
+
+

Split the data into training and validation sets

+

Next we split the dataset into a training set and a validation set. Of course since we’re going to evaluate against the validation set a number of times as we iterate, it’s best practice to keep a separate test set reserved to check our final model to ensure it generalizes well. Assuming that final test set is hidden away, we can use the rest of the data for training and validation.

+

There are two main ways we might want to select the validation set. If there isn’t a temporal ordering of the observations, we might be able to randomly sample. In practice, it’s much more common that observations have a temporal ordering, and that models are trained on observations up to a certain time and used to predict on observations occuring after that time. Since this data is temporal, we don’t want to split randomly; instead we’ll split on observation date, reserving the latest observations for the validation set.

+
+
# Temporal Validation Set
+def train_test_split_temporal(df, datetime_column, n_test):
+    idx_sort = np.argsort(df[datetime_column])
+    idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]
+    return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+
+# Random Validation Set
+def train_test_split_random(df, n_test):
+    np.random.seed(42)
+    idx_sort = np.random.permutation(len(df))
+    idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]
+    return df.iloc[idx_train, :], df.iloc[idx_test, :]
+
+my_train_test_split = lambda d, n_valid: train_test_split_temporal(d, 'saledate', n_valid)
+# my_train_test_split = lambda d, n_valid: train_test_split_random(d, n_valid)
+
+
+
n_valid = 12000
+train_df, valid_df = my_train_test_split(df, n_valid)
+
+train_df.shape, valid_df.shape
+
+
((389125, 61), (12000, 61))
+
+
+
+
+

Create DMatrix data objects

+

XGBoost uses a data type called dense matrix for efficient training and prediction, so next we need to create DMatrix objects for our training and validation datasets.

+
+

If you prefer to use the scikit-learn interface to XGBoost, you don’t need to create these dense matrix objects. More on that below.

+
+
+
dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+
+
+

Set the XGBoost parameters

+

XGBoost has numerous hyperparameters. Fortunately, just a handful of them tend to be the most influential; furthermore, the default values are not bad in most situations. I like to start out with a dictionary containing the default parameter values for just the ones I think are most important. For training there is one required boosting parameter called num_boost_round which I set to 50 as a starting point; you can make this smaller initially if training takes too long.

+
+
# default values for important parameters
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+
+
+

Train the XGBoost model

+

Check out the documentation on the learning API to see all the training options. During training, I like to have XGBoost print out the evaluation metric on the train and validation set after every few boosting rounds and again at the end of training; that can be done by setting evals and verbose_eval. You can also save the evaluation results in a dictionary passed into evals_result to inspect and plot the objective curve over the training iterations.

+
+
evals_result = {}
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')],
+              verbose_eval=10,
+              evals_result=evals_result)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+
+

Train the XGBoost model using the sklearn interface

+

You can optionally use the sklearn estimator interface to XGBoost. This will bypass the need to use the DMatrix data objects for training and prediction, and it will allow you to leverage many of the other scikit-learn ecosystem tools like pipelines, parameter search, partial dependence plots, etc. The XGBRegressor is available in the xgboost library that we’ve already imported.

+
+
# scikit-learn interface
+reg = xgb.XGBRegressor(n_estimators=num_boost_round, **params)
+reg.fit(train_df[features], train_df[target], 
+        eval_set=[(train_df[features], train_df[target]), (valid_df[features], valid_df[target])], 
+        verbose=10);
+
+
[0] validation_0-rmse:6.74422   validation_1-rmse:6.79733
+[10]    validation_0-rmse:0.34798   validation_1-rmse:0.37158
+[20]    validation_0-rmse:0.26289   validation_1-rmse:0.28239
+[30]    validation_0-rmse:0.25148   validation_1-rmse:0.27028
+[40]    validation_0-rmse:0.24375   validation_1-rmse:0.26420
+[49]    validation_0-rmse:0.23738   validation_1-rmse:0.25855
+
+
+

Since not all features of XGBoost are available through the scikit-learn estimator interface, you might want to get the native booster object back out of the sklearn wrapper.

+
+
m = reg.get_booster()
+
+
+
+

Evaluate the model and check for overfitting

+

We get the model evaluation metrics on the training and validation sets printed to stdout when we use the evals argument to the training API. Typically I just look at those printed metrics, but let’s double check by hand.

+
+
def root_mean_squared_error(y_true, y_pred):
+    return np.sqrt(np.mean((y_true - y_pred)**2))
+
+root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+
+
0.25855368
+
+
+

So, how good is that RMSLE of 0.259? Well, checking the Kagle leaderboard for this competition, we would have come in 53rd out of 474, which is in the top 12% of submissions. That’s not bad for 10 minutes of work doing the bare minimum necessary to transform the raw data into a format consumable by XGBoost and then training a model using default hyperparameter values.

+
+

Note that we’re using a different validation set from that used for the final leaderboard (which is long closed), but our score is likely still a decent approximation for how we would have done in the competition.

+
+

It can be helpful to take a look at objective curves for training and validation data to get a sense for the extent of overfitting. A huge difference between training and validation performance indicates overfitting. In the below curve, there is very little overfitting, indicating we can be aggressive with hyperparameters that increase model flexibility. More on that soon.

+
+
pd.DataFrame({
+    'train': evals_result['train']['rmse'],
+    'valid': evals_result['valid']['rmse']
+}).plot(); plt.xlabel('boosting round'); plt.ylabel('objective');
+
+

line plot showing objective function versus training iteration for training and validation sets

+
+
+
+
+

Check feature importance

+

It’s helpful to get an idea of how much the model is using each feature. In following iterations we might want to try dropping low-signal features or examining the important ones more closely for feature engineering ideas. The gigantic caveat to keep in mind here is that there are different measures of feature importance, and each one will give different importances. XGBoost provides three importance measures; I tend to prefer looking at the weight measure because its rankings usually seem most intuitive.

+
+
fig, ax = plt.subplots(figsize=(5,10))
+feature_importances = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)
+feature_importances.plot.barh(ax=ax)
+plt.title('Feature Importance');
+
+

feature importance plot showing a few high importance features and many low importance ones

+
+
+
+
+
+

Improve performance using a model iteration loop

+

At this point we have a half-decent prototype model. Now we enter the model iteration loop in which we adjust features and model parameters to find configurations that have better and better performance.

+

Let’s start by putting the feature and target specification, the training/validation split, the model training, and the evaluation all together in one code block that we can copy paste for easy model iteration.

+
+

Note that for this process to be effective, model training needs to take less than 10 seconds. Otherwise you’ll be sitting around waiting way too long. If training takes too long, try training on a sample of the training data, or try reducing the number of boosting rounds.

+
+
+
features = [
+    'SalesID',
+    'MachineID',
+    'ModelID',
+    'datasource',
+    'auctioneerID',
+    'YearMade',
+    'MachineHoursCurrentMeter',
+    'UsageBand',
+    'fiModelDesc',
+    'fiBaseModel',
+    'fiSecondaryDesc',
+    'fiModelSeries',
+    'fiModelDescriptor',
+    'ProductSize',
+    'fiProductClassDesc',
+    'state',
+    'ProductGroup',
+    'ProductGroupDesc',
+    'Drive_System',
+    'Enclosure',
+    'Forks',
+    'Pad_Type',
+    'Ride_Control',
+    'Stick',
+    'Transmission',
+    'Turbocharged',
+    'Blade_Extension',
+    'Blade_Width',
+    'Enclosure_Type',
+    'Engine_Horsepower',
+    'Hydraulics',
+    'Pushblock',
+    'Ripper',
+    'Scarifier',
+    'Tip_Control',
+    'Tire_Size',
+    'Coupler',
+    'Coupler_System',
+    'Grouser_Tracks',
+    'Hydraulics_Flow',
+    'Track_Type',
+    'Undercarriage_Pad_Width',
+    'Stick_Length',
+    'Thumb',
+    'Pattern_Changer',
+    'Grouser_Type',
+    'Backhoe_Mounting',
+    'Blade_Type',
+    'Travel_Controls',
+    'Differential_Type',
+    'Steering_Controls',
+    'saledate_year',
+    'saledate_month',
+    'saledate_day',
+    'saledate_quarter',
+    'saledate_day_of_year',
+    'saledate_day_of_week',
+    'saledate_days_since_epoch',
+]
+
+target = 'logSalePrice'
+
+train_df, valid_df = train_test_split_temporal(df, 'saledate', 12000)
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')],verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37158
+[20]    train-rmse:0.26289  valid-rmse:0.28239
+[30]    train-rmse:0.25148  valid-rmse:0.27028
+[40]    train-rmse:0.24375  valid-rmse:0.26420
+[49]    train-rmse:0.23738  valid-rmse:0.25855
+
+
+
+

Feature selection

+
+

Drop low-importance features

+

Let’s try training a model on only the top k most important features. You can try different values of k for the rankings created from each of the three importance measures. You can play with how many to keep, looking for the optimal number manually.

+
+
feature_importances_weight = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)
+feature_importances_cover = pd.Series(m.get_score(importance_type='cover')).sort_values(ascending=False)
+feature_importances_gain = pd.Series(m.get_score(importance_type='gain')).sort_values(ascending=False)
+
+
+
# features = list(feature_importances_weight[:30].index)
+# features = list(feature_importances_cover[:35].index)
+features = list(feature_importances_gain[:30].index)
+
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79733
+[10]    train-rmse:0.34798  valid-rmse:0.37150
+[20]    train-rmse:0.26182  valid-rmse:0.27986
+[30]    train-rmse:0.24974  valid-rmse:0.26896
+[40]    train-rmse:0.24282  valid-rmse:0.26043
+[49]    train-rmse:0.23768  valid-rmse:0.25664
+
+
+

Looks like keeping the top 30 from the gain importance type gives a slight performance improvement.

+
+
+

Drop one feature at a time

+

Next try dropping each feature out of the model one-at-a-time to see if there are any more features that you can drop. For each feature, drop it from the feature set, then train a new model, then record the evaluation score. At the end, sort the scores to see which features are the best candidates for removal.

+
+
features = [
+    'Coupler_System',
+     'Tire_Size',
+     'Scarifier',
+     'ProductSize',
+     'Ride_Control',
+     'fiBaseModel',
+     'Enclosure',
+     'Pad_Type',
+     'YearMade',
+     'fiSecondaryDesc',
+     'ProductGroup',
+     'Drive_System',
+     'Ripper',
+     'saledate_days_since_epoch',
+     'fiModelDescriptor',
+     'fiProductClassDesc',
+     'MachineID',
+     'Hydraulics',
+     'SalesID',
+     'Track_Type',
+     'ModelID',
+     'fiModelDesc',
+     'Travel_Controls',
+     'Transmission',
+     'Blade_Extension',
+     'fiModelSeries',
+     'Grouser_Tracks',
+     'Undercarriage_Pad_Width',
+     'Stick',
+     'Thumb'
+]
+
+# drop each feature one-at-a-time
+scores = []
+for i, feature in enumerate(features):
+    drop_one_features = features[:i] + features[i+1:]
+
+    dtrain = xgb.DMatrix(data=train_df[drop_one_features], label=train_df[target], enable_categorical=True)
+    dvalid = xgb.DMatrix(data=valid_df[drop_one_features], label=valid_df[target], enable_categorical=True)
+
+    params = {
+        'learning_rate': 0.3,
+        'max_depth': 6,
+        'min_child_weight': 1,
+        'subsample': 1,
+        'colsample_bynode': 1,
+        'objective': 'reg:squarederror',
+    }
+    num_boost_round = 50
+
+    m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+                evals=[(dtrain, 'train'), (dvalid, 'valid')],
+                verbose_eval=False)
+    score = root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))
+    scores.append(score)
+
+results_df = pd.DataFrame({
+    'feature': features,
+    'score': scores
+})
+results_df.sort_values(by='score')
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
featurescore
18SalesID0.252617
5fiBaseModel0.253710
27Undercarriage_Pad_Width0.254032
17Hydraulics0.254114
20ModelID0.254169
4Ride_Control0.254278
16MachineID0.254413
19Track_Type0.254825
6Enclosure0.254958
28Stick0.255164
1Tire_Size0.255365
10ProductGroup0.255404
22Travel_Controls0.255895
29Thumb0.256300
23Transmission0.256380
26Grouser_Tracks0.256395
11Drive_System0.256652
24Blade_Extension0.256698
7Pad_Type0.256952
25fiModelSeries0.257073
2Scarifier0.257590
12Ripper0.257848
0Coupler_System0.258074
21fiModelDesc0.258712
13saledate_days_since_epoch0.259856
14fiModelDescriptor0.260439
9fiSecondaryDesc0.260782
15fiProductClassDesc0.263790
3ProductSize0.268068
8YearMade0.313105
+ +
+
+
+

Next try removing the feature with the best removal score. Then with that feature still removed, also try removing the feature with the next best removal score and so on. Repeat this process until the model evaluation metric is no longer improving. I think this could be considered a faster version of backward stepwise feature selection.

+
+
features = [
+    'Coupler_System',
+     'Tire_Size',
+     'Scarifier',
+     'ProductSize',
+     'Ride_Control',
+#      'fiBaseModel',
+     'Enclosure',
+     'Pad_Type',
+     'YearMade',
+     'fiSecondaryDesc',
+     'ProductGroup',
+     'Drive_System',
+     'Ripper',
+     'saledate_days_since_epoch',
+     'fiModelDescriptor',
+     'fiProductClassDesc',
+     'MachineID',
+#      'Hydraulics',
+#      'SalesID',
+     'Track_Type',
+     'ModelID',
+     'fiModelDesc',
+     'Travel_Controls',
+     'Transmission',
+     'Blade_Extension',
+     'fiModelSeries',
+     'Grouser_Tracks',
+#      'Undercarriage_Pad_Width',
+     'Stick',
+     'Thumb'
+]
+
+dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)
+dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)
+
+params = {
+    'learning_rate': 0.3,
+    'max_depth': 6,
+    'min_child_weight': 1,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',
+}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74422  valid-rmse:6.79145
+[10]    train-rmse:0.34882  valid-rmse:0.37201
+[20]    train-rmse:0.26050  valid-rmse:0.27386
+[30]    train-rmse:0.24844  valid-rmse:0.26205
+[40]    train-rmse:0.24042  valid-rmse:0.25426
+[49]    train-rmse:0.23549  valid-rmse:0.25004
+
+
+

So here I was able to remove four more features before the score started getting worse. With our reduced feature set, we’re now ranking 39th on that Kagle leaderboard. Let’s see how far we can get with some hyperparameter tuning.

+
+
+
+

Tune the XGBoost hyperparameters

+

This is a topic which deserves its own full-length post, but just for fun, here I’ll do a quick and dirty hand tuning without a ton of explanation.

+

Broadly speaking, my process is to increase model expressiveness by increasing the maximum tree depth untill it looks like I’m overfitting. At that point, I start pushing tree pruning parameters like min child weight and regularization parameters like lambda to counteract the overfitting. That process lead me to the following parameters.

+
+
params = {
+    'learning_rate': 0.3,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 5,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:6.74473  valid-rmse:6.80196
+[10]    train-rmse:0.31833  valid-rmse:0.34151
+[20]    train-rmse:0.22651  valid-rmse:0.24885
+[30]    train-rmse:0.21501  valid-rmse:0.23904
+[40]    train-rmse:0.20897  valid-rmse:0.23645
+[49]    train-rmse:0.20418  valid-rmse:0.23412
+
+
+

That gets us up to 12th place. Next I start reducing the learning rate and increasing the boosting rounds in proportion to one another.

+
+
params = {
+    'learning_rate': 0.3/5,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 5,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50*5
+
+m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)
+
+
[0] train-rmse:9.04930  valid-rmse:9.12743
+[10]    train-rmse:4.88505  valid-rmse:4.93769
+[20]    train-rmse:2.64630  valid-rmse:2.68501
+[30]    train-rmse:1.44703  valid-rmse:1.47923
+[40]    train-rmse:0.81123  valid-rmse:0.84079
+[50]    train-rmse:0.48441  valid-rmse:0.51272
+[60]    train-rmse:0.32887  valid-rmse:0.35434
+[70]    train-rmse:0.26276  valid-rmse:0.28630
+[80]    train-rmse:0.23720  valid-rmse:0.26026
+[90]    train-rmse:0.22658  valid-rmse:0.24932
+[100]   train-rmse:0.22119  valid-rmse:0.24441
+[110]   train-rmse:0.21747  valid-rmse:0.24114
+[120]   train-rmse:0.21479  valid-rmse:0.23923
+[130]   train-rmse:0.21250  valid-rmse:0.23768
+[140]   train-rmse:0.21099  valid-rmse:0.23618
+[150]   train-rmse:0.20928  valid-rmse:0.23524
+[160]   train-rmse:0.20767  valid-rmse:0.23445
+[170]   train-rmse:0.20658  valid-rmse:0.23375
+[180]   train-rmse:0.20558  valid-rmse:0.23307
+[190]   train-rmse:0.20431  valid-rmse:0.23252
+[200]   train-rmse:0.20316  valid-rmse:0.23181
+[210]   train-rmse:0.20226  valid-rmse:0.23145
+[220]   train-rmse:0.20133  valid-rmse:0.23087
+[230]   train-rmse:0.20045  valid-rmse:0.23048
+[240]   train-rmse:0.19976  valid-rmse:0.23023
+[249]   train-rmse:0.19902  valid-rmse:0.23009
+
+
+

Decreasing the learning rate and increasing the boosting rounds got us up to a 2nd place score. Notice that the score is still decreasing on the validation set. We can actually continue boosting on this model by passing it to the xgb_model argument in the train function. We want to go very very slowly here to avoid overshooting the minimum of the objective function. To do that I ramp up the lambda regularization parameter and boost a few more rounds from where we left off.

+
+
# second stage
+params = {
+    'learning_rate': 0.3/10,
+    'max_depth': 10,
+    'min_child_weight': 14,
+    'lambda': 60,
+    'subsample': 1,
+    'colsample_bynode': 1,
+    'objective': 'reg:squarederror',}
+num_boost_round = 50*3
+
+m1 = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
+              evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10,
+              xgb_model=m)
+
+
[0] train-rmse:0.19900  valid-rmse:0.23007
+[10]    train-rmse:0.19862  valid-rmse:0.22990
+[20]    train-rmse:0.19831  valid-rmse:0.22975
+[30]    train-rmse:0.19796  valid-rmse:0.22964
+[40]    train-rmse:0.19768  valid-rmse:0.22955
+[50]    train-rmse:0.19739  valid-rmse:0.22940
+[60]    train-rmse:0.19714  valid-rmse:0.22935
+[70]    train-rmse:0.19689  valid-rmse:0.22927
+[80]    train-rmse:0.19664  valid-rmse:0.22915
+[90]    train-rmse:0.19646  valid-rmse:0.22915
+[100]   train-rmse:0.19620  valid-rmse:0.22910
+[110]   train-rmse:0.19604  valid-rmse:0.22907
+[120]   train-rmse:0.19583  valid-rmse:0.22901
+[130]   train-rmse:0.19562  valid-rmse:0.22899
+[140]   train-rmse:0.19546  valid-rmse:0.22898
+[149]   train-rmse:0.19520  valid-rmse:0.22886
+
+
+
+
root_mean_squared_error(dvalid.get_label(), m1.predict(dvalid))
+
+
0.22885828
+
+
+

And that gets us to 1st place on the leaderboard.

+
+
+
+

Wrapping Up

+

There you have it, how to use XGBoost to solve a regression problem in python with world class performance. Remember you can use the XGBoost regression notebook from my ds-templates repo to make it easy to follow this flow on your own problems. If you found this helpful, or if you have additional ideas about solving regression problems with XGBoost, let me know down in the comments.

+
+ +
+
+ + +

Comments

+ +
+ + +
+ +
+ + + + \ No newline at end of file diff --git a/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-19-output-1.png b/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-19-output-1.png new file mode 100644 index 0000000000000000000000000000000000000000..b67113d8a9f88a11b2ed36817c2a30783d3a3482 GIT binary patch literal 17808 zcmajH1yoh*_cppnkw#KN1VM01n6&f;MFch?A{~m-(nu>QDN-t3g3>KWBZ2`)C?F{z zQUcQb&9(XcV|@3Hd&fBE=vn*S?|NhAGoQJiYN#nwk+YH`2tsx3s-h->5MmI7pookV zo&@o#QNSPKPFNkMn|9_-t|kumkQ*jW53TK-tSwAWy4-Vcw6L?iD0Erqg22fKPEHRU zC4_}-{(FFsox^=$j0h(dCOP`>>K#V}p)o=KCCHJ8I0;gvq_VD3MD}X*YUmu5EjERAqm<;jkDTe`3eYaghy6 z#wDw$s5t4TRYF8dM@Ls4hz$r14)$aUC5Qjp5@2HF=H~u)M*xEkD%H_~&)V)6=fKdY z&h9qqZ^vEUb7ZR}$?NE_FkquNg5dL?a%7J~LdbbepO(+E+Pg4ChSIW^1A15_Soub3D2d*BQGRfZF5uZ+!0{X z5n!73Djk^5cT-gC4ewuY>HYrwQ^c>zGF`!5%W~z4sf*Zq?XNELpFP|B_2or9FEw-X z)5$*jj}`;Cxzla43ksYjWN&RHtv&CP=PefK|361yAq z2R3Wsf*iPODk}1i9*Gxjb8)Au(aWEdk>OlhTU&S)7WRr>KDD}?oDG`y1o18>jinsJDtdEd+uSZm&x60Pv}+JwiKJSp1b#%;X_Lc*QHAg8Oo6#VM_)TuD_IJLej^QTt^!7 zD@Q`$3|Hs+Xsr4Q)_TuXKIPmi8;iNj@D#E9tk!%|l9G<@+RdAG3fn0^t`BTHW-d@J zD)91o%`9vf)*M7*wlvo0_A7uHL3TD4k04TOy*J&~zTfW3R_A=7dOA_+(J|3sCwfEo z8D3%%5|gceQ!k_*U7+TW@Hed0l^OoEHJ#FWwy@?&{qB^o+umB=d9Q7!$*C!>3l~aO z>?)|F(z~Brxz9nr_rOvGF^mFb>`tx~~hTsi#OYUy;Ay_*-?zYh(D&V5RWy&XmZvZ!y-R zb+unc*&=hwPVz|5A*t9a8hX?$st!b;Rt%*1tVayr8UC!hV$z#~6WYjq`*vw}rE_AX zQ>OLJ&1~axUEIx^e!IK7B9#}N{%A}uJbOk9+Mf3A-JE)qG0s9R<~*gjiR?Q~BCRA7 z92fWOYX+>vMyqh$Y*Vglvgf~Rc}Ddo3hTBp+v^JyevQ6^cJ&<2&dxTr%DLaGhi)^p zoE<&4Fq$!nHI-lEXZ%HW+tN6fKTx7NW&SE0g-!R2Q^z94^2?8Wz4PeK+~@KU-_M^_ zCV&5Sv6t9bW{+i34E&=(cnV4VXPq{A3BJP|pO_fN>9b2$RaM1*=8VPK1Le^=uVTyZ zn90dWUBw_HMhpYkR9>gFrCyGww%(Y-Q# z_g52YN*h3d&OQ2fSKnuEjpe+le|%2`r_8nmHdf?V&XL*A>Ua%w!tL#Bw9?4zj+NN{ z{`#^#8C}Ya1d*z|gv(ZOU-l?u$Sp0@;!8>{R64jYz=fo%agrcK_T8eOc@{cax+WA6 z1PK4F`#h+o%+|pZ`dNO=t5?T9$-x?`g_&ZtA2dxMh+LfJr)&T9ghh23@ui&;|DPws z%BK;8GMzLhH!d*{%&~_l&I0Ry=2&^UN%a`Ua0~R6hbUC#YFM9U<#7yflui~x9jb#NC-YhF zVd#<6?V&UNpTlA1$y)&d=!re!@WW;a-gFcx0D=3_Y53PsZ6}Q_MZgzF9 zFASdd+c+ zUOzr6s*87VLLN*G zY3&&%gjU~2+u;}(U3s>7u@Ww;Q4UnX!B=`Nw=lC@alceyI~>wy=s|@nO-8Gm?rpE( z?Ovsw5ep^hkoV3)PHwJMUj_Dvbl|^BOtB&?-0VGnf_|Pw^sc1-P*DS2G|b{(*x$s8 zpo(w%o{H0ZlPluHm5}|t9ka={D8x5kuhJA0<=xs3Jpu$Sq^-T(W}~8~r)Rm)XOH6Y z<;&+-C=p+q!HP$_SIYcBY#JFDDRLjMa3^bA2SaJQy*jfw9xA-l>9bG2G2|HO==J|l zb=}gl{y#6+{JfN-g&FcW*b{p=hScr;yI!+Bn`^Ml^vVQhq2Tm_AnV~*wX>KURact(?o>Or96>ZohkhOIo_FXE4f8%X@2zS!!>! zN2k)3!_w07TfL7@4a6fIi+s-n1wNl|v+RHTs#NxB*(6*+Z>*C9$M)J>Yow2aT<&+5 z?W-!M-e9MOsvKD)T?7iMezWH3RZ@P+)YOv$=gFRje*Pj{AC{TPBcNCDwMs}_{ElHN(+B@qoIqHx76y0Qkk~-K8t;jQ z1AFO(eiRsl;xd$eMM(aYwYC-rq~VOH9QA(xeOga|=|#JFd{%A%+a>-0GuFM3(!``V z2(lA0nPR~bn9X!&{pl;@{84BW8N}($5_x5vWdwwgDJA9Q%cB`xfAk?*u9nJ3=g;}3 zh#;1ojLuv8U_<{%y}<`?AA;##$oNR3FGSrwjnonnCuHz*d}X>lA?V*fhiOd>jj`Co z%hmjdtbsUcRii|&31W^OJ$hB$^~&@qK|!jmtt~-mqr$#ZQc^?rmIFaLHui#5yhmyN zGuEs%q9sisNdE_iv?oig$3|RC&brxmK|!KY?C)MLq9KiZ%A4XR#2@>ceZwA zC@*&@bN+(xstH`f-H(>9?ODA9QNzgn5Az0lZ}m@?PObZ8>HSSJ$`ev7!gZV6OoF=p zGd-wPBXvp?f;gOLQ+hWwC|wMBYu2|PLNun4hMc8&&ziX!adzWOyZ=I*^xlv5V1oiAVYnb`NgUwB08vTBCi88ZmRxc`wd z#Znjj;lQ?N+v^@0i^te3qP#$sR;t4-FbZHe=*Iamv#hA&RGg%O|4*g1N>r-X`k(Pn zR3p-lX?9wyJ`E18K__@}3hp9poxDL?7{7%vGu#Yf=ERD>h9P2hTzalhuv;Lr^s(b>u@6K|>_z8JhT;^N_(sVX-hDqCll zfsBfK))E$_l^aQj=!l0)!`%20+b5NAbg?bZqrt$DNW(slaTOzo^JQW6)BpKOCb=U6 zBM1KOW*uOPOi_u~T{MTcZ4iOY@_TZM9zjehE;QK2vq7B1-%y<6q~R8mk&|O`#L`U* zq%7ww`sl73f5%ge)}*%=9h%X%iNV`KsK1M6cNGerjX^^!{vvc-j-j7s3mK9la1TBF zSrzo1{N$ML;OSVs&vT(GH^td2+<+@wGsmBIZ!8!HG`%~V}k$Ygs@s{10cLs~i=-O<-#k`IQv{)@K#|v-H z*=sV&Q*BYFuBxgks;H2^R!a&2GgP^)AKF3)S#e2QDupXZNe*60W>rA7@5LJgZjAbH z&h-_xLXaMwp3WV1{32;qmLRk5*RQH9;Y!obV`3(!K2XciQT6=M>m-oB`UyaXUJicw z%?_iVkZ5GSem(vsJGlw0al#ebVjbgRoy3iHJQAQ`Xcz$yK;hKx^0#aH>&quwlVm1| z>rE?ES-c=z}Oq=+&hj3+h#j zeRxX0x!lTWzq3%;Y%OB;G3@vSGlQ6}RO}hn>@FGJLot57-PQRin$Lb?*fk?Fllc7k z^Y6eN7uyVqto|L=GtUdve8ceKJDo3L*|EIijRxQ$z3C;UE#xoXFUD9J@E8-!em)e@ z+skhpwRW!=G;i$nW5n=Md)owOAALoP6iG!LDR;V1bNYCJ1~}>J|HSu%%%LF(WpKE2 za~&#bV7eII>1YixF#Ou&Zh2^R=1`Ei1u(Yc5ia`j96XkfUpeB#bdu9djH1|AVf;*| zTaz5CE@YrW3ZRcDFN@zpO+_L^fVb*Ua{1xmaVlMgGZFNb?bueHY>-oORhUwff#i3p zBH}ys^(a0OBBT#h;p+`pa~k6R63+%f1-~pv_!A8GaU``+>WnBE2(zy#ZpJ_L#a%@9 zUDh+lH}Fzt0;X8CbG(^NLJWVff*LJvP>UfIj?JQX|Hxv`N*5t1r4CxvHld>>k#=9P z$WbVE_F|3VOL)`+kGQl78!ckZgknRQCJr}t0_J;rVHH7iRK(&@x0Vg7(DZ87$Gmk& z&OoiZg#acgGengXA+{GnQLad^9y|)xoQSNO0o}AFY6Q>0Bq3L?2HxQR_EaMEU)j7E zSRR4^Y*w_p(s5icKjJGE&xaQPV-ws=W-=Y=Zuhx(m0RZvchRr&z}LyU-`Xp2&>%Vd z5}5xQx$KKk;6_sGcy8O$!gV5WodiFYco1Jr!;|7i;e24WWyJZe9pal7ff+pl9zQDhTPQA zlKQ|jLV=`SaKxXIIxLO$MhqPF*sKvg+azZfz??V+`wl7od&#&G?_d~M;Cnn0bduu7CB1oO*5maOcb(a}BDd)3Z!&ytfz z{x0O{6cfI8T~snRH-DcjV>cod6CW=b|AcUKJDB&h^iig?0CJaxug5Jxmn5!FbcCX? z%GXzyDay;sY2}S~))zK6CnBR>zI^IB@+jM2ke{EwVIoo{qOg9SRZdQB!95(0E4uQP z+~&asjAaHZz10lf?v{FR>iIoo*w{ZSTgKSY-QS``y7_LiZ?ZKE48kpXUZ*KXp3tdr zI}g6&XMxT9&vFU?F;`G_UO^_}o6gVQ z(@M1Fq2aadkJnP~&iwR$C?1vU@mJ2$5|f^hG5pgWU}{pD6Z0iPJUqlW9IpMxPV~=b z$+E|dCFA)dQ7bX(ICuAF?AhDtJ_q}F3x4<0u|#plk3ChpakVBwjQQe5XmrDUFs+Rn zc<(B_{&zSOP5KIJ*IC_H|2$5xtv<@>wMqvF>@aodqZRYY(_t&Tha&7alq~P_EmaN; z$RjvqBJ%PCGBY!c0j&h6ZsV?WJ$>BRi|Z{91PSt-@kVS*G}7~o$t-x+ABC5k7HrwsIClZwdkefwJJ25(|&BFYW&J4wS zfVhD1n5-PaVEXhOsYV`6b5OEgYU_Eln+(U@$A>r%M%UFMZf8)PHRW{Na<$Z>8PUn11tJeCK`o4a`*K-UU zfYSR=6&II^m5}85`FSA<2C~(zGY$49!w%hp3Vf=qtsPqVM%)Z+_-VZ}8b%Zk1~(SI zp-cRn|0R0y{i3v%mewUbBS17D3oPtO+cnVF=f+?a#Lvi_L8JUAPRHCF0i!y0zz5oY z^!4_R?QM)MI`R9&Gcb-PUeBuDe1Qi7_j46)??GBd#&yzP7KQE{Ye5Q!x5;JfCnczO?O-nbu@G=#}oaGMo_Y$rY@ROspfyjWB z``{!~U6?I9@CU=~>YS5v_uf4ym7vmd_I82d-Mg79=abZeMru7yoj>0N?}*|pct^rGwoY#iFe_< z(Rx_4zHsKunf4zWe}~y11x%u5Bs*q0VZua;`0`J@0K;0Jb8eji&JU_%SfJwYN*x&gNrY2sv`e?qmwaB=UFp_ljzT9~)1o;&w zk4H_>qq;Xz^Qgb{K8=Xo$o#(+|2_j(4(Y8)dbo7g*Tk**m|yD-4Vq_AmmPHguZEPD z^m?2?BjauTUl+o?0WXf=n8a{t}BVsG!ie1tJIFOdz2t8#j z2zmV$>vpK&SfGs1(r4ro{_dUF?ou-iYSptZZePAJu7UV=d6Ln}qwbsgV0?X&4@wM; z|E9iNdC7aSW%W3zUc?g(J#mdYl>RNB@ODM?r-2q!`t(=8lq4T zto&A!%<7lngQnAM{~5X01t}(RNN6Ylayl9!B)WVda8OsTUsnL55Kpqfd_|R%9rv67 zxgUA|3mR2S-v8;C$83A&tnNcb)O#4`@m-WI;|;(iK_Z1#Ju z{nD5mf5?-vvP-HjFSIonTvi9uqGY4wMOtsJ?b(h zpJUE|X>Rf*AcC?o4X2W-DkX{#aVZ44UCOMVOX{YXy3=?L^Ii56*U97wG+{qs+n(`n zaCD!^kCdtj5{+W8@be?JY$PNkED}!qCp?$r)bFo)5FKk;cAU`Kek6{M+PCOH^^&i6 zJOdXW985M{4gcBQ>hw`kQ;WzhAqg{1BCY>(lK`2j`d{N=RSy9kHJ=L&4ehG58&wM; zSG-IY5!t4(8a;Fp%&m9kFt){7L^6JBWqP8x^+fai6tBQ&KJ{_Ho;T-9JI~$kV6TDf zXJKRujk}=u;~x|R=n?zzFn6OTH!`IPP4)s6l%xf^#Ep_L!& zkX&>spr;x&yNH6y))g)`rT06Nb&8B1uke2bi8>EeI{@Czz?Cj|)9;^r+sRvW|14Q$ zlCx?SC+k^B@Fr26R<5|2kUQpl>dEsC2Y~Wk1lTJC@ZfSCEdeG90jAKE{g&7NHh77i zk3Ty*{!RIe{N5X{Z|j`GhhKNJY7#}z8t?nXy15IHpTSYk0NSQao6sJpF0;n~K!00X z+k<6s2t3{Px2A`G)>(rLT5IrT(1d8Xdy0;uh+(kM=lp03*S|wg>%4YMp&lgWu`ZyZ zqEZB~Fyy?ZU|TtU{(#SDSy{QPp+!j7P_djt#KpzYZ2;_l z9;zeO10`84CFy;0X?=6UPv`09i~Sy}oXlBGo##g{R*o+Gexr+Tqp_@uiwG$-`=641 zFcd78EgA6fjMK1lAJ^r}jCOTfzp*J+A_*p!SaFB>dF958;KA@N(P_In0__K>Tcg~v z`LP@zbZv7I|278mBfUY-ptU6pB?9WjK zLEAJ0yML?agizFd6AB0#8XCSJU#2Y~jGGH~KA~_Y?}rDsCO8j*p$@<;D#{F@Y%>a> zwhi!uK=7b_i!CUxacu~=GtV2=5(L1sm3;7C~X{m@=E>uB-(NqQBiLPsn(%cT<0hBt`7_~7}&=IV{v z`!~d1WdaiI$r2hu-0^0WAZFV9I*KJuzK1byo?dkv5^Kk2?YhnB+)rP(fSVtV-V=J#hlcQkq^IVYVD^4SA8vvdHZfp&ga~;e7KY+_>g}T_VrT4o(yTwvDf+UCwH7?$q-rpHA)RHZM^RrZA$09 zb1-rR51&~2{Q6)ACvXF%XQF@IYH|yMjLmnB9xe~CWX}fS*SH8azKS*~6q7K6C%&&2 zFC)G$90z2*!tp-7&JWvigpyT)JUig1E{;ZTLB=s(5=S@H|2w`ksI{!I7C^A1-d87z#kPN_wGhKd#eg^!(Z^Q|- z7(Sr_!@~hU>kzGp@z~A4b8E%=&{+C>TVLdVOa6Oyq&^aMYjGSgO&b-LU z$@oQt3lAwh55=~$@=(_@Im?F!Z*z|g9mVf8^SQ1Tgf3_~7dQl5A|gzcLk<)`mEgJf z+$m#+k4t$K!1cvZeteXr59nQ-px!yF9|CC2flb8P_q$~a;CN8%ZDrR2w?te{MY zd_8iYj=!_2OBY=Czuz{h8ZtVh#uQui3F$k3zx7=66W{YfVrtLe;2s&oMKX1q&k2<5 z%2uyY15{5C*}He|hDQFUh{H>pMadRqnWmhR&py7Yju*CT6Bn^qvd2(89uGNN&@k5M zr{{?>hW?Z5|8r1TWa;9gmTZv)9hNqH4fXWl_)dn{fztbCizBsYey-=$QS)d8aL~W{ zQBSoBp-1vK9*PO=D?FuD`9N#t~iV={v z-ZD%rVEsmNn0MZiw!a`)Zrxu*w~1Ah3w7^J+EjC4hv>2p#L47Mu}ObEF!jVrgQfvM z#BBhG)(mOaqb4X;N-VPB1+OL&r0F;S1BhuZHNp4$yeMkX=p zHQTV&YG!yE|KP$H-ySdG+wFyi8_|a{8pCD+$jQygN_@fK5kRe+!5j`8*79;?yqxD! z4qcX&j@E9&pE8DB9`;c`E$^`dyfpti0pB+!|M5$s4eD#RIu6sTnjmb8y4LD@Muhfu z+3}pi3RF})=&tigpXS+EVe_``b78jKD;qZ2e_7!hshM^i{J zYc)v%UMfblkOV~n=fRBSfRmEGGvz(>NpS@4CgxArEMVz*fl*@2DSO*kV$w$P9i zTrPu7kQx!9$asLnFkV#y>(=WGqrZJ9jvx2XlhC5g8QK@duf3WQ&T5vPJ|u}wlmMB3 zFN4gxe1e!DxEf4Xib~r%dd4Q-rJqJ8|JB8!;+AtA&fpDO;6|fF6agZ(!=e6`7l6tS z>Eqc=Ps;e`9p6u^7YqC9mQKP;6_Z=j$6J+*&eJI`JpFc*Vs^a%FS95_yxTwJmA+j5 zR|1i3p;^S+yIvA_{6Zy5_TsoL0H(NU2j7c9hbR_6+cK^bQ}M>7UMxsMjZGC^OOjjC z$ETlj`xOhDA46noH_RH$V-sQRZH_uxgtDLe2Q@HFw3sI_DGc0R(}8*B4tXOynQ53l|-1zQ*SjNgvaX z$=cR?B5}qTq~X;XdFL&V^ZHhJ%B|q^=TZBfnDHW2QaRy|Jd%=A=v?G5*UR%Fj0fLF zk02T@wG=Vzhnw>fz!03in)`7xVX@(0AW%Wjx+u>>$%O|NTv?n$occ+DzWBnd9Gu{` znGr$N=x2O_#sro01iQLMNC1F$NCs3p^LEyk_Jn~dSz{wc29c3AeT0Z*{U(ud<#jbI z2#mdEsLHKdj3^rou)wx3x@SU*^a1m|K{zf-GKo@}{tdfE=gQ+JOJC#axfUK(v02HN1ltC8P4Bh*~un?^5+3$a;Pu0b@2V1lHV^GB9lAce}!) zM^%nz`1zl4e454v%tkq^QBL;XOe_B0uDr)x?h37)_|${-Kc)l6CdJX z8bo8t6G@alDGuk1+u6K@6`>{HgT{!6Dqc$LrwW0I)6* zB4$5MjQ%|(Mfjq&?DqE}#k>f2T4#AG16!ttcQf!e&CFY#Ge1;Ck5WdKel-&yj?*12 z@R#LmnZ0Y~dA^yNXWrTUE7!4oe*)3iXu0W|rRe~q(DVwZ>7(Y@pSC@2W5h!phQ>y* z26%a$P%J(tciV;1bWI_#7gA2FFxojGK8~X2}>g#qiApC~=~|^5&PM<|W(SXt4*2B+kx9=Q)R|w34GZAOx>I&&23J z{=;Ia^7!qsYwet15mF^;naA_(8-(BARXzoL;1Fzp{%RQ6$I^|F;F%xy?Y|}?ExQyS zWbBAbWS-&Y;9_F5GlEhOQpo32l$`BEdg_4*5*_K=0 zH2y(+H}F8%Ft?t=B@K}rC5clf#Dv^!MPKy<8=(7LMu;-^zR}Z^?JHqBb1A-eh>1vK z01sg66NJX)(#wxLz!p;E*L7pnDq3?^~Sn z4f{rm{I}YlNSTp~M>qIamID?v`PbdYLPv=XUQ+cYMst9M5;DNj20y^k+)d zkygZ+4eT>Im5`U53jlS4Z2CdC1@}E^RlA}2{qHL zu?O}f{`n`)sL*xCNZo%r^CnU5i33w(NS2vo7aIrPUy(RwzoXv*#Ueicc|u@QG*Pks zWRP`l3p>1d^T>lINO(uFT0W7Mr7W)6)D6UqI+T2?);)oPPvz~yqHkd-p_nDd=1y)$ zL4R5@<;&wSj#o8QIArXZmQ~J-$2tb%J{v1Cs;V$H#ab5LW5Q6Sc1x&)>QdyCX=t&w zcCnWim~@Hk_KH>wFWty}JNb=reB zY!zv?Zh61>7+Te_77q0mO1I8d#!?uL-et=dQaM{#KjzZ2(cSGeUGb`}GB8D@$A&hF zl4EP9ed5<-G#I$;)_^o0eU{gk?iksXW@L6hF)by@ZZGFAC`x_zYg(N#!ufvgti~NH z?7bvLjERDqg2nl?W2du@oz~*kf1`9xfJ{^1Tq<>APJrASKLV!K?uU9o?tgGJNu~jH z%ObOMy-muRA;aYT`6FyR4k_Il-*hq8-%|WF(&^P#86atyZ%XOt%Ls6`VJxQjVtjTg zzr~?{^)l5CiLOxI0>+wy-`OViYCPd1lgBm|0>9kf4^%Uac!ahb&)=1NtLtqi7;&wK zGC#~9;Xujep7oHS-eP^MPajiX`LoKFgJp|}jUQ8d?3`K3uglH}jWnjqdt-C8xf&nH zK3N{;>;Abwc*|K^t5+gU<@%Mc<=VmBLrX=W?@dUEA$!2UcBx2?$Rpb84#ylrYqe5=&ny2gHY9|W{b_+5 zZv#_0u=aU3FH5OUeVueaXj&`S>MXq3_}5L9uEQ&7G;pF~MO@H!W7p+Xop9{re7>#bgv?Um+eGyt)~pTvTWDRb?ow9R^;e2<3tY~qwsgJos2C4Ek*xVNs2 zOYb0?w@3`vd)lI(#MUkZxdsj9m7=@W)oQ6DLdJB{eS`Ca6{j~y5H8|*mj)r?`6hd# z2K)MDaoUKT^kqJzYia0X-#~(O<%~`1K5NC?zwMPlm*2F$zNcuuy_oSV__)`%Dbv+w`TU9pW*tO@$j zYgCMdpR5w%>Q;8FEqn?Jfz*3r^mfm(LiBEU0Id*j@wIh|kD2yE3*lLnpS;;N9#bTb zu2z(Pvg*8VRhR#9uwbHZwe}L{97St(RbG3~cDO`RuDRZ+bQOfE|rEH06=HP;}5 zh8CAP9+LuT%m3X)Wr{MsxKmov0sm2y4h5DiL}78O>Hj{m*`4I&C4o8zR9Sd{QH?>N zX}T>k2)3iur5m9vz4Y~B=SfLPHb^6QZEKd5Ca0&rgh70*C@nu*$!Fv;a1S~ThynSY z)sK&h3)Hi%B7q2)+3QOE6_h!1;54~dS#}BoZ593HR?Mf>691HVkAHZ&xxcfh=NaB& z`gp+F7+zl9q5FMcAM$!L6Lam{Tg^@f_-WT%I-+3Da z(WVbnEB&uPdPzwU_W4}#XZ$fGWo3nYZBa>gO?~~LdH(CCY7J>>4Hv_q<4SRoD3b%% zHRFWtS;V@3ti-!Hkc#D^d03_v`ssD}l)x8-Q~K)nFDr$!284xC)&L8`=`n!6O#a1$$AuEW5V<#nNmN>}@u+T7gM!9w-du+a2|kEj0M zTJru|n7W^cj<^v)0hf%Nyb0)h#m;lu15t!cDEk6R22f$93rUAm>atKI1=rG$V<#6p zA@w6blLV*;3iDYbZC}5B_|ug^$HWxa->);|@I?@4?a#hAiv3P5b(+znfg%BJ>?Y-sH(7tShK_l+yVYxVa-2sLyv{a0(VM?p}kMm(^KlS>Wi^mHNM#w zz{@iRT&lmq2Ee_3B=>vBQ+mGiw{Jghu}4%*ug(w9Q!oiMbRD>W)GP&Ui?;vP|TpNjPo88n`B+=C8)j@D|YJd&z+tKHXIK$_-DY@t#* zPUExbm#O&ps4uXR%U9}#ZJB{^Y1;UNq{A7QG*B%F@!6f=Ja&vUd3@ZIKo*KDShB|R z8^GHkTZUrR!@TzzSZn~$xKsLzO#ILWM7Q1L)&Sbwz4bDvooQ3DiR87kR=ejtQp|b^ z4T&gahMrH&>5h#3xT5F5-o|rHXBw0OkHbZM>7QG34`UOJFk9q;*ckC`YrBfDuN2%Z z9Vh}iq1LTiT&GU?nGFY+x4#O2ocuVHqm$5mPIHJm5I=qT^y;;1jZlrZ-`%uF3}k+I z?`8QXYinu}LSYMvy76oF_V$49J&$?o@qm3a?xVD@u)0#FFSmaKkTdOD7;hUxQyU`o z+rR{l3rb+JHa3FLXjAOIW|^6tEqP$pUc8>icvjmORI9((Ou4jmb7$dByn)McpsjnY zE{<}dg_i5Wwf<0#G6s@_f%FdnjGzWX%Tyhh42fA*_ML+B9_!Zg!|$0#y?1O%TS4G1 z8aE)&Ih2~0M+sbs55O-iy7z@?dV<}m6LR$bO<=nlna=#Vm}OwcQ=>&EA2P|E=c?We4le@XCNF8bn>yU2;D8#c?gAR z9v&X2y+v2xr%|0dtDQ%yo3D*_M~p*7&lFUUn}U3*A$Ef9rN^qn&(I++eu*qB216t{rD*R9Ipv#qb7&kmYoMzMf|2dHJ!pHd2YXxMQrDLOWm7rCTo z{`Ccy|3m%pb*MRqY;L+iO|#|JZ3Fy5Mawv?Eb<6~?|q3APiY2Y1Uv8C8V!H(g5|vT zuG{@FNhTT9C%Mp7>NG}p_O;qe+S$lx~S=8 z_WzG&=>N97|J#lKPZxDM;t0h@GdTAWvr8y13}^$;%{PFnf!>khp-h6$IsDp$p|Luu zwOlI>C0u};0_wDQTJYiH+t4-)O>hc0hQhyn=B16cl^)J1ij(5dnN84US#Amq)zH5c z91_xWu-kc%I@9N#B!icp2!-oq(EW?w1L> znq^YxjtPSH7HEk5wLWNf1o`@Zwb9Di+S-;zdXIQ4UV-*iv^~?p!UF9WLp#@Cx$zq) z#YcR!6(|nj$1j?X#p&5vtksIzZa`0{GVojgEc^HGpU`%4ipO7Zlwct=s9W}c?s;@{ z(#HLc#HVO{f_8jD(SVkbF#wvI((39)wi3arswPVv2loUHZ4+oGgS(Z0X~Aicq)vT_ zSIFoZYvuI$FuK1kDxm+4cClt9VrBW|#rvky9Z6_S#C!AC5i+iaJvT(f561q2aRs^~ zAIfa^+FAcqHJ#jiu)laPUb`_ugrMLW+GAtr_KSpF!imhtYmKp>dKO?+XeNfvuE$4C zxFO&z_yq(ML2X0Db8~Z-p-wz*bg<0~WpOqL!6u*+^r7_rzc0cQ;6px?-21WwrQk;3 zj)T8t_BsF>4eDDZ7i;{W>l1j5>*4p@#*_Pdi#`-w{%8y9i*qJVzPO|S7c?q4S;(K(bn0^~x0ICri}v5z7p3+$ zCxqr;F~o@d-yv}>f7&l`g7nA-Xxd5Y+}asE5E>sJPu!RoAOB#q|BrdvEB8CbeQlQa zwd%|6*&VwfTcDOW8hDJXly*v=1$_5+NqztYa|p27Bo{khoL4~U@KAi8u9p1!P65?C zc``1vj)w#TIYXRN#%(DTImslX-*0KoZOH)zY?4&yv>KYye+GdbJGe05jzMc8nA}Ea zJoxP?g~5;$ky17^`W+!a(4IwbItCTuz!IJofpEbWUYr4DadWNECmd~;gv00PI1c&% zUSWFA-=BE83+;nSl=fncH*|lqVqdNs9(79D(CIxC_7#Gtok#3CJ!}i1l|n9GyYfYq zy-bEuxP!Fx!!g+xs;{XR{DM0GC!94`jCQ;wq(7=-`)b5?#@WQgq`%ff6zyOFGd!m5 zvrP|Y@#EL^$a3BMFJAEb1yfdMm8~dA{l3ioTuHu#xm7w{zFMqt{Uw|F4;#T?OaAl*S}5$l0f3AOq2|4yLkzICn9Q^k%0l_a=QIV3qrZ3HT&Ch z9l0*30XcK?^%y8g1B4z=i;B+e{-W`bjXZG$?X*W47JUvp!cJUaMLW7sf(J?=(5@ri(5lZm&v1z<#p(OyevDWwDwi?7*}(*HDOgT285RZ?$p0_u@isHCEN zi^oJq#@G6cS|Klx@cfW|WpN8s@*`M)rO#&z$H17iLR=f*Jokf3N~&gf=Qy?BFe0bIxT5JX8YbdBI zM2iOvV6xDTT>>Q8ZT#qTR|1?q7?1aB(x@Jzo%rv-dZQ-i+cyogc^o47*`*4h_2i!J z?j=YL$KW(2(FzLCc;D+k?v2{x;pH^}j_A9EioqqQ*aB6xqIr;u`v|sdn#hQoo71>* z5bQ%B7#@1Kfpn<%Pm=8Z2@3Yq299Oj{eVLp7 X{f;5iAD_XmqafF?YKjH&#!vna>WX=_ literal 0 HcmV?d00001 diff --git a/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-20-output-1.png b/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-20-output-1.png new file mode 100644 index 0000000000000000000000000000000000000000..2805460e9cbfe6524a4e5dd6759a06fe7a9cd760 GIT binary patch literal 94075 zcma&O30RJ6+b@2nc~A-|MVd%OlMJ z4%sWoxjpBS6)h)cx!C??8IR>AIHYFIey0M1EL>FU9@U*ccI;TI^5Bbt_|I}>GL<|` zf__Uhbo2J!`$_%z-$I2bZ%z&Tud$q7@t?oB&ns$y*Dwiw4C9g}Ke=*3o0;d&ugRy- zg^{1!N!Ytco4nZYjTntEd2zZed7J75Rt^rvDks^?!AsY9MyXu8c8zn{GQZl|+RN|f z1FLIm>8hMe3NAYB3LH+XsHzec6Jx5orTpOM9wyE`_sIoy87<}J4oKT|o#oZ9j}IpH zPV4R7t?lS2ZFcyurM>;^-s#B%@_UMkJf5DO^`Aav{&ZyO_;W1-$6E`w^!@&Qa%&ie zp@|7oL}${S(ev{~Q;k{Xn)~+UrIoF@vvU*u4kp3q#~w2m6t`~GIDB{&S1{e`)vKRg z9J-yfTSH@!=VS*X4laA*xagA9aU9!tJ(Bhbz#=G#FtU?z=0JB2?>KkLnW_crJNS#JubSrX&ySn$H2f4 zr|2Za!pb@|qn{?fVxw8mv$H+7Ez1J-?Az!6^5x6gxZSsug4tJXRnyd@J#pfMUGtL- z;{$C{a%q}@>?Lh&0_T4ErK3D76&r@X5N(3JToC&571`ci*(Iu%L2G z*jQOvci!B(T*|y~+2etp9>!oTpX-|)N((D1AFMe2_4!z$=i06jEwX+J z?mm6Wvwiz^pOo?fmy!B>2Xm8y2Zdy1x$sGqBRzHRG7oPPS+{OvOjb_L^7QG5@2_Kp z*REw?<(5>ta6y5epMP+2$eL1X%sedP(5gfgV}UqLj~>6F;<57ZiwkoGPK zr{YvRw$4v>a8pNpL{;w~l>Kfm^Wv8yh#4s2b7|hPk3@Bb4TvHjbDzv&<_gwF% zhsMqR{{EBwF9zz9H5q5e8qEhMCaQ96UT0gBEGpc+cQ5Djb3c7PKRVXg>`+v3OWAG7 z>eUfXPBjZjNQ7ohjMS-G_ci4>Iy>ta8!!8mVd{&UlYFZA$;H9;60iCAD|Kwxy`?xS$#Ubyjk8lZ)>S-IcVD06y@}!P`uiX3-|*?- zq0Q=yI=#>V9dn;qV+++#VvgNAc7752kjHQy_D6!LE(rElQ#u;7Bw_9RMBfub{k)IqYBk*-8#m?!a|>msfO#fI`@ZRE5s(8zSfxE<|(0=pPfy4 z8>TpOb94I~Z{$MdSW=L(Pd4}bw{qrXD^_%3lPJ24OX3#M2Wyo!8sSD1U!olvC|VGb zm(OeI$*O*MD8D|-T=hq9uO=Sgt5>faot$XA4ON9BI`zWCOb;Dmz4aTvLJCSzkypHj zwvLW(ry?7VwA!HDyT-=A@bK_<=OFjbg>Uoam<8|f<4R=BOQqshpK)}&;*#i7x-ftJ ziFuKyqU)%bZEYM4BO{~Ti4#7ZiWgm7zm4@a3W%PjrKOD@8pnREbvtwBOj2g%k)MSp zKHe8qJg|4~`bV>Gu5V85Z)|LgHQm2wkM|0xBkfpr7Gcq@YumlrbUlW+mo%F=WgoK0XvD4dyBHKT&R(AEnEn)WL zXq{*;p_Q&ox_fus$n)Uf;FdZy6$jF)f^M8>%v?-0Jv+NDEG+C*M(p+Lp+A1u(J?Ss z{c1>eck%R8!L@ev8=!YI=(kd&_^8A8=y~UunG!6H8|5%x`(kPch=pn)ddRs=w&leLt~2FuKc*+RdD9 ze9)F_|BDN!9QgyEK2;3Tx^38K6CEkFlr&+XbsQWVV(ZrnDl5Ni`MEKovzSL>tZEc1 zL(|DACn@7jM8;kjbRM%pchfx=Ix837NlSaRv$(aVtBdZ`sZ*$?)?L++%Rgo9e&IY& zd3$^4JsA%amKmP-^1QRmMG(>6Zh@6j`uQ_23V;S4En0jSc4+5B zo0n7`kKIW38yc#&F)J`2fC0Pe-MziSKYsnPYA?RTBW*zia0o=8-hD%bHF@eNh=(S8 zpa72m)!HX3O9~IK{=osMhKGk%TwbkQnyMc=_~V_hUqHbAu_}=jXhY}vnmF-cfyUw$ zKJ=_tVq*hwJ%zhGIl*uV2At6B(hb)(9 z@9kwpO^>^NJq7i&{@uG06lhK!9wt)gzdm0cyYcXJU0nmZoS(nHB>;f#u3i4uuk+n8 zT>ti#GWsp+>|hy(S48k{DQ%p>{WArwijRn3 z|4?6VadVrq?SZT#(Nk|KQ7#^#vdiH6GkBx1$W*p^$~knD`q!41`=#H%FYD&!HeRQ? z5H+=-deycI)nOBpL!B9D8l>*-G&PO?m}Q=L9*?SaTJ+Tnch^kfUJHvgw8hwPB4QI` zV>Z8j{Yw2}mTf`9%Fb@Pc;zMw+`)|Pq1=*d*QOb>IXO9H09EZgeE4Q?h1I9DU3^a* z+EmnUhIpQ)p`pn*^3rvuy1M#c_=m~Ki`_qdR5vs8Vn8=RLb z>P20#M-j-_acwPjp8wBZzjCn-9^u#2UxxE-&c%`*sr&fU$-29!=Ft)Uaci z%zn*h0sypGJ8f;(u9QFFhb!}VJSWdQ#a}PG?0*NnMpyp%8}0u)FlNo+GNi{dMc-mv zyf|F%R3PAS*XvTAGL&(NmK9Ne(D#d9x=mQv*_nK>1X7;TYI$~+5w}KcWGnW#G0oz| ziz$Z%#=Gh1re*I4lwOZMC%tJ?*TTjqfAZ^UHIGw(VE&+TaragJM@qBS0+OaGbco^t zs~dy4q$$ARcE!wwci39uy(tEXDo>p{wWIS6MAp{sL}j^i=Z;|`K(x$-zNQnQ-%q|* zl1@w0T;sQ7m^W#C^+Mybv&XPZnO|uM?TyZp;tmz$4t2I}lOyx^@ndWVRZ{!iS5@+GZ!}|C zx|IB0Q&W>g&)b{(=KJuq{4ZbTICt(`Qd(NT#>3CoY}gQ9bcD<8+__DX8#V}X1v}1+ zpDM4cyp0NtjTHFgiNcK=H^^Vk0}+lV+q{k;eJ(hd*~iy+FM6eUk0N_=a`KP&sjMh{ zWJO@rGg-Yj^PS~Oz5@r4@!`XV*?48?q>dD{XRQ|#Yai;YnmsEfCYFSPH#;XNDCl_Z zTvr{4PRGxm%)wfVl$Dj)_!X}d30&N^Rq;%hZkT}T>b#SU+Eo#&9{ey`iw8Cz$R)jn zo{_PvxtT9k$~*wAX?kMd@{Jq(l*8Q_>|%c`CgJt#nNe}%07>>+UjO;?XHZCp2EOL) zloThRO;p^(A{CG6b9ickn>H;2C@#Yjm$oRT_*Fd%^b+6H)p0O0GdI0(mfD@Dvbd5* zNm+TXiAi;%%h<2w8X6jdfQubp@=o5{_kjDk^8ovb6)Vt0c<&k{R-ztF#nsmmTL1B51-RklIJy>K(juz*?c0FRP*(d>ry{Rh@dm*D z20}DAKF(dfjFXc_R8;iF*7GdD!PlnZzxVVAOGz0x5(aw4+7EbB7IiEwGjpS&?VE%O z9&0x0I)J?SLPh9`uujDb?Zq^eJk~~OJ1v1QRy^Kv>dPhuZ)K0^D49y|7oHFA-+#+u zWMGh$mzR%;jjf8^m9Qx|@ATJ%0r16W&`j1Zb3gAUnxI)qxTl4Mu`zfDS9~+9OHE2D zt_}(*7MS6OAm`tUa znOWS!ho@WwHdy!luH#TkFLSg%>OaASDs<+1cegAhxp{Mbi|Z(R)52NsI6J$0Nu0M-<}<_j#!AAmMGv=)>+IFN;H&p`S7EV(*zRf-)ot+=w1 z0Uc-dJXp1Esv#~?!@z(8l-vY2lWIrric@w|kd>8n#GdKu?k0E?E%%!>T0D;k*N*hM z^~%btY;A2R%FS&iL}fo69i7GOdTFU82Sa%GHbrgIV7p8YolRIiyxZX64< zw6x?=cHP9jS}6#3$9j77r=t6$^q0rSY155EfILQlVvU>c-@pI8((L`^jJmP8t@s{1 za<*)zTML5m-EV`K>KYhSp#Tl_eU^J{|M{xH8Bm7j=X(R_S-GQ5WQjP@tbMXh7P7EhZ*LOkBJa<@nu`Q-`SH zYwHXgJ1fHtg^%SvwtKe-4^vG|jpARmYN$%UD+!1<`_Us!s0`0`Y@lvzIm@7;qT(}h zl~YAuUtdd0s~zCN-SFJ61}t2$!IXWX1DD>!%P}e_DAYGJ_+gKevNQVAs051Cu|RyA z5@;-XdZP~vASiX$>rXC7sj3Q0479D_AMo6ksYJ?Org>4sdZV;|a zlx_$GB3!%sMvzlyC21)rkC1SL_GcXl4+;ueXh@U9R=yM)yMlpZbr`74%$|SxYbidvePH17w{PFXrCc@BM;3!%6qN|g|M`^ zSk2LWmM(&n)s*ib zfHpRo*c>G)Dap3s(35ho1y&s$9qGzJ;KJAhA&*!HLgdB*J(t$jvS|jel|Z8!f>=OH zDJdx}S+y#>ERdtDq=e?wm&YX_%a%uvirB4#=mx387oQI;CHs8ue#ke&9_8ibYd~Px z+1ZKWv-R8$Z*#A?ef0r=x*?&VQ%6HP6(65$WT)s<%h1|78q$r#Hf{_E3}ggfkC3sh zRMXe*ePHqdvSRgJU+dw53@!0f{VmT-UR>CK{}_EpcF$FeD8r@J14JdTfA{_V{kyup zJ}b1A$16kx{oE_M=G@ezN{u6l{DfYiz+zV48H;rGD; z*RlI)We;TIYs3ebjodjxm9E6S36Jec*L7Uts-1pbNp|!q}m*KoC zP#qw2sgAHkVsb}WpeDoz{DY<%;O|eLk&#jV=$JuYTaf?+D%!yd@qxf6rLSMJqRz$1 z9b>1~08rox0zyJqYTvxcaT&2K{C$=4?$tRSns(f~c18A_oE*vbbRC=A_uB^W%dfCd z#Yu}`5d^Oi%@qKl(6hH@o)0|`ypubLv@BiTek4t zx^)Y3O8tiqWl%4^9|?*j&{*5=%P#4 z?QZX?xrWdF(eg}=g@t8OA!4<1NL17^eie6~ckkXU{CouLS;e6`yGD23JD>G6ln?#v zcT;zET$8per86tI$O*<+;oRmCxX$1fMddzvM34~VE)fxt_VyBAh;uXrFJCSgnPD;N z`twX>px6e`SYY>aNA>>tC42t-d7z%RYYlGI641l9?LtP<5!|59_+@$<>=Rqu9JGHl zJ|61A(6{aL(;w~{@KON(02_O1%WY-NldK6a|{g*whWW$ z?(Ka&Iy!0sT9KNlDlL8Gb5qk`&#}5&HX$J)eZZ%9cW}P)sOV@@d;5%bm+rd6x;7&K zOd`Q?OPUn(Sla_6WffPjP_iM&0rqU(qhKo zjlYgLbXmJt#b{ta#-+byi-@Rbvgsq+>bkm~*S~-NrXPcD*51`6B6jx784W$XkQNUP zJ_R;XlO8_YfSVBp`Zs^j%1Q_wlNqea^KdTY-VT62^v#}LQ&ZFEsn!EFk57DLd6OVt zhQ3Hqz+>8F0gF3`P}u{mR&L$ewcna@tbOtt5)@9cP(E>O%elA~ZP>73 z9^7oU{TSK&V@a`ts&Z**Y1k9!Fe{?2$Hh@pxm8Y**{4n4XKMJY!=xlWFO7WpNXg97 zGC9c;ssLdlWb^4n@T&*1ufi&2k~~ttW0<~-6l>r?{joy;a%VcHvltey^g}gWj+?k0 zfYoz$NavY?x>UMwwB9?2sgfqyNye*)xHiG?FWhqE$ZAnbESoNEY<^y#Eh2Li%}+}l zfkgCkW`m@p<8apwbyG<8(0r*`sU5Bp*8>L%ubAK7eRJ#J$VgethC|uzQ)e>Fi&j$> zXw;7%Zz3WJzNC+)zW#&vL$9ohy=kgny(-0aa{B#6eiFLx{j=S*yz^~<+5sGkKmHThCIvA+AG4PZe(Qi2e}X5rt6b?Z2M*<)Tffx*454V4m#LrhDJs?Hm|wO ziab|giP^MngI(bl7-$WxL04bD3|%P4ZQ?Z1WQvtVlh8ErMr7~CD?>Ch>$v; z+_AUixLf{sq~FlRgCZijmp+}Jn@PFw`^(_)u!fl#|363&Sean~QoGKS61Zf2iVnxl zgiXv48Z>+v7VCX9ed#Jc-uKxT(vx+05a-3|(e<-)b2OB&c9`1k-R#rjzeCUVyzRiF zQatx#34XNK%xreA6}a8-`4oJ)2IvlCm-qMgGchrF!%ToLm2={wIJx>q$LcnYv|VS5;J0K*dvo&_=o!%#b0Y+~Mad`-V#DurIq_>$s_3l`6#-B|mO?;cPH? z#_J#D1lo4#r%&T4V+L;O=x8){Qo6&@7eOv@kC$+9`9L9jrdMbRzs=9x9=_WNaFH7p z%s~^Ehb7IPE_w3g2}-vn@TL6ro#Zt<2|gn7{lMo>@tETqckSBsKqg@!AZ-n^pdW?b zR}r?#nS7+1OZ>?D8b(*{9E`RvKfD?=QqsJzpk*t6m7uuz{a4@a{~9MR8Q!x? znepFKDuHagK9Infei7)2GUcl);sc%qCT(i#_Q>{)4@!IOsW|c-Ikq$NM23Q0M2pm1 zy?HYn5=SX4KhUk9@bK+WJ?XBy32eK-26&9i;~@?h%%QO?~nI6t(X6muy&$mQ@c_=%wgP?cQz{PX9};29dYB8#I(=_mjlt_C#95-dg*l#ETM zzAOVBr~puwxj1N17;){|(zM+-Stx=A(v9wB+kcit8Ot$F1Q4sht|VC9bEJ=yM&hwS zzJ+py)v() z5tiMqqoXMlNgU0HhzQ3E7cw1JE}`u0?TP1RVrrTUt*0s5lJ=I0hxok@8K!b20UaRm z7bgdklOj@5tkXx({o_4gn}Lu97!-QvGv6>4-+tnRI6!0&V9&ldwFul}ASP{WgJ(4> zC?Iwq-%wSy#fx7*1Of6Z2<4I%x(3vI_3G6t2?-&a9a=3Q$^(0Dgw;4W|K;Vt?p=U? z-v&F%6dl_s=v}0wfJ_U+L(58mZTB>haMSlJ*54F&+itPWs6EMFr7p>RphxKV z`_7#^{m;`bT6E?5_16N<{VkCXIPlEqlIX&s)M1CC$Vd!ld@*#~Y+#==y&+mckp5ZN@!?uWqP!#Zj<*#mL#NHglf9FgDkv?z z1kmhzR-&>SH>#5+ym^rCY?KBi*HI=|P|?47KqZLqPw?``2M37c|qPg|~u;6&=D$Wiq(ca}h;YTuZ9sn%>ahU@^Le zw9D{OAQ%QzT4L5mOB|pm5Cl?(W2DR@3JR2XczG>ru8KlX)K8s`hilHYbSWo4KOXa9 z^0{DWvc31y(*BCP8FkD$h_2M;9RF(3LP zNDG=P=@nOdaq;o7MsyB>QG#V%gFwtsX*AJzjt6h2;)F*ebp+DEYqGvS88_0fothdS`?q+;EGt3mq8)k%Jgl}S^07Wn^ z;YUdiA1($$ai95}PXJo;29RYCQASZYAVTHPdm(atSVS2hovtd&vc@65;v6WTf_lIuAj;>d&Y$_upK=5ADn}4M|<%_ z`UPz9$ZOYpiR*=pjoYA;h$1uR&dJFM&8(DEztdm0w!Cy(L3;nbeHBnee9OzVCTC_+ z?%iXekR@Q)d0k;K00|Lcv5Er;gAo$aum$Css($@?DL z1CnHEl@k}vE2rJ9j74uvH9@%LwGM*-9!6sUseMC!7Mg-A)d!phxDVT?0-AW;9mC`4 z&jDPm0MjIqGLXV;T}fL^ z2OfwBs--oAHu7k_9B{kv4AIG}V@(ep)Ii(@oio~18x~lJ3j|zXNJbl*?}q0E;3lK7 zx8F_l^wARPxQs@_lqS};e901T5SRIT+??!my*$w2Nz z4T3r8ivkUm(|A(D4cZUfprFNk3Q5@A*w;iM_e#8Za}~j-P_6m5oS|!NZG{?Ih1*)f z0}vzIxncc*J$uBKdqGG{cWe0gk-o<&=V@)6ewxl^`zCK-G}g@Z;^MCgbbz%*xS~I= zMIkU;y^>~zB_67TsVVQVw>LUoqlJa=Z9Yv=L;$L~EMN^3O{`#WoL5EO<{GfEsXb<9 zwXw=B!*sYGBpCxyhk{?Jo=KbhJPA5KI4|C-%Jfv&G4itn~Es028!ut`Hy#c=2Le#>0p2&^3S~`GH1w zRNNKOq$k%`z)`e*cYAvik{-}N|FDynuFIi{icQ3Wy#>Q$>If9@BpNBqD@QOOLM^c< znOIm#(fE?n(}Rgg1yvLB3!xE}mj$FIj=9M<(c(6qX>bmHN^ z5Y8rZ|216^6L_(pA06%O-mn5^QI9|js46GGRv(n)8#iwPrn19nvu`WBS=3fmMhknV z7mrp?8rDNd&CG=N|HD)LkFcTpIU<5@yI}uE((-ecg&rR? zJ3CwK#E)i@n=1=IBklA}tH6CKx76)#1`qILt=K}t)p#GDMTBZVfZk(dqzNL~(beS# zjVeC$mgmm|#>(Q0zvdIzNugO6!EH5H4bWPJ-K@}Py(sZ;G(6@J|;^fp+2~0ff(JI6ZCL{Ck_$Y_`KBSfW zw1h689iF{jwyY^<)5Y8cJ}_V%6evh>I{{b+r;jU)u&;g^<| z(;*>lh1&awXcb69QXDZEP&Rl?Ts`!LPA^9md4UO85=Q zp4_`9rp^bEP6u=pWQ*u3gq6`UuoVMQ(2o&$>&lfS#Pj;{<$1%%fA>PWQihPdfYv%Y zg*I=mo_%ecj8870LN=;2^Q(U=bn_~x z)9N>`s*of}>~GkRG!|LrMVW(Bcp7}luIK#6P#l{!x48?8opp4yfV;YWtG9i)d+iy$u69nsjY9i9T;}g8<xWJ&yXJfs4vmNi&=OLI zT#glj_U}(>F}5NWtOi;)jFvUgdNdTUK=$Lu8O8#{O#H|4I|(`rVo6JRYW7aN+*WE0 zxJiM^2|`IRIb9ScHVvBQkCFq!pe%2sC1qY`+4`yWhiZBNc18ZTmdk< z7m>^xN-kWgbEDVVuSey6u^2dIGwrD{**|u59vR}x=rXLokS!{4F-`pVVyvww5V}d^ z3Q-Gz_~*~J?%cJD-Ugay6`-q(*Sx!DgNUG@`kuR8d?8SkQeU-6J$m%0Ec5US3wwLB zhi#ug+XXzQzOR5Vmeaswh}(!fvBb`!-uZP$-~@m_sHu1SJ{b%HJvwAN3?B4;MW_e_ zg|+`a=1f%V>v3=_k*kLIh)$!Q`UMSC<386*KXY0ri-tg{ABg==*xwUbx=#4_S_Hkx}0qB?=+m0)L2?Ug~bClVkmR!uD} zy7wPGIN@6%^|TUbKf2!_=0VBo>Kko&dS8S948Gr{%U}Snx(+hcW@a5L3=5cI?+jQdxlHkXGl|iDv$@6pTAsl*RH(Qfh#mt<3P6^vx9eIe3 z&Q4lXT5vKzTF55RS#r4BwKujbMIa_SJKG!n9?(-DIK#f@s$f@LMcO<>Pj1}FsGn7Eg_ckh0RAP*|pB39PW1om*OU(TfU zx`t`k39-+TpP)W%+O+BZmfHp^cKsDGS%x2U|N6y(8VEEv>-$AUR`z~^w~FPyF;ikC zMynfL+1CzBZE~t7@m;p1rsv2h++SiQ7dKa^x3ygxwEhDBMaS!>SqyQwK}vyqp!8)w zc~Y5S`sm8FYeqI6QT8!&*n#3Fui=W)PZN_KpyeCsa6v)E5P-9t2acer??3)c@bcB4 zrEoF!&5X7`KK`K|&Ke#TB9OFCUB|?cPxjG|-DruW_i-hw_rl{8wS9K0d1yP+XDpS@ zdxeieS?ZD-%GsPc`6Q_~j|EC>8)-W5hTiEkGz%x@!Xu2V5$N9ub+z6*HE9AwXOVWdUw z5aF@~s{+1X7c_z;S_@GOb>tHe?UQ45s~NlxqXv}Lwc{4BQHKiMUE&k>2IDbG1BIq# zWFT}}+XQol?mSI4Kz}y0a=9FnS`?hmQ2Y?{L1;Xsp}P^dA}lAL&+{^mj1TTv1VVta zNEX20!k4TqDY*o$EA8Rop@0Z>$+6k_V|m!ZwkJ*?7{#adTOcxz8l`_&)oWgfS_6F^H)_^p=T0^THXd(?mBR1|u`@K0 zO6Y!jlePW#>Hs2R!V4aP(=IF{!-*U)1W;d#qrb6>5MU@rJ-h-oWP3wDMv{P@jYkTC zu$-)+ERh`s&Ks)jH_RMVLShR-n`i!o4EFy_mm!s61-eFg!-zY2>=^MMWwf9^{>B zj-jWc%f{jyYNW|V)SapSIA|uknOjNrSgy6j#ojWNO}LBkh&H;7bcZMHMqQkG_W)ik zSp-l{#qJ58q_cj*22njI5x_?T;<(LC%bOpAd3WZr_xYdi{Sjq9V3GxK zgDhO1T(kf=j8x3?{|SY8&3~(4t#snw2I=%)V8~~0-qmAI^758P%?-bns%>exu_M*6 z&>ppOr%+@zaQuAX)j9y`!z5&j*ulO@N#u|$u@`2Ydfo!tlgTGxi;wB>MV~qktRC3mo^CK2 z3?Rr_??c)Vp%{UaE+BjQW3V#Cg5K2BFe6P?jLmxJ*E~>2X>~Qz+O=zmjQHgA*AR$` zHGL8yB6KKGvwL&Xw~|N6aWLR^2of&0v5nLxEf^ z+k90o70!klg!1%WQ6tylX1eXG2~Iv6(MjM}_3k7S(U?F4^~(^2L;BF>O>_h8|v$e!3&n~@r9z+TO!p<=ri^Kec~4C-+8S6^fd*7Avb`ZF}I)6 zfb}N%^fYWrENY}LVzv_b8Gl4@4zY*yhB4rW1IIIvU8;b!LqCv>5LC5?7bwpl76dA? zcK6s-QDJzY#1}2e!wiSsL*uO5w--^VDh8%#x}1Qix@$W-JF)mK0TaD0&;e@zODE#< zM)SfjLaULTNhMh6KSgbiB-r~&z_vsIA}>UQkO&W~D>nKen?w~O{)F53r(PfyMHT;# z6jBL#s0_SK${Q*LB#3t}-JD6@Mow<&(9jU18Q0XS!Y-q|jT`_ni1DcF>9K&1WnUb! zqH=XYS{t9AKe_Xk67SNrI|-vF(^4&k9(?%F?E|eZvGr-FxJ}2oh{S-X3Uqf%4Bj-o zblVIpA?BQ+t*s633`Whx(I1@Pe0y!bqE>Z+=kWmH5CI9sGV7&Fmk9A(woGr~7chY} zaD=X&UMa*OV$&ci2hr{IDx|KL2*qBWll>Cg4@2G_YrXdE+lb@GkINu_Niqm&28nMW zhgI;ny=d`EPhW2Y951QqG}x~G3H)^}cPJ?oCqLd_PJ)tN%2HC7q3ne~`MYaU}#bmy@;P0+QxJP}N@J$34hv9YlU!1QNyDoif{mcE`d zCECZ9ZQJAk&`7WXg3xCy4e^(36*J~0_3z(XL3Ep)!e7N0?!o=a;5{;R#X6S!W?gVoP(7dG8rYPj%s=R4v!L=_T^+p`2KN&|0 z^cSC0=~m?c_EiVVBA4P95(S6@=L$wMJUo2#$cKgoEx2S4fO0tgg>tSMCjNy()T_4eK zWgvzVTWc5YxLB}--R%z?YMA?%TZ){N(Eq=L)4IBGRrzG1v*gWtfY^h&ch`X>hLJch z907N@t{n?+crf%L6BYHNF`C@jOGt`vBBvGzvW|b+fB#+#k&?J|D925RW8xb!wEuoJ zikPfF_8TNqnPYGHDK$Ggv6Y)ovwMcl1Da@)@ zEoL)cH(Ua%#*2KAfG4ckZM?4p&z1x|p1yb?ymjkJ1cHQ--J#&sJhQ@&7_MHs<^(Oe zA@eXFEU9vA6dEe)$V(qslvo3Sh(Qtc1lT|w?t#CIN~Nk`IP_PMlVhAA)q?hY%o5c_Y}5qILQt4FxJ&1zVHU&u2@sA=r}?EDSBGwcp&- z3PB^1scRx-78{DOhsD z2`R-yK+97nTJRV}#D819ex;$%kyF2TAoGPxurH+BP6X1rI^Vo{_2==w+1v&(5Vm48 z?;*@4VQ%3;*A^m)c-XNFFCCt*Y{c!?M6E=QWG&CB^*gTW{daJ%?Agp}1$!Q3Mi@YA zNIO99n9YNS)KcKG6q&o(MHzV~K9&NSUAcAZZR4Zd-21s*tYULFFb!OewqbevI5URH zK=w#D2a*cq4Jtr_!{GTX7ruV|YH4NV0|`jI00;rK-{;x0Ed(l}okxGxhB^*)WGW>i zQWnKYiyM+C9Xd4z1o3u<_Z*RR9htwNmYcS|&7hUX+MTGj=KGO=*|Vnz=<%t?i~TI1P+ZxoS0PgQ$TCArWan4(#+w(zSVT{UCV4Df0R%xy=pCcdDugVuc#+gxgfz#y z9BZA~>E0}eGB&9Ln9D2%sG+yP%>~esx$~IHTbij1^G~vVsP{AJ7RS9z$oE;}P9)_waa! z5CSfmH4}2vZTNB|iGZ=G2yhL-b?bVpmDtH;BAntO1Ur)!UOh?-8LbDBBnuW$%4u%; zJObmnwQlQnCD1uII5d6!3=@T&q=hdGw7&7u+=H71kQ9fHL^&D4u;b1|mBaOwz$EX$ z+dQAkZQYt+&`t(b;4G9NEs87(iAxcO1(8qd_jh(+Vz8uY#-i2o#EB66RSMD?zz?t! zyp6@bXCZ)8hSW5y^I82#61_(4?)d$CgWX94WRMBgM*M{nAP~&Y3jQ~#ZHV=u(qY1w zolLJ{_4e~1CSAW@3maTq3jwzvvR#0s$$&G~8CpF=^~=x#iOdRn=tBGizTPTKD;L9I zsYh}n<4`_3_AVLO?K>o7{d?k`014O~J_G0+^$jC2kK>Tn!ilwysq$5eUppR`lYwU!t zg%B?JqG;iZv7wOQqflSKx6}};nxCI>@MQUuV-rv2nIzxYgkMf0jh@OQ(^~ebSxblq(-jC;Knc|(}lT3V$?%TpjJs># zw3F$t1ojj|06vMljFKJ<$-vI`cs#DJKuKLzDiSvTP=|M@zT1(q&cbGnCJcIPp6g6w z(m(TLy8;D~AYzZ`oGdFYrhrhYtE=zD%r~PcVTOhI6^I$mKp>$~^(o@w;qEP>pvqr@ zb)q+BgXvCSa4+4b-%tEoqgc@CB?e5-z@ybPGkc&M`uHn7(5v~RI3OV@>1QITO`uCb z9Y-p>w7x!jWY9RgW``INk{Lv(B>Cyn$&`uT-Q59%_P&B$4?9=vQNdAPC#Xs*pAj9+ zbXtTf8?g^KG*)Q#(h3Rz(80(_55y188o_irK);$$WHG)IX!-h&yVcccNb3UF*M#W- z4}jEpua`&fuScW&D?yUF1aW|PCE(Q>B3#L^EN83rflxfBf3DkfqJasI0A3)B3IC+F ztLfQL?}@AnQTFdS7?1<}5W~P=E1bm`!w-l5cQ`P)yiXrIJq5jMNoa)~)D|SFE*lDO z0RiQaJ<5a(IxAvw=yPbkAYV+Lo=6Fm;1*2M_06rX4)N5%WP=kT ztZ*kH1b(|6fE)tygcJzLjj1bP`RMM$9hYuDCB(ISxj%eaEghXsGKquI;^l!dL4nETxkk_H$0O=%1CCb{ckp6_w z7Uat)*M8z^LKugqTSC$V(AR-m1(9vY0`2ofaD!YYVh*KP5krF=n)6$&+5WrI6WJE? zqp087Bsb?g2{#r0D#g2RTi&>iOS%O*}S1fq-Z+yBLXH4LMMmHB?xgE+ou^exNN5J5H zQ`PHM(XE)6N-7&W0SWbO2+x4F9EIA#`S=6Uv`{Oavn2M+#Sxzh7`O~bbzF6iAv zodnmsjF~-2z%Ds^F`d)HrO$iN`0_98OZ|7B4n@a_tZ z>3v2pndyJoR4)x|>1x673BY=W0H^A!YqxGi0Nmi@nc+t8Sb8Aa`Pou`R*SF_NC?^O zJuvzr2?8_PyY~{pnO?V|qUi7qpFVvW#oG+bjT(O$*aD@>0;-n}r#MMc`h> z|JtJ;rj!-P11v8;NQ592;OKAj6B80x@T^L11)?KhP`#rLIxBiED~9P=ZYc{3YYz5( zuR&-7t}@^?OetcyqJe8741(dma-<+Fk?2E31#;=YigOP`GQ_SaJK4gh-`Q*bFkI_S zonlnqDGy2Hedxa9fyBBLp`m==3+J8adLzzW3rCrmnAA?6-slBeQYX`s_+CjXKhY3^ zin^0NjJ3l&u4}k|Ket(rVuY9at(u&}sm9{|=)*|SCBq_pI^O@P*DA+ku-~A@_>ifP zu&~VO7DW=QLxu=$J?Zkq%2rnf_Ku&3-FL(a1t%ytnAr4yB`i56Vc0c8Qp2%heFHJj zvUo7D-x@0}IH3OvPwJ^-$08sOa#}=CGawu}IRFq2^G|esKRP{R2|Be}3NK8ValT?bPY(c6W7A;138C1`FM6Q6G@c}x8h|bNHAz*KH@#t?T zrT{!r=am5<$s`H5%ap8(PIMPS%#0YObYGa8pjr!;#>B_hm^&cl27k&CP7@Fy7v+eR ztmHNxmSz2Km5jNvvT|WPJ3YO>?26stp`m4<6785~ZvXvy@Unnvch(`C(<33lN)E8V z@D4>G$C_!pfRFaq6)DAw2dBqQ&YNNlLfr_3POA>GhS9OeKT7~{{rX>%w(Ib;4R2jm zq2B^bbm#6}lB|Dv?kCqj@lA5(Llq`ux=tX7OOjTl_yfr)m?3fll)Pupr1>Oz=a04` zRiq9{67=ZNqa>tE0xm#K1G3G?6a=93pnP(o^FVEb5<4vQ`FcVEIa@(SMuyy06v9gg zx_LDP4zuIVlcEahPp$;TeVxh56lgIdc!eowN6es*96yvb&akd$FJ6?rd&h}=p@Tw$ zGm0P#UdFQ$78Fd@vjUfr!*~i=e4_(x=F(?=AXS4M4rtL%d>!0X*f-)I0|9BkT1a#k zA(Ek#IOwIAATPha@4=+VE^Zf39q)YA0WxC*t-(hm_rBTUaWzEHi7)u7nT!~GpZvib z2OAS%3GlTD7&ImDy3n+kI5>15>wQ0c41pBj*GfznF*-XtlM%vMu^quE+PE4Lk(S=P zxr&S*LGCE6ug`9I0`=@24$EK@z~~!RC-Rv?fJIa>c0PJ1$LP-wjDP@@=n8ZxmI5b$ zV^K%c(Spes3VJA0>%<~&{H8jlZviO?r6Eaf1o8;eFK+&s3k{kNKLKzKjF0Cf=nK*b zhII#gt-*(dNh}R10Sj`rhqCKqiw$xlrSsU^Zr+SJ}bT@N!G2}dhQ;M4f94? z7xheV^X3DCr-ccH3N7&Bu#@YrEB;r~>D4P1SUzz|F0$Z67!fl2kW3W0F4f7SG0AIe zfKbb&ie`QA$sJH>bTB{y75mVy>uxs3KV`&QxP&-g#%=TL`f#^~vk2v&?F=e+++9WZ%1YufT>|3EzJG3PIT8)1TySAs1v2b&^6yCWV(o z=rB$I4HILYliT;&h#Fv}mE#dV5G^zitO*Ver>}DI@wb0qI8w~RWAL?302do1xru&? z{^#dM>yWh|qA}Ws(#1gvP-er(oCG!ddJwN-N>|ymXx#tecpoo0k_;LPPD@)137I@T z96weM9o{qaHU^Pb^H&Y^xpaE}D_v~zFYzI`Pz(fpTd&G-&*M5vG(&u&Ytuirzm zQe;XN+7xkSM6!*TqxciKz`+jGTv3sGmbfhUV8%yzgp53v3RTIv6TR`iMsrLoHXWs70l5FtG{V|y~)Dy(wM z#&D`Y5Qa2`)~@Z=)L4$B1BnSlMn#niVCQRK-T>EF+vw&s=JnSgE4a!1U2(VLOW9tw z*v9c0$WAyUke;x5At`*8BdF$VXdgNR3C0{bzHT4a zU4(~0ju?!LJ{Sg~-J6{vvIFoScL8nel`CuR8NJ`{^2e(k&Uq)h%shx#Q}7VTJ_6K4 z3gQY*-yrckC_B_-*DF|uINb$qZt53i*l}nCdvTr~9wmKUqG}bQC1mVf{UzDX5DqJm zM@YVN#}C-5W_b@8Pw=od#2qGak?RVl$c!UN>;UCZq>f0JCD1nzA-8NZ;&BxKayb|~+=RRc5~~~Dzkh%9({IeIV=hYC zv0V*F;83RG0*02L`f5049DczK9Z1o0RuQkP$JvBL3?^b046l~NOE9BI!Uakw5f;En z;rQZN zXvoi^@fJ&wX5CZd=|OTvxOJCso|ew8LVV>buWr+Y83$u&;|ZkJ_QKV@7o&)S$1a0N ze#49-v0ukFp<|6$?k-)^EdU&J`dHGd%BkNOks zqAVwOt!-P5EXG!AMMd-?>UKfxz|1|+!HKQQ(C^IGU@@)%2!oo3odeaM=cpMX8=smQI?By^5zK0wGpT##O@;anE%rpg{TUkPxRXp3xW{!ivSP{4+Bj`BY z0*98YR=QYh&cVXs1Jx48fU+TsOz08<;>Ez5GzF{XwqY$n4J*042tan&C>V$o4J!~n z4@OgonB+QYM@DT?Vn_nq5qa-do@BD8+Dn7EX+EMm;(V?ZSc z@)m++d-tqE!$7IX&db}5xCS{J067gv+b>GzF*EH5FO{@1Y{S${)Ir*d+r&S@xa#Wa zB1feVLjsvJGQ6)oH#&NbOu9htg8xf40S3G|D3Y>;4^6Nglqq3+8r-_sm%xSE@FswN zEl!;Zhb&IU0!Wkz1%=p$2u0vCj3>>=cL~IS2~s<8lo9wTp5svW8$PNS5ZUGv){6xG zCI=^1;bcED`b{h&8T%&b!rhSdprT9T;Lk#P@JaY%<;eWtq3pMngt0?s&4!+|VRJWN zL_0P*gvs`6X=qU?co!TaqXc`6D)^ec4b2fekMafq`v$ivZkzK)WcP6v%_7*)@gFjX zs*RB%nu1+QfQx7!Z;LqcF=ImlbHI?>Fd$@nx_|;skZL*$sUWM=jqFb@f_)JdL#*KC$#i|JDVdps>6jckM7()S)9W}oX4gI+mZVwtH$X~e_aYEx zRNS^L*3AK)W3iK&|Bl7hJ8iFydrr+5rDM~JNAI6xqv(v1Y(xWWDg^caadk)-;LKMY z+yV+9Y6vD{4$L#Cv{?=EX#J#+!C~uu-RdE;n);8S@Pp0zpWF@)fc0>rqmyhUN$@v} zU6p@=Zyp&PEvjn;1CqGq_~%*TK1dM|hnIhnlb83$6bK+@O)L)d1hG@^pZ_0?6GH=b zK;Z4GGPz)>`$BagqAX05PBg4a^cec{#0Z3wI{LeL1$KBDdWqwYdO?g;y^ijK!p4A4 zBghr1kUwq>EHx64oIQ8f`E)0SSIkT4>vH}&6EOm@E z(+?J<%D}Ba$7BOBM5VTX7ZL6H7VZW)dIWgmpN#RVjBUu8j`n?CigM@&a6t|sf_aV6 z1mpV#gaZ{NyxEK9t{eMATcm^zBHnCBnCIV~ApHX!hQ7Y#Y=#9Z!RQ1tAv zv0optJq({ULMdBfxKh-jOBERppY-(O{|t9ruw%LEE(>#FYEDp9tA?RPv^R|Kt;Chn z++D{qZ$~%7Bg2+FqFmyhgy?Z6oGr}9_A}-gL9oKZL`I?JW$s*{_|&IR#K*;g=JWHG zKBPya!F+{F5&aIMD~ULnz<$0B8I3CFEjTjtRV)r7RG%v?Ej2te?Jf?ek$!Q36*_1< z-)HxjXkmwHX=|4vz=c{(;yt&NTzoKoeH(xbZjmoW zz~^U24Aftc=@o=(|2ysQ*$VDKv>vQm@-)G}MjCGBr(@B$CNVDJis2z6iweTA15ReKxuRE}?;l%JSV#Z`C{7!BB5cn=lxU(Ip%akM zX#ghFN%;gl_n85B`kw<4Y=4ebg+cr0U<23;pr&N#yQK(|+-U72sZY`jFkhJj1O&(- zayZ)B8=)UaJFi-n6?iSoljBE;aRHwiCsRiYyyS!RizZ3tD9oOn0l$>=))I>Ryc_3z z;INb-a&#al7|D)dP*-o)n~ncR-X+^y``^^@>J`X-5m|bu2lU`?9MyjUe4L!JHrRt^ z+K1E&XM@wXJ|qgRe(Y-=ZQQK zHyFqx4;x82?8`qlQn_@%NpXL!asWaSW@mty74R6I$kDG)t2bmC`(*Ndsjb8Y$8|mj;AJg%XiwWk^y<^CTKH zXi%a_NuZ^e%jv0Goy&4dx`69>_G z6CJZX*r>!?THdy&8j$%=R+yWja8ZK&2vx#*_GK>DroppXctvmHoJ(4veI%StrJz6W zUZKYj?d^gNlqZsSn8+XtKY{ZS{+Dn+tj758_HF`+il?m%7qKtVmg_-lTvZ`<;Rp@m zG4rC-6``KRj31ExJdl^qNZP*zqsR;}JR#820qxe@Ts*~Qo$jpOsPsl}ko$grXU}RJ zfA9+ryusvxgFbcu8T-^1&_HAdVq?iC6SRUj2m5J~dQWj9px6`8fC0BD+6Nhr)B6BE z-|9=x_XAW+LKImthMT`@K4@AyXglK6LiAu#=|Oph$}QT^UaU+ty{)fx4y$|nQejU{ zkMMx6$HV3wq`X_QAr4f&Z7$Syarm7SV~rs)P!;E&{Hx|eyF3y=s*$s z2n>A% zL7D_OKFFWgZ8eFa%oW3tBO@SXQUPKAq@cOka6Q0SX*0{!bPpI`hQ9St2Qzox}>!y@$?=9CG1i(G)BKuo{mO|kJHX? zv{-0JSf>C{5go>QOBHMZY9S$|#}VZn4Pay*Fbe+9LtR}5pPl!7Ewza_k-qjUpwyoA z{G__rlOlqX4=Q~W#ZZlpDB}L6vokE-j#Y_tQUK9FY&S7waeTr>TtnXkk58HuK^N20 z(*tw0kj^3}Un?ziWk;)o$HsabOax63ULTbSwU_wDiam}Da6yT;`TJMp+9^DUcJA&( z*&26}51WO`PZs_yrQW-{49;U<*z06H#Nr$VA;p&d$I?%?n=00C+$i*I9&py(z6)N~ zrVRk0UgsMfas^RO>3lyqj)@_E^e|U<{`%);ro+?dyaoCrR#4)K#|o$b7?g*}f^?h7 zlYRI|*Z6d(l6jKSh>BFY7@oF%{|CH|)(Gy^&hVVYkig?yzVDqho4*1n<+WPtR{X-~ zT3bB)G(&MflSw}9gTM`fibI#*Ww9oI;>qcCB;hhVhuu_{77x{IS8`PKVu+9W{YS!> zVu(8J>e?FN$5FQ3eicNIJi6nzjFo&J*L_%at8HN}3kE7Z9MywOkA#xrAPyL(O@o84 zNbfX;Gzr(~aN$nA<_V}&m>3Knv)Sz@u|B2#w|XP=p3wsOLR9v8KX4-reb!nA%FYn}iG%s>*ffPp~R)Nxm5_h5_~JiCYhMF)PT64X!Qj zZ_3Ia0)@nzx{k^6?uKi(B~5{-PD6>mictp)+zVB09Ah2q?Sz4?7%Hv#d&5*=sExCT%Mw^F$1u$CqEP^C+gPkkJn(q(Nwi1YTt&e| zQ@CDK6$sYhF5M*1m-YosmxnX-66bIO92WOa%p4G*i3nM6XT%w>xwjTtTQgQp;8Dsh z4yf@Vkl|(0=B^7n4T;$^CO`VlqqNR&y(8z~F`^lW7Y4rq=e|F5N-R>IJgMnWUnXwY zZf#Z5-zp3aV(02_MUl(g0lCE@gBTODB-^m-q_g#*i0rv{(Wj?)eh7OR@yZr1*l)5M zJ_T>`Ucq^UA|ldZI7FJNO(gdI0R!497YuQ#&q2n2j!y>QUHqG|Pdi>!A!}U-S#KCO z!qwV#NuQt~zp(YKt1n-+7Dps-$R7CinLGKa4%$eyAR@6JBVQo?9mzm~Wg?paITZ>n zq-1q?#S!=1*rno!>0a>r?Cd+Lz2t=OLJR}PK3&|}X2!l>P(b0GW}la5Eui6`1+_d9 zhw)kMZ%dXf6Zaah5~I{{W5?d$w+J7Z1rTav@k+Bh{{O%|u2qHr%Wxb^vtG@VpRZZ@ z8~V&nu1@?{R}W7aG+Dj5gHDsvEAFa5;E#2&8b7{tXbdtK=HhMp_C1@@hu>WO`s5+w z*UhlIAU9NcCk3phFY{}q&c_uW4(`l&JkuD_Owru6iV4uK9)1IYpwDBB$S}|%7^qZH zNOy^szUKLUUAjc3^lNNno9kY3nE?Qtn64^&h~{9LN{|(ZEbb4NGCFr_S*elhDsN~3 z%vvD|VnHm?M|^_R)obvu*N297fJ!@!zy%hXMZ6K^D7{1mzKYm+D2CAHioG-#5fy6# z5Zv&_Itq+C8cNNZl4AJ-png1*6bkqby}HvI!_o*Lv_l;j3NtS=D@$}3_%phu2t$s! zgoNX#zng6^#RSGFuyVG*&o}^yMVjqOkA-U_9%;A9{z7Qtqi3d(EB@I`sQT(M#>EV@ z1bI+aQIQBx?)a@_acj>w=BGda(z8iZNFz)v9S#}AaR1LaCu%V_hzjdQWU%YL^-=p# zZ+m4|sufT&?df?$Y$kBmrZV+bw|k^L!~WgU*t z$2mQrK|h41)0tRq0b;NR327jxRw@E4^h=@#Xu>+bEpxGJ5t0l)bQJ1!{I zMX=a>;4%%Q7G50f+Mk4PEc!d{UkGdnv1|SO_D+*st6}l~%c&f{LmyXjaVus(BYwLS z$if~3QJE0g7ys)GLU|{Qz>Jl_JYv4jFxL)nE5toez^;7*)z2>^wtip+OP?Sjaqu3Q zKI(i&mn*tn_kekt;s|wnUE>~wV`9#peOECbBA^J{r-i=&3~QmRTwW4&EFz*7y=eJl zo)z^-_zT9PZ!3<4g;nR85Mgfr*@0qH6mA9Ovsy}QY~x%f*wq}L&ocKQR~K{5z^V!u zGM~$zghQRpK5Le|_2^0e<##Qf^y_aRv}v;qC4!rrd|11A**wuth@o@GX;v4OKV?=? z>fTI6r5(yhZsM_%U!_R|IvuIk3ZO((M!5cOVuC>sQjkBKkV zZ#PD*)SIbOeKoz&q~8wm9vOdoCWhlx=RP@lmA?+_ejRsGNWfN4NbX-^SeC>ZfwmB4 z1XdAEpW{dQnCkt%mK*h1zs>OO-Mi~}j0CDm^O7s}h2^>j#I1x+`2W5BENWV%x9_KQ zPfuw+U;k}xGZ?b-=YOm;7OCB(QZCPF1vyd@H%?*zQ@P_g4-+l@l@kZ1PdR#eFmhGq z473j4&tJX5&Cr9VGV;vVvg7a8^z5qJ(#{#99v0Gk^_>Qm;6@uMwV9Mqa2B|J8e z17=SYis$X8rSaN?X(9~=KxzgHY$)2 zD1?P5#xNL(J+%RY1}@T9-O0INlq7oy)-JDCx9&secNenKudE(K%&jgD^+y~f!4|-? zw(yZT79#aJ7n^v)5m{kH8d#g+Gwk5r^PV*lC6SmNQtP3w3gx?1`h@5Kc=wooyV1~H z_y9!CezoE|%hZI(0z+r~yHkHxyK?C79{p_$d14M2tITl9FH>8<;l4&sBDFRb_mfZ>4OffY z;PxerDp6IK5J8$J*>-p8_BU4INydofuOWLdpN{9wNhPEJ4jaayr-xK1{_f|2+w4|A z*CG5M+_ya#{9j!D(TfpCH3gbY#pR#ZhHO6KbC(SI8;29zgLAa2&Q9#cWML9m@|CW7 zdRJ0zw~^FEQyWziij2T(m9$C$z=Zf(xmRxr&vV|3({-4_3b$TLd{EC&CU>s;@TC`f zVPS6b?xi2S>O6^Vb|ARZ#FtQdh%lZRfX$)FA4QlNd4{= zZXz%to^PbC6+crb;gg}_K&rrG^FQegH;S3mONR@;au_=rKQpJwByo}PBkp};B)66c zGfzwmNyQo+mdQ;TuD9RnYv%nb|MU&hcpF#RrWTHTVS^2y&1e6!94ztZ5naAdpj({^ zT4&I2U)uPdl=ZaGJ^DK3TFyYDc5VXqpW^pAh2$>2Kr0lzUkVD|*jr!2atiBo^^z{$x5n71URDIEj*XOX~st?n5hn27PYfFpqyMTd73d%m1B4OnlWz zF^@F4R)^{b6^`|rnO4KngStye_na*N@jbt$S2h1{$=vDROPuTu0@(uU-5xBe*~f2W z$YeK|K~G&>*U}UB0@IiU&(=<;fOp5wRuop~9$?+VhJTs6hp6v^lGLI*AxJAa+kNf^<(A7C#DB_2Ms^G(47pz=t02m@9C&^*82l|ysmmp@? zaniqCXRn)RHZTyl^Yj({Hbmi?u&5mzvdr)7g1}0@T?_69B+&!RNJY(z#AcKSG*ASO z;FncPUt8MKzf#cMT3FvLJ5fcXE~siKTt)vPiu8Aaz*E)n$nfylP z2VpZl5!lGVXMXzSUE+ZCep`%zZ&xPL4isf7l{QX!&tgIk%wwi~FFDy?zn_kdJvL<- zNzg0BR;6Bp297>~=@%o{|4jz+91@|dqm%uic4rBVhghOSub=eV=HvzovO%?Y=$cxt zI|R<0)ThPPEz?I@gczwxNNZTQz(B-)RVs}+5m<@Y3iK7^3*6xhLL$Vm%sgnfR^d!!!rO$E4gUJk!BvP1gCkdEtLE(qPUH-TsRsw67m= zMu(=Wk(8%>=Wv1B|4}rP2@&Ifq3%H76g=Ud&c(!tfxp}r$cZwvr#FebcLlQRU-rt8 zJ@g1IVb#^A2CrGsgpZ>9N`N9Lqv9ALQdv(=&y0jG)ao!S>`B&wNxJ#%NgC{O@-BeC=AynPm_QiC&xtCQjUaU@Kg8=&Z8pMeq@-9l8Bw zc@Vy2nIVJw$l*f|E{SVO(n5w(m*!35%Qt9am3bXvY*@xb5M40EQojpc0X~aj2iaRU z<+BLk#5Ux&S#)Vo(d%1_TkmDU*t)e3t+aTMQVIKM_2t=(usz+ld?t$k|0h1IMl@XO z%tal=2+n1Xv6e<`|n`g;U5G8u{spi0;5Iz$hzc{jkU=VeVa&Pl3qZ_W<*=FftUv)YOx0&YtLz97E*h^|7PbQ7SmrM6mXtbGsbvFvWX#*4i>@zoIG!#5 zkq59U&6?x!cNCy5V#mmt`|vb6IAgb`XDu1C5F>DclC4Yvr^5~}w-^gPdW+OXa+x9` z5E&>iBdAXL5Z|%*G_!Nmi&E542?zlKCh<|jZhfN|ssKXROTfpGo*p3aXcTP~P%=%1 zEY*>+gtPuRFFG#qXf#?(5Ok5uQHkxuAy#~-7(iO)o)QFGa~QNn^BS-kd4V23Al3hE zYq%z6WP`6RQOR%9TUU5PmTe#^*3EwVmfC&PyFNTyLWRxJ5y5Q$nSP63uV9g3iCrC! zA<8ffri3igL^6=L&}sxf!nGUe_;vz%oz}P3Wc5c(BaQ=n!XF=6S;eI-{H2q2xNA(9MA^v}!y82PxJ*0VvV3Zvx9d+5J7GY+S@)1Mfu7V-<%(utN?-+fuo zSZ4XCc5}ZyzdC9;XqXL4^Z`B+E{L?X2M<;OqQjXbIwf6(DgC$vZJT<(L8%J6@xr$h zbePca-4z}bTfFBOx1Y{fR>!31inS_l2EO3XJX|x;`wQ_F@LU|8n$AnR@jtupFZynV zJoeEEOb#koLWicH9YC{C%7uW%$KuvE>U!K6n*^EA;_xI6EqwVNPPr*cpI);x0O_I7 zShTbr-zrmoH`+3g}d+LF(WyBkbqGKAPFRNP zl4+RT4L2PO8J}TuIq#E-qOz9*0CIGKrG& z%(7#@=PSqwP)EwXaI%%0n@Y9U41{yZFBeIf6+>!7-{JGyMN(Wj@MEHz$^b@HkY1qU zhlR;qTifiNk^erWKG3)%Ca(PQt^ot47((RGVYMZAnp%8+?J@y5ntV@6g>}qGrScG> z$I*pS#Ug$RdcN>CFn=BWjp2Slv7W9O8NZ34@Bgg_PvwD&uF&js869&SHYmJh}{$6t`GC85=QZjVD(0DXV$ zNzQ5(_`b+=gFQgHYv;;(c1=w~)8&Lt&CO^s>M*?KX0Vdtnrt^}{;{eD2Q)OQ7V{gZ;s(h|Fz&qro#EPn!fXboJ;7R9 z**l79@XmeaX8Ul7;@o!!@Y(5dKp14T8Ds z-V&7|SIJ?V=Wt@(?G1<8j@goz9PZ67l@-2Thcuz^^njot`+DdsQb)scRgU82_cbhbQ|F%&jKRQkZ{jNE}J?PVa>-P`ZS%Qidf_xT^;l^N7IU=$U zkigvgs+%E}$&jr(eVK?JEwyz6kTw&Uw{A}bm8{$zD+Xpc%LzU3Y0a&s$y-!9mBUd$SBZ&!BjrCtLb_xRo;M90Bl zC%Pcc}}smNqH|8oZEly zXM?P@vBMtrVlq#1 zxwo+E`bL=s?S^m5jLgo@H!(N&cXlm$G>UMSf%bi5sM%fitS-T>Le{wy@5Z+?#_7uU zPnUjB!ODi;aaR<0p^jrj4u|8-v2*G6+!@r6r*1rBG&026dWSPY&A}V{5q1kgwpAZH z7$fU@be~pA)bq*)kBqJ;oHiAd=TS-|Jb8{|8M)&Ym%I4*o;`b(QswAX{8&m(l`jk7 ze)6T|$DTQ#o+to-vpCY4P4xHJJkyZhy4lzF>7J)5$TBZnSQXiH+3JsHp334{inI6L z+E*T4nmKFMXd(|cx(y|oI?IC*(G~lSKYer8IOBTT(tB_h zmDKC(TzSay*u+3aL^yDaCkY)kJy=!PfAmnkmPr&IX<<<&)3y#*(gP0*=s6}{ z{-j;v&W=-)8f_UVyS>V-R2e`qhbW7=@n!JhB}?p4i<%6q&p*(vJz?Xp!KO@>QEH-; zTu&vYpR=g^3?>fZa3&VPmO}k`xe#IGj5r!F5?R7SO0C>;>la!tOrQRAV3PCJ1InHOJn^U zZIJ@L1S!%F7?zIm8nu%MTM1g-i%UeTL^6}zLhv1u;TVE)-U6w6j^vKULwqW_#lh@{ zoIL?1YOV($U=J>ORW}cgak%qGuAYs;ktss% z88W*Tzb|)Hpr%y-4>32jIN3g}0cGjKhh|{EPU78We*AdM#$?VMO#!{|Ri?Ibx*Zd?ybA3bl=hlLC5yBtIm^JVhA%8F zoW`$>C}$mdq5=prcgiJHtcw=T^#mzrS5cIWCm0WOVD%?+S!QtHiN%YRy<0Y*gH_k- z$Er(ez!VI)vkn&5<2r13YqQPqFwYgpZab4DaM7#&>PCX8gZGP(A$M^z+AOnJXD0$Y z+l6x%Wxj@(BBRMC$EQqsYa4DDXsF7O+ebd^*uMQ*O{*hVTn!W%wHCW|ivx>y{xjUC7}oWW4BD~p+A2(iu2?<)SoR`$Boh|_sa1nM2p;U1i<$Zj@^(U24H1N zO5#dZ198RrYaRCre2Wb}O`9~?-Z+m!OsH+d)gy8B$C{YTB%aA1)~Z#DogsHvDteew z2l9?9SebL4ol3d$5{e>j4^|})T=HWho^F-Rnw>fk4h0b*x^n1@2bLZx6iv%r@R1;=kBl3H6z9Dw5&dJ1#GH8v4y? z;*iaCVkCu)e&-R`jw4AoGaRrPL6xLPK-b4qL5yxMAU~kR$dEDvikQ5C@j*;@fr}18 z`w-q2Krb}S^G_V|Z72XkRH^3o}f(_GoGEGHEG((X)5& zx1qjz_GN|HmXYQgKq3{`oh9+IHgDZ}6Wo-%#ct$d588d#mP(EWxfhB(rm-YD@y0yJ zn+`u73UQ-;%if!4n9U>2)rnRm!3kQDjz@%QxKCs z#klSFB2;i&R3+^AXamRp?LG;KPPqf?yg<>MoL+N?7ZssCEAo?-Jx$oNvL2RYZK zDqpYBZ45s+`b6is0F%swn_dXswoNev#vLIqx9l}v!3R8yrAk9+^lnO97!@_wZwbXY zMvPFa4PX!Zbrj-E={b44#lShd?HQo{u7!6xGs@k@TmO8s)k0HNW;%UYOjW`d&hY3= zl9?PYhHyuo+B)Z`$LfG55`sPXGBNag2oAlDX_1$-TVQls-Y;51UUUcDshPVTVwxcp zJRfH`L|9C&vT1|O0|?Ki^NzSzl3MqmNTcznkhr8I4)cj zDV@htIs^YyIpA7W2%8{i{1eDXojXT(b~h?`F*jPq9&!(w*}0n858#Y|ou$k13D|M3 zN)wYS>-<>DukNK%E^EC$_iVd>_|#iOfs+V4DY-&P8eQ;XgVgH>_CrKCIf3ShMd z@=!Go=-_}!IrTvm(ekeMz_(=<6H^6$}6LhmW z`nM|GWoX@k=C^IPam)`CxhqX?xKm2!5iPb?m;I!TLVU(l{f9c+%!lF_)5iVFn)A2! zTb!(q%{VGne=n2@l6qa)kCIyZm5E_L5U z)oRCbs@)B{pJO`mk=*u){JQ;fIvek0v+G(eoS2~MfI1OZ3m)pKa^E*a?5aQ1tGzfp zaG!cY9pi6Or!Pjwn2~;@2+Z~x&@E$bLR1;;TJe5tuSI)?=L23+Nw9g?km>+ZJ!cm) z;vNfL_CI2M%!_W38om?k&(U3lr2G6i(Ad3Ae{?|fBNbxqPjF%Og?XWJqihGZ^n?6S z>_*=})Bq!UL-9&bmUT4zR_Bk5vMjJHf$>dUUJv?{TYWooxxM>4P7M+5GIs@@c-x*m z{YuhZe8KCh$&zH*N5JtdV`$$3BwgAg>IBnZmL@sU+={J@R1y?2x$5Ik{QvAr%E+S8 z%V6fWEI)l8#G)nCTJ{d+${YB|oO&T)IO}%q?1`xdJ&CY@>A1)W8<2>`zUOJ*#tBlu z(fjvh^CC8EwXim*MfEP#Dsv0ITN3)RHlKP5R3yh*xG2#;Qci zaR`Ii8rGArewnAMfMJDkkqJ%%n_tf&&$2x5SnuGh>e#9R@*I?Pfd{AKiZ2zi)J%= z#mD^je5M;&9qT-+H-mHS`lH}hebs>hoOJASuK~bJ;9xNuk{#Dnj61-=R8cqz1jMyT zxnK=l)3rF&27uK8?h`DI+r-+G>as(Du2u(>g91eSTdCC(sH8rw;=15TyGhEiHf@YC zjx7cD5Jz#YDhp!*LM(*X@5g1hl(2r z(#3faH4G5BS%?WH}muBNyBMJSC<<7>L== zo;_j4{3rjEJ(YH@3{f^{r>&i@ADM=ndeERt?bF%|E@rZuv}1`V4`gC00|V!$&{QT7 z5h^#ZV3-j`j}3t(3kDN8x+t+(!+LBYQU&IxE0y~RTb0wHX>;7iI3i0xB`&~27wgbR z$~gzI&u%|(;KgS>w}!-55DGioR(?yFb(h)=+SY=0RWQ%tJf3N*3Z-+ve( zn&!7A=8Nz7Kp9L%AQV32aByqX(gu;e#LAhBc%8|^SYT-k%i8Pf2LsJG7v~x z{+y+L3SzN3KbMO|&Zih{D3-m9%)2K)p+W|9KjXXrigr*?kh7k{&H7O9?h)^<<0#YS z?ss|=B_8#T?o*D~3fD}*6539TtY|UJ0R|KlGe$?}{>oe*V{V`u%}k)I9`+l(P8Tsy zaXAcnWaCby+?6T<+A;wNl@?uaZPvv_=2VOuR#H&V8SL_!sN(#(TRKhVQPY#QC zTmaVe6Cfs`ps+B6bOrd4Op%M)J>}q?%&~vVMe`O*nOaV@W(k4~Z7ElktC_0x7>2G> za~(}hFD}ufeQ(P((!6%>+Lf~pfzXcVE8o0;kOciBGkz{AdP~fm(_+g=i?A&WZFXE7 zCk=c)7~fVRLm^8$Er{JRWygL30bRGVm#w2Dru}|7je8V~PIvt>ju&0_A=Fp=S_r9D z`KiaJ?6Gqz+d%yo`IOLacf=Vu$p}&j1@~1BA;}GQqoPF1BbJ*v2E{+fa{}I{#5{$L zM~xh~(k}E-2A~LPiT==ex#|$GFQil=I^pVo<{c0l9O8{joJIg+P(Km|>xhga!jdQIvT3TW*x3pv#4cS|FD)q)enr%zf4wZ3Q$? zAhu}?h0#G5JT}pnB`99ixXnDr$m%_XuzU$(%Cg*)%^n^KK;y$Uoe*P_AF$9c zCf-Y|PIo=!DVPLP!yk;6F7y9TpM0i3b%6EFyiafl9z0pSCv$0#8Fn5`iOg$Od4oa+ zC*DwPcfMrqEuzX2t__@x{mc5BY`~HN!PK3^ME^Dn3|6iQ{F=Xp%HI4E=DWOXN6xym z=>Dz}A}Y4YAAwD^p^S_GnBy?M8RH3sWZ2+SM9odNhWrZ+7_Id09wuNHZvk+QF|Nr# zWqwfRX8$pz+J~>D<{2d;uY^)^x6@9~O_e=L#TrMV51N|pdpz14sBGxfuJj)HVsAso zMr={r6cN~Qs`laUHUYd~U08N+5b@@oiCdVql;bsOVo}lJImuz0Z7;)Z>qrQB)woW4 za&qCA_0mB7=%7=*Y%K_gg%VUC>G$0z_GzKS*B*^Ak@w5_lh1|XU>77mg*qumYRFVE?p!lz~-o?yzji7 zKUL~dq%#E6it(AfS)&IY2~$EuVJCQY?OSaK3R|@4B)@4z`YN%AD=wvq>hF}R+=sVP zxc5I|_r|sck+Ox4ib6tohHWxUYZrocQ~>h_9(P387a13C(4 z*I}ZK-D(2>XGo6%1LGYubX+_Xs}?Z8`TEx>o#3rg!rp)!k^=$Fb_&FV&ZKt`HOWlP zc6I)Gn2)SkxIeHkF`*-5JAO*S6q6GE-(Vt#cFmH+x$UT{|$cQiI!D)-ZNtjG9nh~L?@&+)6 zhnx;~mF-kLX04z5Od8xz2+RV0O7Pcd-BPzeZ~PQ@*fynqQsW3KPnL)mN2>{n(|HUytc7@j3ZW<= zU5mw&`nFx-j^1t*#)qh!@UE{|OTB8(SIkY!H`&d86>zp+L)(zCw;N4waS7vUA^$Pa z0ii<)!f%she}#h&X{0XLKvr7SWNS&<8|`S zUwGE+$izeUb3Ut$A~|Kw=Bhi+qS<653laA=z1>z=d=Svu_3mx?YDZ^X-F;kvWdl_i z?88zUzHou%;#fVj5Z98DOms^Csz?y^TlKB~YcMQ}%anG13S!|G2u;kHGn7MXIm>~B zID=TYp#mXq_VKZF$!}x%nu->_pHR0M=vq)l+PalF3O~WKVM$|OhPTX>50^|vywC88 zIat*bNbsmM87f}Ad2=IZh~M_@!`T7F_kG7lhlj!BnPgcbgKQ)|Srqt`^DvS+cI$TR zcE7!>hNgVG$)*{w7_swbhifEbe|ok#^((vYP7IwcKuqO#(0rf1-Nsu1*I2`A7{o*Q z-zfxV@E`FXDGyB;!WM)r*x-VIWyp{MDC6kl3n!nQ^a_!P%|eHG;I7eQ$A*y)Fl#c8 zK9@_aEiMUCY?7^NowX~icT=VTY*C)Ki;zMGGJM92L(WXW=gTA!PU^OQo>qf+ZRAg7 z%L&uuJNNF<5qF|aYyy9uH%8eb`>ggEkK!k#>S|SSL$xLdH#3k@`r|tF>V*=|u_#)a z5nqL={SB_{^zHLGdlhwBtnJzX()~y?WvBeBnLqgJt#zFPxP+M>KGcp!G$5jJF4yIv zWTWr}Ns#-18t4`=>3+{M?K=7VHe`9qUA|Kr3WZ-*YFbCtm`X#;xOD^BPT+qD-r{S1 z*fKJCcv`>Sy*myZxSN0X`A+9xaln!s)oS78p=20^$bOjgg1l{HU{wVa&VaIp&Ifg% zKZ9*atL1g6wJqjDG+hsShb}*5Zt&`_EkF}_E#|Lg4bs%C%R$8xpV!|&p`vLiuk9Kd zF?0&dgl^I1tfEw@BCbKqs;7ZyUM{%V${<}^W(G0L4BS5S69!e?F-qfGbZLEDYiwOF z(%IV3(9qqkm20{9`Gn6lk+C+ms|=#Mp?+dmVn7MycM?>9H(HGd99w)2Lqa1s6BCaU z?!pbP)WxBF3!xMI^fe`n*p@KS4pNGr#uOj(y>INxX8%^nEs&K%(&P-gn z0D33x(}9XD62rf5ZRER=5uNkRK^*JXFVAUT^bWZQ;7&aPjd&jf&Zgp);vZR68R-x~`Ui0bz~ z1)!W8K8%uU5M__bsvwA>$v}o@UcPo8!kuPb%tU$Kpuc7%(+>kx-w0*=>LWvS%7F5c zxC)|y#oKbC3Wh*^S={36esBBs?Ue8`jrA}L)XcfRtE$&NU#tF+73E(zcjgej+B>%D zJ5PYJ9=KPgrds769v_pbzv@+E+O$deKj+VzCu$Gqx-bX|P`BmmVOxiLp>p zsrzfjvvg;)udQTsDhlu^gY*Qd>zG*2|EMb%r5NekQhTSEzEHNwujlK$*i17i|q$ZH>+1YoCt-nYPX+v*t zTws2ze7jn^$Y7KC>9qT~N%qBUgS2iwEo2h89>PE`R`LTp(zCd+<>mqw>}#3tI9g~K2f z)kK>FTmz-!K2c6I+)$x}GlX<|KXAzs{B;=U7&1&$7jAYz%C^|H!NUmhwb6@hLtU<2c2xi z_&5GCSzf<#*~x}GHLTyfORM~EKf-GL*+T9%M|6!#tSP8B)BQk#>Ce!>+`EPZE&LKD zqmm{H!jSDgc`9wPYssq>010X^m(6KDq@I?gdXbSi+v|K&J3Po~P|X*~xAydnxbEKk zuRb>E0B4j{MF|AB=g^&C3v$Rj0ex};FHs!&eRSI27c-ozPoFm9cLB0*AR034m2ALM z;~kl2Mt2DQ<$K`OVfePBw!bFnd}e8dIHAx$LmIXMJ3-pBma%}DGTEa}{rV#!9^;EL zlx|R5yNYx7Vznn#128%yG+DD0`<>V2)3T3Y{zR@(orVpqlejr8V8!v%8LZBEa`XXYYB6!|wz#w?f6sP?B7)4) znTQ{;5O{FSkTE^zWT5n7U~zkhX*qsbYX5tKeWr20fl2=Xd2VbQycjrmI_x4Us>UAD z-eDjPPp5i3kdpM^z%vOf1@VBUY07qy)|?6nm&+%jb7$Jqus=z2yNHDyQne}amHpeCnECJ*ge&NF zR@QLp{M4WA$>_&n7l8d@dUNDXvmw7w(Dv%;2Hp5up-*qVPEzYc3k=uxCZk0=o*o;wu>O9k6J);WeP|eqU$NLYe=stWCW8wq9{%&yNj23$E&Lz7W?Bt^tq&uOTOBExGr6srfgvK*rQp!#rVPb zX+4(*UQwscA54skyY(}F#?+|?Y_+<7^6Zk+s*>3?KQd2>r%zzGNze~_@lRo#d>igw zV?0PIx3AcF7MjlA8C$^(k;z!e%hyR-R!2riRlFZ-rs%lxBl}&Poy|d=C`!Dg9Ui^< z_%`eU#YGN6b|*G4uE{rR%n#sh7adQybt@|0`dO=$?thk;+>U$wGv5`B1LE{C%wfyT zCi9tYtQ)QLZ0T{LJbBQU=m5iH1f2&kr#O@6E=mWv&q+I{4K4yK6F`Jr@}qQd?10Hs z@|w(NCCbqXuA`Dzg6ysM3VR+%nryaS&y}goy%Z}8>fq4?7Qwr?uzcKPT4Tn(umhpA zDf5m<^irAA>G7oT|Bnar9!bO1rqqpUJ)Du1@KSg`g>VmqlPqk~3|s9NTG5xU!5$Mn z08dC(EG0bYgFg}t5D=G+AiwqrUflXJzbhr+vD3BMlTef*HqA7o%;kp0+vfuxSplwV zE&)Ptw-JCVfm;oW!3JZW;{roO-=mxh^it8%9JtSz=Mk*~K_T`cwM8hj(eX4>YwhRG zjS~F`q0|yLfT$f8;C&S`P}RgD+Y6O5%TSKc&K)VC!#EC;p-amxBl9m%{Ay7?Sx^cD ziYk_$zp5-dC=eBh7Q>i11BN^&{riBq^^v0HhZf z!FHw6#PGP&KnXl1({FLLbZY(YoRjH|udI9nMy4{sUSuGp+JM_Km>JQ+Wm=OSf2Vf) zULHGig2sYsxt%|f+C`WGp^l`G-ngmBx@%kRNu)kWPIlDj)~x|UT`J4MYd>S`U?j+F zOHXezp#GvxA&)}QGoub->EMY;+(Q&!6a0Xc)?4jw*&kf9~*~rEQ5w5 z?orYHR*1am$+WHNle1KSC5A?&W`fGDTzTFZM!GK}Qwj`=x3S)*FaRc9xlLYd;|cz^YQ!K8Iu$D=foDm_6hLAK z1cRk`c--)R7)tus$_`(Fgvsw*7D4K`$3*JUX`pB24wlbLZAb}NQK&QMA$`a+-6+rh1PoD-6;!hzCb{d}l zJ^S4||9$(8t>CUT3jZ1$eWyb0zcJDmzdqZSbZ$LVK} zoX~X4V~%HLWSC~29RkMywu7Z2n=VE81`KdjhG06l$#wt}Cr56T^HaUjjA~?)_XqRO z!r=%9&vu^lmA>uhv*n-q;UN>w%ptc7^hqeM#jX_jUT|)>fdvnnGvzym1?I&37yN=i zinplujAASSISL+LgoeP_EkFV*c)un8HTP+Bw1iPmG$ASAcBcYEAsa}N4`&E68h}l2 zPO5BANk~t(B=hNo5PbiqA+N9 z%dAyEZq_-yWFBW`nc0hDoAy6%)8n7Og{rct4v1(y;2>mUDtTc@kg&avX+aR9$zvA2 z)i`2!#A7q63kLJ!s3#%L)5-C{L-iQN9bPbFiUvb^1Xd7K*Fp*!_;$ScC1FcLB3?%% zh@79dY#<)M2m85B@aIsY$?4JOXWZu3CATJg*lwbCH7#uqUHrkvIUN3n-dd0<`OT*= zE>Q3javp`&V%Qt*MNeRpzZ(ubUz16C#lfVCeIr8;GC!g(libYM=>b?5OBZZh2GRX* zGLESB>1W9vz10`fOIjhEg{KORww=Q$i|}9w+U^4S%3YmB8%sl-0MO=D)V}-`nsj8S z=Dy4fN4c5s6t$1fT;AA?W>VmDHf1prS%9I1gAA&nj}qPo=)OSF_8j1zV|TQDtc(Zc2@0l*B>3#J8_ zcWrx-(lvc`i4e~DeHdTvC0`1lcg6>z57~|pNL@@`rR8UOG(4}n&(P3EgO^NJ%6$L+ z7ROEYH$e)-%Ki~yfW^!#nmL+2_#~HxFf$Uk+B?Nz-n_B&oTTrBD@i|H^sO08Q*eJy zXlC~iIKS4XH{q#s9?dyZ@lJ(Ab~%3k^Xm0K6$kvEnEMAGbk;N`6gB;8cchbTV`Q4c|&{t_&J4YnleJms_}%@WF(nKFw9Dw9!1qB zxF->;7qsLv^90&~R>QC+;YljsBzAS_n!XZ@t>mrC;2Yh0R+q z#Ia|aZ+7@M#pw|5)Y*+uMs*cP;MJh_OytD`aHqAX1+A%l1IZv-kZvtV*8$|fBS}kN zw>|+ABAgKKMRq+&e$1u@quD;H1LI$vRiWH{hznco=g&WeV*6+2 zw1|Z9uU|9>hVT?7MKk7#8=dOgcTwPvQ?GNex(=rVvrWpb=_tyquk~ z-@n(QTa@}Q-SwQO#BHFQ09sNl`MXTIRe-+Lkv4bxqOmkyIlJQB8oF-d=NF692S?J> zgfpi#tCt%=91!AK7kGCkCUZLi(4iiNTP@Vh0bMyTb{@n+lvTwuL*unQ@!M!2(S zF4AugX=#r7B&PR<8|tbk zFI<@4-jOFB!0Xgt^yx+9E@bGeg{IClpITLm8JCTKVwDy7vSpSS?gbDj4a|!8aTkKt zV{JQv-HQ_<-8QwYBL%WS7Zj3PLT-1zky#;(aT!!Cyi)dVjA!9k=)3Ug{h=PzdY}rc zt-w|556m-+)ni0u0M;Pnhb0$iR;4Gd)^^$IRRj2J;zyY&%S%-!M?kmy_O! zbocb!r(5YknocUG5}QW)6{p*m7oEHT;10)ayO)^QzE>}caYq%1b%fOG*^b%Eh1K7e z`xPgE4h)B|Ql5HAZhMfe^Ja`-%^>3Q)>G74DuajbW_(67LBAEx}&zYj=S zA4s`7PfZXn#vGLL3|!S8oYFi7LIyyM6-9NB+u45K@n&FaR}dS3Fx4249IK||@)bj2ck7_Kd{U#!g^PB6gwieH!Lxc?4 z;XU*nC|PR8b5G=4z>vZc1`hoA>Fu8R3&bbyT09Av7s&t-ZB(&!YMgs*oq<5EBQ4(I zm1A)i2KT?636durwUVH+m+_gS5OhSM!F>Acf>()@Tm>a1qfppTcLO^D)Z#d^gP=F~ zv>rrV%J-WjQlnHiMto8P`av6r*cz8Ti%tGf+<*28`~=vR#SM9BA*Xpc0rHsH z;v3mem#%>FFP?FI>MBR_K2_r%ty&p@xU?DnnvPAjY2Cb8lRh22S0XvGnRoM!a(ii(SmJ|i%z?L?aAr5{5h=F*nm1< zgt4pah)#R;R<~rH<@8`YrH}YsU^bw1ESd%ye0d{`B3Up_6KxC#QJS$%sZHr*^z4u$ z!}v4NGN11^LH}1vpVO`CAkI=_hg~o#xSzSys|B^~Kud3U8>4L6g@4`SK{yEg)&Guu zo@~4yKl#D*K(Z!Mo}s=E|=POE_?s}{o_+JO}NXIC_s%*4Tj@yIX-_Ekx?b> zY^A{rNLIPNBG0*&O8gPfr zV#Kqu5pvchILO^VkPN0hoid++4i+vC=rm-Z3+y2!R~#~YiBNCwuwW1Y!Y1bs=&#wH zhuC0H(U^;wEgDCrq6$;bRyd^9_Vdl?U#0y_`> zgMh@`$P?D!BUW}p>_hw}^X+iMsW9rK6a@wYKE`6amIU`Vv8)LHH~EYYr2 zP~rgZ+GOBhdA@djV@fk415k@(#xn2RF{N$g6&?cf-ID#XtCQQwKdxxeVy)}Y*}Yp5 z9t)#$HC|5tKKs`YbJbMs-4f$Ef}3mf`j7g#LS!(%fo^^0zNnUOwf+Tkiy}qrr(^dFx%LykPgoN%8!pphQlGevFi%0?a}|1}JJ=rOCm zOsnql{~uqro6;waxPjD7hP1`IAZqD-|Fc&*)J6L=YK*^WS*)+Qk5tuj-3bB6Z^w?2 zl+G1zUOWXm`a@`>_ePhbEh>TVWHrDv3ps)dbIM~dF&XmdAjEeG~3dVHF?3}dUuI+&f z__n~n@HW)tN?RmfYIKZS>pJqc9Hz0FS`v{|?BbX-zJ>iM+C^qRF$k`q$83cLb?eqB zRpTF*Ko%u26NtTEM#0mFbO6U`0yabmFYgWjCo_;g|41 z!7Vq}U!E_V-r7dUF0mp0~&!*=2NqJHThu zTQZn=o!TM)Tg)b(&8Ycf=$o@g;>`O)2bl{=&91O{%x zMlrfr7*`ynef&!?x}r^NK43{l*p1?hfu-YTEO`a_JG1qgoyqZkDLO?tS@c3Eys#;0 zfcbYaJ_DMe(3L&CC~Lq}#3>D`hj<1;KT|7JWy*0Mv`vr%%!}G%=SAB25QsJ;8!o6L zH8vm#g+oixthP%Y2G+(nL5JH18O}DbqVYGX?9cH%-FatK6$XR13n&V~FWfxbP?VV1 zt_9H8GS$^UK4HzcTm_g;q}^qJQh{_9S}Z8H46YSG=@#`KC|}^V%E1c29WSX88lCRw z2qLC|RN9JA5h*Ao0mP6gMKYVV(fXPm6Gonwr>AE~>YPB(`Yt#^7`(vHz!awk6`GE4 zp#>CE$!f~HnGt}<3Bw6?iIfF~#oeT5OSpU22)h__ndJyi=9gp?Fp$Kq?&*8NQo8Ws(<2HPeMn6xvp48-Tr(hy1kVEZB|>|LM{;Zmn{rUZuOUe#Kj+ zYOBKD8-9-xN1CK_$>qA+z@P}U$nl2Y&~kqkh8sWjbD}fmaM%T{AWE{s-%9{nwhL8 zn6M2HApgbb*~2NkAsl%+qlr7b$2B5692c@dv>3H8>HtzzF27K%fK8#= zJZ2LS0qfAy1?Vg|D!0hejQKd(Lye8MI=`lUq5&3{8lo`|yVOJ+=_;X-%*}(4*#nUS zc$TlpX3^(e+EH0gHg0$CH)o#OW)Tw%61$E+;Q{81DUeiP!pm2X#8~Cy!Gm7{7wAc} zoe6?ZBF*C{2S8b}QV=sIMyx+p+F27}rI1>Wot?@)V*z|1g0&qmU>A}P{m;9o@}&%? z6cmma00>^vhS9>hMatQS+U}@r>~2X^e79;PjiZ3MdZBZn%LkqqSBfA}WSDXeC*G3L zH!cPE|1tMI^KATox+n~onz{LNNy$w1{K!;{fG+};jY?|e@iS$B<=(2&!D^hLq2Xpn zQJ7!4bZKIzUt_^j08?jtLNpM8FsYxwr&X}9&TZBY=4mr{c}nhP5}TXLk*NKf5rkn0 z9rYj3m@71RmA%8ei$lnu{L46*`8$ioZ|^`SFH*>dgs(mRVg%zO(PqGMewKE3ai0-8 zFM0j|C_&>TljX*dcx9YTIi$V=#j<8+ll_)=g-e5WENZxSiJEVr(Wvt!9R#X~6*|W+ z9~){uU-pOfKzQHy?hk0xvQ0tQPLkh1+PGyQHZ$l6L=q&7Hb7rW&(TDWDx$pM74#xN zszcR+R#!lYFnHMqaduew5;v2TbJlO%SjbNc>fgd0>nV|@F%&ip9ogVY`8TOy$A{g6 zc_gG_&aK#M%*nTA%Kt3ZrBckvWk#GpfkQL=-oeGwE~3Whg<$LmU1CwNX#a~&G7RN~ zcef@;?6Cetw*aVnWNcP7a10`Yu7s7!i@m;) zkQG9NC^OP9D`_d(MOwe>y}aM&ocH^j_xE%8{qebc-pA@So{#(Ee!s0d;3ikgX$t)J zq9RgzW5UKeQjxdc*;Y|6R`*7A`U;l=w>z9|Dg$;0k_UmUAVNn%8#J$7XGfVDaA*%k zXY@pFQ;vu|j+!b4R=Ih(s(+2FVv7r!BCS_)VK zeaaI4RSM7+#Ek4EhVq2%yT+2DwQ}VNDBs$IkkHy!`8Mh7xBUEWxzhof4w0$3 z-12okh|ELwsvvI#OAyXbGLSekasn(|W_`T5aQ^&WxF{$v1a2fmo6Y2JtTV~b_%dJz z8BO5mh)u}F7=wJ9XZAi7Q-HDOM~c+Tr^vUZ5B6pye~)Ww2kpU`SZL?b+y|7Yu9s%G zQ23B$2;E{A8%qn$@1T%I!1;#M+w zXJBNM%*Uz5!@x5kv(CU(X(8ZCcfrI7&*?#-i+pFs0}i$WE7h z5;tgG1d_?=y-%6!rK!n0av02WH=@3pl1D~Hp)%w0_L1I7ZT@T?d8Yhn%EdV<#P%>Y z!h-1W<=etZLY_3XItyutu@~3tp$H*duv&g3aMR&TRsrk)r|KXWDN4~0EUxO zufIKEs6wx&6M}(rX1ZbS9Juc4`dw+ zw+D}>lHmoQmQG8GD=i)llVYA_{xM*5`MR$|?{xD7>VoA>w_UyZYKPf;0i%tOU=R8D z;@wRitx*yDZ3zn$6^iH+V+V7Gux`*+|CyD`C0;w{mMz`v?lla9jHNLR&DVkQXoNlNT1&(8>UUdEir;r(xP% zz)X5^v5Jz@vZ*sDplJAb{PRXx2Z9BGfBnj;1ehE12;{Z(xjxm%z2Uhag>?wV0+D?o zZvg+&zOW3r1dZ`#rbk#gIQTpmp-)*^0S8|owV3JoE|A2vf*<@4IV#d)B4rU%IvV}) z4*KA;;(;Qa6#NX%@D&s^Z%5oVJkSqF)ZnLI#;q}MO=ii9*>A!=#x7QJ={_4{LQcuv z0kb0DAAf^ilS;~25RIo~BnvZ)<0I2GJ#tZB2|dMnxS2B^Z2!*~TF-yN(DENgpPfP# z@qi8;3twLFwj9;1a0qz2S!=(CE=(w9#S2<7bbWHSGD#=G&g)lNO(duQ)jNM&M05n9 znZPCK|LDhK5$b5*9bWfgVhxh8DAw}tq&+)h7xPcyFTDgL02VzmXw7bJlH$sIf&ESC zcXUs@`aJs9H|~g641-%^_#yM@9l}A2FI>I4?#M@h%#1bVPa>@PiPrhmZCYB9|BNNJ zq&;(_@xZG5u|)Td%=JoC4NsAw(ioA(eUKddL;A5+aTB z;PuiO7$A_s#FVX^!B^h`S3>C9x6Syo6k`KJL(?n};9uEvP?>aw`R9E2rXrpN&~U#4 zpIL(wwiWW)T z`5uEJl4%VF|M|3}%&hywv`%9-C*uaO5)^iBYy_>tnF_dR8Ik;Sbl@pKVJ=xjcY%Ho zGKmW~#$Vhp|CyJj_aWp5O^-rkOWdC#4p+cn9wN(?;@yFyW4gs&1bj>)-rz-+K*55n zp^UU!$|X;gRB?24Q~=H`<#mN&uf9v1rr5(cU&%}s3MujwRoSc{+gGa9Z4pj#-vc}n z`EREJLUWmswC+Nm)nq%j&_OuEpRM$i8s$x8gA#4KY)J;_cAWKw6f-gir%=Ek52|Jw z-6OQQ4_38KBq}pV%ccu1-pqn#Fb5asKWQw}9>+`Y3!3)oZ8S9M!DaGKPDq?r286>F z%_nfU73h&w2>8oGrpV9H{s--rLRLy+0>xbY_B2>XO3`cv8ZqKrkG&s5U6-_JBm1ss+%c4`SZIrOUWQ)bjFx>fN}@>D2fCW{3fq+rR6|4=bXoRm6U4 zT!=isNdu3A;%~K--zM#A-FH!aN&?eAkNYE_MAk!-G5v1?=YWBpV%+_MegUs=b$2HQ z!{x`$far0O(%$#kl?^@ z2_|HH$$I4^6Z&}SY-7&1x3JH>)$=g_4K78Hi`=lxGT zcXdK9p*I}WYP{}4$_?2$I}nHxl5(6IjI0WGrUyxDK)79R2#L#JCT80UWD4NS&Rw!( z+MHP5KQe_w==}Mv2&DPt{o!yNHUa`hci+kZ%}k2uMMLP$DeT==bUn#yXb5BnqEq9&;ajwN@C;$P&a#l_yBg$v zH4;qz|3wY=#s@+H<0<=}al^#;8=dpcj@^f}DsX+u5Sn!`{joSm-mvRBWZ2}D&& z<|Tb(E|!(vEb$0ljif~qoF?C1JAk^;wr9`!*AHr+n;x{6u@KXN%raz|q_?;fnHKHe zgd&K;usQiP!S-~JVw~e=eL_&RBp~=9wQ~8YRaM2eDvuX~0$PU)moM*YRECIS|A3-K zCv8%Ip%bt<3>&s#V7DJ9-3sv>0GwG?PU(s6D_A9kdj_W!*Q(y#K-k3s))$hIt{Q(+D&VxGls&>)DJ0B z-mMd-O{2$WCHvjr$2d<>6?a23hNmg*T!R#($Bby}?P~>#lo$;DGRXj2GY1r55Fe1k zOqWOMMjipbX&)qQ41^H(^_gjPb)EqS_8=6NcdIV{_^H#U=U7@A)1vVa-j9iC3~wFW z8@Nu=?ddymh=JPY>U#Wq!~Fm%X5#Q>T#9)YHIU1Wqfxs~!5|Bkb^f;j^)VG>E1kK-&A})_$gg&_ zWDz~p;m<5&<|hGe%lFE=VV|)7v^v;T*g%l|Vj<#|dq4wcu9>Vx#!b5|2^5UxqSfHT zj-ZiovM(Ni1=3yYKtPiMUUR!|9p-=Tg_+_QM)!unnFTZ_t`$B_@@{8#2y*^ACrfl;>rQ`Z zAl#Y!E)i&o!kwQwd*MQ}IYAJ5mk5?q)%(aJ#ASAokEUQ9A~_6}9(I{19>8|_k;gz9 z=&JX@nK5LaG4ZbE6nER1w~&N5)*I`N0)jI8ZuQne0MiHfH8G@2zTZW8rQ*xTq?;4vLpeV}05qJft2c z#6>pv&&m4qkKajCqL{@_n-I7>?hR9H!JN-8jw?eVib3MU z9AC_(7ERH}1*c}}-ocjuvnPr>jq?ECdpta*`epU|xi8>T*>O`|aN=xGGL?lel)OxZ zFi-3+M7SkDHVv(Ki$yR-$D01|J=0q%G@wPxznh}fx!gB)2MDm@ee*519IrPzhQ#!g-*EqEh#A9)0+W9q#7Y&fdsUO@ zW))Hir^94HKOu^%~j&9@+W*UV{H;cYiK^QKSX~H(lVqHsq3-w4pEn+ zTS>hr+<-2M=jBvg2Rj>^3sek-WbeV(_#XxFAmFFL2t^D7GBf^x$cVMaqDV}TLx5;s zA;xha!fK#@MD$g=9^17LiQy$4&m*3mG~snPCeUEYf{6q1-sNH#NnNL?nI%K^Cy%&UQJ%+QCxJwwPKnbt*mT34|x~7(%t?=H|*8R#XpQJ*RE%;vx16=N;{$UU$PSAWgO$&$zxlS%jxHxv^K(SKzlx5#F z54Rj{Jm`;H5CC6!2+8XF>U;n|Nsf_hLCLqOH$5%- z;lq)ZcV8D*%0xDhFUteqL;obw0c2%?WB3uWZHUI-2jEoHr(!q_@Z!W>rU<~pOK2^m z9L_TT?WY8)p1Jqg_02Z5|FCfWInJM(@UL5WLUkKg2GG$@{7mC>Z%5>nTVlW5WpGIh zbu{dQ-H)|;8UO3Udsf_m$a!F1us%8g`+h!`B9?CLWn?CN4+B(cbv+{KRT39n$2fMe zh}4H7B+HYiR$)vh66^QsRi5kHGZh&KiLpJ4wPbkyVlDUEO3vDV2BZNoT#$c>S293gi(cE>A_eV)Yz4I6GCyGpbWs2=R*gI)Wa-()#=?r^bhA6af$-2N|S z&WTxMQo+lJk&qaKpUHr3-@ghD%_U5y2U4{WHr<3Szul6pmBK`>rR@l!A&Cu;@FTLL z6_DKm*75hZ5+@^?2wSkOmSGwgLF#Meej&!f>esK?d;yGdA7F)zzOy7?6K!^x zt1Ngm3-2Ptsffw5O`Bs;@sUvA)DYPqAjo8Ts`L%Si-)IlZgy_!X53b^DW~k# zAxjZ8MnQ*Yh{w`v?F~1cFbSN%FRNc{aZhLxqqCOEnCF)zVr3>*j*&TK| zs%?8|LVT5Da9tt6aee88+kK-k0@tk2E*s8}Emdt9P`8s24`IIxc$%j({lGv{b57aa zzEJ&3yxbOlHPX!t1~W-m%fRRpadp>@^10 z)H8AJ(Cxw3RTFQowGHMRde?PhBK-G4njof0iRnm)>p4$AyUc_5h_c>4efs6)%*1j- z0|TYZXH&X9N6nqz?<*4M104{#jW}aarLe`{SkB@CCY)&~mOvah%HQm6Win^za8o>> zvO6!8?16wL{{0ziW0o{&Gk9<-Ev;+q9+?<09ENj?BcSNMx?XqQDkAsCD3VcccA4nu z!aJKreD(%Pfq#5vR>hE7sJoc?Iy8^XAxD)r-)TG@*c2g73S&}Kh9?RNQY~CjUl3gI z2-lTaQtWo2KijrlPe@Qj*T(8QXoH8)kSgypYKUe8QC$f=@Po0Lt?^PbsU_kANZ`;2 zJq*qPO7L+1j~R$m$USNYwo;Sj!-U5pa1A_tFT7SfEr8xL;H z>?)ckp=5@Sl>jJND=?mcl)OsnQKWN@w8OQ&l9w zV`2vNaUur36Vxo&%F#7``DGlkc0Z&aVoS4S@=?VyRuF^&-9Ra-H+l}S&eGhR6DJaD zzdS0&xBil)K_c0>EDyqsWerY&Hsu6R+l6FoJt9lHFqu4dQc~Uzofr6&^6$ns*A+nB zIWRCV5?kbaYB2x&pDe+H99^zuw-|?oe-8K=9)iw%EE5^ZT7NEvm+S0gzLPg^uFc4H z%w+$ZPM`TC3|6o1$8`ag(`eADF0@k*#<60Cp^b9#|JoE+&K4^HoEgbbydi6s+`;+H zhSQ2u7hLjpj`W_5y`|hh+vT0Et*$g{i#yf2SZ2uF&V?wvFNa*n75~R@xe~A_5_#Dt z^zC;#{37mCcOpv|<>t2tx?OB`K}%B;S#}rEO0f-u4miJR`;E-rsO5|G?K)w;rJZoI z-8-cc`JMlE!SO2tT6>DPOv0zEi0ovSFHrS7cqG=|Fk+0F`UNP7XL7pzxwdqhR(SGwacXSVA-}AjnV7njd&|0z}kIPzsv-k zUclNq+BR>}=T@#HmoR|rMcVuzmOb&tOat`v_4)b^x7+>z)RAw!9{1!S506N;7l|Lz ztkb}O9<+0_Ei6V;x2OOsU0>fOe4MwuiBvP54Z74_n31dBcA;yA;*=OO$}qCyQH*iu zH$E|BTLfY10q)<8QUfuYj7!{i1!e!WymCgc8F~h;AWrljdAT))R8G(MLK9L`xAKFr z;ZMOS#)1D^&s0GfKeUMaFC-F`F+fU0U;QR=Bz#t)W~K4MzkEEVWs4S~nRI8nB=E)E zXV1Q_j~7|5fJ((pVCSCxREaS&qr{o>gh2qk^U2s+jTzi=+__$2tAw6(FWg791kn79hZ$IN{LMf=2^=NO$RcSBQGqrMRS1WoB%VRf z1E?-HF@tDs%rr|Axwd5y+i$EjM21!1fnpxV3&4@@d?>(tui|3c%83>(`b-#=8GA_* zSEY;UnvY{AZr+zq-IuP|heHWD;Q!y%w9ong%KNp20)ygbLn*TF~c=%jMvdX~6nuzWl z%Zp_%H;V5I;iY@8CKJ_QTatIe3*Mt_{wf zjXbgKAB;N!{6NtT@G2+Gk?!4hU&^kXJ5l7s+MlK$;-q~poiz)4kzIte{PTy}g*@KF zAjf&r+tR9*FsUeEFs6rqmZLL}^}nLE!Kz!OCr)

XSSu1TcQ77l+GLdLq)5Cz%sN z4qO;obugKg@XzgjEzDYjU=SE@2xJa3^E_H$(SO6JPbh7aNKugf0L!L2U|@WMhS)lBa@^Ysey?*SE`Ly=8P+NvmzVyI@l zg-thWAxGCqgeZ$C>=Npm(cL5AIzbBoL3K>MCPm3~-&F)1o530`@w(*(;AOGu1|LkD6;YaFeyYYu#W=y(CqpaG zihqW?kZVwd<_-lDp|`#xb^p#6EIJ&8cw3pb2#Zl+1{8%3Rit&%pc^-Cu#dtFw!|1k z0DM~y*tdyU^@xwg=>XbrhMt+r_dXV1oHkZCb7UQx2sJp2k7#I1!oLb3Q@&+n=VvF| zwe@O8<00Q4H9)PUbC*GbWI90jT46vjpC!mQ>m<0e{;W?!v8G+EqJOGVfn3w{oatEO zwFBxiDr}LNwd%5zk}I9xr~O#$aeZB0iV>*(7Ie$_z&?zb#y@5g_h{fOW2XAQF4uh= zMYVf_ES#E*p}&El;S;r%k^eGP8=ozeMrX2QB91mjq8a*^+@+3mE6cXmO07U+qpY;5 zl0*I9sl}j%sBI_NPnC>YOgpV7$Ns0IZPe7M9(EFv1kZ~dp75sG#QNzKS`e3X9BYAv zJ5L12s$ZPO`?(h>^K=2g)=@$_fDmCu8N*YM-?*8>A$bN^u0W5CH8%wzmL1W}|LyjU zE|y&hY@8E)NLG+ABpk@`53u7i6yh#Wo%?^|2g&j#KpDNboBhEa!(P3*99{r9 z+>1k7P%=CIcg~ae8CcXy=gE_l0uqv5TG`GR!`D!$7X`i@|MgCVG{rf!i_#M^jG3QULRK2_WB$_%PG&bIUUE3r z^RJ_z8bZ^juCA^ar1H6)@B;l|=2#*{xYK+;nF7G_sS<>8H^(irI=bvt5k@lFwQ9PdTnfYqX& zESZr4&&9xJMsM28NtF9!NyyC@ce4YTgY-Y_Ha=!Ph_FL67;XFazaAAYYo6evkjqbD zvluh9j64D7_-Zbb5p$b$9t~5UVh{c!L8)MR_u(u4&B?zVe(l4fh={PaZ#_^m5Hg^j z5Z9rE&zI$t3(41Emw7AhY#uU+Z5H}Rx)@%Y%Ef=t%Z?1S#rUOuR!6U>Rk;PZ(^Euq8sgv3fL{tD7L$=B`xC)0Ye z*>PbQlafvrS_EVL7qlHKf>QaH!_P4JLhxG2az68VulZIa_5mceORobwsB}q_l5TwR zty2;GWW4-7vth!)n2<)iH+?Yw4Of%!DV1_3= zv;`t=&?h3JL`4DcLUrR|<3rGQc!*d_R^HeVr}oGGwg1KHqbngZfBZg;AsWwzLG5k7 zVRoW|Bp9Y~9xjU<=w|?!M~ z+Z04#F2u8J+p)<$<`G6{Fg`OP$T!1-Z~=D7U*vW|Jw5zm-jxh+_Qe|~`Ab|FxCNdZ z6iMyJT^~FMFouHD5w?$(pL#B$;c17fsdfp8vIwLQ9=y?-+s!gStBYcz{;O@GC?bf3 za#BtF`p6rZy~P?%z+Ke@tf&I(s2unNl?u-W@Y;3xFE3G^N+Y-<5;xxvt{CuHg>dqO zX$jGR(nyKCrvUS&U>gcM?_m&VG@-o(OouA?P@nbOL=n9;+b6ieF}=_r)*s%QJgDjA5>Oa@R+{e zJ_|2H(3^-Dk?lnJ^Ci}0ioldR1|1OnpKJv{Jk+ppW7*>kfg-IQ(KhI_thdx{`6Fsw z8dMCW?S%2sitKG^l1acyOC~<*6QT{lUqlrn<8JHL0b^ZL9=Gi5!PLrG>Y8*X)Z3?{ z3) z?>V2{S)j)kGp<3LJXudPPCTaO4&kk%%meMf@-GClzL^GBQPB)H8#ErJsFIM?6 zEl^Yh1HMo;ZX~aF5m%J8wYaz3Xp?2P9fd;-NsZ{h9=v@!wJEtLqQw?hRQLA%`b`qJ z4o#^9yRkYr`OHFeHxhFPTCIwJte^xL-+ow2$+wdIs?vJQM_{%&T?`Tv(MV5(9jW*e zxGGE($xIxAuF>O*dqS{2M)FDQw+-0YV!(L*{4>)S{!B&w9gvawgbzVL(dPtpqaQVc z(Adut@%dCe2%)TXwfxZ}Iv+e}c6)I|I(Od|Sr_Px>J=oWiR2nfG?GWF1`S*-TWcFG zSa7@}Mt0qG*GgI7Mu8?alT8R4r&Iw%o`GHIm;H1A;f!gAxfp$Py({(Yo%Zp8M6`_)YngC5_sO06prqn@a6z6L$+ zYSFtgouX#$uEKUo&;1MiHfpol%+XMuBfrUI<6H3Qo1#)Z zy{|#foB%gh*P&1@fFWI&D^j^-#THLx-0S5vmJkA3Uq!k{e=I!kfn>DEk|4fI!UBF9 zytY64)jrZ3h~A5jU1`?*`I}FuU$KaGZrE}31x6sMvZ(vGX;LV80S>{QR46T%s=URn zL9sps|w#pR*`~Xz9t!5@HQf?bXd?0Fv|_ zvOmju<5z(44YVCXHbLcV)L#1J`SVV!HW2P}Z3&#Wcj0^r4AJL&THCE7Q&ktPhgFfBlNfBREf}@&` zw?h11b($Z@(uC>}UnRpsD$}Nozch9&JrkJTdv=92j3!zZsMz&+`@jpM&!GXHbyHuJNS#NBa>* zgNTm0CN?+Ohr^Y)oR=g7Bt$8f21o^ApEiW4d`Cz$<=Qe&b9cU+BRcj6(Cs}w(4BqR zDwy?-gIh@ag4jnu*0VXk@*HsA>+?v@G_5By;@r-|N(}!M3kRhsv2e^+>R4zc77ozm zEUnT65DhbhFxyhf*eF!7;S}_V?4%tM7bRt_dmdfZXr6^ectuslVG-dU5Y{1?JtX5{ z``u_HCHc5U&6?wh*Q^yCLrqp02d^-5eiMOgT{?gHww`pE(5|ckV8nE~=h!;D-C8;a zLS8y4NxE9w5e@b!Dq|bA87L~(V~3VSE4{|}&dYynnkT|DC?iz^d;S`=WOHiW5Y9V5 zGWrh< zyr{yiZ{7EPj+5T*pA>Mj1|{dA{qfGD7{Iw-Fw&m0TgDxRKpP^@&+8pZWge0HDAPG^ zINUrJT76jhDwgfmT-^Ipbhu*k;y(jw!$u0sD z3W1BKK}mO8L(uAtRi8J2h;nNIjqN>fps?MHyCBkNTB}Vx*BAw+5OY+c{I`9`c?n)3 z*?U?Z@E?ary83BPL%1VjGk6pvjwE8z3h=l^4W=}>p2@D7U!vy?n3_uBb**}omMG@vaBM0D0&u2d4@Tpqnr5Taj<+usC<7x zoh+s`dK@Kk%e+CVDHH4m%H3%Zh|V8I?vj{8rmtFcQ2>CEC=~v4_%Rb{V)S08#kB8!dFU6TT|GT$TuSBGzQMD@r(V!jOh5me`J{) zVC9hVveF{bq^tjHs~|VBsM-+g!u*Y8EIk^uIBy@ifKl{+lHc?li-<(z5gR$1^y!F+ zb%!t}^1W~YAhx(iCqws^f*`Lg_%u3qlS}gq6e1YY+!G#=H#T@Fb&igm*_FxH+io3; z4M7u8!w*9~LunT3Zw$FY{27!^q5OMOuKwdr~q2d@3u=?cSdKhhp3iFkJi(NvSxuhQ^;ab1~#<;O}qtGG!=SHf*C5G;GpDpQ{tW$6at= zelX9slfL&rU|X6_*&?JM8my<5R&3XAkBpST7Iv`r7nKY0?s3q%vKAZM{VoJ|O`WAG zFG%?d9s)b!iSWj+xc=b57S=!V7c;HTd~@XpryE!&pqet!rtb$VFVi_XwZ2@&Ji+)F9RWUT zoHdUMpV&*Ig_B#iDSJ7I7m38i9Em_G{THM2pb64YG5pB_?Z(+J}+(fJXn}7-awjL4VRb;I)inllHW|*d+cB;R}mM* zxY8p$cpU5KA$5;2kV?BD;7hFCbv3QEjD5OzH(2P=Ub+)XhTwz=CT+gkcxw3b!4;h^ zw!3A7_wgemD>S2+RtidZE}^g^Z)WC{Mbv3X+|wAo>Z5NOqtFA4C&gTofGq(oY-7Ay z4p>QUE1lzuK}S73cbz-8bp8A$g#&3Og6+d2MELrZ%CgoGqXyWk8SP~Gw+vaMka%^G-U-h8gZ;VNkQiUw9+G( zU2I2wqIb(WUsC1xdy`A=^I*?yM?U8*AN5%V#}*txyDppNWV@A62~?>*aIPZU0Z0ei z>CnFYr!(HOG^VaumugjPNz*Sm6&goz2gap_73*GW?Nwk|cpb<^%-$e(1E4Cs@}Z5P zV~|OOn||BpeL!QS=l1Q7`Y6ALl4B6{Tlf6&W3pOw3Yzhj+r=6NSMRpwLlpIhbx1D^ zZ(?l+rjv#g1JiF*XzqdY2gAa`w$l1=jTH8NX43;;KvKS#w#n1T->s&<(o0OhbmSQ- z0{~gWAc&YJIEy!^8k#zoT~L5D_A6oDY`_8_Aj>D9Wl>veh(Um2LPu_KJGK-DT6zI} zT;!7Y9h^jRJQ>Be#5SDvp{d3oGAa2}?v;15>lBU-S$vd>)cN_0^t2$puwAA2Q@)hX)Yh=w{{4Xq&nN&4q&L5{3pyPFJKgD z2o(<0BPcwKi0zLqx{{HHeN8F=M-?%^Hi81P)2!=ooFvR=-u(G58JoFcud@hJ+_qdD z0w>Wrb21tsG=7Z!A`D&=+4xNJ{*xt43<#Mw97$+ChhKJLv0Wgxc^$iEgm{a*luFD% z{D3J6FKBE(Cc-_gt|kdfGDvXw-!!AH&s@ch0A0Z1CAnu1XDe z3jcKa;JRPe4RtJS{@ZW$Q5jOEI2sj*7n)oR7j$bVHHq2SO@}ugKOsQTcwa@bp1X&K zh=XW~SQx5I2QJ|qw7sciiQP<3*XA=yWJ*Fd5=rWiII9;PbvT_;qh38>qQ&ILzbD!Y zmOPX{y##{@k04K{X(_kHLHw!qidqB~n`qb}(vuEPlo%wsv;CDUotuD3ctR zJ#nU6BmDV&VbcM?dAmE2{Ly|!XjYj-H9}gT&m*ah_F=uZ4flu2*j@L$tE&l}g)0I; zwvosJn4u>4DAn>yof0ebrQT3)?I2X(Iq!jVJ+6qdHo zB`O z>GB{}#0t=tyP2eATlsieDU{UrL3nb%XlHAK;sW~EUJd)Q!ZaM#}l(6+ZPK3FrcqRV%#I;(sr8WJ(+{*>yhSaBR^#`bg@ zp0a1O%5Sb&9<>7j@Ti>dd=7r&dy^fW7(a0liN^*UJ?05Yl3mcQXexySqLEBl{0`E0 zgs|&?cqDqF$ZJI&Tb2VfJBVKhqdctFX=p9!X?UhGme#$EkQ?(Ad$DS8I{@`lJ5HUx zNlQ-S%qYYh*{*NjzVV&Ik!eC%ZfB4eu#_uFa94{TG>9^li_(VZDWo0274wXlUX=^w z$C#!4->DP4RljxXqn3lYs?gWvkGu;e!HE87pBCLU9jNd!Vx#09NBx?f@_|Q?@VZSj zL3<7z8bv$D9?A{oP2pKSQdre!+j%ttx+0q_B@5Iejwt7K{NH2^J31_H+( z-DpRCd7{?(2(khDN;iP+R!Z07M^G*C@kc@E#E}UC`Vuc4LCB8 zwh;?bf+$4n$cY>zU*)~MMegNo$fN*LiCTQsNAy(8L32T?chr1zJQle1@l_9YK2hYC@hnIvb40KSI2CtU^5gmr#2rF z>n+`%}n>b8V% z*KF5*mSz7xS`*0%lh&kP&{zY5;{X z(TRSFwRTqK=g zC)8NtVlvQJ(1SP?($t-PfG#qC5Zb{^Ptso|TQ5KNCKFyi4!K ztAlpjDxuWOtJ<-2OlNd;mww88e>HIND+%DxHuhjIJnlDpH+*`B~{3JIAKfkyx3^79Cdk=c%Y4jw)|#ZnD6ED;0D+TQaX?G>8|?skk>p^V6H zZaduku_>(*|8j2T)Ho(5Twz@*F`6dzrzjU55WjzV-x(-MAh5h)k;uPku4nT4%Hn;^ zuJB7cD1bC5EdUK_v6n(NOj$o)9$ChX#(Vem{cZ=cc5kg+uq)0I&dC(&69TM&NVxZ-XC)BUSjJ2wKegNn9%Us8Bl@_Y1<-7j`~Qg zD7Pu)?Zu64=v2cEKoqeB#MUI8YlNy$t{C*`+tF+ zCTb+uuB;geU`7a&*ggP3`4JU5fm_VK!jG_y0HrQe&z>t+uK2`RU7YiP?AS1b=dg@< zFZEkrqyr(k#4GDI_UAZHhtF*6{zGiqD2w1ngJzM6nF|4 z>(khP$PKwnGg_GWVFx!f*GMWkXA#lWrDbI1vs<6G@2|I)=B#1H=s@W?)heLO`#h&G zv24m7`%JP>{u#UIY;vON5U*!ftSYdJNj^>O^V9Cqjl1_>F=zeKB{(}!nkia+Z9pqD zhQ3Yo^axN9%3Z)73hJ4?$i5@6FTb@Y=Ik4*g5k`(EC*P!{xkggq*xc}IwR=AHhIys zMUsZXg*TFOOrX`BytD|r2sUsy7tLtUWj3JCppAxFI5|p^dvH}cl;1Z*`zy&-S__UEG!;F6iBiS^@d0T{^mX8eqPP$ zd-h*Zh4;hsg#oLrXq))f@HWj_wyex0(2NuL&IlGb_eS%!p^h_P#6X>aULV8V4ZHQA zX0Zc=9j}i0=1*hFAEoowB)W7MkxXl?iU_>%4V||8)T~>Vg|kTm=%C7O92P>MWZ@(G zkkCGwmT}W9H!e%qWm~I+Rt^5(x8H7g+|j$${_3n_nhqW32OC75zGC;mNRa+UO_hF) zI7!0VwQI41C^Js_@07YTvT=mwq7fs<3a9Rin`MY80n2JaKJ%>lqQ6{U+-JeUg}=hf z11UYkt%$@*~a^PADlGD^3rP_y1<}N6=?eul~jb(7F7If zevkwmdSj_PEDzZT`bXCg_09?6ZIg`~52Z$TI7y6fR;;e4`xjX!G^?`6d}Pld)_(R& z4rOFUj79g@CjOG>JlQBNS&S&M(Ae*y6%V{c+1dwmh_qaKYQW+rm#2KcECL`}OtFol za9_{c;$dF^D<;#?-~1=mySedP>$z`Gjktm9hihEW$zeTUPu+1RzZCRd4}6CI(NIF)k;xAOYlaJ}(S2K0wP!?fGJ=W+9&- zQg)fE<9)Rg6Dn9;ltC`e>SdWMiU5}rr-goIm0;f{N!f4Iz<${xB6$~&bn>s zr<+SY+;P-3)}gA|S;HU*E&;{hRobQ2+j_k@aL^2XN73#w(rmtM#%Z$DnxfpP%!D{* zdUmz=A!!E0Qvh{cWCvNndu}a3a==Q`gZ>JjO}@VTxLvC@?b{E3cKFbt?@#SG(T_Pw zVqW22up@fF;dU%SBJ2CP8~XLA97mK6*_;M(LicuZjZCAPZ>;_aH$gtJrRpg|*_7LB z2Ox`nx-a(W)17?C62>7~7ghk!a>@fJ`*?aTc>3GAk{o>=pZm=8aK7Hnjxeu~|0YW8YcWp@?W`2yV&N9v)AF#-&v8oJZlBDMD`0=(DGc%(MaZmSA zHgM397AR|b)GUP<{0!zyi1V$y=Yu4pn}11Fe4S8^|TPF7jaClFP z+_Vn&L(Z6H$vG;0$f?Aku^9y;z&x#1+t6ueFT_``P(|=2LV1RyM5@n`-=o^8_WBdX zljzorPX;W>_QNwptKb*bl%ar|P5BEWTIC|~7(seAXq)meu0AeGrm7)`P_aw|1X-)# z!Sd#8Fsikr6DTO*JT%3Ycc&z`mSb6w8q8yZGY zX0WId!Ws)oG@lq1TZ@XEjMhoq-KYah2}$G$SR9>R?y^&Ru*v{FmA=i4yxNOMEoCD6)Zpt%&F4OM4Z_Tp)=c6rE{bAi|_*_xht^fizpUFAdm4_{$N+D+b| zZ*0KSK=LPk=`{98IMWsVz${8%#E0ETN=O&F55OjpU#4kVCKQ}u)Odwp!+>c7IW)%@ zL@J|@h;QD!+3fd;0x1jmbc6!{wW_G9vQCx9lY;2N0GN#jM}qP$mQ#;1-NSQ3hhB2T}>! zsUqucp(46RsK0(CmP|$l*&V_ap#Vpm_{wm@|xq7-N@sw^^w$@YLizOit3iVGSG&W~BtFBidDwVAP4?%OI-6 za+HzLvuj7-OAt;(e`3~)h!!lwQcGT(q)-X#%8diJUC{WgZc&GmC)5}9_0XQLy6Sct zCVB0k4~sIdsA+ptb1RJ8JgZwNGF#R8Cj)^@1TafHZ<`K?S)k#dJ49VxZ8BU0+1aO|}>0fr1-??292; zH`H_wcA4l5AXJdYZ7^V6$@L#XLsBYLug5$FtqIPY!pC==Ahz3!olj)*K+$pEM|f7S zVHBxjoLUrwz$Bkry0|FBiV)zu(`ox!R`FE@$pVg))n%xLW+A2E%lEGp3q~^rb+#{! zfB}1S+VQnn3$gSFfyUIh60RPaqlOdJ8B4mX ztP)OwH!gfCP&|t?b*V$d>-oPZU9H_)fjOf|nhg1tztUkAJ$Bf<5%Gk?elX>-m{JmA zG>stuenW|bBk>sMp)`P8iWzRd0(SDI)3~%r-$T7Q)**1jd7h3Ocr6j#1GkML6@K8} zi2&-T;F%&i00b}K!e#c9UC-hQAa0n>UaT?zeYujvYp+ZF6L66_MZaFXAUyaev8gEE ze*?eY$(Ic>BAJ!^%*=_a{W&^HP9tqsuHVf@L#$`BY2lD2?)>9+s`69G^FME=@-})V zvXV!gVHIu5k|i11%}vYAlCM1y4PV=u`fagp(Tb!|CpZcj4&bYFbrMk8q6CH zD38|>6_QRS0m95ebx`%jzRq=YxpJj%u$jDRQPNj)!mk%_(c1PWnQ-H9+nSw!UQVIY zMpwDuIF2kmCQ;tLn( zjU^mlH^dzEV;qzsD^W^6KC4*|uYhvy3_IPXyzPII2`>*{HKY*GtV-|7W|msqI}#tj zI3c2psFkA7;3yBia8pfZy=KYi+rRh{WYkd_El0Pz=fh5AC$Byz4WuquXtJkGS= zXEaN^sxIg{TQFvWOr;?gAz)`*p0w9o7pOsYgMXr#h_Jc?3)7u>78Rx8 z^DD;_@l62XaP5jRhzt?ldra3Qlt?OM#@Dus(NhsQ#hj1+=Mo91651|1L$7w`I2`qz zbgmgiiAS5no(dgMX6&O!H2_zc$my~zrlLxuYDkW6w8e@wYZhYZR4HpNek%?6YQpJJ zn`$j<0qMYHwUJvSIG?r@FH8uhs@D6|Y-Tc1@le892>c94g-^qWQ^AE)&nCZd5kiAp z;*$34-yaGo#acS$R=axB74GIz6MzWm3e7#|q=#Jr=BP^qxCp94nPf+G6J$OnueK5i z!-tDDuO+t&DFC~H%Y%DU$-|In7NMbm9n+z&dA7#%3s-X=q)3<^gsDTg*S1TS(Hh3| zK0F7;AM)QZU;~L-QGf#B#r~cysXB|gChl4mQY=UNApjq6e{^q5Pbej7lRQ95duJBE zwPt(o+6)qu|8h5%w+*td-K<$N0>_T31ykgu>`8z(%PIg+*Vo(tpAA@er)CKS9(t&g4_Q#Wx4$VR4(G1-6LJN+ID@qI{%3 zm*PnKbPofJq{1l4w8@tkee)=FwO}FLn(WSFjJgXFRLA>_r0u z=`)d7#D7@l_>T<5TFyOJYQV7FbS(h!9U8FEmdb`}W-roPOrjK0dzRSb30@ zRsPgY=}&}9lk|Zcu`2VbehIuxON_)=R>eV~)Mq0X7AGlze-@hbSW`1^F{^{uG|PIF z2P_|J7`v!1H=VET(PW@`5!BGy~wuwbKK5-F{0Fa#2&CsCEl?%gH$poWb`b;uskj-sFP=Mb18~#1ux! zh^AfgMEPXUHlqOz$7MTT?PFo|i5;|7i_&=8t8 zNZ~46-y)V_8kqxDD1Pc*y$D5b%FEzpk!e7x8B?-!I+9?_M?5V4Nlv&^m5uXxNZTDe zt7LB^LTH4k$k?DTe?{pkdNrn1Dp?hD`uia-xa>zTl3>OGK;9&~UT5ZF?wP|-CA%uP z5nlpfV>hd#-OOYf6`L7Wti|%NQAYqZkNThEVe>}(k#4$!3hYf^VPS_583kYpR7?oE zBjV0vnY^}HOCYb0+&6YRidx6QsuNl&QhOXb5){fE{CQW23Wmed}(*nKDZH>Rr! z;ms?UsR5=((3lHujIdA|Ecp2y%%AE`1{S6Xh% zW}tS7jrMdN#ecLh-IG`c2yRzkU_r*REix*N1}iHwAu}xn1IPrekgM0OsgY;E5h)0~ zlA2TJrIqdyF;CPX{2emn1tZ9?rJFm*PgJ+9cj&vm^_SPKh_x=G1%xoKuu77=!Lhdg zU}U;^%a-2686sDUpti_8EA}Y`8l&8EaPnAY=!N7?Z1jS}yu{p{Sz+(f)Tx2&w# z5=P|-eUDsf%qyp94YB>c@{9H^?;OE8M!wg%ploA(<}yDZ3Cuf_T*haOhOQ3aC@%)O zc*3_;RY8%?4orj({QlIZtC5ir;4f{72?V3;`Vl2JWan#Y(vvm3tzUAg)*zz1)0*L5 zfCxbrJuO?o9m%C8Y54%~BHyO9(YWxGYhI=vVBwKqTH?${JHj1VMbaqg$sZxt6lM~e zun&_{#D_xo+WVIq7$!^*m*nnp&DOQfR&e}0sDT)ifMXZ#f~QZV6Dt%45k zkuGp^TO|_enz97x%3@Q<@%zj|HR{&wM8qUI?w4@0$VJEqsKd{aJblG`hOYQaqS{3T z(5sh;4rNF_T-s1DF3F>U9NNX02L;A4sHSqjY!6NCO1dnbl!KOJDE8;pN?*a2qtpyl zIUAa`HOH$C;AW={i=e-=%}!7f8+k~=%lInBhu2lj{x{X%=n%`9Ht~=|vIGW#d=#@O zsEkDYyjVJ&9teq?4_{i*=_P?vuS{Kf{E6Uqt<>aA3P}S(C~T*TOFbsa7OWV=087yF zC{Psnsj>=m{3Dz@Kk;uqL4cT|JLSc?z|`oK ztBEBSXU)%u2DY6>RC!XnlA;%h7cYN2;DwB)a0W_XD6#`x_7TYt#l0xg5@wiM-K@}s zbenqR7tmDxn?$T;uxBT^bt4=i!rCjd&7H3zq61EnoSNYkinbwz8^m<8>3i(n|MHOh zY4X&;IQ17@o96H>wQajE&@&5NkOj z6W48Tu4Qf8tPeN0y1Ks9eUo%>wK(IomdQd-s>?CD4$>Dka#mdhWadont~t=LXKe=Ihpj=OJ&Q zk`Y+`gpbcQ*Lo@cG|`5JS<{80q!|Aw0=r4!?;Jy$OLpU@uV3ZMAbaVjNyE?EkjEVx zC@MIa9RZfbAxDcLo5&GkeHQ@pt3QvLtjgnf%W5(fQ_eyHW@jtJH$D~c?w=Kv8LxVX z<(<-z6jL#}FjMmlB{yA4ExaEW*GilUtG_;mhDV^O3R4pML^iC_Dp6b}V7sJq@mma0 zJr~FY zphmVmQgcOuhIqa~vt|{18&ggHRcZA(5)yDA$$?z4!fO+`2nVTq`Dk%LMi_3jr;JN@ zzvpt!39P8d=?Ju#i1wLPhU$C?vV}r&twf5)52i7N-M@8Xd*Q1e6_Mf59TZ)ELT#J@ zCfo9-C5$K%FT(|!wrej{`yw=@b@~c^1X}m4=$NX3V=4%g=n9K{Uz!9s0C(U-N>=*F z7hvTJfsG@g%NT%GAR%OGk8;&wI1Mg-wWzP%x7FMVf~$R>Zrw=3T&Cv&E6$4md$3<8 z_%vv}*mht5yD5O9prRIKoD}ZJ(=i#DndZKBR(b6C6C)slsyp8pd#c90`@^KsM8MCR zH%);I@t($&?TJVaP@De;S=nX6VZ4gJ?CBhSupeS#P-9$8r+K}yw3!dm_cLPkm}%c+ z1PurXXtq{KcJnclbJZCvSkMXdA6eBjBMq$XqR%_Ox?`tK<1wFW9#Y@=e)047u4htv z5gjkE0G%$13-L%ZN-K|JPUU1V$sOa)0o|HrsZ>hS{~237ox^MjIe^LFHTD5yc5@wx zzLEhsy%)OFz+1SZ8F0?Pt5LN0k|uV~uq|Ff7zsPV&yX(BHOiVwTXmG8FcOHD+S>u1 zgQoog5kg~(%FEk*2~+NFquN(hH9^}rVeYJ1dlt@k&6$fnO3qWw8I#VxPiWt zX@M4vD0RI6cjZ~YIUsvIc`T!}p)yKHouiO$_wZHQA~7bRzM1`O6kG?n$xNrZ zrRppVI;NV;l&c2LI382W*4CCaVBsLeQC1Q#6cVDKf&j;f5^!TRCs6D?M5CC=$DTjZ z6~I`)l^<`y!nPujT2B36`;M`2f2ws;I6RZsBgxhJx)e*|eD^;k9gv4Fi$-%>KJa4~ z49_(p)z$=b$e6A9j0;MVGkFb>jNS;Ja?f>*1tfsXZ7aAl{*nDXW_ZxDXY&t#N=Z4r zMv&#;=hQx0gbTzpvs3I5!y8cR+djn^1~6k>BbBRCx=m4@6w0^9cJpIoi?C1q0GD<) z(kqaSdJLI$LV%AX3K<9%Z3wGVcWU2_ph6iU3zjwD4wUPVFj?S{DlI`AjvA@ z0wx@(8eOJP0e8xN>;Oabvt^;CEL9;YQ}&a?bPK1=KtK?II$c00DHmv$X8~7&DLK({ z1F%JW3Ai_~Rns*2D z0r9x94>pe;b0VoN*I=Ffxl+5v#l=15PL0k7b(8FYn#b0B;3IJZBWY=Jj^$;KQA0EW2A6aORC=S~7F6kG0&mf( zs&ns8K;-5I($NKCP4>;gYq`;ecjj=3MH!*t8AZ)wIPo7~xs&__DWxKO{rrx3Sey|i z^h>Hp?PY_8IDQ-9OXf1=bDm)G^yyPo9?w(2iRn6xPT%qBgcm|-KfgRI85TKbK^Hv9 zBTc*}kLs|=W*K%hsuA_OvnY*n8h7P4&$9oTrtCaV?Abm+HIE!0=>4p^cxwU!4D5Kg2`MV zP$2elu^0nGO1uF)E?uoC3aVkXzSSEBl|TPE*mE892LD)vMOe7qZrvo&M0ik;0h$fF z`5aL6YV~N|7Df3sAfDGY{T(+Sbt42W8@I*e$%o<^8*iq-ujkPQaFrv>SJ9h&@jn}k zqXEhk$8E!C(Ql2XD95}DBWvyypi)}0z_!?;Ww{YaAXO=xDD8#PQLs(e2RWH&&8X)P znNN7C3LcLdJ-QvJ0|I!#m!vR(zcl5C8phwJ(3q%h<|4Qlq78$`z5Kb-i3M_Deg^C3 z1@tYb>*W)t8h^s2VD`%L!;J^DD~I|mZ3dPm@D57M*>mRHL)NGeV=ObO!0;IE>#M4w zm$+cABM1@vKgt0yDpRF}d4B}XeV-aJGrg5qsy#ht-1c4;B=i{p9(HM~ZXEPuk!t8W zy~ZDRnE4@*ltvXioT>MxX41kA779It53;}XAUY>e70{0;NcTLBi;@$eY0Otvkd8Sb zfsu?z=3zRlub>A)RoB?qIO5=qdU_`L>5k*MnfUZ3>XPnjP#-St#0x=vrjHu6ZD1n_ zbICZByiDiEyybZ|A*P4E8Xyz>PU3fsTD6|+E4);z47`9|*$g~k`g&V%cw!w-r85`h z6z=UjD(5j$x+8o5E3)Er_`dQZDb@lZ!n3Cb0r5y`ymi!LK1&eLv>D%$76KE9b_|`9 z7a7!Pr{3yfsvv=75uqK1Ztdmr95FH^wCh6H%du+E7U`$x-L>WfzD#6g9XE-IWa``u6&u$berpKg3|1UJk-95EHj~5z&JZ zJqBR=P=ZFXN^RyTDB=+go;>2?9LO9 z(-*D6?ab5K(T=A)^`3!@1DFJ7E})p%7~aasQlesdGo`j*aRXKKR%Xhr)u8a8HNjuk zhx&i2RF796fwvk~j)u1`#(h_eN$bzxy)EK}%YC$bF0p^~_5YtcT0z`}7 z8OWZhCVN8xEGE6eoo|G6mmtl(an`apmlCHCyDW0CYTPtjA1vl(F@_{P=J3ev;|P3SV<+ltFG5uiGf~uQiY^pILbqvGzqpq!&xq+Rdzi5(5$HUN9$lmrcD7)8 z5~0Q5)ea7ab|Tx6)=&1j$cj;c@R|Ihmtd$1k&%~rq$IW#E&~Zc0ErMBm-*@Fo7^O)>8*f2f zi7fU*!*xtR}0J2zTdGuN}ff5-z)(Al-Yifr@#Dw!dM6202lM+Gm(- z_mdmI@1+>nU(yu|0?h{zj1@97`LSGvD=@4`r$S%^H}u-#uh(oJ8yXqOCRRo#$t&=( zr1xP9efZ_gFZSWS6JZb7*tm$VA8s}w`HQNwxiT@GxrBJ3JHED(CWH!|Q{Ijh#jrE7 z@&;s8K{kX)*tk-2upDU@@cpPyLrWR$G}%NIIU)qYNyOi_xh@Pq$jh`-t%px~)T}37 z)!{KtI(HnCjK0Hf;c-92$NCDH3IBK~Q-acUw6pv}_kqgrDbhuYfeApM=z!avJt0I( z2n>aq4NRL$wlBOme7l<*HeQNFwn|!HpIFppBKw2tzK6?#Bl$V7D`+2=+;Xf_iUU1nh-{55 zVaj4FgB@n;;S6v!!rNc>c~|WPH#EJd1Rw%#IxZMe!=-T2Q@5K(6M=ff>@i{EWwk#t z5i*{DL^kr%u+nZY`cCD7H8Brp$ix;W8ys6Lh;b5eKqR4EsOK&`AGi z5x`jw%rW>AwQbOqx#9S`C_?i2t`VKz!$1Ul0zZ&`_L(P`+jv7=M!Zm*TW)ozqVSb6IVSbIkoWazRJ$ceZKH}MgL!C=O0(| zy~pulVYx1&*3eWuTqJ0@I>c842Kqfw$E&cK5xDtkMh%&JOp<>QE_y70Y(y{6YIcEd%cD?q6% zMTrw?%e71OIwqiGM|9%jlh|=oq6BCkBP>23MH=f^mX&k>P>HT+8!N!lJN5lfr zNj|u}0gWbd#`^Kyhot3F4#}r6WQ%fu*cPM&q-dJc2OqDh8pM3Ur;yVfnCDgW;cicd z^f5GimFkZ8NeD~;e^<_BAyoL^akdM#Q&e%v7?Bi05}`jc(GScWO9$t>xsCZ>w~r&% z8n#3qAZ0~p140m2JjTE_TFC#b6{BaFM z)K*$v{u(~ZsDl0GNNc2aak`4|R?b&&+7q-dge{3s0kA-}b_(_@nhAp3wqhb`XC-23AU&=!#{#tyGk*r0U z6wq~OZD{C5jW-jDkU1OuQ0P1_^Ho*oh)Mhq&YYvd<_6 z5MQ#*+QXb7n3FmuqJ37ge%+HqBY8FF2tf26P~@j2`s z-io=|iW)4~iu^x-@D(CUGNCLl}kUH0fHFEh-(QQUp@ZY=ZH6ETbGxBm>e6aylJbF2lopfqBklq zmV7`FZlt}{X8gPQadg4i^&&p+mOw6gbTrF83~T{RA0_Nh~6U`U}HXj`*z>K)H*Nrj_;M9hTNP1 z3mgY~p{`S2a&*uY0w-O+b!#t(wo&6SW+P+gt}7Jy*0(@xj<0qk=W#GnCpg&@HbT~- zdQhBZkg<62ooD<@-><$B*Tc9Y%Y;f7boj(Nv(m&CMLIAT0G=##$HIBz)|=u^vWS`f z!JA)nDI|PFRXyBc9kAW?_zQ!WJ!mmxn&d+Trsi^O6ziML+)0|7>qYG`eB;)^nkPin zT9oB)IL-{+!VQoo-ELV(iimpnn;#VOsF~cm8Nw_@*@!vnVo>;-c+0ft1=@0%7$CREf zq#!M|=;x8_?o~5tOiiz389|&wH9OG2-mx87Iz?z3nYN(ZY*uADxp)38VeUkXYX21z(uFGq1d~O_##Y#; zkUN9Rhyy>Xp0DDM<1#2FW}$~%ryzBrp2^By$)1NLQH|p}lC`F8V8~Q6dlJx{K7&F>N$#K1qdALo=@TaVumEWdOZOMOk4({ZIgG!o}d}a z7eHZxO!}^Gr^0kQ`NlM4&EcyVJ{f;e9fcTMhB1D|U2AuiO)mLaYJU zb3+|pkpx}9To~~tkPZbBk$wUYzfzo6OdS*%g4(6z$K?kWkrz$E6MM?tu zG$+$spgyK3u#-8OpX?kGQ2~Ywtmj*B_F*0$8ab{5OFy#$Ot_x>@hc>6i~^$Q;6wna zh3y+8=m))A-?=5-Ld{~W$ksxi=hLHQw?Nuaao%qXPaVw0ZyJp5zfV5Fu2g9sbN_~^OuFv3CMqPalAHo}%lhgP*# zw~Sas$#A3v6CO!mUW8pswoL+87#ccZ)+>^QbYbr)uq#AGUK(J71BleA7cR_+0Nm1q zk$up9pY=>4F{5D^1Gy$N05j_2ksuq)gkuGL2Mb@napUKGoAxRl#yKAzF7fv>^GtRP z$c-eG74@&-gK%@QCipymNq5XE3tBB4N7|m!WtLNmi2*`QMg>77@TBSxx`%;mumJyn z1IRs8gT@6iV-pA-XH!^E>16U+3fM~0Uh8uw+V-i}`<7&U$Wsxq`OxT`m_ttDg*o^7 z1g_=EmTNDdwA+~sLTm#vDq4wLhkFH3U7ZJsfPiMNa#avIYMl$5zA5x+aBTxJETl>h z&LVwGnx&x>GzcKn-y3OzCHfMiN?n=WoAxLntcf})6bJ2<iS(?np z{*>4rO0cBHpt2cZVevxlCon7r>D8I@Wn~*ul{6Sx0{JFgUB{lWU%3uyHrXYsxjji) zSj`2(=XjQzaUxp?YAzk1U)Wit7B?bYy0YvjfyfLzXrpVJnr6+PKZHljkBsIv)$ zy>4^Pu$VrP_bN^{kr2^tZ6-{M$tqpBTpx`oU|rT z>=e=7kw|;^2S;&2FuO{tJB)MSG55^DDC&NbZxJ%p+MCk{%mDE*lK@6 z9UsPt=gD(%IA?a4tG{pi<6o(6T%m*@DRHe zE4j<1>c8`MKKuT}a`r{8$!tegcusaR~l1=d1*k8s| zJodhkn4E-aUcgC?=xC_&H)^GJbjv->(xBpTFHmAenY0id6>?H>holJ?Q6 z(Tsrbga$+J-sX4jJf$>~#EmPfN_rb3VuyQ8ODLiV|WQDD|!u!@Nb;XGvEXKG=LO96wul?&& zih3X?$MXD5a)a@QPT`;Z^qS?j-4K<_I7fcCL@x1>h%o%=Z+=e=dfiwSl2B$oha&>x z{uH%f^_5)_84NTqxb9);p3sE#79Dv;$y(fSf}6*u`oCYpG5B@vycB>_jahG0GlQ831A6fc#_PJ$JnO z+}$kF7>jm6xQ^q2<90hq*%W*WM%vubu`9&BNl{Nl#TvzJSsR!>R5z%t@?XCcxX!6=G% zKiOrk_v6wIaLE`DTVzE>-MLc6pxXfmswHopo-2oR^o1bu9vTwZJ&3<{C+}gQ|MCY% z3f|V`^|0hd_cU_p)Xt@A^kC_yO}K6#(@-ipIcsLb7yuzrzPNaNcUOWgLm#J#a$2ot zB2^~^!D>K5N6gCU44On6!6SX{mMfM+UUeyfDNV2)yI8(n#4qQrI-`+md0@Xe!6|$t z@)VfTcL0$+7!A~C5}x65(aMR^NLyag6mFHcvTy4^*&57tfax>_?dAOX9l#i>-SfKa z>pHyGg)4FPJu8`{(3;A&qNc7KOGDtJ48|gk#-8Ug?S_+n$Z@us!<%>MjR`7k zora523$MdLQUCJVZzCp;)Eji&7y>se#n|Qa>e)u+TeKNkd-gX)3&jL)=@57>Gs>GF zWZG{XlO*40KWSP(D3%sz18S2dMLAH;plsN46;yN5fnW}=}Uh_LhB)YQevCI0~} z{iHs3Qfn}GlAc?WD=tu>CuyVmeyMq`lZBmtnLQ=t7%f$!Qe9QQ#NU4tLo!8T34%V! zfq>9gpD{tsb`0oW#M|5y=^Rf4m3NCefnBr0+xWNTUVn^1Ku|_vJLb1v zWiQ7>hWPf*HjXM@>DqLLCsmFEl>?2WDarmhw}V_fo|^Fb0Kr(9J4CEz=hLaJ8y&;w zby!>1_w~z1$1*YVJ!Z>d1hx|pZS{n=clD7%H$}yz2#Xq1aIqQCz417QEg9*`P~*sG zFw)kxsrzB0J1XNDjAw+liWE7TY<(%C$rWf3%GMHSR)k0KaEck+EHPO8Na@JfE`YUc z8!AIw){Q|jeeV}k3(f4ckSkv=gBXjD~7#_A~@D?|Y){B#>fiAUk8(M{<8i^Ud}%xpw*^F{^f0 z1Jsm_5vRa`J&|_r>!pYp9hE7=k#zs??*}=_ puSxr5{ll-M`@e}ax%i;x$5mz15^oo|>ha&pF0-fZpR#Dfe*qutK`{UT literal 0 HcmV?d00001 diff --git a/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-7-output-1.png b/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-7-output-1.png new file mode 100644 index 0000000000000000000000000000000000000000..88e3a25156c2b11eb1ac423d6e036c9a84f9a814 GIT binary patch literal 17274 zcmdsf2UOE(+HYo@5yt|~*btlr+x-+voyWf6$ch9}&?3_$=Am#sm%k%u&JGZqpPisYMbOjQ z)y>|;`G}~b=-~r`cJA(%++@YXoPK+OsEezu*nTO2lkg!wTv9V~!(g~9(ZAm$D<#=u zFvo|nCzQ{6Kb`2oo9cU%EvzhrMI00_SmpE5lQ%JP;H_O_-79R9O-tOdplaGkQKhg$ zIK{h~Q}~dbbd=_X&NbF|qF$vXKFu6ot591gnnP|9?LNmBw5m7u%rm|Sx1hwWYI}by z>A7ghVt=T@<$kZT{yU?r;eq~&HZOdeaUZ?E(#vrok5nhVUIi|2t3-0vUih=oa1D&W z%NPIi^S4agw=6S9$Fwpw!|`T7K#fffo$e0r(}zz>(uxkN8ElSG4lkcn*;sTftDJ3| zot=$W*nq)Y+Hj9gE-}KTZ0QB{hvSAD`T0dtG9IYhT0dhlJh${|#`{r7$g`N3n6k&_ z<;Mg@l;s$Ww-g(%d(vVxENHPQDOmWRq4I%n^ttn0mhi?=i8VKG-EyeD|8v;NY7!3K z!~2K#%=fE{IrhJ8{^z&!Ki}{_`Yjp=1L@hqI=sS7XSvzgz_O@sF&KqwBL9m){70Yp z$DbGR-N<+8%Y1Ug_WhviJDZ$Q)F*R9h#d-l26G-W=DAVUcRqcxV1P5jytJsMXra&FLe6XQ zQO)4kbJ4wfPw`4SpT%NtzpttB7^>Z#kdTm)knlGBU3Im#hew|J>C^hg#;Idt4uw6= zZurxu@7mNpj?J-cNG&Nb-MxEv=X1;JTMwK&-zFg`DJ&ynBpQCY*lWr_N9RG9m__F6 z*RLm|q%aunE8i9uzrru{Brzw8vucQ0WeoMXp*nKdK@$n9%JtYkmv>;9?w#~>cb)QaU2^DKB2cDJ+$VP}{8;6ccns0qL46`=5;vRH${i!W$u zx-LbF-GGnbm>5xsC>yD0>3%dD?bsMiR7BCGM$(BA?l7y%3iEx&B`u4Cf{`1`MT_CX+mW~7?aJfab}R7Q zP-M~cS}7FAH^2TkbPg6GEV~%}G=tIV2fJKlBORQb4OCRF_vyzn7&wimM|DCzxyyQd zd~=(VxiS>tQY#{>nWP8Ho@Ub5*eDp?Fl8}2ag{o$HWMh*I4D{<)x;i9J;Qps~f_@%bVv&5zVm~u8(xMc#)Qtl9Hn1 zUEwvY{bAcNnCDT>7C*|m4F$3Xl-L!fihhgF z;rq61-=6a0E>*jc##jPfSy>rw{?6UI4~1YQ);&3LYTecYjg7G?=vA=r_y7F!E5ADz zYabsfTIlr+eextly4PDJgtj;kEOPK*UAGo2@5+08BBZViLkqnAUahE>*EiR}>+E}9 zzpt)t{P^~+r0@JiR!<)*l1)oe5>`^O5Gz~UsTKGQ=E@{Ttb*^n-B?S?kSmjF&+j{a zD8sCHK6O6Zbx{5L)vF!a(v4*O*(He@5s7g7`vnDGrW=u*BcW{7CDk;O$*kVueS(6= zBBWgm3=G2dXb|J?-`{9`;ey@L?4V6hMZdO@(UT~@S!@umv?eTKSWk4{4I5!#GMzbM zh4m8MsgX~DcgQqbUY!2$YXAQIx;i>d)}*_EfoCsYz8ucvKlIBDDW}dBYmMW_uR#$e z=(~N7!L+@$4f+OysoA;azw;RierFlJ*rxueaCpt`h`K~5_xL34- z3slk!vnHQULg(u&_R2E%?*D$Gr`T|+bRnv2@#DILq$FnR;HOV^1C>Ghg@tu|dK+UzqHHrO`Y)SfhXLTzd~0fZ;0b^Ip;gQ)Nu|KKz15QGNfV z}(yEQhHb~e!)y5TG1@ztf^^wd;2-3o+5qNzl`O% z2Gkp&@7OIb&V8Do9N7Y(KP{mgR;%SFZT88R5wG!+g1I542EqZ8ZGa#r8K;C=t4$3Lru>^bvTFrXq9TGiXP zZ&RRE&CJZ$tO<#S7x|Q$!~5-FyQZe6w|si{(3;f6a@cy%IDXfudv-0!I($P^I&P4T zs~d#T!#a^zeu~i+PR!YafgVe)f*EHrT~)x@offkJZ>RDF*(HiGF6C8?wF^kLnRFhd$e@$Ci`RZ0N^KQ%Tfth_061n8L659TAUI&VqMCpkm0)$1&q=CL9 zu2tVFz!h{ha^M>8DdrYN;g5PvcI&s&jW@#R;HF=3w03r;Qp|lP?$Dtv^`wrC=j&^= z6hNZMv*1h^rnmYPMqNHOOr+{L5uoAmn?AXyrP1qD^v9YhFQnm5;(mk3_?b@Cm zz5J|#P`)(8J9pDoWsJlwzjsWWMd3-Qzn|`o6Zd1*k->RTZbTjm`dp z2Q{E*hr|O=2;!47fd0yypCF$Sz{jIDcH{`Txa(a|wPwjLcFB}|!Z-ncQQE<##FMCA0tJsP@D z-wq!>oa?vbnv=-7%-$RArwi3e4;DGVNtXGl?=YD3v@T`4_m70l%8tynmN?FRcy%4{ zVf?da&uZFEsH&zG6cp630=W+zkC1X2d$9+~aTm#@{Nqy$<_dk<$m(~NgJVS|9GVXF zbI$2|q42WtMR8Z}rj+tBzLyWeZE5t)7z}Hp;SG3~^^gA*&mMAXochYlj@Z@%F$t-& zAv34Sr%1ix>jwu1p@0et37ybOJkxY_700pzz~-Tb=&a=AeK&936jo3$s|praNVB)I zyWcetR?&YS9$gb{DSAoE4ycp^_1b^nz!@DKo%i+i25xRS7IkbEQ_^Sl!p$2u>=vi{ zjbB{2y}n{((tZ`@$~Vbf$|FD^iU#h<<>KgLniMgU{4?-9#-+GC&#B&nN#vAIpDsiy z;ycp;#5Pq1@#^d9KAW9&jg)aywYRr-u(v$X0Y)X>mS*VIIud$Ok(t(U=?@U-X8pBE=@)kmgPR;rfL+e~l^JzkP{ zhC5dG^5yJbtgP}ps0P4Z#^xH7455p9js zSfxU`u5F0wVNsE6LVWyVE6-vxgbq%gyos(4c*ByPVmVKEO~T@$HzH4xl87a_9^HW& z)eoEYR6{?0_Ut%Z4m7H>etsp}uL%jmi_EhOp@E@2zF$NH%^y&@a^lImTM-A~k#sm4 z;5LmboyC<7x$5aX6nZy22al!Jd@&2zUcUJ&yem?X`1}0}Uj8(EnQBjmzl%UpIe_|+ zRSI^e1UhLa0o$Jgq5-{12nz#B7=S0nY>ikQ5!%$r6DQo0p^^dLYHDiYn-}2Yn=o`6 zy}Wbh&dob^j2j+t{Z!3aWp6<-+PHt;J_MZQ0)6HALD}%c^;%n7N8Ymtvi7o=OJvob z9x!I&)`ru8WSNH#_(f0Myn5p|#^a{IyaX62EtT3N*fQB{0>sL6Mye%2GO~5yga&h6 zf|ztP+_JE=G{w}}!9n-p#dKX=U5N|T8_SLVK@;=;Mc~3n!sQH{WrLuP33GU&l33DW z-s7+3C$|2(Kk`czCSRn;XpsGPTQp}K9(`Juy8TIefiG}t0`UP^2ZYaEdRMW zX#%sO&@HKt0hkfZEpa|grBC5r%S@{N-=S8gSJrb5s>2P@hQOOZD3Ty7 zd;7p*3kZnlN;fUky?XU30wwj)N?G>gM0-G)YuBt1cSLO(F#Fu-3mVP6d_fd`*f>-d z0h=!Y`*07cu>iA;%lrzB1}+^vYhaN0PdZ#qosFKkdDQIetf+<6{Ta<)EeQwi_@W7x4rofmSMmmO{>v&el>>a{Fai2YX ztQE18yo()&OBrKnK!XlDg60PF?Ko(G#-P-Nkw?$Ey1vkiR&;9n^dBN$I7Tjzmks-# zCwZMSGD?YNjoT6rS=|z{|5W{eQRTmUYN8|Olx|n+XZ2=jIGVDGOI{F*8;dZ?|Gep3KI`T`?-TqtCJTr8S(9GnWJKLoW-Zuv*^5<5NEl5s$OsYQ zQ}jCrEC}%nQg+t_5&>ZXB)ld~p^1HNOa-Q*ph%`cs{=X-t?p4pzqrbMA&tl4$^f8Y zC4jD@XKkGTJg;t}5Efnpy)890wYeolPp-B+Cg#}PZDNz+lQZ6%d3ZWUWx_*4J&Ls^ z!g}y^5%Y3lL5`)aIs!}66WC8Yc08|8R(AL64%~AvS zacUuF4Gs5V?*Lot65X>$)q=2;V62_NHm( z?MKvl4NNSwlJq2fQc8|G*|NO8}_vnjQtG{m^LVyPF%(*{K12wY|n z=pt|vY9UWGlYqGG)I(je^tUdl5-zz{^XbT?Ww1#qvxb@nBtBC;j{;pp5t2Z|Ep!`} zu)DwFyz1L;YpaG&1@Y`oB9D4SKQAma>dJQ_zq!3>Hx^hjh@`V=DH$31&=-szK6(@n z=nkrRO*-ngpenK);GeT5CXQ|1^85gYcJJ9UKm7!7EU!jzA5R2GHE1e9%oxe;=`6S$ z2i5}Qmo-q5<(N|w12nHcqIz(RYHYYwgV|I`&+&7J0|K(> z%y)`FT5Wpz0Z?DmL)gp1N>nlSN3mS5Df1o=irJ8-60@Vhova}}8tK*BR3+e_4p4Fw zWJB(|x8V3-`&Z!YoY%oVG zc#6ooSXx>Ni7%ckIt@s#k(yPW39?p8dQQjk$}5TWaTV!!t&^Q5Bj2PBV(+8s}6F%v6JFaOhBWK*J6UrrQp;+}4g7 zDYENeh!A4|jWR*H&rc8fW@3E4i9T%Z_YDmcxE)Cx&2DgbSOh8(kdzQ15V>XChU{Kl z`<_SEJsH+j-lhKIyM=@vYdr|senb`AuB^jHj?gQ4UDQJO{g*F;u`LWPZ3v1dWLs6e zpYWu%8d|izd$_OQx&b500X8JRoI92=BRDIlHe?T4a zA2KgV5yQ;0=g-I3)DaV)i#bBmFfR7Y_!c@m(SzD1EGK6I955svxEG-BPE&z@ATG{+ ztwdupf;-^Q&J)iCCe<|<_~U{Biy;ZscHA1scIm1tVV>C+2wlJM7YKD zLvS-jrLHH}jJT*Ok_rtMT)4z@$0zd_ZjPmpHXenf1Ahk>=EH2kPQWAw?h5`lJf~5v zBMaf8RD`RmYwe-GOSJ0$O}lHMZS-5fS87?SV|$9dsCCkPEfb`^N$E0RSzESlbpX=U zS?WuI?E*m8gF&;;+4CPtSu{p`?v)!$x86Y3SbzZGoMT(M4uEbMXYN}D8O8|^lrZO# zpS}W&fnF~+H3D86k_ClQzFJwxg&HL%Q8SAtkt{er)t3dnfqKYSIyJUXzETY73%!JD-z0mp?t!a;Oa4 zPE^Df5#}QL9uEX?2^c^m0Tim*Lm=J5ZbNhjG6cYZLXo``R5Pdx3Lh{dh)txvf3FU1 zV~~JSZ~7B)%WI&vz%PK%LFG{gsswi46Uo1!L0w;)m@ zo(YZSMd(%cmulE=Y-S{ZoVJp%21yI+y+=HZY(7UVl5Kh(;sTu zK#(f8t`L_EG#hIs){kW+3>E*`!)gMguZfMY4Vh^D6OCG1)K?e7#T4gbR#Xk8%FLfQhVb<++DRbzgcC$;83sIn85@Ni8(eZ8jNTO6S9T!=aWl#B2O;XMiM0PviOK<)-a$Px_( z2Nh1x`6?E;3OMAI54G_Qm$4w}-uprd?8N~n=X&4| z3kmPueG=T^P+*gQxKlwPcj$k86BNVoOA+fZm~&cL3X*N1|3VLd(bB+}b93b}CqDf6f%xW>i+0II0@An+gzD+pYIq7^pN zW2Eut^1mzjrOn_FN4rFH-BnA6ydi4~wM zLXAr~EB{$s9F+Kk=;cp)53g4NL97{!rz{SB?_(h}=uG>dxKb%u6exN~dLs4WPD30c z^YgBOHe&#;M!s`TDx?E`G>_p}lE;Ceg~V4^t6`h_D*~W`B!SVKQ|U}{oc;7JH95Hvt^_W4H{wt$fAg3Q5;33B_**T(Bl3Wq7cvY>0-ZDte$3e-5F7i z;LYah8y;RzXO0Kq`}Tbf82px~xWIi2Ol6M-g@t|Kv;8i_r3Kx--4t#?uy6a0gF|HJ z&Sxqb%z3EEckbPryj1%+^mjUV)0b#P&X6v!X`pDA11^Ik3fNzhBrGCAbzgS)5~=Vq z=b?1MMsuUZ)CIK!a49W6r_8hP1nKpgKgT?NH=|kwC0u&t@gNm~cRBSO6gK~7Z~jVQ z%lxw|CEKk&x8w(AogQe+*REYFddeJpGwf8DIKkM&6jT9)vUQ#B>UUi@D|z7-@|OWY zU8$@gYO&{1TpWkf=h)*HTPjab!k_Jjhja&uH#wIr*|>F&x~n@B8y~sYF78%RR@PgB z8*rC}L~J_vP~~aI>2x{~au-H!C4J*qHYq_|jE$)Ir166z{RO_ctcK|PPq>VK3DByBu=jOKX@T5X@&+!D1 zX#D*A)mxl=sj%=5nigpP@WT&7dKauJ1Di@0dedra&y>zo@`l6%xiUVSF*5$z}8z-piiH{&etN@r=uS0IBWeGf<%+9^hUSM4^UKsFLBlyu7=y5K(>9 zFg%w=U-FanntWQunm2(OXGY$!N-U}r3d_Tn1}ek#dHWls)edHq`g4^iJwu!JFCKy! zy&qJ)iBPL3MZ9sy1#VS}8#Od;VbjV|^hydunBKm5vlvwM=1p>DW+vGQjC{AU~TBV_vNI;UlQZgo`B5E@A7yXwSvBf zW^w3{4(M8h0eFJC_=`XtoSLC+TgA9@UkJqxCp;ay3oauU4V0l2n8>l)d!t6EDX?`E zr9ND56eG;6!!svH$j5YWbld`ocxV{fq;A)mgDkrHq{p)#;zgz{on8Fg z{i3p&Qxrhk%MR-+w84G`*XQ{0;}X{IH}f66^fIgR-{tr33%wG+Ta${=fCWDESf4Qs zLec1hQ~XURR}IBgZmL8Nu)+{dgw_v8i@_{k@kL<;(8UoI$-D6Gr@Z`pqU5R!E(mu~ zYM&edm*YZC2MsQqnK9)uT>Pi=8m?~}?la2p(-2>T78`p1eheVJLQ_!%#z-84LCreY z8gLrv%y3bFrDl1{Et%(Ix?rvXA#831=4U08!Jwol-1{}_)*UHZo~f8dzga?5! z9X&lOaPoMM*wl)HOl0{~#;r2lXK64z2u9SZ;?{;r2t);0EPvcq2`XYBq(5_pVD801 zAa({GyWMwTQZi99(xLlu6<;MNwsl0v?@@jY4weRb9tc^}U5bjmYNt;hp;(l7H=&=S zSC97;Q}&679D(^cuT`DK{JVjUwR`#?}1)XtotbyXdks_=4xk1BvH*!u7qYe-hO9;00qbl)&p#&jNl z5t`;@_H+5$$wTrLFA?Wx2ri)7VH8#2zGa!4pU(sF6>k{kIwG(_Za6fu*fT}2H(()=FIP(h%|j(*W@?2!1s}%xuz^RSdJ$e- zOY~o^@|S?nW>b56yYF5WACw*`)nI-nG*v(&rLNzHH49;H9=NOqYl=w%7n*p&QjG;8 z2t~#!!QJ_VyYl5!MZ(H7oivA^$fK+esIW9Gxb!Mrc;7y8n6akL&Q3@CNG(j*dhF23 zqHzirr0k^8ifCzR;f76r*J5MQ1Ln(&Z0D~<8$_PvN=)<0Bo`KO;r^WFK7;CKji)TGM@C({-7@P z6~HaN;t;Tdip5xJ@XrL`i(Ct&Nfq|)#}^Do`J(9QC`j1%g;+De{%0XjPJpDW`U_2g zPPpleFBxc7Z=e`N%Pjm^dH2BRncFaO9HO!yATDixWH71%2jn2I6GdRyBb%A8t*;s{ zxW%KGv^B{~xl;!ZL18hv?n6}osjWTVVI z2~ht0VQZn(JcK%^U_k4^i82sofvM5jt%A)yza)H2Yq+b!RFvf8f~wzNg|}Cwa6z4A z1o$sBlmQD(lS^?i=BJ;2f>V0*Gz-@I5@Wki z#ayb(aF>w=1i<6(8D`+FHw=^0jdDHQjN^%(WL0qvNCgsNs0SIx?(UK z0e01UJM8NxR3}y`Cr3b>0!{}dLpHs92A;qw$qu&hps%lJEb*D6_4-bY4g@Q;CF{g3 zw*`OA%PoH;9&^i(>PAs;Lp>9dG$N5FXOFNj7Ir11ox_a~5dy1*(F)r+IX}NGea2fJ zm=KJ@_|n6bdnh)x!e&9RLTV)tEG$F~!A$jlQ+meFxF8HYHa-6XN`EfJzA;!LK$zs| z>J76i9^4S4R#bHPa{m#)(?5sGk57CKmDl?J>U;Wc^XeIm9B}#w)?Dq>LINaU(V!kQ z$?smokIdJkSt=^7w3)bU2a+!VOD7Q8;sB&Q@d|k&jIMH)Nk#u9?iEJXiUULfiT+?Y z*|wEH_qzpA8W3V>QXGfu#hUf&B_LJduso{ee;C@7?TuTv zcpxeeHdbL3tYEWVK(X7;j|~Lt2q-&&AmG`&obRWNk}Dy85r}8bZx0EC0G1Vm;)lrE ztWiHLXhUz_yy50jpxw^~WH5l$e$Q4;g zOUdRfn6LDZecXG zv|Nygej}=^Gh!I9BbD zagB#Y0S{1v5gLZ0bpv<0THD&jP*6M1=bHioI$ewr#I8UAH-wZ~>0G@Wf{1W1@drHg zCs_zFXYqTz3)ziMLqJd+9aL4Ad9w+H9%y$hmY=COLHHGl0O;Gk9-5d;0U(+qZlDS3 zLO(P28Qg--n4)fsAPGuc!PMLr0w3z@NiT>oz~P`UkOIhqgq1-H_DM(;@yL5WcP^zH zfMaoPo0vH*?FTNda5$BMCT~o-d?^T_iQah>@<9CnPRk(C5svrjR+(4<44_=+QcQ-q zhXA5IG~mucH(S!sH!&`kPOdo?&3wg0h!I%+z=2y~El9!*`}0plPSCu=`fXP}xMW3|P4Mj>7f)^5+0Cm=Yi+ zNF(_(Nep7!KHWu6#}RRlm(%^Gg0y#1u(mEJ4`~dFAu0zh_4?2$^SFjL=c?p zV_qkZq47mny3dv$#WZrD96bcf&O+fq@+ROc8nbW395TRdkh*;ab_1Whw>CsR&_NsY zVw7xwV+@XzOnr#D0e0PuMaRGap&*BuMas<`fuoz~tT!x2N!L$mun<9ZQiH4`91uq8 zFdee&j*T(OXkKB4;qXt}VjbBg2U*=ns)kERSjEAK#N@~*a%=6XV}1Q?+*b%zh>k%T z0z^D>znPPt**t+owLmEj1{A^6y;0uB#Y$-*$cu&y4*2qmJ(Y}1NI#$z28QIZq$>Lb z8d=K`QkZKuY-k0R0eooLq2l%HqNi?ZYHH!sEKhz6W3}>{W92BNQVN%Y4i<@;mn5U~ zW@Ajgtvq`)nw?VPv@({(R)L;~h6M>M zlU4}BZ-8(Wy~cvAi--pbf7v$hLz6j!GA^`j{xjRqCf0R~q?$9|D<+P6F;tsu34Vv-PEK;x&ctsP@oaSI)Hg4ijLvm)m?%q1sp zKmM<~0uEd_4!#d%ywf5)gBn>n{dyhR_aJ?Fz_Lz-d=rZ9f_a6mX}` zTr5!b&kPvAaWcFNHvH+}dIi9iAhUtd@Oc()-1x2(THlc4wM?|~dCd=?z}nq^|9@n_ aR=Qz8YnQ3^CIf}XU{7kE$UJ`j+W!GdMXdz@ literal 0 HcmV?d00001 diff --git a/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-8-output-1.png b/posts/xgboost-for-regression-in-python/index_files/figure-html/cell-8-output-1.png new file mode 100644 index 0000000000000000000000000000000000000000..edcade02a80b93ca317a33d3d854d25ff45e014e GIT binary patch literal 16837 zcmdsfc|4Zu+V;~tr=paK29boyEW<*S%23KY6q)BK^RP51LyAhqCmN6`nF*;7LS{0P zd7fv!pG9~7|wNHM(XUAjWiocB+?d%b7wA- zNGruiq!n-1uf=!53${<{fS-o-B&dgO&5h|>9Vr&B{pX*n6ZK!a6)hcn`uH*VQxs5!M_s*~P7T~?>xb)@6 zXI~P{xsMfdNpib*?VYDgmL~m*dFdu?&#!kq@QV+9 V02WjFHk!weN=p{bN&`bEN z-tCTmNurPFC3Y--cjaa=yhchCYNq5NONHdq^tes_j=I;_dTZU*PKG^jjL@W`;TX(q{p42g<;y*R2Puz> z_^7Icw5nfS@Z;Jb7R+ZAcEOL;Z=%46j;7XwZ!+}PO#BM*+u?m;7qZ85sg*AH&|fER z?&oc|gr4)E-*AFT!iRpv-9z~IxqoPPZ&Vz^!Goa%rI}Mrn<^bY-nf4KU{6mEJtHH# z`Hyoq`)jvzi_`D8{ykA|yu0ky+6?%KEuCC$g5T_;4A9gm?S zN|CNdF|ysb>qu6($nLLA~$bx zWmpd}u(SKNwQ1ZOY$#Y^QB zv@><>yMN~1nM3jQZWMHDdj*-;X}m&=m<4@jN0jY_sQKo;eNd~6r`Xajwtr40Uywt$ z!K1pj09zKD_l~IBRyF0J?lk5olfU;mi>)Sy?DW%svVV^uS_A|so#aP1)i z_FQ%9r6B_C`ZPn~1~svG{h*x1v9ZqL_E0xk{dezEQ=dr%^9A>qDqmcXec*TckMA2X zlsnR%udFqnuvLs${LVm=2N_DMB(vS<*SktD82jXLhQARC49KH(7#7z4`fPiK>e3x^ zUS3|K(MFvkM~+}EJKrsC-TTMCi~E_?u~Yi4cqc5=e(K%L0VOr3nU30biHRJ~2aid` zsby*uF*7r3%3GC}mq$cKhQ4`oY*7D_me!F^cei*{?c*_hULYtm-Eu3{jLYadRZ7;I zH}1L|-+yc@XS^eFP$X2ww*$x6=;p_yjx5$2za7V)C-Z@$u1qslGhfkBs_dOsQFmPe zqh|gCFP6TFa0)#=Ju|-R%`_bMUcP*p8ZbI)o@~-g8&yzJ@;W^|pg52_ zUmtIL#>&b{rFe3no{5dkyNZgMnsVcv-5k2#>h|G`rGMp|J6~4d%cz)axW|`4GR$%A zZZYMK9Y>EIy*Aj8Hr`j`?X)m^(t1#Z&#L!~ni?$$$2VgrD-h?TJk#Df-KsAyS}|ca zkLk7kM%{(Esjk<7#u}qP3i!vqvXdGtN(DUkazDT(Bbl3=#jG&3{9KUu4}kKg8@ zVuJQb+tC}%whR(JCFc`F%f2TW5aqzS(MsOeDnX~T^YeYmi@|(Pu&#GE?-4OGI~jc1 z`au={%n#p0lje~7l#J$WpErnm(d6QN&T43AM0e&=tKXO%?}@jLlzevLS22%Wxz&se z4DYmyXZ9{c)=Zd)C$;`i|3FDQ0*?%WA)%)-++u9Euj&L--RRlD~a_rNPpzl9MjQ>(+`ZGM4kY}pRl2`-F>lLXz|vN(^Q7#P*X4t^2n1ohdVLK zDS6fFgo*Y1SA$1@Jy1)cc$3)Gt5-ME2?o7(-zv|7^XoYy!iM|?(@$O4`UVG0P85u`Z-y!?@VknL; zKzsD#CbkoHq!@$9zS^cbYNqKwXeUw z1P;k*>%kjKiwo#5Rt_hltCXRi|(aI?g z*c@6n5oZmhVAWflQrct3j(sU8xNk%hi<26wU%h&D(f<_j*mih{x9gWzR1_2zh59@5*HEiT&FZdaR}~Vly7>KYjD9&4|VzE>1yJL}XqkcD3-g8;Y7<22xj@xj4z=Ew?C6 zY36y>4t#Q>@O|`1SV~GNN~m#{mpBs>({6b+d&1VIcyMl-zJh2yw zLxOt+9eB{|JW!rQUt?NzRer^{aA`VRP8;h)Rb}N{IYFdsqDH;+&DAIytmRCBw37%~(9xzh zGCXx*vR;$(-wt#0(SbxAQZHT-IWfBd*1{$O7L&U0HHI^&ChTwY); z4}E+--rhiIv$QbPuK(uEn=hrMjQh4yZgndtE$#XiDr&|#H#wj=KRtpr?c0m96tAVp zQj(IA>=f!g^pR2taa0Qf&5JD9@$ux8W~bB1#*6ba#}(eXpnZwgD%yo3vYTCd^VtXc z?pIM!QN7i#IwlvbkTW~c$7?r!v^~%3MRGEioUfnXSw+P?AUISzcM@#J$;rvk(a4Ae zok|CIPPBS<2tkcd1GGzmY?B=yVlC3NoozI6aqY2Nt2Vj1gooI<^ICM8_Qj;E9G)L5 z^$iFpyugNCq8*3QL{K^cP?n{pZoUQj7=$u5eOi9 zcXO-rLflfWU)K1Mt^0UBNITEUqOzh__zOG*{u6j={3ARC6~4C}v*`|X3pmck%KBP1 zT#9^ink5YTN^%kxS3tm1?p;1MT3Uxu2kBPVYjiX!hM6`KRr1PODno320IZ2j9;bPm zD9`3-8OyO^?tCXY9@7JLb#)FS9_l}4y2v`+-$F_B^q|lPioj-H}-hY^``~vCHLs*b&cP73#{9`kLMxkuwV9EM)s=S z*tYTBDrPxZ?CXp-zE7WuOG<75Oqt1h9AjqQ!)2%t-TC3yRg#kvIqT>?dnM>qyiT3K z>t|k+F~~m@WTM)*edC>gnBjgvP@YW#0sfgjLYPC@}=Om$Kq}!SK@jm!jc9n z-e*TcKB=4vOPV#4BFj>N1}7O%10}d#7Zw%O zWjWeMGob(K?bQd^?2D>P)T6AbsxsrOPcaDxvdGCovB5U|_FTBj9Gz*;sj6sZF#;q| z5XJ`^1A(|svlV1>bVCf10cS)+Rzk=?=imq8RTif%)L9aIAzUi>OHt9P1N*jGOf_1w zTXbGbw;qy8F>ZXuX|QY6x~-m>dt;oJ78@)pBrk^W`|E_*u63~=OmjoQf8_7K;TI@I ztJZElAS6W2r?vOIwT|MHLG^J4>5%@@rG(<7CwJ5i#Vc(fxqf>16UEGYmryC{R$Z#u zVR==nout1DQeGoJe5K2=1PoF^U%L*CicyGf^Xwxi=<(w(-@YAA zvU#D{j&F^IQ(xQk{cc* ztESFQcS5=wNNFwsjC2EF3~m2Pb9)jUeLTc@(e?qIPzk_k^)J6^Zcxk!%}oXJ0U^5q zNduD%YF55As9^ zN0mK#Qa3*S`{fALZBtzxg@PQZ3if5?QwB}y)ez-qP|57d`(+X}ayL3H%{3=z6+LC> z1FIvQla%yQ4gs<^W!h_U609fV;}vz;Y)5b;0sBeAQ%-*r<0@+Zh&tg7uU|cR{5Y|| ziPe&&&TWP}u~w!l_tP!ukRc|OZQyJkG1PHcv-(k)cRZB`uFTl%`w_s>cl!4binK}AC$?db$Hr@8-k0jG3+ zsQt1w05CeUR_$4RSi7yW;%_RYo&QDjw|+MVC_e2MeL2}?JN%E4r!1OU$(x4Lo!GnY-o2|WZOOTn z2AaUc%>2TnIcph)djCf-0C=Psa4<-i>D zmia`tRT)WXYimoo{ptE()|@f=MhWzEJ=HOL3H-teEVqJSmKGI~+=P}FV%xcgG?>;! zOK2wA+6*M9tebbAaG9TXptbKmxs{gx3)I4A{I=#%LOUV*u?px+3?{JqLjz`%3cA)? z8F}UE)sqm%-9f%_0nwcWjO3S&bCVZ?cpgzQN}bTJJQ`v@$h(7*Qj?P{Yx+<{xU_fX z1tYZI#G^|VJbCgFr$O}ztKKVjDd?ok%}?R-`4^|#Ssa(4*R&+G8ydvDjNspCd9VMb zmNz%}Q_C|hu{RA#rkcH@BHz3@giZt|5YIjad;+T=x2nHB z#TzQVuaA$2H?6=kK}SC5H^ux8AfJ3_r(r1^@fMstYLYpzi}+aT$eyP*PV0E8Fzhen zEVyac@oQyaVup3^?m(T-^W5|4(c!Zin*TDk6uUsMrO3%*CR!mUhlYW(w6?dg6bT?a zYd7!yAS*498^mjoI#I2>!e()%OF~kz9a$yu=Hpqo1GloiOP)F)OQ<8ygT6)DJm+)|FZgFVBiJx z6@Aq)4L0pw;XxMkI}Yx+|Jj&M&|$ZWE(dHTfqFRyhKcr+NuWVYnVXEpHx13CS27s~ z21|>cEIX8mVsxx`=QQZJVNenRUR`K60G?=@vll1gRLpXB63*RAI{)kck)i84W9-49wZ8}$7>7Cdkb0?PW?%Lb?9)zS| zOyJHYbpCKo6Se8o94D&e_GY}ANiBQ+XO!CSPCNi+=5}9UUdDcl*?U&I01Q zFN$|Xj54=&aX>B@qUMlzUvst)ccUexFQYUhypRDk|dE<;a7vV#xq zpB*x!R1i;2#HANDfW~tdFM3wJpKkNGNEQ^BWKG&Y$?#UUEUb3+^XD_w(TblcBW3qM zDJ>`{fNMQ6-R@2PlN}wZ3jsz)h5Q%jis`A1KfH;Wm7quAF$69b`LPMLXW2~+7WY;~ zl`0&T@EKQ}Y1F~O{w}Ftx>bEF^g;yzI#na01n5M}eGn6J6Z5e=IC7rIGAz6O{}xdB zwn&61UMAoCsmBRk{hjl+ngF~pfmauP?;!pAE(Q4l!epN9iwnno8tgm?&I}Ar~0`HCI+@I%DBKe35o0IIo>ZV&Y5vSarn{I1-Wm7+b4Wt z0Rbg706oh|=q!-PyUSjzLz@!*`t@sGg5s>^V5fqQ@&JUw4^aU8#7DT1M5rVG!G4db zuVf+enoYZ~Ck`bW)`=)7(V&L3A~6yl;Oe?kE=CE5ftj$H&U({oVFOoISLf+t_a8-% zh;Ttz*rCG0LTqL4n7Ei2t^V5heW!1F!pTYi5s-^jDL`_e^cCH*4w1#13%SrN#g9c< z9QCvg@X$P+$G!ePBLoHoVz;u=0qni5%FD0%hT}_`?5`6AEkPFC7azcHLj#31Fx@Hs`OegL@}eUb8gJFji0McZYazod>TQYq>;S(HvaE(y%@ zhSxcsn3%9p3_#N^pFO+NBx{C&2s0VlVkgU3?+llZQ;Y8ai;9opLy)1evQh@dl66wV zyLUd} z_FVe6y1#^)DyLdSoceW{C%i*rsywta=Z>#miA7nw@2BqMuf8-}Gu~6dXzU0p=G=t~ z9`^1LAb3`N)wwH6p9|kQcmBM46w(qpP{!i}`uh#vL`BuTx3!4Y#$5r)@~dD?1>?*q zNP@(o#RB4ScD=>f4zPaJ-DE#RG=FV?=%ov2;56eJQ5cYq^8E+crIGaSa-Z6AJyz`I zHNVj?3I$*}l5+wip8zBv4O(#xND>HzQ~*a67JQ`zC2-UJJZ6UoK2{lViGQlWf}VjP zPhZVpL=46St-arOcpb=JT{3jk&^TmJ9eo|NjWF6_j$ln}=11}at4Od|bhVFebd`3t zzX3@Tp$V#8yAYN5+gz*nx7ZyPW(jtfu4T2kU8#i427%9 z$B!Z~*O5}uMy3tb`*1A7|CmJyU}wJUbPL1Un}MypL%SgxOa8mGiSu6-q~n?bYx2=; zZrEa+HKUD>PTAQ1l*>ee{35F~EG3o8;`7wIqkGejQol4<Mo(FWdfF($ygyg~`SP+_(go0_Pb)Hdr@5xBhl|xfh zkktHRXwLDeC_fQ#GkQl(;9RVHNU#S+%MH-$WH~maaSL}u$3&Hslp18uKz%?ghIN zwzj;mbiAv;&}CIsi#v-0H8}xLVd?37vDzq_kNy1e^`TpBxFdo{Uc?a!glSdR1~|Uz zMh7@Q;_d9&*p#3{p{1oonCuW=W&iiX2%=>~XmzS2cDAPbVgSc|1Q{OGCFocnMu&Xd zc_1-#h)@JBVc6y;+m046>LG*y;q3>ac%D!uiC_5q`LjZ?2V2$`V5sKM<-ilyMH+Mt zANhMRyBSqNJIT~`3xg)JV|BQ5;6Dchi=ICN0#@-~2natFH~O9m)rAiM&;b%gyUylg z6FNZKk{d6e{h>QE0ps)aV{?B`H&;p4wGs^`jrzG4BWNq~MlCcp_g`zS3!v#5682(@np zU!Y>Nf!&Z+3!9c`I+u&>br+l;7S%T}fQYv2Q=AtQ%VkROhxbJxFpkVD6bHR0yj)Lf z80^!A5q$bxCG!)Zv*Rzb)1>1yzirCXpwX_H7#g7_BI01TT*mAu1g8(MYk$4hC}=xM zzm9^Ii28&#zJ9e&(1B2H;7mySv8oa-_Nl)S20pN!$Na~7fCEc9*@KH}0`Ybn1qS&$ zFrx!i1l~|@ytDp}h*=Q=hNVf>RML?|+J2#6%VeuB5t06xmbQX9B0U*)ByL+(hOWA9 zW2oplqQW5a)G+GfvjeblGejGLGpii+z>`BR5SAX2$%GGpRBvaU4q2-?%h3UTj!?5K zBBj*+c;b$8_064#sJAf+&1`@|0qhzRy7`o)$>wfsJ+R{OfqFmk{13(;e%sN+ zdP1H-8^bX(Gv&NA#T8^*bQ)`gMnc7^>MAiCU|4$sbk;o22q2q}9=jW@4LEW~PdM55 zMO|`i=VaSu9>ykzG08k{I%yDZ7@hA&3z^OSmb5%wX(kUU3(W=#PWTH~qvVpxwq3hkan1_{%(Mk*Gl_!l=SPdc4$G0e0TTjGp1attX4rMcV7yg>pRVO=Cd9D$ zmUTkcU}(Uyc^DE>vfsKP2SXWt()&}6lxz^av}wAo)G4n*TBtZo%!3F9jCCz*MVb&^ z$ms)6n}~4r0`#c3E!qusSyP!qmjH;H*xrArH_4U+w|{ZgyHf~JSSiJ9^dDI>#U`T@ zan8X;^~ocj=oTO1JR_Y~$Q5E+WMItJtNOL;GZLk#liv*#6Lrg&YD)i>DsbO1tv!-p zwTV}jyP+36=`d|d z>o{Ik%HNr5Bb6vl*V1;HSPtYvc(RukhGxpVojwmaFWDpZuGM7QwI7olrgqj_OaF9mnyL`f2cAi`W4JFpMkE!3{fa&c<=Q|7-L#BD*` z)uPZyyEJ6KibXM}SrMA99iZxoRh@OaN?ud@edk@VI!Z3aoj9L29zK1#i+{HJMZA{t zLZA9AQ=WbN3*!~x`*qqr-_JoDk1FLjH#hfmiFH;*n}sn8OO*!*&YD)!)SWut(`lw! zeIXuDl5&!b>IJQWJQjj1Id zMM~q%y)`NpTTny@_ZzNS@SUqa*orW36s<$+?Njxpm`hm-!h5BNvMT zr6;Bur#*uAn$$EHv0CcXdlVvvX^f_T)#9U$4tK;u;&E=`uL4uAEEQ(_e22r=G( zqLPyPSXt7--t3s)N8U?pEG%1E9y>RCExFa$WON!^gr2TfDcVW*$bWfW}>mY zVMkQXcZqV)XyKF)Mi2}TcC{m}PfsuJmg$l9L<8#PLf{sA){&(vI2_{G*GlI4U6Qcq zh`1AmRBF_ngOy>fJ+^{!k95~(7G`)+$TJfujhbZ>K#U)|mc#AMawxVh#$p8+f>5tS| z+oIJn`Dzn%q=!d~*wzWyjjt!_MU-vJuj^ci!2JMZ^-~H-`d5b$BVLV;oPy|1Nb8|( z+6rzB&4o#@CN6VQX&5@pn%vynoTM7Hy$zu#a&w=B(*mn^dIH#kzLvEpW6;MZ^hi4wB%fWq5r!f2?HYcs@&q6tp8jxBrMe0Z|n zWE~Em@;A(Vv>-DQe-c^XTkB|T4&0ox&Yn|!`0UwkqKxH$rMaG^bL7wkg1<#Di%Tf0 zQ`bukvdCJfusP(#@$lt$BT!;363!Hu-Kp%%pHg&Gq7%d)Sbh=VDO^gth8S@v&k>&D z%6nx>)NamUYro?8z! z2}5wcmE}0M)}r(4{kcs+9D0{eV1}%AfV|MJlY@Pl`y4PK4Ds?-oa2B&4&J=vQ@?UG zo70HLbM%wKnBa3oK(+--y#*#6m2z_`1qS3y-j9fzHd@sXk^A+rS+l)FUuneOS6$Yi zrn0f?%tHP~jzxdG+qTtKCSpryPWB<%uqIUIC{A+L_V-9Um}!ESACglaB79BPuV23& zKuRICGr)k={dp}-CQ8D3m>m0wS+QxP)#wnzUJVi0$mUA%l%697E$WzRmZtNC9(b8C z1yRD3^cZni@5(+Ews?4h)Wld|^E?BWxIr-G2H|0HoIcw8bkCUpV#M*kpVzW{iYTyz zuyD>yhT6N+SIL)cI4GD2Pq#qyb704h^b+R^2PjH^!%fj z7IIgw=Ay?T9xaja;BIzc$C)%%DJGJ;U%os?uylT*c}W1idW!{lPQ`E`d$BM36k-)w z6EP-b$nD4Ch%IMAFi#i7bnIB0f%+T`d9I*~3gwBIK^JrB0bO6|l#8PCVVAleHg|`i(5AoII*!TfsN2jyU%2+fz zP9YPW2Zh20Bb9A=P<`8LvKKj_Gw0|{S@hRp{^wZ^GH=r<&U4%do#i(#Ei_vo$1hv$ z=jX?+@p*mL=##kNYGo5*_6C;fGaj?HJ>G~g@^V?b2rZ6y&jUkmVD=~6F|(ZLA7Rr> zcR4TW+#L9xOLUGq!uAZEkvO(Y{V1ajb3?OHvGj~ z2iCh1ZCwtc+wIBOSRtYV%E-k2c#fo}FlI?zaTRMZEwq8Ku-QvOVF8m|i2ykUmWcvN zT8-kh$@Rc`r|}nbgok2Qk1&A{$rYnQ>u^jl?zfSMP}I&!26C;zH!09K?p`gtZ5{__ zEB1^lz*m;mdE4q7V9}}rN12#-JNSi)`00b4)tG!NrEq6r5I!)4bk0x_wpqc~uQ#xx zNC?&i?ln$5Xq=IkvJUYGn4m@q(H7>|a6NK^IwVahrj#H{chzk>QNvTn1mxd{pKEMr2YY^%?@H zv%OJ3fE{R8spH;{ZbYWP6~U|)zwCLfz!K5r>dB{rsDmaH4q}89jVcD`F+1lau!R%d zS%}lD-mc>cAAs%`FxO3VSztGwUXKC;i^J1}sBJqS5u=N5&d+5poz7kuFe#n7Lcd{{Q_Sc47x07w4e e?R92O6aPyEl7yJlnOKo)uKx$2_yPR@ literal 0 HcmV?d00001 diff --git a/posts/xgboost-for-regression-in-python/kigali-branches.jpg b/posts/xgboost-for-regression-in-python/kigali-branches.jpg new file mode 100644 index 0000000000000000000000000000000000000000..93e9735d0a97b114b9d11954e1ac7c0d1d5f9523 GIT binary patch literal 79493 zcmeFZby!u;7B{{REg+3_NQiXTp}V_5k?!WuC7=jMr;^eQ(v2X}(%m2^aA*ZV^4)m9 zdhhc-_xt#0zoLv9-LkE-WrLylt%jKt%;$0(sDY z`w(of8vsB7YbgZhH%15Xdk7wY2woxJ2Y`?Q$iFcFv_Q!I;*XHW|L}ksAe_HtQ9-^z z2-j~+0OARV@L!$ZAd<~i|G{V`ATJ={;^JTl*n$povZUtZ zyd!}=ywmZoD<>Zt9~(%%a}1P25X#9%&B+ht7KHK&{HGLkNZKE}5j;aO|Kh-KntzWW zoZ%l#4FGWfa5oktP%T#+;-B##gvBBK#c6T>@)yX5P#K5&n|_BI;_&|BjsysB=g+$_ z_a-3x?RO*r@h^QQ0qHMZNkIL_2GHXP_x`~+AV&Wy`z?XsHy!jP!a;)not?;pApV&U z|GQr#UTzK`A>rTn9l{@;Z)_Y(@{%>>k-)PJnTu=@U z0QkSJ0N@}209bhdFv9=mNGtB*69ugMMWHi-iF0KN#$W=MM(G$@>>00(^h)gEWBu4+eRl|AB@6VgL~S zi;)0{zZd`%|KdCU0UMBBNERd;@)W`gp@O6U900-ZNkstJ;6Mj}k8ls64?+PEgfK(i z1NR`0Ad29Sl)*_u0H>7?G7DLRynti_$PjW!DUc0VLxdp3fH&k7*p2|G11})YAc_!7 zhz>vk*#t~M7uW;O09S}Fg#R}OKSTk71{nYdfiPePs0JL3S@2vDLGSOe@Jf&c@=12P0;0Gm-2$fDDoYU_mH>RnP+}5Ov5;Kn{opj=>;)1V{rDaRT z9I_J90W5fy&LJ> z6Cd;L`hPuvsm)By?Wrw1OdY7bY%MIQ%}pIG-At)@*tmEd9jUF@Ioa+EyAKZgE*BBN zwf1gVz0(G+iTCcFEWhdCVht`sf9ZESj5|O7%hxEs=^(zx1={eN|M%eT|D{90dHPM) zzSD#Imo5pemH(y_|D`K}>*JjbcRDriM8NW=zdL(DjGqo(Ex=0oPaF7byQAaM|E6O| zM1upllesGp00njl`dIJp705pYfQJgVUl1NENJ%|Y*Hn{LP?q@}JqX06;4bD)oud=X zO;b*iTHnBs8uc5fgvjOs{dT?@W0E?nSb$4vfk<1yUvRdcq!NISl6!R zyFkqTZ@&ME&@I4-2V1Vd%3;ODCPXfFJ-XIk;rl0`6dP zJAg~54?qnrtla+(exBd_Qh+na?E|=ha@JtV!$27ckaDNN62uCC6JQQ5#w;M_0vCNQ zfa}iBpe$H^PXFUSmVeViR{=oe!|m;D>%VD=9{`}59{|vs|4n-gMhGqj0Q5Jzn7Wz% zRtLs1fM8_>0N;ziU^f7xXAG1va<}5<`rQt}lmt^}<@WXmxEi1)0>E|L?d@gm?d^3Q zmpx;R?GOsTyc z9h}^u-Xb)=l|w;%M`ov?{>=ij7opKtQKy!2cC)1BW8+}spaG@bEUciK(z5@^f_EY` z{}}1z<;CX3&F1W8&CV$(D9Fyi#m>dW3UaWz`#8Z&y;+^yY5!=D2II}m)&*wk>_mO1 z(bUY@113TP_WUoC99{mZ{jX#H!)OVfR{lHZUCVDb{b6FD zmeODqp}E_4vT_Kra`I{aIX|$0`{+CE|1dhcX*)YRh|v7&#PP2N6=w@uE1&jxNsbwlGTtM^o#&zG$fbnghKP{+A)(#QwjmUm`$o7cNQige$Vf5d#ETVXc%bs(C-R}Ffq_E?{4mt{7V!884(c~9Tf%jH_QL0uiH)l7Y*VC zIYETb00_7cL|n-2r{4!G6hsK*_fGlu*WD@o9y)-ChH%#|51#EHh+ww}pt&f>5LD#f z2YFloJj&xevTR-VQ! z@8{5?7~L&g+1DX=BTf#5ti@NdFel_Dj_1ARS1IXedfMcfb(5cGPw2E z!%tmR7ItRW6uBT~Y4;uH6qFvaOCyk^Zp?OV-6`Q^V7V6dtBZ3xW3ntJ72D>zPRv1vPA zYANBL_B@2tpl93yh508u=FV2My{k&07Zd^BW?f9GfeSy!xD=8}_A~9hEaF}|kZ*j5 zX7cRK;z#m{k&Uh?>f5c#_4#}}J5C*AL(B!m&j`(Q&=w)FNnYF{?Md+tVv!EGw3+9a zsd8C7)O*pSsXj!gBmZba)o2@u-78j%qDJqk{)LRJ(a|o3g%e=pKV4pJkk+GO3uM3d9;kM>(ppbh5Bt$r_4B;)flxh%Q)G() zsWiH_`Q&_X?T)ivuiT3)fHxTg@8AoXrn=tm-A}?t{MIyuq66^KrW2k?G%?zk8^BpLaNNPBc7}nVv_mg#0%Cjh;Vbr6r zJcJ~T#LlJ^&t^hrum9;iNSLg4MpRVv(Eg=Z(n2;Uj39xfP%jLV@chULGEckNk;%ZH4cV+KxZox-KVt+30 zCm1*7!FQ0Fr>Py=p@bjm|J>btXX6vPO6V{Z zre#ffou-cdf%de5j+E<5dd}L(Q4%J3YHsPdc*Gn_V!_p zcdn%y7mL24#_gQmvsO!Ol&w#_Q172vco057FMFq@C8dFZFVbP9d@oC2f!{!~%23o- zLUt^PmObI7JO6DT*Zd)lvePaOOKCxdwrP>^vSL>&g5L&GK)sOY^T5hv;nBTxFIr*a zI_<)}c_Ay^#mlssZIUY83$JM1EbZja6$Yg8eI-iF-PpPWc~Pw#TFUqiHfvRD=gugyDoRSX9|2#aJ*c;kqWnYTLOdQ1ckNM? z@(kI^U{L4{%_F=x72o|i~u#ra3eY!%HO6F6jTd9}H@<$f`5}_* z@p41!6mOk}o|jIP{K?Idn*`=?7##V!Aua^{xd%(CkV4Udc6~qUDHmOz`32&2{ls@h z7y~M0??qryVkt|M_vN;5g_z#h=glY;j4GQ{oQRPAw zRiphQ=^5&6`eA;j1B5v~*mKdi#>{kS4z1Q70+^kQ+zPo2^)KAdCIV0Gyp#BM7aOuz z^(#Lv21v;;tfJD&=2l}-T=EG)a{EjA&kuj(z_b&lnly{3ihAGHndr!9iiqB$^xKdj z^dPu)_!LV@kZ`Nk|Fm#v;Rgzl8{p&FS=D< z&}W4S*>MFeEzN>A*$v4JrPvOB0hr6JnX1}x$L~FNS_htyCOn6+E1pDoP8fR9M=&Td zm7*qaALK!fGx0|qk~gWbQ4y6$bFxi{zYot@OWByo29~tquGu?^G)cDisXh z_#+x$XLNE3=d3r=5$yz1@xn$<3fXzid(6NI;QV#8l33ZPS;_H$sHRAJs*^J9G(5> z-t$@-%4r~hk6V>I7oD#CxS@=*U!)4pObbI?>iQ+KHmgV4K6^|+>DnU7UTSJc(d)=4 z_7FTCrArXd(Ltn3@B;w4Oo^aMFEMh_Euu(ma-nt0j%oKrf05$ykk=kPDsL+j28(O` zG{Zz-0rKT}1@}CxF1mE5j;7PRnmgg47&IQ<2VC4+I*H%R?G#KlR(`&pbxdnvWaNK2 zfvCotuO$pjE4{Euf)F`n=LgU>aB;xqH>U_u(VwJ~wz_MNDw<1>uAP*~2#B;L_-*V7 z@5t3U@DHyx?WeQb&)e-*&KmabRKN&0Qqu}^=9r`-+vnegL&OE?Z0Q>*Ee&=$3Q|bY zXj8sJ_QA({0L!`2CyJr2KnBMS?!6{k;~({n$P_cn6#iC*wnc@>meH26CBAXVkmn}y zM-p9A1ZLk7eOXq9M11{R_%4SIA1!gMt*q5lRDLy^pDGE;xsp$vt8}p{G0r*3hWB5m z?x4QknoksDRT*CJ_6R8AQdr< z7}AT-BgyQ5`Lc?i?ORT-7FLZYnLbY~FiF8KBVz*k3KDDk*j>Z6&1drMb9Kmv-gJq4 z$7(T5$c%Tk3=pvjpxE!oHVPST_LReYT$xVhJ3ZT1zg_t-Ys>CR*mZz9a~z@N7SM*( z{Gxh;z-2yhT<6S(7A@mocYF^TFqd77T`lIY)!-qp+VfCGov#n2^22A6f&_{l;GO$i zu^}H9fm zJ{Eozqc1a{#a~pHa%1eu0&W3tp_vka4oXI*v0PRwjfd?-$v52sMCRe}X2U8W-v%Ys zwMI2L9Wt%~6&(y&cvm=eSQiqG1XW>{i~VK`I|-YxJ~_!l2V{{CYS6_4zk#T=>-rsn z&jp{}xUDW;+7B_b1q@(IZ8hHlvs!h-?5dY?sYM}XL8l1=M8hI$bQp*gq7E%dsvxpa(Jrtjc796Yl)4VH7n+{?lAeD=*GNP zf()-ic)cUN;=Ei&ERBB0WVP`-k1|bP#VH&v!x-F>C^6{6R!Qhrv!9^d!NN+}+5wz%Q{> z=dWR3iio|EjW?F`N~?S~JX)zd>GwIr_qNGEoK6p$rpN9Vcqqh^QJU%oN8k#<) zM=ZUSh3Zb$H}cGj+^9A9%5F(WapPWGflN$|JnAX_qJM6q)C1#BHHmWpVRTiAVqdH` zD2K23CbpPgX{S%D(1-CaEowgAu7r)TI8iDvjyx<&&R(Po3w<9%7cJXI(QNjOEO~fV zqvrhdW7*15Nm2TuVauXlq+)yqYNQLMW|?`lF7=fyEwaNgM-(+R5 zRZAx=18jKkVOG3ntrexHQhwu^O1_vdyE<%LuYku*a)@(5( zqwTdy?l+g5^$yn0v?cT1A0Xv*7!#@;Am?Ye&+|;n65}$m1L~`aVxXt2hn?H&snUIJ ztDH>iqF!mVaXP6o+$Z{^?7M_z<3Xj(d68Np; zR8Cd-nX6n>o9>1cJl*fuFZ^tH+&lBl5NwIaz@EV&BYu_{4<+a$hnR|30BKo`mCvUe z79nnZJHAv}{|7A-v*>e*2g65HHI~Ai3%&6rml|M; zpImZ24L@c6Mj5$k|0PwRwYM;9+EXdB1H)2TcPgxcn8)o)q(Dm~SD5-dy5DoG^bYY_T!J_e@&|1Gq1{0a|mn^ z*5pnXdG&Bce0ZvZjG^HL{mPf}7&Z^;UbhdkPOG~Y+w3Ev7oD7$xQ?-@bXVuc-SMLQ zTF1MgagOOP;b@ffDvJ#{O`mdR3nCxvB%m^%oPOG#6Bw8(c3ZzPHTo9oq`QCM*ep@M zTHs>1>b8D*T+!F{QAthjkUaydE?!l*1o{GrrM+*YA^$u3D}>{8`~G zAb7UOj_u9)8M)=@dh=uDDI>~#2`vSW82K^b_k`ZpjUE(o7f$Obd>-@?w6#oz;Tkuj zS)9a=eb+4de!(OBo_QbKFYwD>GIBaIa<0mGqa^cGLib2TjrxF65GPEQ3NxQVF}2{s zv4XqIU|VZYsQ)7KG4m^eOxUFHBi(>u$Kbhm8Xbj4%r7isXwor%54R z;f1uZ@pO4G2-=3Q6%?AZWWI*-Gjd(tOs93>k)_hhwEW;{H9Zt1*|n_UWpW&iHx#h% zQJIfuUnErbv`>{6Owx9BjwJAt##&dr##%V_>d;0nw2s?gsD{!v&A#mdev10-=5pq; zW@~C_u`4#Z=ZHOzA64sHl_5yHbi0;o2;}jT5O%}yt-%9g=HZ~0d?q=GsV8k;-kSi{UbD1h9&?;)%y}IB-5>iXsdTl4G^Kg1#Eg^ z1w?bBIDuE5=!Bgxfxdx>!bvwYWlpUUDmP&YdPfgD=?|Jd#MeAt&dE79#UkmQQ~8l{ zIhJbXxlPjVXB5NreOTAFxgAI>yb{vz21o>5m#D5`28~~fOPbNrWG$VK5{vwtCWa-f z747VdN3y4YvrC_;j10vdM$FT|<+t2#&;-wk$9!LF^*x}ECG|Sfoz8uyvm1qC0k<;$ z71WS^ea2d|sz}q=Y^HwNz`3eI^8JBE4iycWer`r^MP%1Qyg_l`+>wE~>XhJnGG5@8 z__rk;r=Bc_n-}mlBZdL$i<-(b1B_yd(cs+l&ChTsG33Xk4Z_mbAUFzO-851^pmz{L z50&?hHg*xSPXX5_54G~wg|-amcF1V8b#^eG3!B*Uwm)8C4_p|=Af3k%4O=N7 z6fQfCq%>^WXt#JBw%GelYi!iBsd4@1@^}R2NV|0|+9UJ*p_pX2%_hT+!*Jkn0yN0! zu@#)oV&h7zu(aODxwt?zmF?v7%7pUa%vg0NZhxSJSaR%Y{*`ry2#1L2*%<;GU&pgw z<;&CjBUFimgy}AN1G75y>Xv#H7rsS#B z^?Lo{HxLySZSpuHGP7u7zLzEUZM$ZzmLSK(SHa^4 z<$`7o&BK<^Q~Ze{)Jv)lJCz39IX@r&B*JN`?-ChM>N&W7%(T^b#u{(1-x|`RZ&Ff( z$w2e>4qXGH!WiEY#ylU3=XgPd zpFb5}QZ84QzUy-=Y?jedv-dFEK#c9KftBZ%To{P z2>b9dg(R=6&HOkv4K4Ugeor4qoc8J=>?qUB_5<@xSMPY|vmUh}Ru9pkbs`Sk@KUqn~8mOg~%W2x@Hv4t%41<%L?*;V|?`C%j-@ zw~+iqTXP`hW@?P5T7Y7(KhU*I;m0j7nC5QO?~H%W>$Z5`zgc_VF0#E$QE2$;OBS~X zd0Yk_-I?}d#2+5XB5&YDZidzim%V<32jftIH}W>C98Hw#XAechlU;N+1gVrTH#Rgv zOY|>fdMk+rnZj)=r|Q0m1!f8E7$Ffxk3rWma9rd%p`le@eJZIv&isA0cEXxUqw`E z&_fD+Rg-_Nq^aizv#7?BJrCldaHlHAM)Z#Ud`Rtg7$(d~h`(ZLXjCs!n8It_N>BdL zhY=7s7{tINPl>`VE6bl?>}la=vV790J6W;7A*IhOJvJx_71&83+JSeo`BQFk^i%ts zk1L{kt{WrI$ch_ub`=-RjcB^m)G^ADN z4U@c=!lGx9)#kRO=sD`-X=xw2ugWa=kt&hBqVqN`GA>9&^68&xG=8;fJ-WYPoL<58 zX)QbWM@7ukDFfyBlbe*`OdZ#k<* z&`T|wwuN21(s{RcB{qcxX?&OowCo-x;?(fDt)37K#|KTzt@aLwLddG0S5ipHdSbpJ z${5)CsFGQzkYst#4&e0oAUqmp~2dgKXNtKmmzH;nqsO=A4L-d(yaYTAY z{2=28g{5^ek{Z6|_9N^?&-kt)mn4sGftNMi7HLkA6w5a0$DS<-2FZG>65;tK1uryT zCQ^}Z4G#xa#yqr%DZ0Q4e%qmM{eCKi>Py1L@Xv9tMSq#iv#|7#MfVu}mmQ_0b&@oJ z9~m{~XFSm54I-Hg8a8?SdB z8t8S11vnV~A~uD`o^N+iEHB)SdOBwDW7Yt43XFSWm1f{=Aqe7%y{h{6?h?9nAXnWd(jdm-s_%6 z%;r(IJ9CVQO&x__utn#HW_ZcFFsXU;gw_3{4)kpn_hf8`oTIIQV$VSzQ=`*0)J67N z6R-5iG;EwySDXNqu8FqEL{{u{%y-Cmrt-PgkX@lnIA>7;@nH9I}6t)QW! zJ~6~o@YPe0ZO`qq+5laGYE&s-a;3heAK`Jqfp%Rf`oQWB3-9YXCG}cOl|9|gqxWafaXzWd3+EW+ASYf%ZjjugLxsI~k6eN5X&(Q1{<>`8aq}M;W zKN3%&@gs1i&M-Nf^vqGT;uH8qJJKLI0Xt!nj7@+sZ==gZUyF(;G`YgbKDK0zIyH_v z4zJpj*fdgpHY)l+&t!d}qn?8PeX<`55%!gwi$5@7bgG`=U5F68nKCtb)=*5{38!1> z1Cs-@PWuj{b<5&cPDQh%B=37v>u;zmJ2)Sed(K>>&_~hX`C~_WnI$Iixm!l6iNm-! z`^Pg03h*Cxz#cFO;@;HJp3Q9cHO{~C-&q+e%Q0%O@G;?5bo)R`@`MpiOOL|2$(P`H zS@9H)FnUCEf$LrCfX0ck|1W5|5$Op=7V!iZ5%|_5T|B>F*MxvPmVuO%?u?5w@xtCi zgo6cJ?W{K(n{PoKjuW$SGX9A`vA?)1H0Ia&?2tcac;Cs5e>CajqD7{Zbo;AJx;Hxr z4c8SV%XWFRv<*Rr1=p`#wAJ6Ss_QX^0OyysMArC|-%v)l<1-)czYpYl^RX7j;ZXb? zD~B~HRyKOrqe@tQeFgJV{+%2ij)-hxv63;K!l9P~g5$|_ z5Pa$ls>)Dpf^s6{28H~eNzYM7!es2^vF?rTRuPuYADNKwvT3TgJ;P*En6~p!{FwlKw_5=k8qtF;6XyY=IrVb(V{IYKpkCWIGt#;+zj<4 zwewvF&nb&eP{ww0h#Od~*8N45Ga@#r`|(7pu;kT5=<}qr*CHJWI_o{uGL3qE$8AsD zF5NxMYx`fw@2peI2X&WbXqB(G$E!XiUjK+!5jwZAAyy2nugW_kD){1Dkd5TU+gq4y zN4vgE-{IUfpgL%nSelf;I3T}x@l2C4RTqv`I#0TK<^27_9`i{iugH{^eR9g>f>8Donb4*cQjoEM~{9qg?I%DMeRDL zw-oYyn~Zqx;wP_I+h$a}7t(Km1lC18)hQm=`|GtdiRjxF$)?ScMPiz#q_#(>RtZ{( z(+kS%C!E?NhYAxljPM=6KcJp;CDS;(ZRoq>@YC9u#v8W%x#9-TFInKvR1j)woiHq> z=qYlk&J}yDqVv)VLM4qbp$hRz!@a)li4R9yK6#E*AuWs1C@H>0{P0PnUnDfSqR(kV z4uv#pwXx}?&s#X(1@*J9Uc_(dVMRK`1S90~kb~H%=gCY+uvKRdcJs(zis~Z%ibWNW zoz#{O&XY)R53vY8=k{A5zP-75qj1PTAoyN_GUww&&CgAHoFV4Mo3V%_Q>e?xk~6#n z4rNG=hdfU6Lvh+TqC(bCwxylmz;q1jJio|JW_x|K_aTzd$ikq{+w{I7_f4bJ$ett8 z#W(0Aor61B?!0fo#x#hY7+Xb+zih`h6GQzNV%r8?K7{&9Uderr9&g)m^y1G~+eNpv z-iecSO?Jw7zPdiD*IR-*FkTt=Sw=H4HxCCp+Pv2+keV2?j4 zpHWm-)gAY~k25_J+9KY+1u8ovspA9Yp@Rv&isw{?p*J-8UquG5L$dm`%9xgNkWYq( z5tJrQlS<>OzIF;i(O9kAR)fL(LSL@FzOqq|Qx>bFGmS+w4k&p>K_`K8$?smMX%qbp zf1y%@LKaV)WUt$B=3XB1%G*m%5k(gs4;3Azka0~j`u&4edwXMK$?!t6fepll^o)+^ zcQMO6I7_A_C9jtw9_gmAL#8NqFDyJKDt}ql9*}nXHZ0kZW>vhZW~Sbb=%?%4=(j|? ztVAQFZZcdK(ea#HERdTipYMb7u$!!yHLf6cbVMq)FnxJHtts1VO}9=GC>7#|yV{I6 z^74HRI+yDr!pNa zzc4E>+J((@Uxe4JY*4$sLm5!^9FctV)uw9xjIMDd7)OPZrjt=Op$7%9_Q?MB5d(dLyPi&D!rfaICLE{(9 z&ozFv`oW^5K09N6%R~>)O%aMjMLMHz4?#gu8*yD6i4$5pw*-ILpu(=pB=pr~47OLk zlKl!Z1mhOqG%U`g4$K@W%$xj?okjHHM`zuaGwkiJ^?Djp=py3{Oix$r=}j>QFyzPI zB~xro-u`2U9nRk^Z&$eT#uQvVM59>mC8Y4L?psnvTVx!%<=n+$>78jV9mOmW9QLfR}= zaSw(9T8xQ~CIZGB-SuRUQhlBKdN9Wy#)VFkDji^De7&ygwIuzPq1W6)pg5hARAS$z z{hC`+$C%ftG8SD{%6fy=knJ+0G$mfX@#8qLXiZsGWmU#`*Vqk-u7rU-0scBk$J_5g z{O};DTfh-z)kxFNVZzCf_*2p#g7&9!G(?d%QCOWfZ%`#UOB`k({E-5z_N*cT595%c zMYlTltEv?1UBlr9bSRv&NReo|*I1D|c62!vVOf#1acr2B-Eq;!F z*JR16>hRk+dBLMnUopYQ_|?}IxCm$%+&V=S|5k7O7><|aXRmTVPM5Uyp5c^=RD=-O zl99#q`Q^3+3+l{s$_(>-FRIGMMX_YER=x#}R&c(y zB)gC?m!2$F5P0RBud9j??Ijsx3Q;wyy_Hjdqjgn34?OSM zKbz*TVNVki^cZJ!;$+b)0Z4yWqoD{eF>tw^7AG-4zRFqCPZ zh|DT9r(xFo{CQq{fQ}wLe1!4%Y-$RD^!vH*!dX@Dl6pBd1}@?F$=C$+JA03a0~S(%Y_yg$=NNf7 z^r58oRX|j0tWZrT=6 zwgsy&^zU`?-4hbvyWPyP;jZ1J!T86~qRF%>4Og;0&{Setim+_Te3sI%j<2ffT-PTG zT8h_}?}JkvJ-d^^6a?q4BX^m7IrnIpE$?{-s>OD?x;QLk%huaQk}WT z35aQX>nWNaB3(Wcwi!%nWK8h6*O>>GXL);mtV=*%$n zd((QY7zGyQriFSM6f@?#(s8BY@;4LHkcQXNoqTW5p^r+z43lxP-Awg#o9!MkE%y?& z7qD3@GF1$^?_Cu0l?P#kJotQGq~MrY*m zwdE70w+=!zOvV`*jmm@%-S>2u*20KoMaHt$-1fU2bhL=VN~HQ9ihl2PJuC|8^g;5( z6t?);53T7rwA-&(Y%KFk*{{}o>Zr%$j@-NG8f+O-Nc)w7tQH&fgc>&lGK=y9dt0x% zQDww_l$~7vMh-(hLCf*Fn>?|A8}SQMJv=Z@_#Ukrmz)Oi6!!&Rf%h%Yc!6Y<^%kSe zs&Yvv<`~xQ#C)nU*r1SG?i)GZ{%b`vu3cscK^{#mL7Y-GoUc_)7)M7(;HE@`I_RbQ zoPC>!mIz0wpzzm4dsiyg6`L|I!ajGs`O;EW4^0IL4f*%<%J_1sV<-`CB4)%7^oeKl z2_34?^hxL%%RXhI>&q)ffq!&OwP+x#5v^34bMWwFDrrA8p;FOm)OLdomOTspK19%u zsl!_ULyS0Vee=@a<&jlMN~`T*GutBrOq$FcCv{qe8iLEWc$2EnieaY$J5FDv378V1 z=g2Evu7j_+kFPK7%x-}fxeY${=Rd5A-uF@Pz!u#KmC`kXbV**%QWu>0w)Se1NSgmZ zak#QKcgoM6z0Q*x;7n=H6xKN=Z5hW2{3Vgla@_j!`YDf9M@&wU%=zII>Z;?Ggat?Ev;KzcL1zyYP{8N`hNC@#9ZISe2?$}&Q1oA zwoIoEWhamOd;naqmq+R;=G1!RRzvGUB=M_u zdffh?8;YQyn1u?~o6P%zM4VN2(r1W`fi-=CR89*G5B3*lDh?Z*{Fq^y_bM9G?rl5a za&7W(%Z`7z;H2$y;nf?l@n-N_ljCfy3MDI$uUMFkU@ga_osiL$fJP4mth zLOiMcRJuy9fQYp9&BQg!dxFc{u5;PklAMeOnl(t&B}gW*b1@6yYEMN~ zo5|O#(brcp2-g<%`47IdrN0Q7Hr|mT>=34x4trBAf03U+T^j877W-vyf!Vv-DNc`r zBFv(-*l3y2vZqI`3~54rn2i($zGS*%`HkvYe$UZv0esxzHS319h>f=jS|LR6nbgR_$Xr8%H|62O z1UW|4)a$R{T)f9NTbv3a9LT1sjhfv8BV4Hc2i!kLJcfI?bG6*G{S=nv;dD_p(mIOj z{nd=Lw*VWAlqkk(@5B19*N3@O`tC(aq$_lNDU5}+}j^N=7Qp8 z!V4cB8KFvkDb0r9nV`aXT4j+npOo5uVI-%Vs6HS=6g!()bTp`))mqu(jdpGxpX$?+ zb16*9k5i5!q0kT1Bt}5{g(yx*QVO8dJiIvDuPTHp`rj+t^pDwm-1}0ix><~7yU1pM9K@7`o+Hub0cT-S*-vUdV;B&=JTu|q9cgx(w<%h%30sUNh)+@WYjRthG;Sd{p`pCzh z)Jfnp8QcUq>J-|j=-in+kE$2qD%s>ShxW!!`xy zJWnq_!gn54=EP))j5e0xzLZI;l)t!6p@T44h)I_uxfFFFp&AQ(cJtsktSp|?jS#^t zxl;6V+oGIZe{@+=;#n_e(4MeB+xXn!k$U-z z>HR^$M>}>G<{E%6j9A3s*{kIa!bhr!u{ z&2f?^8HHs%zGOz=q8B}x$&K8M%LlezIll4Wd(Ba{|4P=lOh9`O8~>XzTnXVC9_{ZP z5;t+hHUS@n!au+8LVO~aX?4?+(-bugW_m(LN+AyoG^M-R0hy%RBA_+&x^8RQVB+ih z1CDYOpIZQJuew-pL==sxFwy9TiPLhlZ?nr*kmEpw*K(_GN!^O$snmTrBV6x-m$BU} z#j>)2Xd@~90Yo$3Ve$nfoc*7K%$d3i!UEBzCN7|}-$i;m@VY)aZhYLY z()!e1nyTcd^JKp|f7ZxRc^cC=GrjI}{FDNI22U}E`{ZQ!>K1n*&sY2O%=w{fcth<` zisIMeg;ew;$S$ipwXdG@uTIpQaOzXp74GO~(PmmsV?3K*t}yjFMs;Y0D76EqV`=wu zyg8L$BQ^IQP6WDULenwgI_v~WJ{o3m9HqWm-49^>;2uQVKHt#TtOILM9lwlM;pG<> zdzF08A{hg_AQTo~82*yqigOU>(^q}09(_+Hxh^W|Gg5Lx-3vtRZghpu2?(FHI_HVJ*evQZ8_JYYDI`4Z@K>j6z$-2A)$@MAQ}Wk36Gj+| zt?^a-tVIrO$v#V6oF+1s43iHL*Rta;cbWrYN@Dtoo^fBcCdD{3ydvbAmI3l^tWtJ2 z`EBVHNgmsUhvek{{H~qv#$EQjhr~VPo|aW%qv7k++!BR2Jlyea6d*opp+qGB+%H!? z7#bcRWs2~vs4ecI_F`Ioe*X-bk*$`kZLHMne=@lK{2S_9igt&Nc~AfCOS3c*y5|14 zS)FbI1r;hI1={crf=rca20L1rMg*l}&9rs{II*jdI=IyR*D!b6=15Pf*CY(R*wJ+7 z=AZ5t^{BkE`l8#@uC>HUG-N?I10VF6kV#-v=OB^PE4C;D)RdSR53rJ8O7{|%lkd^} z1kOoow_}y{94}9a3=2GH-j7ZM$|k$(7$wu%F))-!T(pJe^Rn2TT-&PRLi!9y4)gXK z{fZh?6IDxZ=J%(X`*uqd2qYI?I}ZgE5uPS86^U=d_k%CC!;bB93d?&-qImjhQ6~iV{kH+aVy#eJG>R9tvCw(XMS^K` z(lpVy?k@1{+v!6{o z=G-OH?xqdy-wI>dc;m}o_&rqkk^4GxgcE}V?W#SDSI0p%4!zV6R+PAuNO#_KoZn+> zOV9El0Hdk}Pac)xL`TCIUxd!*u(Y_EOtfBJmpO6KAa|Z=s9D6^pzzC!23^u+&BOri z-gf<40Phx<)VjA#2p2J}S~wAY=^A-T$UC!Q=|uX(8er|se62*^yxRAHjL5q-){RTu zG7tFVd9=Q=H`n;kW;unal;nW|ZBd%OzMQ(bQ;JRY_kKvPVq(=6CLq100)mxkG4 zsb>;86f*Z6StVlRA4PZPhj+Ks)P$f>hX|4H4)n!KtcH2&lv-9e5uufoM2sk5osW@{ zzL~%7b1J`S`SMJGO1m`(+DhWh-pAS}rP7m5r-7EP4mo~f8nyvHAev|XrUqNvmygy#c$WJK0 z(75O}M6RQ)+LLpW(JB8DH|75Tus~10GOhsy8|S{bvOAH}1@NGXg-t{eFv__@^+r2` z^W*;jeM+bBr|}bG?T%x3ud+8T>9*3|FWWBO_FH7p%9CAVrmC5zXq-qRV+fKm@iH^? z$<~aaWa^2!OYI#8w*^t@jGPIi$4NQxfA#IDBK{d|uy+SL+s}IYa6EPn8O#Go6jEX8 ztEr<$jhRa$bfkc?6*xU4m0~n8M&$nh5BwVVmJhYo(M{JXN*IMCsBCFJyf5@uG-zZh z7>oc{!24%Oguc|<(TYk#V;Rv!#_shXmz0(#I?=}Y0nX93uAX5xToGjp5Dq(Ns^L<{ ztVW>pFVqjtx5IvtBsB6!z!CJ|`Otg3@jXAdG1ONhzPh4`Fp95nQ9{Bwp%}-J;ZV5UNTmtz{ndXu+K0s4W|*52z>Pa-fm3~+wh zgCAV0f?Rz=FUu+ruP9th!sET8ZjfTXc@-A1u9H{+^$Zdkkkt>!U*zJnC4e5jTIZ#=G0lH-`4j zF|O3Cq!lCQy*-M4NJ zZCk$ijgs|Kw&-QMD@Ikn64dBq%AVks&PTx~QOoMfu{V7;IPmV>!?zlS&v@H3RMrcO zN;)nLPN(qMH_o`%)|Z!i&hFs0-{aoir83K~ z>m`cq5?~QQR?#V#cF*+YNb~kt3HT>n-m!D^&3zoWemN|Fjw0G_RxSSkd9_`m2~!2i zQ7~4>$!z3f;O9TzKy9}PEp^b`VnE)+iy=lp!Pmnn2eAcCJN@+H^J!r=&MZCW*;E&L z2x?oWKU9hgZhuUa2mAiIZQ_Y!MGf1y8PDIh{+i~-)Ul>?KEXS-%$7UkvC>9G9Z2`+ z8@?JA?mx@H{YHS&&C;x_Gn{uiTGH&PxQAh+xyH2i!x&0>bx|Ql_TyA$!xDO()VarC z%)kIU1YOqMUt_P9)pDmXR6_kSE_{C9aD0QJo#+ho@qGeC8U1zjUDFsnwc;sVVEd`o z=JutDSr2jKX#PxXrl?svGO0Pn8$)U2F?~_n<0s*i1ERM^NA24_HM*Xu;y+Oxk8%2G zTKX_x(cP-w5z?G|hin0@X4_XR!Dgw99AE{PI0xffX{Lg34?`UHAT~58OG^rqPi`vY zU^tJc{{XIq(SpaK+*Tzb+K$n*fQE?o?c>Lv?lh^{d-Af^x~Z;|R0dkwb!e`%V4wGh zd^b6j!GPUYTQzpOP2-HKOBeu8MTMQqvhFhyVhBNk4K1i)vkm@JR=y zCqA?C!^gy}f}WD&vNz0>7Hz{dWynyixBch{tI3cR4G*~}b(5eJcKU_^)ag^yQ`ANz zhGdag21#RZ%IbTD1a=^f;Ey9!Kj6n1?K^6srrW&jk~CKtq9SM!N3!3PyCKI=BcvRE z%i~rH^k(9l@6JrEnpw8P3R{tZ;kI(=_2VEN!e*<4V-yoMi|fdJb5~dt_baxuDZt*GBj@c zrWJ}+9n`$>5J4Tm1+p-F>Q_IA7im`BI7`F5%^8%_+xH#Do>f!y2AYyKRwEv91Aui&6umBA?u4J63ts~$SuG>t**S@HMRFRD-9X?dZs zx`=@Fhv0?VfPSF-UD-9VRMb;mcNE)}$zNHPlA;RsHHh6^x{AnVjJMtsu4~&hGf~`@ z)dMLKLMDw^-YDL=8)qW=I9_RO~2Ih*0W_MXG-$#>Y5v~^X2P(?w4yCFJ$ z3N1f}uhTtrSvR2cwm$ z?aOH1wWiZ@nvUUStcLGtxKcMvv^6D}mE$?;$YkmV$8ryxYp8v8wqEGC?rsb7L4QPs zr*Lrko}T9feI_)WWJw>Tfgqpvv|tBd2g&e7=@Y=N81a{#I4wKo!qxW6mc+PLya`t{ z3o}Gucv*yaJ-XY;{{Rsg_}69n?eK!#eB`#q_Df83^qXUF#|11D>(Y=#EgZ2%il7sK z@xq{Z`}}Ca5_o4M^{@W`1m*LI%$|pQU~Qev)n^Xw6BAFw9jMgM2;(_Q>JrsaF(cuk zH7lRLPmOmsH~y2Hn{$n5o_*uPIz7L+SToY^EvZQz6?#fR0jB6sl}Vful7pXyQ{Pby z%104jg#2UJoM4-%MYlIa149 zkoQx9GBOA_ARpUZK8GhPZa+lwT4|6Y&Cj71IsCHH%f5JJOAi&cNd3l$8vz4 zqQ$5SJGv-POUBMVq1b$J6)jC%G}BbfjqVvrOB=2_Ty`Ljf-!33L*Zg44)}pxByMij*pKp>sXE8GElmty29c1yWQ?OTkj>TK zH!UfJ^mXUQm1~6eYtIU# z-aB%J-qka1jfWhywD2N^$(k~YG)w^MIYOWj^&N8kee#agZ;DCf5(gO>VbsK)!yiBU zYmRV~9?@v4=vLd3mbN2LRRYNysZ-Pe>_!jlar2~_?btSc;Cn^U;44uWHp%=4PB*!4zz;-y%Bh%g5q8SS8( zPhm`zAK|>_&v$McPV3~`byw9^Q0<#%^UbFu&HXwWSE>n7NgAwRD*N-9i#xM1lq#rH zCsMI&0mzm71KgcgFVh#BE2pN{8!kL#-kdmH>14Uu8r5A>73tusvr|0L$t2N(AGa_3 zBn%vOP{nY^RwvrGi(cWe(rybi#cHT4>0+y)s6=HG!740~$O8aCWMyCo2fuxEfb}R12>~KjGB%MhD+TR}1?=Fpdd<`;UzTYN@Aof+fIi+0kXLy%yh`A&R(pLpI!{wiEK->m;C$;Y_h>?g>+TP54vLBcV2Zo*QN+{8 zy-ZJRp1|meUC1hGI@chU_|awOaLG7m7qaKhik7{Z@e>^V#;+I?azf3w6FoSHHhcHc z_S3Hp3_W>i7r%Wr@!i9oV-?ibaWAr?v((vW~PBEsA=Y*w$wDCho2NS_wV= zNF*X(KYV%CQCMp(bD3eB0^P#5__PA+Y^qsAVHmMLI#iLZnOS&_L!Pop9!`laby-$o z!026No~aY9ff)0my7(lGdK)^VQU$gW+^5D!IL?ZyDi2OGq4y+gof1^B1cG&Rt&@|Z z8gtZe`ppL{N$gIuRpSz0C*&)a(zZ9UbQjN)Xmd0SVE);!0J=f z#lx>jQ}-RU4Bn`Impo7O^x`92qWAASIlNX${>tvfjGt^5C1~po0FJ3wr~d#fqv}FW zQmu?_X-yntl>zWOljBXNJv<=YoOtBRiJTtTG*q=$t;MuzX{~g1uq265IK!-9I3PO( zWJbnSoDf0puNrWV8c6gZCZ5W-MR=;PGm45?x~geH)1-gX3Sto=fSvQ>2fi>o>U^T> zL)sHo+RIh;6AfJ_S!l^Z8@hr!kG7p}PhKu@pMgBv?ZLpVK2Y%5)Y!>W^m8 z6*QGGO3_?<9MZEz7;rx@!0KF*7^_6@Yc%pIXP|&_*dMtXFu%}VC^LJuP43gkn5d;> z4T17;{w2qe8p~>EI#a!({-HB95XD( zp>kqJ9l8j_kzH%!$tXq-j{4|OrgO^EqwPh<-Kkdh0Kx7x>iKbBzt^InJS<*SBrg*}2xY?n!++S+GDGf+h=dZ& z0gZa4Y>bYq^v1bj0-P2zA##}e%uj-QYC6A&Uxt?2-Vbp5jjEE0t2W}A%`>c7j6+ii znF%Afe{u{d@q$Od*7Xs*EEt_+B7bv01&X4ra;~rrlHYvlQon}}hQhmv+)m;-Ia^-X zwAtb-E;7?iB+Zo{r83~-w|sn!MaJWtKHRJ7QV+G*w`}+QH6k8{eEM?L%EkM6>>fK@ zsy6LL@t)ZB&CxUt_UMubrc+N)$O(Y;$O_0oQ`gc;<2vK?)Uqe}IwyQIO#)FWAW(xN zbx%pn*v+bJt`D^K{RjDPey^cQ@Y^unLc2GM=9RzQV!IgmR zjd2uNNTeAch3tQAS%=lLX>!wXEVqFa$tOywYD$pG zg-1)%%TB``MMBP(&LehlZQpKQHuVKDyHQC_(?(mOWhz0>f)9g_kJmt>mNK222R?JB z%Zl4NhIl{QN*Wqwk~vXs@s%uOQdMJ+j~_>Zdv-ba(o1#l{)2SOPj#T4DmTab;SmAu z{E`CsZ;$R_%)_wv6w14%Sh z6$MDaD17V7BYiBV=eCBKgX1g{fNCq{36AQh2E9X54L9`dRPxHlz!`R9{&>+%9m=MV zu)*!yr$0t0VDHXbx!B;%t9sa9Z*RSueu4_< zAgrfJ<2}h$mKXY=m04LzuvSv6Jb69k*MJ-@<-27j%i)I8w@}7qnp&!=TYPZKRMJ#S zKlK?#SRs_Ef=E?SjB2`moZhCKTJ+806=i!&7XlXi`&4IxxVIRmXHJ zhXxJ|+;;a1xW5(3AqBqUawX#1*AuoZH1wZ(U5p4GPEOlT%Qyy2hS2ujRMZ?XB_!D=a+_ZN(F<(*a z##I(rr2+6jJrfh$5PC@LNYp5E0l_=}00DV`+#DI*6?D_q*tZfU%vT3(e171r$@7BP(^u(X%f+vN ze4^MY%?n3)-f^w|_aursfti2Z;TZn_0zuJ%_mR)ex(>AU9AtUMyS+9pI4WL$?3|_Y z(F6Q5)r#HruIsN+~6YE)hSoBz<9eKmdm%;|M`G2UE|rXqL}v zrLiogM53gMnmGw1xQ!J7$@89^{{Vdz?fVUzZoOHtS?Q>SeQcGL;*JG~RMb8L%_m45 zZrvj+a!v+xrJ<&d6*3X~XnKE4y*ram?9-nk(WyPvhB4Ud#Py;D_ZWV|UtS=`rE+q0 zsi_Hn>aSPZI_ePA)f2p-!rdT4O#sAfb<)?yczOS32 zQ58p9M6wrAAdHd}9+RC$$2c6c@k^90U~e3`@fz<_K19u^P?T!Yg&9n z#CF2~dmp}Zl&&+zr>M7|I_`{ooYK%9x}rqx4{x0lT&^XM{aMys?u@Qt>Nv)Unpvt} z+m9rZ<5!GHwrHxVk=&}Ep8EL+a&nD>_s)cyaZcyC*7{3RO5av-ol?~cOa^0={!mVi z_NAI?%GCu}oQz|>bS{H#%QE2pT;m#3?VDvKmZfT(F=g%uIMn4dmWGU7i+!uI;`|r6 zJNpi!{wqEL{{U~D749n;qM3`xvk=Fnm%-4I8;g%BSb`&?)Z=ASJ=^`9`^y zBDzQ>ogxP^ApNnT8|4i&pt2wb=SPMZe--@wL+Zb82aTev!;L~g`I`jkS)I!#sJ zKljp}`dUX~SfAfXn!eHzpPhBUl(q*cp85X(O=i7oHF{+D$a-YENGIxxs1M&p6c>pl z$d!rh->Xq6!T$i_P>S)h?QOrgVS4+(%3kWM`+;d8R{sDANE#`bay)|A`55p!hm-G7 zep~s&M5N))2w$8a6i2c<(aFHUxC6 zRw&;Ds~lwgvEY2_*HjPqiJF#JW0o0#M~XP#t&oQ`PFi{&$4WL2RYnT zu<+|_+pLx=cK*3jTP=3US?TGaj&?w?mhaP*IQU=6I{~g=QShqrUlZh*&hOYM-YP0O zz|Rbouq^8f1!i8bKWzTG-MaRrMW1QhE^RmS3u03{Jydb2mKZ=10=Qo4!}B z^Vv7|2dXc&YNODX-M2|C%GXxCSpkYCR880(iB%Y9q#CZPy0Q~Q+gXbI57mU0G<#T8AopafATYqeBOM$gi zTPB^Mu7Y+f)b$NdCJ(nPam&(IxCjdY)THz7)2WUJG{gZDs4LXD$omni$-XAm8*Jg7qw`^%sX0uylv0o}9sFq1-9$JQ$2@F^%96i}uT#v1kH%gV zs)epLccFGFND2V&)Al|5_Zs?BqS(hC0KYr<@)G!Q6N@=1`et8 zyqWyX`~08R_8Jdu4NHBiq8+~#8qH4`e#}7Wd;#D7TmIgZy3Ek_;b3q zjo~g=3Ojv8rdstf7nvxkWTuSB(h^=zYB<9;h9f%Y&{?PHVXhMX9PQMX-b{B6&K6gt zmv3Hmj040zHa|nkP6tjsxev$Qr6YFkBkHa5IdXql5d7 znyNazvwW$x&TDC22_!Jc$gWYqW$REJWO!hF5vXOZR#G3Pf(47+v@J+z(cqO-yPxne z!C&*iAN%TUJsCJ1!p~C=M!WYZJYRa6N{%1#7PcL;x?!VMn#}caC1g@S%tu;=mUWdf zGNpTh5rir2p&anZ11cW4Sx>l~V>s-3f5n0U`|4N!0EE+xmm7B^Q>28)A zg-xdSM9{Qs3YLw54TFYUFhBCTpOKb^3U{?uRqfj87FVF0!dRRMRA8fjbt->+>SFzS zy)n3}MZ}BuKioIATPpUd+o+d$TIdo)t)rCyh3*cADJ*T6`q?m30hk=Kh>qf&3?BV9 z>ZGa?E5@{S%$WK=GeZ9We&n+M0Ac#-4RD3M+aH29n|)A(HPYzk;t1&6ixrh**ZU#p~uWd;^gXPMOE2niy!!iE=7h|0ntxRoBl=G;_7z{suohkQz z(W~3l)f+OJn6V3`yFEP2Ywf!G{^2p$MO7}8qhYpkKDG?CNIDnt+Uy_1Gf{wxgR zTGY%6skgAHVGq=yI*A(dPb{zz-vo_(wp^Y%c^r&~z+ilhZl^(APO=46Z*Z?lw|r+q zuFMait*5B^WRgT;J9Np9`s;k6g%~wVH4GHxOF91lu+dktW5Nj79_KCd`)FmVuB{{~ zQ8B=8>9l7DAFhQMF^1(x$Z6uDk<^c=IvIu!+??s{ZdiFmETT9>P^+*&BN zhRv;$6jy3=BORCi6v{KvI*z2t9o3mjF0VN4CGuzi@0Wyk{`v2(CgX3UKGk*cr&%&O zW{Cj*08MDg#o7+MmWbtVoL&WUp~1QyGw>63$8xpRueV!AE)uTSQ;fYOR6vsytNqoN z4uq)(T#v?YNd3{qo^|W^xw<#?7}{Ho^;>tjnrc=x6w8Q$QsIamz-Z^@H?wZ~E%YE*eT5GbX+*@Yx8&7Y##+ml|kxG=3KBnlA7&7$`dX$wm8s(>$ zy^qZY12)$kczJZ9siE8!m?~z5jVg8<)7a56R1C}Z^y--1m% zZqT}8&#z6_=!Pe1TCs6882kopB(hTjC>FxW{ zume4`GNO`>RR$SUF!B%1h??Wjr|STD3$Ft2BQ{P!{q(fh)`OCP93wF5A7i6iOn&$t z;)8+PI!w{b?=Mpm`<)%zZtF;Wwn9K1yXjhCTNZnDR;r>pMhO!w;8JS<#Xip^h+&!KyvRi`@kZD!)%rBnVjJ-rcuQ zY+DY-vY=V+8;+JLN@&y`wL(je{GmY|$L)^#;?I-jpmbw&zad2oyOwKVtfwQ`!x02c zhQRTHM{;}W!aj-|8QUBq^w{Cm!jguaZu{=u+n)ViqxzK+)jNACi_n!0A1d zvkgM;0{Ld#Ja6L0&E?Z&xLdC5yv1*dt<{QAqh;r*a0lg%B$7km9*ksXRu1782>4~o zPYAZ>0k~-duvm7jr6_HwJe_j7f?1@IzGRWv4~lb*bz} zU`~Zwt`!7^mO@5wK={y_*yDI}f)BwO2}?$Z-sG*0{{R}=XkDPfp1G%QY8eT*En)f?RPErB} z26LSbsi_?T@cqs7?uS2(vIay^O6u~s%u=5 z90fVm(C?p6@KyE#Jf4<1LGM?JKt4+!pCa;^`_9dv(H*(*zb@q1Bfl zgZkthH~c!>o2O=O8ZIF4gL~aA7n@YiRYhBAt$JFzGPvmpA~#t$_yDV81b4=y=anB; z4qW-R+mxKTdb;rD$#kptRtskoYOSIUm+ll+ya6Pka2o=mA|g5dq9=AfD)#fGPNa2M zbI4b${_@%POJ@vur+GIegRDX~oM?ZWTNkiasf|DtkuE-`P$mPv^+>_vk5qa2d8+jb z-`6f#cidO~xpS$fy4KWCF#If6DyT%YP{iFMt|fN_9Cl}A&Q$6}IIZX-!u`#5w{IR* zINkiZ%~)1xX>alyoPNWVD=jTl7&5O;NLT6ywjUYS3VOP7mA3gx=a%EPID1af*tb64 zx87~=ncdztWH3$-R&klBBXvf~d#fV@<5g(l4VozJv>UPr8HKu~8lc9VU(j!+Kylv~IsV$3f8n;@ z6}O&Fw|@_3t?Gws@7H<-6wauTR?^2g0m&-aktA-F0IyLeJwr``Nx{Tv?lyL}I~=}k zo^H8Y-yFVfD}5}IZf(6&1-7E5m1Z=x{ROG25HTGxB%yHG1Z4NYAk``wTFWKcdu8sC zsjBEIRY;1>5sgc(M^_2j4M+uDvQ?L@hya|E_S6M>33F>~<$s^+Ew6Z8T9a;ZE{pw@ zyU6ZDP*hDSJq+#oEmACG`L==~jiUPve4Y=XQvV)#Uhc_(%p+G!>`$qNFa>q$cJv3 zDpo0Jdh&aHIvIT15?O#T!a!l16~;dLrzOywUu9ppxaE6$a$m>0w=aA|mIkP`2%g(p z1CqBoSya3g3-pi7pO|9{BD&=C>9DvWQ(UNP+oS<6vokV+3OMbbx6}a7>#4;4AYN6h zTuS8`d5*Q(kZwL4*To}M#F2-V*Ha=zJZwt!NzoyN&~<$|kbruQp!)qsSyN}8bzkbl zzUYZ$%Z5YMkKeHx;L9EpIwJWZS^ofrUx<5BhV{j+57t#f4btHaviEMK743?MVfU%$ zD*^g?kjV*Y7r#rWBLE#s1JOq;pi{qCJ<5UORnPRi<9Og-Jp3@?JXMpWy1w@vrsyK7 z2#-l{U&A+0J(vcTH^=_~rZr+Z3W|tjG05aHF%A2JjdbCcD1He^Qxr55lusDyh=~V) zbqZg_56Sv%o+@pQboQ0DV@11Z(#5)}nx#rZa;A&@*o6GU+@(eZyMoYs6OBy^R=j`J zPW`;;i1h;Hii@1yRJVr!tLilCsQc8ESKtG>czAE{2MSbxa65Ji-%lQ;fli2-(aGYQf}KlOUdocTLf}3j=&AS$z$rIu9{Z-V?y1e>K#ib_aW4e9frEQ&|8@+7Q*!j+BWUqJX9~a z=&QEtRK`8H=a!~0Y5KtJlRQzT4gf9)Be>J!sWkjCWhc;?=jYSMn2t&5t$S;Yw6?pH z5|@{5?#mH(rMiSb)fDPnERjsA!HjOBtEZ`YhjGkOswEt#u3zZ#Bp@*R;Cb__8~Vv` z$8qsj)31G6Hg?Wy)yCZ|y6+u46rx&)BBiW~S?7g%S-R2@Aw3vkg}^u@8uoAPm^SVD zmvGv9+OCnmL})<}&5oyxWBeo%%l7(*z|hGBj&!;-T9j-Z!|qbJoy=S`HojB1X>{H9 zL$~TKx4R{*32SM+fPMB@i3Q>S2VR_?tN;nqT~%nkZF}Ul8}*{D-(N*gR;r4+ncB8X_`rUjL;;oyATsz&hJ9_I!P_kQZRBY+^Y0<;Ra)odU3=y0*aB>OM zv2!WQhXK8IIepw5+VbzPrLc1|vMh9wZw;*Tpo4P%07Ql8Zlzc1=#gNkZkJOSf<+)k z+ITfBc?(JGhA5|suBoG+3f%_;p2X)~ky-&8h5XC*_|dl%xIx8E2yXY;Txa2}{{U&; zBaAy*_Z^~IIl`X}(JF#h^%)jS73K0i;kl)=3UwL*R- zw&P2IpQK0ApWK800Bv-irgx}j?eyv5m38pJSAXHomN{A;l#yr7 z1#yF5a|yyb2OIcTbDpzr?s@@PYN@KDlAcG5voF;gfHIKE!|ZX8NFbGPQyiaMxF4pW zXVQDtJ9hHjakBE;%^l$`JTFM1w{1)8P(~`Y)3ks^6s^?03X*`JnfY@w zb;T+=9_sP1Wng>|M`900yGP&G#Mx1j2Ym1JC*v6N;|!^kW%u3Y1vuV<%E2HviRZgCpz zzwT7C7@CJ=n} zB-mdVdsWFV;%KNy6?`H;wk2`B`Y|FPF+YV=ZQTBSwO9VFSws41N@O z(S5)8sW^8Xme(cAl56CbLrE#Utv2_jT11==>56uE438Nj{ZLCnpMv2NNE9oABR-U3KJ^Po7?Cyq0GypI|zjp>c)@dqyVC+1g( zd$VufcOAmpH5T7ZyICD|6|M(vndC~i*ri0S7uzL1lu*hUi#EE>>vnH^)H%o<$vUmq zruVB~8+|`Fr6ns?U3X6n0Pw@3mfE=#KrrFzeKe-wW5 z+!kR=!|qLO6jv{LKB=#}##)MDA3*B?V1_qlFV)g|fK|qNj&;$C+Jq>aqT40H{{Ve8 zaJNd81ozH~Y3T)X84usaxOdaH)~{&uiMuN}N5?I-x3&)Fk~Dp{_+e!zu#GX zfx8WAE|Dq$-2Q<0)|9$aXQd_jv7f6%@=GcLGEeWIbIRqi8(WHEw?d9vgi;WyatG!) z(Ouf4b&(lA+~}pQ4ylf_+x6DAulI5be)-d=&>OS~_f+TVAFjS1XrW-kMH5^?u{~YB z`qFNJa^CUNsJ19>ULbz2{{Y(>%@Y3rG-vvV&+Dww;%ybgx$wtw?TN(APb$qDlrvJJ z291F~OAf~%DJQlEjORkE*IFvJj+=DTmCN?J(Z)KDa!7IDf=)^E`s#i8`t+IM4>dAZ zaZ`ow4$U$6B^CsC(kCEZ0sS@9VD*IEP`Wux=Y& zS&rjM(WzvYx75lr?nJ|+hD`l9DUfyZnQ_k)j>S)**Yxo9HshZmw<~VZx>w1%cu^%u z{8rqJm6Dg-A)a~(8|tQ%63y%}5z2GfN$VZ+>o3Ty+ns&}*g2Qr6mNIjn@-GXDXsMG zfBWjmF#Jm_3-dY>B~{PXrPI4~>x8*O^or)!i{Hq1&fKK1Qf$lY?kX>nU+JfSK(aJ) zF{V<|m%_J9eIVtRtE3V{)$nJA5pYw7HjWoqcOCNMZm`n|TV+-Dt!Ao9bU2D5BC7QW z)G8?>AyGjm8P|R&&UkAl^fPKM%HD08eLmo$mWF7(+G@ITvMY|n5rQ%2xH{_p06<=5 zFWd*_^bJs68d{3j`_R`=Ts)i06ZtJDGi*Y@{rn_8#KE$e30>=~`Q2T)ik%7n52_U~* zw|c|wjyv)(zIk8Y5?v^G$!OuFa#?q5)sggWi0Rig4vf7o6?)K#Wa!3a_MOo60t{yE zo6X~BaSqkEHum43mKLasr}4Djh@N7)T?#w0HpHJXJTG26a2t8>Ve2WNUD>uPEo(hpKf4qXO%Yh-WbM=u z`XA4g1YnI;pLpyI-@?uLeBplmvdL__?dzP@`=tYbOf$yCT?rWt+-D%1f)CD3o<&o! zAW4Dt&Y|1*fcad!aQ^_98-F$S9loiy_mverzVBI08Y)3}g&ZEWQ=X}zM1OFMF-93A zYJQ^~!RahBallQDKn26{L;39{&J!Vn1Q2 z$@E|5*N41p`m2>M)DKH`()w3z zp|Yn|cPif4!|g=lq zh{Kg?Er-kw-{wC-@5&d7{{UMH{{T*V%{ls=hCY{ia~mG)7@|^SjQAJBZ zRU4#n#Z+RK9l8=lC{BMv)8FGiZBGZUR{$d7?;uw_^Gfb3xOH@f`3n2a5$W8DYK17O z!P}-(U|Rut57Yr5!7LPl;YB^9c4;Fk7eT@5$I^fQ03bER>K7T7#_b*?wM!)*@(Nc8 z{_^nxl6M7xZpY*F)t`7Bv-l~+4oY@D$&Y2*>vmP&hZd+Ku-DNmK_x5EMg2C6_&Frx zV21jE2M1M>+c(Qi-mx94A{8l_-T<8~l!f}ndxM7U+qnMu)t&iEvQ}BSS>dgkwl+zr zcz8zT6k3cXx845#Irm@D55c3f3&VX32z^|KzTR+0Yz zm&jW*%;zRohG^uBG5VXQC%bEfY zeOZ&$6Oj6jayT&+-(Xv#ZT|qXmTJAR;D0fRfHPa!k74d~hK!!eDvtr%Ro>=9U`@3* z)nnCd+M8uc)iiaqHB`+Mkj|u}u{sb8az`L4%0S0fJ~XaA)<(QOfSYl@C6zrfd6>BK zjmd7p<_`H$5xre)Zw>lL0LSwxS)K{b2W2eLi3r}=*3JSRa>PaN0L~TJd{F6T#X@pBE3m>cOZOxpr3bdYq$RZOm332 zd$z+9O;ZF);AXBWWo&!v(zs;-n+gtBI_R%aZ&2S&{{T?@@fU|wo9kt98*kB=t@nMy zx7Yc8VIZq9szQa9+Iq$m8_Wm0CWiH=#F@8;_GUtwOwN9sivLa zse}-nETpOY%1@D>9thV<`f77;#hy_(@lC~zmg3`6X}61g(#uSgKHc#5wL5s00rgQ}d(V57*P$_>By+ zR542GHibz!5<0McjNo@G)Id4zFg*9?a%ox{^p+0hGnZahy;^xip||e^HFj(Dxh^|X zaX_kT>@{qp2@}85ByH6cuoKeQbvQU-0etEDTXUVr%$0YpEpVnuc8xV>;iz||l4Yv4 z&ZM`s)XK%+c2y^$P^5V8sfzSU-}edjR`cNewXHP;wyK)*Y7E4PzZ2IS>c3#m6mov- z$LC#{y!PJR-PWhE?#;Vzx=}d>S}UCzR6`~_bt8hO9sWCOw;YyyN<{CDycH{=xVBLkHd5ZaYt&XY|xJ%T1sRVS=M;s+%=nNDP>=}nhJt97- zH$?OQ0K3%4;kt6@tf#1deHCtKBB*Ao`jad4s6q8N+?@S}sg24GjU<{LK72h%IhFMb z-WL7A#>>40Ew;%Ut&eQjsoooX0dS&O6^Bu=Mt#IDl zBjASZ+_tMtT2owZb{)G@Sx-$f1uGnn3O7rhrNa5c1K{ZsZL-f-IC&#)WPgh|>g}n( z{u5o#YoyyeV&Q#8{{W#%-{v-}#V*(p6ss`^WrAA7eEo3)?L^8lQMyUMkY=t)_9+y! zUYs7Fw+q~Rc8`*-D9c$Sk|c2^#;yUG6tj+|h$tWePEW>*EBP4bO4)onV&Tef5#qV6 zJd?G(wq19;B8+9^>GreIqsbkHb#U0Covn)ubBAkxO;K{{Z>lR%qOQhhCUlrYc}LdYaJG(#F78cU{7nWePzk>Sl1k zyfPI1NEV!Y*fOZabG|K_W*(xTTS*E_-^4IAs2ajS#(R z77+y!FnGalsgwl~yG1L+zlo!eGK_V7IA#6*MwB*dj75>Uy1Kvc{4eoRCUzR)Kb*b$Q@;hL=KDyBoI}B{`yy1w*?`H!-1T6)Kz*A zbCqIUol(i>5O+1Mil&CC>bQSS@yL?aMJbY6s$((!n>=g-B6g9MM#orOEScKdsfS($ zNWePojFw1RIb@zWP_GptqXHBpc-OHPan;-R*Sk8=y)aT#H{V{W5TW`Ge#ctFO`y7} zot65t=i^1ya4&M5vO8$fsJnIp_c+#rMqG?!Wc|GB3J}ymCP_L6G!6N2-%9x`%_I)~ z&N@buw+Olzf;@hj$%@cTJU|c2pmhq-=ReyTF{zIo^Pn{WBtKJQRb7e|RR$Ss{{XIn zO3XjMjSj8oQdAu16r;&LHE4EDX0OB%tk#L@P)J5`tkp@KMRBHz7HHuN-x^8o{txiu zd)=t-Ti1p*3&nl7gF{<&vO!ZU7|9(EA{I~yBn;pl{&eHtt&z;)cm_Zje~g_1rH!SN z+i8B1d*|a`BxHJ2#MQ3HL28m#w%Tdwefp=Vs1&rx7?x)_c~~d`orudZ?oX0*pdQ!Y zNR+7RJ^uO8jCNJ0eyIp_Z_06`{^W=)NsbvhyZ8sN)7^olH~iXi*|WL9;0@=Oz1_U4 ztaKNOiaDO`Y^RQC>6)IVIn=64^fpRh=ir=o)Cuzi%RcGhF8sT4{l|@~Q);~1YD?Sh z2^xtVLmy_ko8!iZ8m29$*F^yPeN#TThaz;yhFgQAHd6eaYgPhsy6T{9fSej8A znQb>}TqSL?d>l&zm^kio(Y9lM45zUX&4b)loiU+4I%RP0n`3d=<*BpXsY$z10M4=(3gcMI9c*xWOwWwpBGTRgG?i?PV zN%-t=G^yqLj$C%-BaJjXR^V3hscNC~~@ zqNKgjZYT(7j6w;W03}EyqkgBkT>NBfM(>PV?UcK^*B@M#(qpaY1oc4QnUCnFx%2a_ z8MP}zQ>rXK$cL&gFP!%9>Wb-R+*EsidqqyExb`*Hs6$gpQUL2{41|6nor&s=)RyUB z94i7e{nk$G#ThVh{vm_kRe|9C;hSi0T3U)~xR!$Th$@XuIFTcpIf?KH=pm0!R&$5% z+fqTve?A;};8k6-Y~F5jZk?*qMGaM+rfDadnm(WlQHLebfX*UUWo!Y_oE>yW)kRGy zA`*wuD#-(N4uDTrZrX}Z;y>pff>+KJ*!fb}+(xzDcU=zZYR$=Px6IQutJ{Fak<9D; zG?1BJ=;|(oMpbj2c893nHS%zC)5m)8--I`rjb*~k4G!D6!wbalOItG2to<@4RBVdU z954hh9taw)O>G3`8FvJ)?nx~{F-tBSxL5AStItzpl6?H?+45J!7~x+g8CBUYZ?(n3 zQ4DqTztxRapgRQvI3Rt;!SAOBl>Td~xoGACWb+xdWhJb)IR5~f(?Xp;h@q1a)>Ann zqz8qZWxE0uv)ejPR%|=A!`*cj%Y&0oRvt@Z`(aXQcO$p}k)Eb8k)GW@Vl>7j z`&`r)IynHukjRBj2?wQyLo0b!Rbku>R_B|pUt{Gzn|<-in(mQ$hiYAa4M%e^`%h5= zMv})PqLLCot`&wz$U3tOzz`Es9QbA7k@fsVGF?fgR9=}#prVG>wt zZR(K7%mM0f@w_BH4}7UP0BSkXaKECM?2j5TNok>)U&HlO+v@%2KmamE>_}xKA0w%8 z_ZU63WPeIdVfdlW2PKFg~5K2n2ykh_q>I$KP zis)AgdnNkUdzdoPxoS$Xpy5_e$WQGcAGS%-_F#+WY>~Q9Vrs zORQAzU1_6fW1a*A;#Wx8CLN4{fh3sMJlLnkUy`kIM38!&`kC$DO>P}p=df`84b{Fk z1w8ZFZ8eowg|DN8bTO#C{lF?Tla;0S>g`LxACisba7P6#Tu)< z>Wc>Am)ua%RI3#w3pwbT7YyK%JEQoAEVVa&X?CK#4q=~QQd7iENNn6H znXo)|1Hby|$K?~2O73ZW9j^Q>qDViU*to59z3f8D0~(mp5iL1vcK4O$k3GJl9!>_6 zET@wzg7ML!(uoI>*`&4H<+jZW;asj{KnpLyBe?DlkGMJ4OhL(}=da}!!On*asH!iU zJk(ES-n8)#a<6M#>4;Y{^nzZBWN4+1M$fnIbV=b|n!Y7)hWlo?Zi=a?jU={fgn{U% z$dpCy5h{>;7@VHRz5{)=)gF?bqt`w`F1Kx$zqY$>P8x3MVwZ8-(bw@d=~WC(R|IeL zO_;Ef!uS#-YC-By7(!5&Y|@QA9Oo9=U*wb7R*9vz-SHPvpQ?_mkYSX3u2>HJ##hhP zjS}p-Kjs%pY*!^V=c2mP3yowjdy>*JR4UCNQTd7mbJ9OB9V5=AQ`F1SD{oTmXXY=G zjj>f#Zmo2X!T!@7S1?(qGhs@wCy zAGbAKwz6s6{n6q_ilZy;0g@(8tdW!5Z#Hw&;Oo)D!`fU{j(OUU7PF2&EnKwp{{V?jhLNLU zuZ8k3K?EM)_C54TUwvsPe-8fuQ?dm~egFr-*I4tHeWki@>$eIm_K??_4c%W+cA9E< z;{IIUiI^-PM$Up#4|PUmRRg)m*DqbTYq{v&DoWZyCa9KUPtr1GU^88u5FD zJCAKO#xnNSzgcIbTBPY#>WQU5vJRG0^pn|vK6F!HxTQqq3dX_d!k?8NI;52_FWdzH zucVXSqE``j=t-abLDRv=r#4q!OnI%@+{{v0XShwZE{jtot}B~U3xow(CZ=f@Ju30B z?7*oY1_WnFIkt_yyQWsDoz=UBSI_-*;;)AA^cJA6s9cO@qL_sr{^6xlOw@}#aI@C; z(+`LoE1=>$yx{FEx+`e?>8x9-;wf&FWD>0u)XE*Y!BdWopu-HLh8kmd+5Ao3n7dY~ zs5U|0`_fpd3~g0m+N(`Y=s)oh>t>0We&0YR?Tlu-6}DQNG@iv|yzJ5Qx zzBM$5`n2^cF!^5%{{X(5m2Cr#iy3gQ_=deNxSrJX_g*buHf>Ery&lH#+O-iRe`aZ^ zz4(IsWU%k&9(C9}GUB%mIK#J-b?}>r_6^Z&krG6%+t-*YV(D+nNr~I1KOI;<8mLoy zupGL)Wccd@X~5;Tnl3$ZkHmO5v&*j&u3J+32skq8yk$93hlz~DQ!wS=HI?iwJp}wUmsLe5gc|nAc6=4 zGLXu^6(E3h7#xg$6V}Dt)_ZRbx&GNz7Q4#IB7Pz5IsqjV=yQo`=u`TTustBPICUVF z_SXa%N8#5JcxT4ll}*HM4{z%(=&@EPG}mhd$~vlfGNG~5g!HxuB^wMg+znPTQd$+o zqCL5Q{vLi?cNMO>zn6|C?tPncXp39zw|dcYwTVVZ4qZh`Avgqwa-d`>@M8OqT)y)Y z$`<5@X7dfh8oj$MQ^>Zu%BK@j+yb#vEj>zQBp*&f6WsRLlm>ID2fb# zETkalUvmKcCsp9yb}05e*{kiRs@YREWv5j=G?JYCW9*Ig&ts>Th5Y{jbC1IwT5o)< z@b`%n)cZ2=1%Ki#)7lbAg4RZ>GRITYB#B7{`c!}mKUYw4r(wvUE%Y?w$x&n)8WKKnz&cOLB=v~eqm*E`L{{WDiNuhnFmt@N|nNlxuNPd!g6Riskz7MI{ODXI$ z4qhO|EFXlW`jv)BZ4_4SC^49&1>4W*tMO!8K)co6Zl*+#fKO!uL^VL?toPTtir1(j zPT2dMd77>#UV#uf`O_srg*O&J4@!6a^ifod!>FGcIill0R(s`+fCT z=vJMDQ$|?$)`zFUmVdU64X#g=@B3?1)I^VMlxoUUglfcPY?6wG3z7%Uvqu*j8-6K> zQT_Fr+{y5e^KxU&ZrJB1mCd8fZq}xbHQbi`+%53P&L}IRV;w~?ECFtb6ppM8r7FX* z8g=Y+v8HC0AkN1C1@Wrf`YHO0ZoW+UtFmqF)xGDU;im7nM@77;D&|p312FdCulkgT zj84j3(*>doq~Hu$t7BJMnx-n5p^8Bu43i@US5h(vBe4Wvf;$jF9(Ci&tCV_w$~{R% zC$5dPw%c_@vN~v~%D7LX?hott(i;6>l7>SsM}DFRZ9A8k;<{Z{S~_G{KpjOqpVzjM zoJ89XOZ|0q!R!@JgQbM1Sww1Tsp*=ptYiCq^}3$kl~RWP00=$sHLB}wnu@rEazQ=v zF`>4Zl@q0upSCs8;E-`^w@@KgmJF30j(lmb^%dfF)8O|ad{5rHmt@VjX=U1#7Q<8O8}3bttPLnFIzrH_7|x*6G8|rNoV!cGx@G#SPNSc(%(& zb+^^1G1ODYy)KLne)$LBfJbmO8K%=kBdRKIdfO5~QA&qzEM#mFtSY(_l~I$qn3=6kadu; zl_1Ce0IG+lmht3gjcv23UVE7LC_Ecnt?3 z+|l1TH|Ec0b4O;fweH(ws;GwNYoe-HUcTQ`1Hn2YztWNgex>T|)PNEJCtTj2BZRof zq}BF!AFtP^2FKY^+qtaZR|_r+U3uIUF>xM>#YHsqh9-mD7Z-seWTpX;gpLX$b%WAd zCsBQFw8OJ2VT8o9+G>}knn~9pFOOyO*@5u6^WQzjO`JGU$Ib?Gt-vMU{4<o7_So8k?=-aq>75M%oa`|DZ+RxO%4duXiESzcDwq>rm+@hdZDQ~h<7~*>9%nTAoEK+8U zR0Uay1@WCl_T=H5t!Ag#JBo7j%`1yF)f6#;L>WNG9YMnOVd0mk4;cx58g9NIa65@H zUF@5tyKL@yCF`NQTPr#lqh)CFkK8;`DRsjF2qjs5G`^oSQ{K%urmU1N;xgm)!)$N| zX>cOjFjKp)QrxbRy8_>bpj9!b@%B~l3Q)JMoVaQ#oY=HW*=d_3T#qRF~0HQUQ?3YuN;E*oF@ zd{R#=bga_Cs?LoWk~tToFvnxk19sKvnX};CKB4?Hr%4zJ8?{w_1We!XC?}8)r@$u} zC$Ra~MoB4prTId|HH_~rYkeI!cgNksHQ$9hhTFBMZZdoCYnqkBcFLy6Y3Uj;N+Hf# z0y;Yo2m}R&o-}YLB0Z9HA!l>FlG(#)}LeXmTv?PPH0HydJrqJELv-GjDIV$o+g_QtE5 z)5a;v1mTglOpYE&xq;o>w{e2Q#ci8f@!U4uNUKN{-=?mKk4;@1s2M{eA$2N#N|Az! zb!0uAe%rh2ZCm%A*|;rLvg<(&ZMLaMB5HY1NgMY2s|{7!$8mdI${V^cQj8uzKqtBA zmmmKC7Sx;iXnMIn1Z`H{UwNauAlTNsTsHlsx+pA#q_%X-$D+BCMrj}sz67foXvAxg z(n%cnGN{SLy%dxiM76|`=b#No-|-o6Xm1WwHm?;ZtdBJNH+rH~)l<_3c&O`PXq&D4 zs(aDE2gg&7li$BK?Z%GrMNfFPS3^}vS5A{sRMX0`M>KAsl0X_RR&{}#_GYa$jj8Z!!APtxf=82 zoF5im6wx)Fr~HTR{%5$gwD^a^s+E?tA(q=)1!*ZwNdytkG*cNydIPhV@JDxLImUGC za%I7b*Djp3SU6F75GBH$s%tLdmQd&_YA1Od1|6FUQF1$dJ-cab%kLfdf6V6q_QxA` zEqwKKciUpf?Ndr1nhKQuokUW^2@yz>EQ-hu2W)mA==Q(bR1ib`$obcPJWx~j7i{O3 zC88m{b2)3e@b{QF`GnhaW*av6pp9VPWB7i|H8%Hj2+`FFrI^ZU-U1@G| zZ7YOx(@8BBV~{utG-s(q$bZD7sTeptjzHCh`Ml+4Z*y_N8F)>&F5-cWYShwJwJ*G} zxJQkZlOU92b_WL|0FN4n9&vd;;@>V8N()yKt!o`^kG&myEE*eyvPnc{Wo7wCRARnR zF$4nI;`K0i^0`)@`7Ie-E6FULn7r{#&SwCsDsOvQ-(cK(GPWB%-rH!IzYWkJR;HW|4Ke!=4g=9-sn#I}9WaH_v-y7;Nbw+%zLyEPsF!j+MT zBLgwWj658A^p)av;o(O%+j^4SJ}sN$8+Q3hSoGzCNlh+1b;mk@5B##ozv9%W@t=eo zJmZel8#{@%AIH)qLR3>2=4hs;m<$JwM8hLsl{oUGf)1-s)p~tBp)ZWAiN8th#dWmv zcd##Zx0)!fUy8NYpi!5no{m;1-TQ$oaW*~-dHB@8Jz!7{F_ETAmaay4FUuwJ`G4A5 za+0Q|d*w}5-ni9?C9Jm5&l+{o^|>l>C%uPN3{k?9*y-M$W1n#3>i(M?@#Cp>Pa{p} z`#sJ=#4+uiYZY=ASE2UI%D}`Xa;RJ%r>KBPT;n7HK+j{K6U^-AE8AYE>O@s=p1fjJ z{{X1hTF`}b?pG?S-s-(>8;lh3)mPsut1A|XnW6XPs+Ht1w4YIrthYfu%Jk>#2A!Eh zy!N2BTtQhO4J|9iOlf#Qo2>Gw#M-6t>g0|IcvF~f0C9c_Vr}n0%rAn9fQJ1N)i28BdlONW*^^190BB~ zj$F)bE#r3JO_et-_hQkd#`&@Bk;^6N?YS#0!kz$xRK*MXYbqldR7#8ylc|^89Msh? zqtnCH9Do+~!39t2(ntL?yyf4H8}pR@VzyT_d?!;;ec4x(t%8cKqFOla5%l!EJd($T z_T5QV8SV)h5S?j{6P_ay#2%KVN|j==MJi+h4hSIlBf%a-ZTDfJbxH+h=~b?$C7zYMePFyNBl>{$6nf}2 zXACq_!sR82k%xRA1bxBz8hAND=fj$QMmK!>zm~o%R#+Olq+P9V8q{0vYCH;xSpkkh zGI013E`F9>HVp(7kqG2gAP_(ef?EvRhLy5 zC&sElYPno&wea0-b@Nix(bOc=)fDoya=|K)$m+m`M^XU{FhLm_oG(GIRgLA%QJ<84 zWAC!vcg3RHJ;!hEy87O_t-(r<)hrUJU#_ABZmzI?xK!k3GR=H>Z%?v0A>|SuQDuTy zca#!eh4HV(@hV!Bvt~7R z_TV1S=vF;smt%lJ>9g^8{$2Ps8?mO``$ETg+cPyWc@|hoPR#v7_XEdc-x_Z=l|5m3 zHZA2sJR_hUl0@+M>rtPmcgLQ`&Ya#5Pg%Tn5fj|xL7v_4fdwrqv6dYH@NGn zZP)tQ>JZW2Y0!xyn9@fYM@=j-y2Q;HCT0jqsmm9C()WYhDQ{4%ofhtnQ*4!MD(rMm z7xe!CoKna5id5`!h&{Pv1RNKI7XTjX{jceh%?8imHIt0DJ{P6luHjD7Qczebqpg2`@48mx-Z{3tSz?Fwaa|s3B3zf28CNK?NJe(m;f~9U7M>daYswn8n5LY#*G%H6Q7?@$;56H^NTvKW3F}}xpNwne{jZlw7Lx^l zOt(*QkMFK{^z!FVhrIW4_h{qx#HSQ;)H~8&tWs2oCbU-}!_&DQRE#=42U;?b^RAJm z+!U1d)}@*;g1Bzpf6rQRadb=eV^4Us+~%!hc!}w3H*cM3fF-@e3Q3dvSw1v%yz8l{ zy5^*qu1V}pd+3VQEhMYf>{>=Sa)0Tot8pOdIuGoSNmDOdp2z1)du40WTHYWAdVK!6 zM_MWpG8G5_pPeWvEWXwQWF!e9a00sI9>J7`QK&uU&(gb^{!6!=+P8P`+IyDHmY~%^ytd^#&ufIS z0sY1wT{1KML)}#AE>!;jgqsslQCHF|iCGMZ?s&@o0Ip7ncMPz`C8C-non=Cj#^5r# z5rV7_W(0W$wxujFoBhgirA*-;LS9?A(ZK!`ZVl`v0JN}m4cG~n5w*fnux*cSy zD!Eug#$!Sc)W#*q9V0!td$uV?Lo&x8Ib0kc++)VIZ8}?ZvI=`Wn%z%DM@t&TO-)M@ z#S{^!!m1*Gs-SoP05mr0VvWR-ePCw|oa)hBQb?0_?B>2+5Ex zXpP>MAH!BgPl*;-v6&Yq4B?2$9l;>*ml%BkDmTJZQEhHuE$tiw9aOttjYwHFqml**=whGdCV9!VtX=<7r%L13dzw<&yZuH{RZ zJ-5t_cG{nFQEmFEq^`A7)ul9^)b#HQv~E;~F|37xp3GMrOXpS)sH390(9~G&v~bi? z)zdr`H1kFtIb(Gy=*mF^6(kTwJL(WQ!R2C$ls;td+!Uvhwzk6zP*v@!%J_oGJ-%qv zs*VmXCDQz<>aZNWq5nMc8?b}MVE?OCgLkB_xofu_S&9; zrbuN#%1AyBd=cUN@8*_-9RIpsCZ8lqs{{Vu#+T&R%s%}(k z6!Aq{jI}W;;z5uWQlKkmsA^<3-1P5;oO^(<%AcKd`e#3DVo7XZy-fKh;uk!;u|vVW z9B(^qn)7em^tYN=?+rTATnsVHf}U1~C4Q+xs}yh<2W8Jtfjh>@+;4B1DXo&yZmqj= zl*LJ15e-dQ1F}gAXS(EuMjvDGfvZNyA!E@k7q-0wD>kB^)r-?_i|NPR!_7ws_FY!s z+k2(sxL=^M)BCkHn)Qhynn!_h?-ZR2FSvk(W^lk@5}u|lu4}R5OF%E-`r;*3?x%fi z3v$#Yn(1Ytze7gd(vq=eG_wQS>2jqNyZ81-ZrRk0RZt<2x~TK6F!XTcSBc!Q@6GQ& z6}U}yvVXPvKHGS*)>L+eF!ge+O*5}fW2k0WCTPhFs!1w3v##f>pk;^gDcMiK8o#D( zVpb+a@)OZT6f9YX&ut=j_sgFMd5_uu0PI&5_NVhI6YdzKigt#2a_CXaa<6qIP6#G$W@a7AjRmf@h#W@}w@z|0b0sQ_{{XTTEmI%is%aGv)7LyIGP8or3bE-9g@TdN+UpsE47py~6IBbErs+e0uQIPx zMN|OabD@tMF{+9)ci2x+)fqVMGxpYxZDmHm#yLZ+Lo= zaXR^Bw%e3jE;P26nt&@T^)S&wYGrbubyOKymPR=s!~rCZ%#PyPlzYz6Exrj73Z_iM zQUJ(rfY|)Y0%I%YNf*=uf7HN7I7SU%k2c;Vpu? z&jMViH@!6V^ixX5*c8ho<(Hrzq{fAmk%BM@#z;K=Bh8Ow+g$$uQ0|*mIwaephnCvw znNi*9B9@|)to_N32#5aww@n7=+#8=@@4chN{vKcEv|p`NPV{$bW(y)u3j5GGBl7@N zl!6MCETj;`9<}}>OBIiE>@EgzqV;v6q}~;lIjL@TIN>QYl+r^5NtiB4NTR8EBEddq z01`pP@X1RP1eHk0W9~cXhBZ+A z#s|p9#8RcpFaxSF2fM=Jd+I))0cBo$%Um@-` zQEoj)p(kE}BuIb>7v4XZ4yyWjr!FkhR)-hVGx&NraON&|;s+|TTKH4Chg()#si+;S zcSF)h%n_cdISXSlLJAO0V#v;+OPRh``Gn->iV4F$Vfb}dw{E3N(+Nt+1#}%nMD+6^ zDLiD4%%>y61wFNEOPq2wD@u*vWOwtX(~Mjf;g=rvMQ088&BNP9=CNE5rA@l?MzW#L z_>}%=oF3uPm4O)qYpLUmiTjOqRd!ew@gg~tDIkG5j-nKNHh(bx08#O+R(say*gKw= zaP3XUS!%ja&L+3qt#xk%(Yl<8Vvz`CBlIARO9wt5!@09Kal*~($L>-1k$hRKR1;mP zxP5qnI*6W@CHE(W3cjHsWsFGZBvvP+A!0~l6bdP;>5MJ|D=1_i1fK+akKa&~C5)<) zmq%J%{cX9e;%_P!?oxS!{3i3Hy%-fYg=~^Xd5IFKY8fSDVI5MQkb|vMr~@VN2AQp! zU$%F>ZOGBii6cc9wM#k$liNffu?O>aC&z(|>7uBnsG$!fL~PKjW859e{)`8I@1eHZ z%E~F@t)Qur8IbiwG)vtx+voK7@uZt_?8-EB4+XFE@J`z%wwTjL6!LxzoMj}85ZF?4 z)K!>vAD1IP8d7fevsBkrMPD3?Oz`wdYZIh8{NSG>89Gby>ouEw-D&GtOlNOd#7>g* z^}ST`yGts8`GE?6F$1R^Bk!(-<;#@*D)Vc{D_+v!#gFkc7CY^(=ejPna;j2LQ3|zk zBqW0)xPnDQKuZuoz{%GwzE$LHStovv)Z3zu)0>6VJG#Lw6ts4lm}sDum)U73#cbx9 zUYwG8;ws%;y0e@Qp0}{s9ksL3Hn zE##j1A>0=K0OlacPmJfke%djdpI~TASG;y-SE7_Eu)n5MPXP}l|SjC8@)`j<}kiAXJQUtYg1P@^F9ap^rsJJRagyknw=B zsUKHR{{H|vZm8}x)(X$X&?3b!`2({LwtmM(FmHQ=_uu9fGy|@&Lb`*6bKv&>0OmF1 zJZ|;T=&7c*c8{b#KG!ZoIk&OzJPa_Ec@>i`! zbAx32HvZbYLfN*>^RTxSHP+F0p^mcIbE1=~8oGHDG;+q+`jM1^56IO+%`EfB4PCM> zh>Re-j6vxc?x6k0?Ws2O5cNadJT2e$u1NULx@j9?-lm`ZmAhQ3dJT@Wg6mAK$D~uw zAw8J=60qsOas{zC~^$@qtJ=NKP4yi%i$_AITnxd9Y+`{PS3r!@^f*I60EV=g{4fVEXC zvNDW@J@loexTHz+FvdXq+KiWiu@(2t&lN{iA`*I63zqUUk)GVu6%Q)lf=)&>YP3?- z(zMX1>geiH4u7_^R7FukMjlzhzH)NE8WE&0Si5R^l3E}kRRp|`*l2>{)esX3IDY>C z;yBXw*Lk$l&^*% zC6e*Bc8#lV-}e-^is}n&6!n*B)J0{bqz-r z40u<>GMka^uHQj+w$E$2&_P#ARSQK`OH~TV9MQ1pWOQ!Kq!J3A#2rT;8NCiSUEhqA z5pZJtXNO^3>qDwsZZ4HpIGFV`VR{Ll+duG`Pf@~#>A_SYQl&aimAt?jkurTOCaE z+Ny;NfwDV!@1hE8i7lDP13wxP*JvblSci|l&bJj6OQEjaigw*K)c{2!62E3XbYpM4 zsE0BTK=a=Oe#hfU80rgnC->0AhNeipu_0Aq@ug`6HVieYle}sRqdE52pTGU}x_U~A zm`Lg90C+wQg4v;th09J%zhFm=9z%74dBr@@vZgvtPsW2|BqyS3iaA)uxd*ZPXmmE| zs+--Fex97*{Aoc~Vzh;DO_MLeECzL&bzt38G_n*rj}P3}w3=d!PU+R=2TT8e@tB~^2h0U&|< zcR$;{hSE~WEl2!AA5rgtp*I^#P@+{yoCgnpap33A#**oVv_*fY&p0-2ZtXGh=fG+~ zF3gsdD|c=B{k1`Qihx_$q^E+mAkjP|8CrGvgFImvbjg^fpr)kS{6@cQT20iqnhP~x zr>3~70A|4NoOUD-RFT+_PNXBBo?m#EvUvXhCocAV&n@1&cU5W@qL^Ccsi=7A zf~VI589IRp+uDVXSJ$psA7MugKwAdOTcIsX72KBw|a9d3^AMMY?q zl8MjtqGX%h%Ou3EGQ;iP<6Wuf6SFvV#7;F{EW57PXt|mVi&al`v|Z@ws#d2p6v~b5 z%B|LbVS@pXM}1Qgcf?7wqhw|Dv`#@dWMT;)aoa+gBUMS&8UpLxNc;W2zN9YLGt|+tq5!eMG7KmkmP9Mzmr>zL{>-5K6M{8K z#lln~X{$zCt`b?@bPr5EyXxu{lJ-)?xDze~r}hVlAi{Pj4h zb{`mp*ROBatDoN^2hB* z{_ck(xj6XL7y4=Xq;b2`s@iyAblY`&K;cx?4^nozR-`v86wcjKR8UJO#H#(->Et+O z0INnz7MjjlIicW>Iou~^s~l7%LMSq14kkf0S6%D6=nnz z-3*FLL}dX%W}!xaxBy5$FK-9$s@!_S^4D$Uqtr`?dlPI}sV(+Pt$a7@1r53%ZW?Mj z=$26$sO-Hm`^$mbxAqojw(atUSXasXox5jI_v^>$zhBLd1tH`24DIhG{Nbgxs$alU zQx}%`Y5xF=Eh2;knrN6bu`Ul*WI=-CT7NjSjgFJq0aS_B--nrgvb`hR9hrW|S}S(+ zw9G5$7#dYzDj|10aQ5?`l_&hPJ}A~HKd9xLzeX7SNI}624{&qa2e`(JD5QpQ?6N6l z1qeO**W?{>TTHpXf}6I(usEG(-21gJCdp4$Kb=!;`l#cpsiYu>NX((5FN_rq3YJg~ zGm&I|Lb+_^+hA~d*S`5_w!F6O>ujRCPkXMotz}(mya>p$N`#{Hbr52Y2##L^B$HYz zzTc;^dllA*0?`yJQSV4t!ujotpUQqd+Lv$PeCDHU(Mh~{HoQ|uO)a3|Oa6SAQAOFLhO|ZDwN!z{%62ymop~ z$CM<4_VK3n?_?<>6O8v7=*HqC(-YKf?RRsI`ZuDhky8cF=yX*r6p$%qC-v8ZUo2_T z7~@vcCY=g2tUq9MaZPNeyFwZ%RZ6Z|Nyc=Hp}7)OOo#jFOKGThqZ7sn0QVWvwL_<7 z>8!Cv{Q*?|y3KB>t&VAnp*Z;MtkBTOM5!%Xt3V@$jOYGmC%@=EG!~M7!Pmubk&i@8 z)$FqVnEuE0(CC#24wJB4WHOwMY?|K{#S6()h1akJeE$G(uRQ2N?`v*59mds7{JM@o z54ci9bM=9q?c2AV2;3C&S=iLtjB>{!^veu@5WtTZ`2>I8NDZ}mx$WzeG!+4V1kz?z zV~_?2>L1%q6&3E^TWx57jKK1AkssC*w~&01_x*By`h=q6D@79}%h5m8x^6Dg+Ib-6 z3ci@`8*Qm6wl3poFmCX|m|6+*+wl>^quYNoDv*A)%Skj8^=!gYr=!R^s-4o7uWwev zUt*e6Xyp>qRa3~y@h~Bkm6U?42m}yvFbLGXdPw@^OT`M?A1VCf+&}*Sq?jcZ+~O3@ z1a~&;>L&xYZ6o)siD>%f02qgSljro(qUpLSA$O(fArIvw`O+(UUfJO@M<`-@dwJ8r zOOPEP75)_R&l<#Uu(p*Gf8>c58VMBx)J!B1`9A|iIyD;U)r4RXp%Br^$154e zj;%XhO1D%okPpckQ?A2W2o^>e1#(W4lCbe&5YaDBLnqbu(S1cb(>@6ErLApKBT zf#bG{Ywe!!3S$QczPBW9B77{;t~>tv8A)rKw4FyksnMMUfXVd;@&U$-MXjPvlOOr% zT!NJW?DpxVU>ZN;mNL$F3f!iP8-0bOeLWHdj65=_~~SMra( zbU6l?*KQ(OY>Ipj1Pu^POeB*knD685rz)prJzSDY-3d4VjTO(b3^iZYU!*Qe0jjAX z8WUEkA=CrLNFM{|N{Fnqi9)r4gm*YBJ`RLY?YJY6Wi;#%_g2SauXj3uK~7>uB!>LC z>_1%w?TnQkZF?d{Kg?=k9V5R>3}Z=tBHbUQ;EbXoc0ig^U3b&PXjZ7I>g)PMVD}i) zb-8b;b)Crs5BmQ4F;4A*O;E;j6S{aG%Tzmyo=zjW)801z%WFj?&Z2gZq(1DD#Ysy6 zU_I&TB?>}eyOqQ*m)qjXXjZJB?Ljbhpd6se64z%A%=it8m0gH`LzkOZ`e1 z51y6n=3xH-h+lVGtX!gPeZMKJX>L43qm?74{{V#|Lkz%)KCn(2XxTtl%Pw%Z>Kr}k zufpq38F|#rz+b9BXF+xrgH?D!gS* zJM!zlqqW=*zig6N=;NuUo}86tidf_YB#p{-W3dH61Rna6-{MZ;yKg>TxOcsyu7zu? zyTYoL6EuW8f})JdR55S>29%%5$0T5=&S9Hv-s)izTWTwvQmRh2R!oBaLG#f6XHBaFL3C~0>xOK;qE>Kc~Kw&zPG1w}KQGL>&`c>1&1!Z%RP zU|X3U0q#IGouX@*Q{Gu=dS{U$BivMOhzNGVzA~f9Eobp5a6->?*!(f$ z9jj{2`-AzFeVSX0OxTu@VHGJ(>(hdL!z(Km1bu6U2P;{wSd^b+pz3ZUMAI;;qh^40CpTxBylmV4hWRfzDI-__QbBtr0ww3ZrG=U^RB`m7FK}qaBeWMFEW9Uumxz3#B;QWregvMP#VhFYSAGCiQcW50ZN(qou<6G*mutvxw2imz=8 zsOe>I@RtVwWCMfS=i^d?YGXwcrO8jOZG)ZeZY(?u*_;NKXzi<&4a|$K;k~5w)paq% z($d8oFsP0mrAF#jX&bBQQ`@P#qTpkeQBeRO=OLB6exT~PUYR_*^B>J06>0WgFPqZI zw=b5ONiMQog+!j^3P8z;RnP#f?mTLw=jvhb4yRL<4ifXrvn&x(^HnrH1A0BG2JlgV+w`7}$ zHFBDr-?c?Val?CwA;F%J2mvbqMa+4``6nVy?+Vm`+B8%~4jiJEQV1cBP$2gN9@z2- zBUaa*o}vdKgJn;Uesv#yq<)3mGv=D2>&C8Jt038&L#Pr|*4w?Xww=K!kFq>#fkixy zGSvx?%*47JmkpC2J+8qS_+%tQ_&SJQ3hGb9BPbKovSdiVvD2;TqsCgkRlPtsYs4zM z@SwL-ZwgvljP43V_PWRXEb8R4hERxBOSo`{j=Vd$ZGF#d zSlVk<&I~+>C;T}OsVgi@0V^UBtUH6G=KO-;QMX~Qdu+ar+e~{WZvw zi>6v7wHTEwqCog0bpi-Ks5G7ui`ED}EWcd5($lMD@3bget{X^vG1$wQGosu5Q-^Yzt9g8s~TTKvRtVd&_E0jqK zlDPVgNc(9+ai)-#c%v=*XrgL(>Xf9B;lam$jdVqlYRxvvWkwAqG3UR|k7=u-jz4Y^ z0-w#_&Wn@7VO<1fAb(a2FLf-Xb6#qkNRsg1h!Pxp02i;`lFEJ z80jarG-pG1pqc%unqaKQ7$kw8-$k7mK75Y{UmhTSuMMsPwwR=`J^3%X0C9m<%-Y7|E+bgetK8Cxg7{@>d~S4ijEwcT{m$U!+t5r`oC zcOAjeMT2nC(7iRbjyGzU5zK~l>IC`7{{YiR6*n_;pq{7NjsU}~By33|zH$yp@8eTR z#f;Z2`6esTFV+#b!B=PGvzi;r&AL`GO|mz2Oqu>}h?!a{w`2bR;(#47N&f(f6f1pW zspV3UR$AJ`+uMP{r>=evND715|Ov=uu+ARrMiL11nBysaXs0^I;f2bV3K?1?sW(p&i?=v zq*iUsaNAse--`nZ>Hfq z>To0Zw35S4`%v)SIOzm#3khSXTDe=Z0<0ub*eFLv8zCMv-)jE=pU^2fCfM35YpXvE zGLpa^gQmZk-e5UU2@7zGP#T$x?<=)~fk+h+RQ4~!S(mZ8w-E1Rn`04r6Q5W%9Zm;H)=3YmvVuf4% z>NvO6N4~=c@-e3AUi~M!p1_ZgL`E#|N%SkOapzv+i68Q^lk?wMJ=ecum8RSNDlS*!yiAw=Q}}}w z0!~Fd^p~r2vdI_2G;B~Kf8`Fqea@i)pk)~m=l0jnWT?+WX?^KG2IC*1dG@FAX>yZI zG@|>;2NB_gByYpXOMAA=(kLHEWn!Hqp2r2SJLLG&nX|u&>(gg<*~Ht=U~xlhRS3gL zLoVE>uTHO(Nnoh@zjCYnwOqp8zDM@f*x50HagPT}&3y+&sd!K88R><`>WJw0ea;Tg zrmdMlTYluJlFv9NKTa!BB`h#<10RpL)AP61&Autc-(}q?ZuF%0tF=`nR9}D35oPc} z^Qv+JBwm&vkNH@3{{Ss2HV+gyo5IP2+%e)N;M>xf1g1^9eW;=(#xUpDb;%eQKcM?* zxu9`H#qE14bXJ(flmoaQ8Vz`{I?L^J93KQ~xBLhGBTrS{1XCrRUScV?wInpc80h$i zx1hPYC_R`?zSQLWWu$J}?JijV0E-uGZX1oiZ*!T;gqAqJ2hm@<_6=KA#%2lG&Awo+ zitIaxorFRB2l*f>rKRQU;VXV1N_&H%Li{3-_8a3S8tj|{|_j@no$?c@% z`=UE>sH=udJ76d${-36uTSB_9;i(>`Z@?rtNc;@>(p!2_tvPD;l~gh8pE&)s>fgL) zp$4j=V(d;rlZ_N_?a51fg(~UarTQ@8hkWVMe2m(m6}c`f)df}loc71YlXnWvfD`v%|9P4{J2^+nkdoL{##Vlg+@oBtTwJ~LExak|+K*{oXSdj9F&}cOxKhO$ z$U3`)hMq%{{wZbbaHlH%X#F;LtT%4*=5n^_MY{J6>D}!$9iwmCB#uahLx?F9WMBnV zg(PF1<)mi7I_vIE`D3|q*TFhY8{RK)+^*FZ8k?Q+wvkAI>nAJLkmtFH6(fyF!C;+4 z4y*`cjNW$T(km$?3XA=}*pKxJ;BMgNPm}%e!F{c?sV^5>)dVYUrKze&3QCm`$R&;6 z3+|vsD6CakzLsF5op7Hn9*Gxj4|5Z=Z2V^6y}y2KN~qpV%US%@q@YNor^03kzZ(iWkEf#U380CZ%fGC%#wE@Z-7`}h z4xlsrUy$9ifDhC=XK(EErYlh+3?az}=TW!o%g^TvH#Ww<@(;Fl72>yY+V?f4s`Y!T z_Mo@XdrGxIA^Oaf5|JtU>V1Ayr zE#vGDN&evIuGcI@p5IbGr8<>A-%mE(*mk}kSMDoK-ZH>i>EVf%GR7J4D`2oqUqe-Cxzt%}*$q`B?G)b8%JnJ6N8BGHT(EYiyHv#w4LAPy;MWqYoKD!6F+;fa zts3oqK~oe`lw>Z^R3pUdQJ58VWg(76oi^h_={p<_67Og)&-i-C zk>;Y46LiUuBKaNuN1yW1_ZyWj8%Ah_9IXkAx)g^hy|S)x++h6atK7Uxt*hS6zG=3u z-?F@rK^1Kd+oEb(Nm&LSRfw}JO#RuBI%5Q$g+cV`-r6WPe%$vfDy2ZBj7u5@>Q49p zNIhBmeCiWiWH(OHJ3f!xmU4l!cp0-fzvYgV&FgqknR|(VKoo0||!!EWQPq?*2#M~YN*ECzCW(Ji+J$U)5nZlP`1!* z>rDH*b!}a#di*-YO%q0wR>ucW>u%}Pkpk_H4TGY-QvFG})ZjlKzxzpT{xjV7HQM7e zZMrD;HR`6SEyoBs_9{fKz>E{n>(bGd$`$XM!f=cgA!i9m+hv)mRJ2`A$`8QDbFX&m zOm*Kub?QDl>6GNRo<0zIfpB&{D{!7^C@z#Lk#6muSu(YD->W(Pl)wSk8(^5H03@mo zFheMIY3gNypca&_4!RSyqAfJ(oFn{5JSyQ0XCgHGO26&85xlK?=FLg9Z>eoVQi53} z_vxN+h1Q(HC1QnwmsH1dgAG*lH5E`Lyu{%OkkPD5(!(F(B=~f)oJLE1&=l-Sk0Oe>9HDi}a5JIL^G@H%GGWL#YSj z14)aH;`g1IiuXPSy5kneIUrT$=TUQ+iDEO`9a+(AwzOfHqysa4Z_x7qi60l*Pldp{xG|nYP0OM3i zqz2P8poIWlqDURif>hMBQk5Nt*yu%IF*Z}D^&>!L>y^8CA37qpqhPH#85+$5{-E-@ z&0z|rW2E$eKTUk;b{X%jg%~g@GBxTL%7$?u`}E_^yyociV7WR2u#wy-2Ra#QyG?GW zUvj2V59M^uar%L+ag+*PUd@x-jd_vj$`Z;+KRz@r%B*XS*mqcqA~g2PoGl>-rbyQx z-@l!BpsuK)5z@U<%QHsV1M_D;>+yr4t99nK%@h6^5#wXsWKeK^;OJzs#t>a&q^6cu zOu|)9&mh(I#e6K28W%;gaGHr8t@&QmSRc!oo*2ZliDkWCj%$WwY)5i zg&uQ&KTR-=9{^%-0Q2&7u_HY_MYH;RYsx_^gAt=M5AdkYls0}F@dE3#>$lCTVP0zO z7TSe^zU@tImP%?_8G3?CDydW@dw|4bYZ*{WObo5i5gR}A9B5oN^lbZd@0{SWNq`Uc z*O)3I^CcK=-J7HkKkcD#+-aj84vhz4&|ZsZbXRGk z89mqB8DI6*`g%$!+^r>OVyC}Db$`=CqpYNQkyBY=h2>u|749@7QEl4hlvP?*0{$j| z;hVRA%ZzAU29R7RruPC0gdvyd1dN{g+GuKm{{UK#t0x^v-2J{ZEN!a{i5yT~YN9Lu z05kPx^dW`_*5t6?{A;ihrdllDhT4+wi(*hYmGD17x2`j5&QS*ldZG! zq{j<++vgvE`-twI4|7k%D+S7O7J8Zu-Bv{E2*7la*(4o@=2Q05bq#%jzBM-6lB}6O z(UrPfXX6K2<)f>GMMTuj2Y!+l82vv(t;zW{v`xPuAH{)UvDz*B3)DA-HP2lXam3s} z;r#+h_<*s1WvsWfibRg!mTGbkXS74UB%hc6098*L4mJ6@vV@xo@?x7zi# zU0``wsz&s+MIEFq^x{6TlA}LR%|~hGg{nUnBH@RCPC|`tqMmwM)Vv88Rntf$rI{b4 zpm1`mj616X+z>$oXUNjDn0Xf?xy>S63Qh`?VkC`*2qW$^oZL+$lZj$EUz}>X9JhM$ zdX)0-OG#jxj$?h@d84oM)Oi;uhhS{sGhCzjV^wp*pGuc?JkOG+V(eKB=?KI4^A-GL+=b3Yw_ zhzo&~JTHd7l=~Ih?Y!ReMyYS!droL57KqA!5bq>)0QLJ|bo7&g;E~(3KUQ6Mo5kJJ zHu>FmyWZ)!*A@J2*6&v=&ndvclBRYffCp~m4;a#iZtN)IXP0&EXGQFPmf30v01qMN zVcX;KV0qLho%kypX*(vfl%H811ak#&gJAO~XrbYD?5di7!qrq=RM@A1Au_3Ix`snf zBL#H{Aj2|_mOVggtoSFy{ta=*YOe0!Ulw-u?YAl}q647U+up|mj z4{cW?g?3I6TyAp3Rk7{!)J0bClTgc0=N*?U`R+%@#--QM&(r?^fLEQFZs!N_HNyHX z^Fu{#vRBD(q=BiVQooE!a-WTUva$q}v}y|dEJ?ycG@Gu*mN<4uG;Xm#X9Y)3=1l4m zJ#%+<&ffg_TrONag1UFHZTd?#@y(dU5?m>RFzA@E{73X=iCL9LL4>K#W2v^8AcB8v zeOU9`TxaUb$#)1h#{)OVFxH);aZT?*YmTGw7FUst#3RKKSzrp?ZS8grfm`VWk^#}l z7bxfv?1ugZOGRR)-CMf49+Z08RgPH?N~jdbNYCFaGBnz8hlO-}K(bZwUu1hx$43h~ zP{f~2R7zOL>yHOHAwcinIPIyE<%iJsh1A@>+w5Fy;$7ahvvk|wxNdq18)Pt|OEf5d z4yIt_dr8qAN(0o4e!wx@@$wU0Ma-QqEIgO&D$U1R{?4UkO%#t#k8Y=uHH_3xH}ujt zRisa;l(EU_>MBOL+%mzbQ0c|GL_OP#6UX>^fF3!!_c$NhUgo5l*6OP#RXl`MUK0Dyb_!5ZgjEwt8Z$fJr7^pynrEg>XE-}12?@$xhKX~yNlirZ&{JagN8 zN!>eHFP54L_}2YnqnYX2;cjOih^I*-Wo2@Jm0(BKstLh5oYk!vI@2Zl&|83V?oMX+ zCpT186!2R%wW6zQ($Y^>t6luUp{eDOq^AV3#IT?W@?bkZ&_xU=l^oi2jX74Jz^3!yIflDV$j|&6z{c#cWDQRcq%VKhawDv?CY~ z#XW|ap1S=V0k+l1lFd(F5oo1pUKt?=p%mmX3_=#wRz4W%o0^pA5R=JYGTc?=lZ%#) zVE8Rkt$MScYT;^Fm=Of;rsPYO(x#)A=DLQ0a5#D}yC(J(z`K939mybLhxk)vm z^#Nw4sR3$j)hK#rqo`l-O%vr9cFqG2rIecR*miZpgQQV1jbiPR-}n0j$;zDBsVyW@w} zvsuDT<#n#hulEir2C)Gh9-^ZK+7)KtG%H$Lrry*T}ZPJR{^^g?j?4c2nNvwrmS> zHPVW|@l^FaaEW@Mo27lohq=;?f%Gc*@=<)jdVKPc%+#>l_rDM7EmhIe$4u7^qi{CK zB!xOaM_AOx_*4i9Lcx1yu*0{vcE;A>mk;dRHP|#yNoKIx;J4c8B&m{ShMp7&8W|*J zcOgI-VoBFNFHbyjwo-W_;iKi!;9bfmkKaXA621!)lltr3Wm5GsvuD4JYl;?_gChk8 zz6sZAM>BQ+ndESKSd4x22BgRV3{P*r#+2!1ic}^rr{_8YQ)~+4Oqu@LZb-~cej$O& z6RptPWQcUB?V;71dQtsIBPC9NOK*}E!R_DYOxk@3w0rzxO8qQ*AJ<=B3f+kQ=Uz=T zXan`*9(AU!sAr)Ef0K;>8y-a`upi%9pgnHg$Zza5nr7wLNz^HVD-~8eJ_vsN$;P~>&={L)Q>_EPvsCKk;u^!OBOx`el#9O zLKPWGmFx-db-EhoLSdW`z~HGJzrMWHdr}oBr|*J1X;r68q}+JgX0>jS(iC8y2VSEy zQ8a=H5tW7+G5-Kh*IJpo{+h+Y{X;rpN=d3`qV(@dRUm$@uBjz?Gmq06($kp;IsX6- z2*-_mAoca*xIa4g1i6unsS=m)OhAG9ewuXbLs#NtMrb6?^nf^^c;1XmMYY4V||* z;We(Rw{%$SR;s&nbw0?AX0D^DrDTbtZnSY5;#6Y5s{z)|ZR7Y=eKYSC`o*(ec$I3N zp|edqd!7l_ly!rJNh$uZ)0G`W`Ok7S1KcCy-xzsx;tdB8d2-@a<7w_nGRl{$4Oxmp zgsEtoBc@oJq;*E%qhke5r}xp{*Jo(+@X&Je$1S@&AashS9E#U}U+5{Zr{6!qL&>#$G+*ooGu%Y8%9R19Yn>Ghdc0pK zs;dcB`!OVbnq~OU{3M&6f)i7~9&oL7&Z++Z;P;N)mXNdkgtX5n1NU7{9P|cAJ=Jdh zK+ZFwD}=8EhpC5G?WLs5+oHJJ`}k-*QTAjtmDYWSwRbF3-At>u_l+}4I)R*|Q+sSl z@zacx@uo9*`b+wjaNt0@qmrH?#44v);A&d~)P{EaqxuE$pVt`q)u>wGBLWm;{{Wu6 z(^E}VImCqlJ@MF`4ko)@6+he@g>%)h>{Z*tb@HRcdwsIoB1dtrtJ)Q791P2xA5ZRK=9cSXwNj8-WGSB}2=H$hmuh&{_+z{pGW-n|0VW)KoI=TBgC0MkmK+6xR!NhT@>rQNbz@2@bd*9|RHO_w%4p?UsPQ@_o%TaY`~N=Y&eC zelSVu#)Cn5YL=23n85ywYZ+eH3d8=|GNY-XsGZ0rH5CH^hhT&p{{X1cZ34oIyB_OU zWs;)JM^F@!vePkBfC>D*PEH1kp`Myp&-0q<4#1Lg{qfsNx+{mfN@hxh zkfCn}>~a49LNqW`rA$VS;YO22%oL-iQn4$M*!k_Er}i35M4N4RR!Am}bx=$DQU>jx z(ER=N^%AKnMO91#`a}9;gWD&${{XPkwyS0BiolOY8KZPip{Sh1#Rt?Bk)GNQ{=jai zDT>ft)zTsi6t$?2B#Ij)T2&{p9f8096ana<08U2l@Bn1Oht5g*M**0G^I0A%LQlXs#wT z02w}(6*$LP!vzPcQG;7+uBK5f`)xfWE(RW!Z+)U-NIhx>Km)m9oM^Pt*{&4E8XL@V zhb00B3eV&GErLglJZDb8*V!%{{rc|Pcpb#cuM7EIwQZf~lzVlSuXiQfy*&-uPpJe3 zF!5C^F%@SW)XL!t5U!_zxi{n^oNrQ_*8Ah;%a+e`tBUCJUoX3s>a?P|=ThcrBTu1$ zkgFLFWkm!GjDfCwqS*Aec0^m3w@(=>Dn>9ugWZ2a`u_lZbvGiM^Kny=J|WZIb~8_1 zv+C-T{iN8xi0G%TkLXW!9;1k>IuM1KvY=!zImn~Q89IF#B$kJ-VSf+js1EMA)5E>5 z+ak$C#5DD@?suy(XW&HyY*>%CVfy2~x=WY;0O7H?*l$#~u4XvxLvgl2QnJlq+H`Y9 zYtS)1+tf|cVpafTq89p!f-%*u;^c3g4qkcF;p^{?3GOMX1r)0l=FxJc_ABj@HhKi{ zWb_9Etr!xlq#S~ApkAf^pwR#e&Xva+Rv7bcyp4x5-jH0Ub0KoKU&WHYWL+$l6|`IG_9>^Mj8n$hhuwt4 z%R5Jw3d}8jKs}>;>)JF&bDZ_;s zMEZV}@<-YHlJ)54A7(=mhdk z?nHnOxd&Y;2ou|$_jDQR8WFyJdP`9}C*{q^pac_^yh(U+Q{D3zXA zqUaNY^W!o%=jMbUlQt7}RLqK==VC231D7wW>K(XKvq+}y$P zbHKiB>#SFvC{f%ay6xnaJ4MpjRI4RCQnHy>qD5%WS3><=A%-#4f<~*m>GRDj{HpUi zviR;QuTa})E^)=VZPtnXxhk!4H$_wozxb=ui8>GFXpjsX5oF#SZ|$?XtTz4MwQtmy zTQ%B>D!X0Iics*?Q^=vCl^kRhWgvt4>!t~0$Hj>&jI9w(wRsKQ9Cq~vn}N1ir>3-U z*85vmySE#|9LsEtqp7E*Mo)1%rJa>XMfe5r(z=q~Te)P|{H*X|)59IHRTa{vg=!kE z{{TI@7^-TXdER*eWgS8=&4os0!lHssGpA;T-?v0{mkV8{iVBL!)=J8lDUl?pV7Lz} zCK-R7qkI=`?zSY+n zmg$h;e)!M_c$P!apFVZgm5>w^bp{l{ZrJcnv{&6xU~orEe0R}Qb#mlo{0#=U%`3hb z`SGlxU^PX`Uvmh@&tsi^rmm@yRcB^B_3|o4lekX*0Je;+wl|VR%9H)IO(kNhY(BEk zrCm6vh1d{t-&<9XN2<45~sVFfepQc(XM*Y)9Ycwt|weTwMsN%07j{XZ7!(5>s^&d>@@D zD`=!7<}7~WMO8E{6A}k(YKfBqB!5qxW`fpkeYb6^lHXf)iaDYnNg2*pDi{NS_8|WN zeP)}N?iymYIwh!M14q=d0<6DqMhNY$5cM|n7U9P*TP?Vi$#)O!8)JwNRT---kW`9F z3&d^tm!#!lpFP!=rB86*8jw3$LmBIk_xov2Vw4fqtK0R|awR20W0Qof6(HXHGu&H4 ze%-ib#NEABX57|#$tY{7YJf(9qBDk$Mf>DqKO+Z83fq0c*-~gOl;$Zw`g7rj+#ivv z4tmS_I_yqtxLe^Ro zd(8YW;!UQy&$ukrl2cScR}Z%|QYphm>fOTvJB@L-#nU@eSrwtYBPwz{Ysm$1gYlyq z=HIz(`ol+G9%EvCiSdPy54Y?3XbvzyKYbErynKPI>0{$tJ-(5VubBD9v5jCAIYL4E zgRcz`Ah~=td=ckcSr5S1oMX;{z6j8|-B=kNqp7EmqtncUGQ4iMD(b)x%iIy2Hp91U{{WVhTX%TX#Vk}edA-_nQn521 z$bK`ABS_-n?j|e)=;7l#eJ^o>Y{`nJT0(RVT^&>TbOUePQ;$ z5q8y!k#1=>Fr?j-3l+PE8=B==G|jn@oyyTvNg-~U#LA>(kh3C%70t0fAO zgr+^uwuZu~5zK5$d}vH|*(Z~w0RSHP8v8v8&^hircF?qB35MqJL>1WQ!35}Q-4AFhJj>sktx>67QU&^m{CBodS7xYncw!3xoP{a7Eqn+|q4ce!(0%69qX_iSz$ zt(P6uXEgM6RVrE*m1Id)DL^Coau~#Hdw?IPgN$i)Ezx?6F`SI$yX$=o$bbysbl~6( z8gwdj?5qav;p*6KmmRaY>~$2i^tDpbS5Q=^Ot4IjQQ~DF5LJ|epSF>m#QpM6kN*J2I^=3ll`{RSLf_Ma z)1zkdx6P#oFZa$bUG~K#O_I9j9eg&}YTiYq_T0b`-8xB)+=n^toyUE4w-EmTg)@sb z(+y__`L)|uMUiJUP;NUFMF&YxOe&2FP8+DdCD&|7fwgHq}sgA(d{#Q zrrWJm4c^o%8W}3+>L7S$5fR(mRDW>;4wh)t9A_fQdv|rgxhxc0g4`++mf=ASRVr1$ zVwGc-HIYN1?vK*LDtsJ|jams#I|H>vd3cZfKYpjYEizM5a6*rD!z-YQeVMrFs|Q;V z*^$~g5D%6s)B5Rv;ukAi+wf-TEwhOHzwaAV)Ku|R*GFHrYJc6~a?!^uj1a0L6kw!~ z8>HjLtw&7^l~a9exCbO19MVTEzpL=mbo=0YqCVYt7#&~pW8*$Gr37UnqYzx?qpPT= zu7Uty?4Q)>&KIBnGvB_BDQZlw5im)Y7*<~W@S`8EeCpMEtAiX6-!QeidT?)N-eYLO zJa=8KM^LPwmJ9X7djd(v+?_U8uTAex%KLSyj^)XZ4}^9{V{ckp3~-;SDI>$YczHk) zGXUh_!nc)NPupSSirG=sQBI1FD-`Nu(Pu!E#%oSf%Qn$ttV?A$(bbQ-F@{k6z_+hnMR69f-Tp5n-GPgic^x4{F(l+Sav zG-it8(KK>8qPoZtoeF#m{QdN$+jVqRC?STWWsOh|YIXXD(mM>F+e4G`YS=TCjyi7c z4e?&Z#{LKH*rD7O8nIDKDXMTLh#%@xPapmyGNxl?U~n6$aLO2TC#WB&6$h*LHM}_B z6`OUZ3wTv+t4QlF8_q6~4RuJ3YN{$dlH=P}Ri=j|D<7@OEQhC@HxCDRe{)8?QI@t` z;08lkM&8V^Kc1k=fX5$wCH9u+vh5AZ+joio0EVfitCT3dnpQx=x%TQ1i~vbLP{{A! zw|x~oF_+mhTmNbio~oKRWHi zvGG!hIi}{FFfR-asdY*lcq4Bil#=3-Btu>tAA}Blf?~{ zapC6?cAdWaW4qHDYn`U)P9=h-B0(zZeD?r?FhKx<2nRvnxXhSbWC7e~TV$)45hidC zfvrHY20c>w9JKN?%&+?)vD)QPvo{63F1vE3y&1m@k0T_M24XQQMC*Xxn52>QeOV=g zv`X~z^rI?6)OwwVV>LWSqodou%u zBRCnxxIIQTq}p0KvB^p+>|!pK#IWD#=caFNn&~G+O0mdEyn+vf1a?2Zrjz)7`2Bs~ ze7o*00&WV)DqDy5IAyfYJyXc*L0j%J)2gtF8dB8TCp|| zLdU6pZ2W6tr97ZBN0M}HeA}-ItE!RJZy%Y6;yM?!|0j`L#h7b zxBmdwOgA6+(SPBVY93lTh*drKQZK2Oy1K}@WiXr!f;4^Fi~A?S*un97qUQ$vt(#`vw|ta!9WoX> zaw@`Q$4rN;)S&>fmII^$7%=R^S)+vE&hO*(G#mQS#4VFf{{V+dTIegOWs*dmF@uvD z;A9mz$vynz8qIS)Qg)F-L#eG`$_albN7kY2#vzD)$T|5w7AK`t;x?PF19+TID=-V72B54LR3R}rjMzosv!MB zr4tR1ezi%BoX#Hj!5!U zTO7E`%6HLSZQG*DC5rz5E@~OOH%QBN?CO3Pf3A{uyA=C+q@<2!r)K_*Kp#J2_BxZF zUmr$mhbWv+{{Za?%C8Ksvv9)oO(k{l>5$Yxw(E@FR79ZahMj{cLixe!82}AN>{l~( zfZHz3hjG$e_;LFEkL#{<+BzCdXibcPl5!it8u(O|$o~EGBDRIB^=(Nw`U%W!pFgIx z6j;es$A04)a%Q93&ux6bVchG}qOj{D&z);w(0;t*#&wKm)A5h)HTFy}>`(r>%nT4U zpFNlRV_3#Yk(@M5@sbJdHRN{##~nrof1cZOA+WukcJVJvK?DF=Zd&^7RnhT<&ld zyMtU|+?3OA?is^PO+CuG=pM6i+L#d?d!wGWmNo(zq!|pxHzYKKYq~eB!WxQ`EfHU4 zBhU)?Z`V~ruD3)BEt1z2`%I{AFR zxuvu&DkT*zWOL)~ptlQ`;pxnMc=|{K=R&R&^%SesEJart2k)fS%1I^e&yT*DD<*oS ztcbr$9_04#q4Fg^v(&HBe%bS)B#Zj$<*)(C8VA34$-;gYS&i$897wQjnly;Y)q95Z zPez>s$p&X$q56$cCi((mE1m9ND|xi#Uvu*LNxX%QruMYb(^ot`Hy?t7&ZImmxo)+ft#}F!|naHTG70=EzT*fR^p;V_-O~`SraPJJWeE5&KYoU z4`Dp>dW!M~Qw;SHbF0GFN@#t`h8yCKD1>J?>+jp(Wd8ttN0&d`nsAGmUNdg)WO*&- zGl#qCP0MPU^Sw4b2KAzr;gpGEmZFL=AvHeYp<@Z@WM#sJ3;>R7GOkHI6~^~W;7!$2 zcUt;t+YPFURVnESBn- zA?SLVcR*FgbCZ#Sj@cUNuT9Qut=^uT(|`3tbl9#p9jmx(QrWk4J*Fmlm|`xoK~)@C z2`P?Pz$(l>iClsX0q2DHI%QhD77eAXR2@KhIxvc{q(G{`oO~TN+zsQ84Y`BZH2hBC zMdGUAVymf^8cOP`JyjRt3}gm$N|@tzjFa@p7p1%OlZ`Cns`gl)ccp;)YoV(o3#zJ; zH^I-J1I~d|+>Q%KPwkCp7dDci9eWk%4 zriicX=I$8!Z*oz=4h-D*9k;k2Y`a-+)gJBsZOaN**44HY1Ney(svw;qqgeV@Htq(w z*JS!9@{>oxdOe@Nwv}}*=cTHyTCLHx_Z+dyTN$1~Pa`8KiE#jkn1py&>+adkq$cZS zm14;qkH{KE?h9O$(ixCtbN*U*LmTZG&S`c-7eD@(JXPfX0D;?=jFg;gw%#_4uW(!F z?Xb&lb-C12RZOvuf20UvF|omAC$F*TUaSsx;my9u#ND+UZi`(#aGVrDD8U&lFc@I= z1MjIw_1V9}OX=%;+7C|@wO9M~1avDZ$@H}|J!86*I1Um9z-+1NBN@gyOVQ_--b6Wo zwnee{jdk2f#Htz^s*25}aJ!RfyvNYq-F-NX5<~4PFS}3fj4}rLk-JMS70-+cMw99yXM(&;CBB2sixW+#)d1^1(MTI!Jlqir&p)ANC(&W3Xb0Q%X=&qP`Nm8rL;-Rx22xb<8LOL z4}f#-dPz5*4tBjgrpdc%X<9hzB0-sHrB%x;GB?zxC8KbxOMO7Y{ch6Ye=fXNlVxt- z=6hpz?Te5w&kLApX)2*$$t-TC^C{^el;MHMC0i#qlMP$Z0#17hW9yHhk%N0%e*(#w!QDoUP`0qk_1-qu>JgS}}tLb^2_z$Gok@-Y{2(9QP! z_1|B_3BWDUvH7{%e8aJ$8+^ zeAp+UxtJ!3RK!aveqrZ0K6{Lg!^i8dJ1kZ_mEh}()tz_fEmkf3{?l40cNA2Qf7th0 zTdf4H-|y=J$5BrjfTVS!11Kb~(y|Pk^qIeo-e~^-p(obJnCw2SKo6kT6K?z*^%Svi z+VQt<74>%ru2yx9ly$sR)i4H%tgmKjSiP*d8R%E*@TV36sJK;Bdve078-?+o(^SO% zG%iTfLU!d#43!S@LI6LMAHIRXNMk)pxW|K`lG4=s&F&nM-ShX*m?)$x(er{aj~cb4 zNkk89p@QQfS?aNl{U_%~HodD7gDNL-a=#iTj+vsU5&%^}@9~`;+pho-1V<;p$<}R! zAXIikOCpyUC-RL7l8$(fT+p%ui?& zS>jWNSpNVq`)YEXF!8iDG<0xOI>Rz&?W0z=#ZM5Xr$dACd*er&KZg=tYFas@SmQz6 ziOzfDN0)9G(^)CK*SUakl}|Yt(z(Xy3M@?7s0C1tDU{(KA&B$(YsTlGhHRozb+=9m zh3p`vy1pSSO%7chGozjpB2^|$TUFTzUnrdoQ51cgbAi2}E^ z_4Kh^9*{@}ARii3?Ttr1bBDv*b!C#>x~i^hwf3f(8)ddiVKUaV;S{M1>LV%!ect61 zwly7b@7fLrTkENJF4eW}&7F0;&rkT>rB3NAWt5dz=aDb~SeQHxRJ%rnRf8Qmd{gE} zYV2zrUd7!Uwc4*jde(qk;@t0CDpmcdm1zb8yY&dzMn0tsd*`;fXK`~qwBoLVc<}d! zG!(F0nP~4@M&)sOi!{v8yUNnit1OTiT?C30mJJh@Cn|NzpEMJ@rWB|;-COF-U%0A` zqK|g1suozR&#>*X3OT8QE+lZ0s1mc6CnG&2Gpy4MbFHblyjR_qM{*Q*7#5_`+4m{6 z$4Mh+suqDx0T}E?a)mpbXIZXS>L%T*3ZlVjYp053MI^Hx5M?gxzg>)&q9O3<99jii$#c=*x9wtr#p6Stn)0Lugi zqz?WG*Q!dgq{QI#{{YRM0ItK!jI{fj%~`dswwhY|eZraPtLUof)>!Hya6?Awz^f@4 z>_Hl>hpjK8p7-?l-jBu|lUQ$DGu(H1R+6zPeWtQh$nNPwFa+du=i88AFj5EvYR9y< zXQNhBkl_4js^;UDUJ!GW!iw$(ar17y*1=ySQpp4rCK+RrHtI)?Mf#C}`5nmZ;E|-3 zK3#TJGvNOKVx(y$J!D3jk)3>%Cm@fw{{YicVd}r=F1)7j z=Chc7RxdZ*$F^vyvwYi?)WT}JMLF?0%O82-Q-IlCr5{lw1E|WTu9~0nNtJIap2t#Y zrd3p$J<|5qz@#Y*H~Bd*P^|LZvOzL zv5!>o(knC82W4fz3UF8T{{Y)s8a490!R!fLZ3!p-Ai+K}udMGJFA@O3yo2LEf6(Zr zVNbJofh{`4X1$^qBN&FHd0c)~Z~VFW_}57CUFsvuHzxNw&5y&rIA3#zPL^k&~*(@ehey zLf=wb=HGkP>3NdVYNxeZ8sk{E|{(@bByi?-O6FBL=tN5eFP3v*oyMpI8yIpv_RGH_My8-n#Wl#VCR0U7~00E@M z1)}XWG<8o%W2dNRSSFexBap5-!m$`)y-S>D9YZ+Ij%}7o6;N(2tI%*nH6?O>{xp-x zs_1nTmVl8`)Xj`W7-t}i9Ba$lou4w%=O@cY<43gCyHXY+y17ti0yd3^KfW|B3TR?r z(cJ;S=?Pm%K+oT*Gyec>LzNnsMP;tJZTq}8+f~APTBsZ}Qo!X{W8*(>jCjVK?f~<# z#NG$-8qdFPHVY>3g4cP1wrft|qXlZ#IaWrEr9qr9UYxS98O{zdq;xi#$>R)Yt+j!Z zCOUfDId8BMDE|Pc)|$XM-oZ_LVi;;gKA)}Ay*QldcpFO(`itPZ-o?oGRAvp(s+U+8W2 z-l)O3SJuZ3IQI&O{{X7&5Oo$%z5A9J&PJ+2mOZ^o64SxBT9|@=-7ax7IV7B}SahF} zIhXOGgwju_zQ1*x;WX@BkNC zAzceaudQ_U+x3#4ao8_3_4E~yiD|1SsuozLjGPBYLRHrw{YpqU(aoxyh4z_pK<~VCLhHspH%HE#6Sx_L*Syn-g?S|ai%g@rljCBvbh^`AvkvBo<7|Fo#qjnKLc36yT zxBz3mksE>~nO-yl9|VmouIOd@Bk(?c`b^zu9EQhlzMegaF3uJ8w%C*I4nj8`;-H3O zUA69%R~lx5oX8aUe}s}95@tJJ;6 zQZ@+~I!yMOk{cB@ME?L<@3i}l#Y4C#!1gQXp{j=-Iwg%{Uz6~|E7!VPWluIT9?QA_h^FN9ilg9gB1DN}GTj zP2%0QZPR1CwYO?p?ls7jw+mO{(pVNHI5P-(W{eS%7BvSr*J4&tLTR3wwwgKOl}bqr zh#4e9}BGUy?h+kcM7I_kSKZKgKh?_(!!VZhOANM^S3IZHZE%NUo4e z7hkr0f~?3{VwyJrzNAt|W2u_q)%NANcEzJ{Ubgj3p0d8%Pf=NEtpH6$E5@wwxCH%3 z#Hsziwwn#b%axn|0JMhD=09UxY3!BOTQyCR`)q=N7M|;Dkt8uF=;`*RhF8kBbtulH zc2XTGNhK1GZ2th4ZBGs3pIz@ceJS8n1EHh#k5F4Eso@OOXk_Vu;5fTM;{W!zTTX=;pe5rfdnlBD=#&Pf1g zRwAd}Z&5%M+xmk!Vsy-Jo?UoTzd4?9lHEaypTvro;49jH za|fugA4<1RQi3`1;dx{GqR`0{sgVtO;RiB&7v|H4HZ7ZFzinPC4ef5A+e-yo&{Y#v zD)m!J-91Ybu$Sr#tfdQmEt00e^~~=p$xf=Eu_0q@sPFJ_GCtjhZ-K6U=3AF;1oMYr zptbRS%UG8SJ#-beUCCm&8XKj)XrJ7TqNzYw0N^)C$P6+td+Fh&+(JYS(`4sAQOQ60 z>0^oG)r~BwQAjs}nyKo2)=sXW-$qvWCZ?P*hpdk)jRm$|%|KONkbi9-(EI-Yh*^E5 zZ@S~ovvdWsUB=N#T}rhP&@^%$&)*|OwA*~gQazD|8d@l&N=~|j5rRIX9&`om2~!Wf zNL&(o0=@LALo^TDH8pDr!dXDZFb0)Yd)r%ORjE0R`xN_rxRTTElzFv0<-4IRI%;XX6=OAdB1xeug zQI{S%k9};TqQw61a-jAA>0D9;v*hnP%z$+xDu5ZE$Jl7*TRn7^yv|EAeqXjRuQvL6 zD~-^tW__jv$DfbBhF$kGA_lk8Dl00LjhGS#+c?rT>|{n)KAe_t%BPQwzBWG{_Rhw+ zTxt61Ej5!osTD%>f?}hlidhw&HfJRi6P%o=#-iJtJ_7LvXz>p1e%;$fn$bAAENul9 zG_4hEh7}>=0Ww1?#>C(bsY$|qT}|V1(Z@TL{_(hOcN<*|#ntQ$S+dun1Rl8!QqY<0EwWf}WDCuFANbwkxFbC=; zPP3CFNz#6$8sznO7F=zlHKauSM#tQBuv1k-OI22_PB3V$F}lTqN49;;6=9Fd;ACqw zw%+_WyzKqSQ*GWGC?#BzNg|1T$j?W(%07_5sU0K`dx8i(^_ub1oKu*&_(xpHcmE%xc zWUwL@T!0Fm`ue|j5|-pg+Q`>Ls~KKdPg>(A<3sKAY3-83xjzFs z9xiPCn`qdyP>KsdvQo|Xs=JDmE=;mYORC06b`c_%bqWRu9eWfjNPs(iU#6+aHK80h zx?y$Iq=G;bu=xG7-t!B`z7%rH!h6RXxYN2tqi$IvQ(IhMd0nJ)(-aZLfSDzc@Q8!U z5!?)rvql`G1oZ{=^ybHp@?N$l18dudI}+;zv_F8LiW#nzw2zfz zQ&v@qGok5dpy@yfUZ%;(L`s=lJ z1@D~={YubWIE_xUJ6e&HqeEqaa>ix{$ewt^As_g01OEVsH7L$mr}FXqMVe5Kh!*3I z130(KXA>@5AmdKSt=qen&mcV^P7_{fEfg0yX=pcGcKI6JaaC+Au+9fbRv05BA0xh%+ItFWNMIzSyQM=? z8YyQ9+%$2LO8xt5TDr*2`NIL-50G?uMRQ+w*`12I{#Om>_s~1!QAtt!;KkJ{F!zW* z@fXHV>#U`{fy6CH`j3r#bRggpt?`(p#I*8c5J4YpdB<1RkDMO*VNDH!-%}iXYw3S) zu*vQ7t&%_sWa~fjtYgNpjbcA@>SiY&bFY&aRzk5hN01MZ`fFoHBy2>0DE;;HOqA&6 z6jo+lmg^bqtb&!As;X*gshi!CtD&>!zCQl|u9SNpVcPGIC1p*)Cv0^~3^MkF6M!2X z%OAIa`spz7i4ZXu1opx4t@i7UU7ERX7G`*AAs~qZb_f3eL-y0O+RzCI^z-K5fSfsw ziN_yhif)fUA4~2s z*rMIMKieCE!AW_JJB7OAw6}bcQp-;?Vd7{~p=Lu1D=V%%j9}pE4z^zUv2J97lAgIK z>k8n|HhQuE?x*~0#Glv?(^4Pk&FV>S<#%+{@GF>pAYB_`ZUv`{Y}~8r<(`{t)`=N? zzLHznq@2t3KV^i(tk^<&PB}(@39a}8Qtb{#{{a3ME*w-dKMFazy-)_KW>_~a)rP1f zKjB|;o>-Zje4daG*lC#JFYwZHqqZQHtA@N@*c+)M5;9XsSyOGART=3RnkJ8_hsRz4 z?Vp`Zv{wX~1h6_vi~&g_aLU8rf;$j=f;$~-nqw>^cqM!3-WUk7NfpiC9+JMJ{4Bb` zY}`DkU+#i(i48{GNUo5)jJ;hY>7BYr0E5&31YRP70!A6L(B~5fFF$y}9kQtgXxR5S;Px{hz{{UHl?0?Tk){zggBNz-v);~>6 z`gqr4ERB^jxZZ+G>_u=%rC0u>H6K#}p7|NP#9YX=B+%sc+N*CkHW+*G!jdRoDBh_uinR4N83ER4~}7~zDZ zwt7Y@e|+0F&8dC3?aSr5dMky#s!D5x6v$ReS02(`- z2(n8?BpA$+(B+R;jQeM}Kd2`fIoW%<_qJ}-8>4F9D(Gu%RjE^KuArTwnwq93%yJ_P zGBU8pK1uyDHI&jIWmY{dQ_qhbJ^ujyNBp(wOJxng*0!P_YNM%rs_0`KV2>H^hZ+9> z3jz+2OJH4-gUH|UE^u#&7F!<;x#`&Tn`3Y?LrY)8I}k~%^>gGYsHmj?OGFry>SXt0 z3|YUXg9{;h4~+Z?<_~DiyZL|O#@^mG$SWR_YmV8xMOg|=pP3lOP7lv;G7d&ZeN^Ux zuCl~^?ZHCT6i+Ww-6)~}LO=aTm$>ibss7sQ4@^%{I{yGqZZgNkMwUy)#J)pMLB41$ z?bOoIb^icPvNa!`qMNQs5f4vTcZ`wN2&8AdjN@g8X0%Ni{-S3-bMdY;^$q6Fi(I(z z&ueVWgIP@lviq{@Ca6l+qSqD*2jdq2It4!fqa(3Bo+7F zuU`)0xU8KKSHTR>!$}imaAiX*-&S&C>0EXPQVGh34mTeJIda)NC*N_>Ey~HasluH-842vvn z(tcIdSbnNCR~I~7Uu3OEC89CCKlv%AP42uk<_hb$cJAoe*2*2lvli7|_$sTNx!snY zYkT_Kg0m11R@HAOpNYW`}U+=47#~e_kPb}U?U`T2jrMgN0K=Yj$ zQb#E{81eC-RFrb7;KqCYo^+$3w)Hxw$jbbUQcOtMt{5M;QO-}D8BxpaFG?{VuC=AM zKpl=oJY!yAl&rWaduaj~(=@Wk0ya<1bT;8-g1(tsqYgal<*AeE{(wLEfX;=92vP^f z#*yicx(PF!7aTI=jm)IYgR5E%GhogEMf{{Yuss8NkaTKo)=qsf|(oEFQh zN@D4MJ;&QX>L{wCVi^u|-#T=xc6v(b6jVqIWaQ<1YkuM3y(oasZ64sxPCM&+s=>r* zVM>I2sXxAt?GiZX_X#VJoP22hJ&QEVwC33CSm8>KC+(uUm8PoAP-$j?*f>5$n_d`;Sm!ChY1pl2hK^?EbU5;h_%MgR~opfsoUBr$E!IwKjp7fcW@|!BBEoqesr6<)}%}S z02~6}mnX3rFPm{JIgAnxJM^6e7H&b+z!N{c*VM4I?j|_bN z+AFwMtwQ0y(^CzwJ#FRXd;)R%`O|C6M-X?0#NqWu>Dx6i)2%&3V!ou%8o42WoEAs{ z9_h{pPn;4!Bx!Y03KG~Pews&drsK5jd^NjNZ*O_+^3(e*1w5}Dv&iQW#Ty_juFZf5 z2OY`m4yb@c-!Qw1+rBuncHI)n73%64-dpv832SCz%u7v@I$J+aPry;0`5MhR+)w4F zf-`RFita2W@ozO*o?2@*>SU*;Mq*YaX<%lQk1Zh!)WwKUF`U-~vF37Z4nxJh^Ir_H z+2Y!>(?;K0OHnF>QP{J&$?xaqB>d_B0OgOBZa8y2xT*H90C-P&-E|EjiK-~6nP}-l z6NjFvOhE$nJ>C93`e`VxlhWO&Z&rG?qO7-7rBxMtJtjyeLzIy^fdPT;2?qpt?b=Xz z9R4C&&4IQkI91K33ZCI^wo*q8F6r4ZqSQ>_0MgXbNXoKEuafPC5;8YqB4mTO(Ltv{cKTEv%x%Ei{!E3zxFJOtHflB@~hnOD9ZW zjykdk*Px)%RJ1}|o9HO#E7UuHJea_d@9T|K&vDe!HBF;v*yCAiY5)YI$Vm%L6EFZt zA@AIxzeyUc9m*S~{@!mqE7MK$%lpQ* zC`vSI*+*<2Bk!j7j++l*a|yuh$-u5D+N7=9yRO+KzUNO2-rIz15a{3GbzrOy-G17J zt_1qWbF;z@6j}H&Xs_D$`a2>l6jy!EYo4aCQmjcFRK=2b*Px{7F_@`$yqbTgJ zH>ejRoJ8a^m^-fqces|efn{lS=Hyx_A*Z|25UeDOxB#+;RFEklWDL1D(=Sa?6mk1i z3zcv2AAzei`lxzzY(9UvcdE5E-ZdNFYS7izaQ>t5zQVOzD)myzdQVhRgy{g|rb4Ze z-&Owr=KIsnnqE$5ty{B<{1d)ys|MkjY?t~CQ@2#dk>Xqoj zV4nISnpJ(wqqzNZt#ZI3Y!?haL9LYm(L*8z$?>mooGDg}AJ-kmv5ge8@zvBSnM6@_ z{{U6`bJ9LYA17MpjAf#NpfYt}I{07@oosoV|Y{^r@EOA;e0I&q_$d>}%I6b$(| z&?;(JDXG2Ln+ooz9FE;Y_FVnR@I0LnZ~kET7s1=*Hr?W%7q<0=<1|sr9R=3pkxC zP^wr1Bt_G!BdCyc;oIsRR2bB;=qCVt0iXJJ(n>d`KH|un=I{7j?vh>V_m3ppWuRIK zBxz#V`|3#+Sh|^+T~4quQXnkeBN?TW~sjMzTdMy2Wz7IMQT|% zfk9PM1TN8mD^izAh8dHj1&xay>I#B&DqQ{g!f);m?)}-q-Wb+dtyDIdqPNj*YkCW% zbbKaY~-9tbR3WC}^)26I(29y&>vm zmHWw2Az!J9Y$~FI*yqoj;Tcp;h5Bwo3A3y0mkLOVa0yeB!}#-TVlK8)I80t6TNs zanCe%n~h7=ueB62EN_aK>6@+6HDy&*M(Ji5PxE)lk24&ITz0k3kCJV8=(kH((%MCe z>eZ4&^vMIe%HG;2>RIrkxym*`?UqTkXcbK79l5*g^joY@Z~{lB#|U=^4>&)>UAxAq@7B>(Y`aBQL-_YvsJ+;q+>97osaXMH{XIP=sfqG+ z^3S)nUl8a_JEp})aEyV~Bgj*>Gn|h82+}8Xa*M@Y8|_Q)FkCmX&2-_#uXMv@x!)R` zO;2yDnL41SmAZpuknx9J*;UEQfvx(rA zx?xO}N7`0mdeSKexT7I^HgHH+%VKW*lf>Pc_x+K&C@*%yQBed}8)Zt(1d>A|MJl9= zm2k2nA!Q|Urv&P+G}X8Hsp5m)tfHnfG_ozKd0w6Rx&wQZ`TBz@0K{aFpd5`&XQBr- z`=g7T8?4|?@S5FJc4fMnO}lZkMI3bh0LvWve}#2cC1#3JP6z{{MHwYdx^d;@QGJu1 zG^5EKXL`2*IgthzC%4-{Y3L$i7)jIm5v4V4qEjI(y-ItHYNoU&S4+Y=@$;^pS~>Xa zD>O000VE#YaiMhU1aappmOb_Es*o#tE-~?-#*(3D9Uar}qjnUMHr7WyH}xXRDf5A% z$t7i!y03D4X!Nm2!Ib53_RuPud~(0P0MF@<8o+6StrarGh4_4R3~0$lT(^v#`YyId z*C6U0)ZpmRVQdB+hwG%5VO<^?JpTZH(_ZM)+p1S}?XPN`OcV~<++qqqBfdQ71Tvwj zD8Z2657SC*yLRhg1CFI{xzaW==zx9vXyU=w@1SFVeD>7ZL04m?w(L~X&^=*BO0cQwN2AQWqD2335HsR03p ze2)G{oam>BTg!-iN4zcjh`Y-5xUOcBDqrR|yR?-v`w=fpQ1!^n-r52}u5dDb=UJ@N z!z7Zg3`uK3e%ZABx&HtYQ3|mwoU>=Q&+VlZg8u*ogLT_)HB<&lYjm zK1P4nS*+D<%dsQ>0Ffc)+Z<`yS9Y3Jl155+Rpp4p${3JH&OSWr#^Bht8}&?E>czM# ztreA#B$HL!NnW%))md9A!^f2;I?ZOKO04Lb;Pe6B_uek>#)^vVw{P@!iW-?!Y9z0> zLk#rnqo{${DBZ{u`Ld=)Wtke$cD$C3Bbf6RXZ_;1Xg2yeU> z;;z<)?`6Hwyi?N1svRE(qDY;A_N7+ANimf@`}jGm*QZIkXBuv-uAiwtrH2=NIPTka z>Ad%Kp7~?lzlavfyIjxJPdr^MB{Zudt1(q|{-^<8xx)C@F;LYMZ^KHUulO)N`pss! zB&oBa4ay0#)omYRBnpN z>URmNt7)F=>{EYDCYD0Z$<@+&7?4C^Pj1@HX1IMuT$9NaCwl(?XG+u+9nt91%nkE+ zO?u_Z&r3!A&1}mExXT=LD+Ai*o@5cirGMiqD3G4b&>{dZ$Xzququ#0XJW;H>ST+{v zz1z55wq*Pr!L7*f)NX3iD2$gvQH46bBS_>R60Iu^>K8SdyPW?3;=_Hq{(?#gWPi}$ z>3D&C;wI$(0BQFsM@~`~g4Jt-9?h+F$2ULxL*C&=E0}2!e+m)V5Iicp}#ecbS z_p|MH+jWBBO>ekCUw@8Nx}%69K-WxvbQq6)ABhR=&|%Mwh#QY6TW9Gy5~o(h~W&1StMml`7E)3kGU54{3; zSKGU$;aNq+{rCK~rl;JpS*KbVqP5CdN|$M1f5yb55$wvn4!=wFcPHnhUnm=sZSt+W z_7@&_TeWwG4|vmEHsX8R%%Vv%F^MQAq4w2bKvAk7cvuLSxGIg%b(+m8#RardGnAn3 z_8InEUciD?ZIAyrHEV)>mua$E;FXb(@%gh@2+G9 z+S;66-n=f{{{YKwboYy{%&v+$iC&IOy)4Rl;F$!m`C%YE7#R(nX0uV8BN--#gKiD| z6AxvzS6i&q7Aub8sJc~B(nTFr(y|!RTE|ri(>lbeiq1h|5miAwfEXIxQv$;|lOSXC z)@wC*?0`*opWBqUIS0mpnwEA8;5XY@tkoi8RScuQhW&NAdmb603ISYT>ouB{vFTB|*<=!~ zXI_v#-Sea8+UY=Jo+r*!935t}NhZ;fLceTn>8og-t*Qc~W9k^sZ5=^Wwr7Q4VndIs zw~?&YYEQC@p)bWN4IOGpT^Pq$>Ny@X%HFL88Qojp{q>s7bYUkmU9pWlN~~%-pCo8a z>`0)=A8d}=&1S6o5Gab6vj+`~p87hoqEr1y2*4fltk!5q!Q~|8VcWND45hfsM*Ss; zAHK6$sQEBuJ>pk@5}Subscribe

- + diff --git a/search.json b/search.json index 76bf255..322575c 100644 --- a/search.json +++ b/search.json @@ -11,7 +11,7 @@ "href": "index.html", "title": "Home", "section": "", - "text": "Blogging with Quarto and Jupyter: The Complete Guide\n\n\n\n\n\n\n\npython\n\n\ntutorial\n\n\nblogging\n\n\n\n\nStep-by-step tutorial and best practices for creating a python blog with quarto and jupyter\n\n\n\n\n\n\nSep 6, 2023\n\n\n\n\n\n\n \n\n\n\n\nRandom Realizations Resurrected\n\n\n\n\n\n\n\nblogging\n\n\n\n\nThe world’s favorite data science blog is back.\n\n\n\n\n\n\nAug 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nXGBoost from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nA walkthrough of my from-scratch python implementation of XGBoost.\n\n\n\n\n\n\nMay 7, 2022\n\n\n\n\n\n\n \n\n\n\n\nXGBoost Explained\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nIn-depth explanation and mathematical derivation of the XGBoost algorithm\n\n\n\n\n\n\nMar 13, 2022\n\n\n\n\n\n\n \n\n\n\n\nDecision Tree from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nA detailed walkthrough of my from-scratch decision tree implementation in python.\n\n\n\n\n\n\nDec 13, 2021\n\n\n\n\n\n\n \n\n\n\n\nConsider the Decision Tree\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nUnderstand the core strengths and weaknesses of the decision tree, and see how ensembling makes trees shine.\n\n\n\n\n\n\nDec 12, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow to Implement a Gradient Boosting Machine that Works with Any Loss Function\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nSummarize Friedman’s seminal GBM paper and implement the generic gradient boosting algorithm to train models with any differentiable loss function.\n\n\n\n\n\n\nOct 23, 2021\n\n\n\n\n\n\n \n\n\n\n\nHello PySpark!\n\n\n\n\n\n\n\npython\n\n\nPySpark\n\n\ntutorial\n\n\n\n\nGet up and running fast with a local pyspark installation, and learn the essentials of working with dataframes at scale.\n\n\n\n\n\n\nJun 22, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow Gradient Boosting Does Gradient Descent\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nUnderstand how gradient boosting does gradient descent in function space to minimize any differentiable loss function in the service of creating a good model.\n\n\n\n\n\n\nApr 27, 2021\n\n\n\n\n\n\n \n\n\n\n\nGet Down with Gradient Descent\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nGet down with the intuition for gradient descent via a fresh analogy, develop the mathematical formulation of the algorithm, and implement it from scratch to train a linear regression model.\n\n\n\n\n\n\nJan 22, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow to Build a Gradient Boosting Machine from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nUnderstand the intuition behind the gradient boosting machine (GBM) and learn how to implement it from scratch.\n\n\n\n\n\n\nDec 8, 2020\n\n\n\n\n\n\n \n\n\n\n\nThe 80/20 Pandas Tutorial\n\n\n\n\n\n\n\npython\n\n\npandas\n\n\ntutorial\n\n\n\n\nAn opinionated pandas tutorial on my preferred methods to accomplish the most essential data transformation tasks in a way that will make veteran R and tidyverse users smile.\n\n\n\n\n\n\nNov 25, 2020\n\n\n\n\n\n\n \n\n\n\n\nHello World! And Why I’m Inspired to Start a Blog\n\n\n\n\n\n\n\nblogging\n\n\n\n\nA reflection on what inspired me to start a blog and three reasons I think it could be a good idea.\n\n\n\n\n\n\nNov 22, 2020\n\n\n\n\n\n\nNo matching items" + "text": "XGBoost for Regression in Python\n\n\n\n\n\n\n\npython\n\n\ntutorial\n\n\ngradient boosting\n\n\nxgboost\n\n\n\n\nA step-bystep tutorial on regression with XGBoost in python using sklearn and the xgboost library\n\n\n\n\n\n\nOct 25, 2023\n\n\n\n\n\n\n \n\n\n\n\nBlogging with Quarto and Jupyter: The Complete Guide\n\n\n\n\n\n\n\npython\n\n\ntutorial\n\n\nblogging\n\n\n\n\nStep-by-step tutorial and best practices for creating a python blog with quarto and jupyter\n\n\n\n\n\n\nSep 6, 2023\n\n\n\n\n\n\n \n\n\n\n\nRandom Realizations Resurrected\n\n\n\n\n\n\n\nblogging\n\n\n\n\nThe world’s favorite data science blog is back.\n\n\n\n\n\n\nAug 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nXGBoost from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nA walkthrough of my from-scratch python implementation of XGBoost.\n\n\n\n\n\n\nMay 7, 2022\n\n\n\n\n\n\n \n\n\n\n\nXGBoost Explained\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nIn-depth explanation and mathematical derivation of the XGBoost algorithm\n\n\n\n\n\n\nMar 13, 2022\n\n\n\n\n\n\n \n\n\n\n\nDecision Tree from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nA detailed walkthrough of my from-scratch decision tree implementation in python.\n\n\n\n\n\n\nDec 13, 2021\n\n\n\n\n\n\n \n\n\n\n\nConsider the Decision Tree\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nUnderstand the core strengths and weaknesses of the decision tree, and see how ensembling makes trees shine.\n\n\n\n\n\n\nDec 12, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow to Implement a Gradient Boosting Machine that Works with Any Loss Function\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nSummarize Friedman’s seminal GBM paper and implement the generic gradient boosting algorithm to train models with any differentiable loss function.\n\n\n\n\n\n\nOct 23, 2021\n\n\n\n\n\n\n \n\n\n\n\nHello PySpark!\n\n\n\n\n\n\n\npython\n\n\nPySpark\n\n\ntutorial\n\n\n\n\nGet up and running fast with a local pyspark installation, and learn the essentials of working with dataframes at scale.\n\n\n\n\n\n\nJun 22, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow Gradient Boosting Does Gradient Descent\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nUnderstand how gradient boosting does gradient descent in function space to minimize any differentiable loss function in the service of creating a good model.\n\n\n\n\n\n\nApr 27, 2021\n\n\n\n\n\n\n \n\n\n\n\nGet Down with Gradient Descent\n\n\n\n\n\n\n\ngradient boosting\n\n\n\n\nGet down with the intuition for gradient descent via a fresh analogy, develop the mathematical formulation of the algorithm, and implement it from scratch to train a linear regression model.\n\n\n\n\n\n\nJan 22, 2021\n\n\n\n\n\n\n \n\n\n\n\nHow to Build a Gradient Boosting Machine from Scratch\n\n\n\n\n\n\n\npython\n\n\ngradient boosting\n\n\nfrom scratch\n\n\n\n\nUnderstand the intuition behind the gradient boosting machine (GBM) and learn how to implement it from scratch.\n\n\n\n\n\n\nDec 8, 2020\n\n\n\n\n\n\n \n\n\n\n\nThe 80/20 Pandas Tutorial\n\n\n\n\n\n\n\npython\n\n\npandas\n\n\ntutorial\n\n\n\n\nAn opinionated pandas tutorial on my preferred methods to accomplish the most essential data transformation tasks in a way that will make veteran R and tidyverse users smile.\n\n\n\n\n\n\nNov 25, 2020\n\n\n\n\n\n\n \n\n\n\n\nHello World! And Why I’m Inspired to Start a Blog\n\n\n\n\n\n\n\nblogging\n\n\n\n\nA reflection on what inspired me to start a blog and three reasons I think it could be a good idea.\n\n\n\n\n\n\nNov 22, 2020\n\n\n\n\n\n\nNo matching items" }, { "objectID": "posts/xgboost-explained/index.html", @@ -147,144 +147,172 @@ "text": "References\nThis implementation is inspired and partially adapted from Jeremy Howard’s live coding of a Random Forest as part of the fastai ML course." }, { - "objectID": "posts/consider-the-decision-tree/index.html", - "href": "posts/consider-the-decision-tree/index.html", - "title": "Consider the Decision Tree", + "objectID": "posts/xgboost-for-regression-in-python/index.html", + "href": "posts/xgboost-for-regression-in-python/index.html", + "title": "XGBoost for Regression in Python", "section": "", - "text": "A California cypress tree abides in silence on Alameda Beach.\nAh, the decision tree. It’s an underrated and often overlooked hero of modern statistical learning. Trees aren’t particularly powerful learning algorithms on their own, but when utilized as building blocks in larger ensemble models like random forest and gradient boosted trees, they can achieve state of the art performance in many practical applications. Since we’ve been focusing on gradient boosting ensembles lately, let’s take a moment to consider the humble decision tree itself. This post gives a high-level intuition for how trees work, an opinionated list of their key strengths and weaknesses, and some perspective on why ensembling makes them truly shine.\nOnward!" + "text": "In this post I’m going to show you my process for solving regression problems with XGBoost in python, using either the native xgboost API or the scikit-learn interface. This is a powerful methodology that can produce world class results in a short time with minimal thought or effort. While we’ll be working on an old Kagle competition for predicting the sale prices of bulldozers and other heavy machinery, you can use this flow to solve whatever tabular data regression problem you’re working on.\nThis post serves as the explanation and documentation for the XGBoost regression jupyter notebook from my ds-templates repo on GitHub, so go ahead and download the notebook and follow along with your own data.\nIf you’re not already comfortable with the ideas behind gradient boosting and XGBoost, you’ll find it helpful to read some of my previous posts to get up to speed. I’d start with this introduction to gradient boosting, and then read this explanation of how XGBoost works.\nLet’s get into it! 🚀" }, { - "objectID": "posts/consider-the-decision-tree/index.html#classification-and-regression-trees", - "href": "posts/consider-the-decision-tree/index.html#classification-and-regression-trees", - "title": "Consider the Decision Tree", - "section": "Classification and Regression Trees", - "text": "Classification and Regression Trees\nA Decision tree is a type of statistical model that takes features or covariates as input and yields a prediction as output. The idea of the decision tree as a statistical learning tool traces back to a monograph published in 1984 by Breiman, Freidman, Olshen, and Stone called “Classification and Regression Trees” (a.k.a. CART). As the name suggests, trees come in two main varieties: classification trees which predict discrete class labels (e.g. DecisionTreeClassifier) and regression trees which predict numeric values (e.g. DecisionTreeRegressor).\nAs I mentioned earlier, tree models are not very powerful learners on their own. You might find that an individual tree model is useful for creating a simple and highly interpretable model in specific situations, but in general, trees tend to shine most as building blocks in more complex algorithms. These composite models are called ensembles, and the most important tree ensembles are random forest and gradient boosted trees. While random forest uses either regression or classification trees depending on the type of target, gradient boosting can use regression trees to solve both classification and regression tasks." + "objectID": "posts/xgboost-for-regression-in-python/index.html#install-and-import-the-xgboost-library", + "href": "posts/xgboost-for-regression-in-python/index.html#install-and-import-the-xgboost-library", + "title": "XGBoost for Regression in Python", + "section": "Install and import the xgboost library", + "text": "Install and import the xgboost library\nIf you don’t already have it, go ahead and use conda to install the xgboost library, e.g.\n$ conda install -c conda-forge xgboost\nThen import it along with the usual suspects.\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport xgboost as xgb" }, { - "objectID": "posts/consider-the-decision-tree/index.html#regression-tree-in-action", - "href": "posts/consider-the-decision-tree/index.html#regression-tree-in-action", - "title": "Consider the Decision Tree", - "section": "Regression Tree in Action", - "text": "Regression Tree in Action\nLet’s have a closer look at regression trees by training one on the diabetes dataset from scikit learn. According to the documentation:\n\nTen baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.\n\nFirst we load the data. To make our lives easier, we’ll just use two features: average blood pressure (bp) and the first blood serum measurement (s1) to predict the target. I’ll rescale the features to make the values easier for me to read, but it won’t affect our tree–more on that later.\n\nimport numpy as np \nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\ncolor_palette = \"viridis\"\n\n\nfrom sklearn.datasets import load_diabetes\n\nX, y = load_diabetes(as_frame=True, return_X_y=True)\n\nX = 100 * X[['bp', 's1']]\n\n\n\n\n\n\nLet’s grow a tree to predict the target given values of blood pressure and blood serum.\n\nfrom sklearn.tree import DecisionTreeRegressor\n\ntree = DecisionTreeRegressor(max_depth=2)\ntree.fit(X,y);\n\n\n\n\n\n\nTo make predictions using our fitted tree, we start at the root node (which is at the top), and we work our way down moving left if our feature is less than the split threshold and to the right if it’s greater than the split threshold. For example let’s predict the target for a new case with bp= 1 and s1 = 5. Since our blood pressure of 1 is less than 2.359, we move to the left child node. Here, since our serum of 5 is greater than the threshold at 0.875, we move to the right child node. This node has no further children, and thus we return its predicted value of 155.343.\n\ntree.predict(pd.DataFrame({'bp': 1, 's1': 5}, index=[0]))\n\narray([155.34313725])\n\n\nLet’s overlay these splits on our feature scatterplot to see how the tree has partitioned the feature space.\n\n\n\n\n\nThe tree has managed to carve out regions of feature space where the target values tend to be similar within each region, e.g. we have low target values in the bottom left partition and high target values in the far right region.\nLet’s take a look at the regression surface predicted by our tree. Since the tree predicts the exact same value for all instances in a given partition, the surface has only four distinct values.\n\n\n\n\n\nFabulous, now that we’ve seen a tree in action, let’s talk about trees’ key strengths and weaknesses." + "objectID": "posts/xgboost-for-regression-in-python/index.html#read-dataset-into-python", + "href": "posts/xgboost-for-regression-in-python/index.html#read-dataset-into-python", + "title": "XGBoost for Regression in Python", + "section": "Read dataset into python", + "text": "Read dataset into python\nIn this example we’ll work on the Kagle Bluebook for Bulldozers competition, which asks us to build a regression model to predict the sale price of heavy equipment. Amazingly, you can solve your own regression problem by swapping this data out with your organization’s data before proceeding with the tutorial.\nGo ahead and download the Train.zip file from Kagle and extract it into Train.csv. Then read the data into a pandas dataframe.\n\ndf = pd.read_csv('Train.csv', parse_dates=['saledate']);\n\nNotice I cheated a little bit, checking the columns ahead of time and telling pandas to treat the saledate column as a date. In general it will make life easier to read in any date-like columns as dates.\n\ndf.info()\n\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 401125 entries, 0 to 401124\nData columns (total 53 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 SalesID 401125 non-null int64 \n 1 SalePrice 401125 non-null int64 \n 2 MachineID 401125 non-null int64 \n 3 ModelID 401125 non-null int64 \n 4 datasource 401125 non-null int64 \n 5 auctioneerID 380989 non-null float64 \n 6 YearMade 401125 non-null int64 \n 7 MachineHoursCurrentMeter 142765 non-null float64 \n 8 UsageBand 69639 non-null object \n 9 saledate 401125 non-null datetime64[ns]\n 10 fiModelDesc 401125 non-null object \n 11 fiBaseModel 401125 non-null object \n 12 fiSecondaryDesc 263934 non-null object \n 13 fiModelSeries 56908 non-null object \n 14 fiModelDescriptor 71919 non-null object \n 15 ProductSize 190350 non-null object \n 16 fiProductClassDesc 401125 non-null object \n 17 state 401125 non-null object \n 18 ProductGroup 401125 non-null object \n 19 ProductGroupDesc 401125 non-null object \n 20 Drive_System 104361 non-null object \n 21 Enclosure 400800 non-null object \n 22 Forks 192077 non-null object \n 23 Pad_Type 79134 non-null object \n 24 Ride_Control 148606 non-null object \n 25 Stick 79134 non-null object \n 26 Transmission 183230 non-null object \n 27 Turbocharged 79134 non-null object \n 28 Blade_Extension 25219 non-null object \n 29 Blade_Width 25219 non-null object \n 30 Enclosure_Type 25219 non-null object \n 31 Engine_Horsepower 25219 non-null object \n 32 Hydraulics 320570 non-null object \n 33 Pushblock 25219 non-null object \n 34 Ripper 104137 non-null object \n 35 Scarifier 25230 non-null object \n 36 Tip_Control 25219 non-null object \n 37 Tire_Size 94718 non-null object \n 38 Coupler 213952 non-null object \n 39 Coupler_System 43458 non-null object \n 40 Grouser_Tracks 43362 non-null object \n 41 Hydraulics_Flow 43362 non-null object \n 42 Track_Type 99153 non-null object \n 43 Undercarriage_Pad_Width 99872 non-null object \n 44 Stick_Length 99218 non-null object \n 45 Thumb 99288 non-null object \n 46 Pattern_Changer 99218 non-null object \n 47 Grouser_Type 99153 non-null object \n 48 Backhoe_Mounting 78672 non-null object \n 49 Blade_Type 79833 non-null object \n 50 Travel_Controls 79834 non-null object \n 51 Differential_Type 69411 non-null object \n 52 Steering_Controls 69369 non-null object \ndtypes: datetime64[ns](1), float64(2), int64(6), object(44)\nmemory usage: 162.2+ MB" }, { - "objectID": "posts/consider-the-decision-tree/index.html#why-trees-are-awesome", - "href": "posts/consider-the-decision-tree/index.html#why-trees-are-awesome", - "title": "Consider the Decision Tree", - "section": "Why trees are awesome", - "text": "Why trees are awesome\nTrees are awesome because they are easy to use, and trees are easy to use because they are robust, require minimal data preprocessing, and can learn complex relationships without user intervention.\n\nFeature Scaling\nTrees owe their minimal data preprocessing requirements and their robustness to the fact that split finding is controlled by the sort order of the input feature values, rather than the values themselves. This means that trees are invariant to the scaling of input features, which in turn means that we don’t need to fuss around with carefully rescaling all the numeric features before fitting a tree. It also means that trees tend to work well even if features are highly skewed or contain outliers.\n\n\nCategoricals\nSince trees just split data based on numeric feature values, we can easily handle most categorical features by using integer encoding. For example we might encode a size feature with small = 1, medium = 2, and large = 3. This works particularly well with ordered categories, because partitioning is consistent with the category semantics. It can also work well even if the categories have no order, because with enough splits a tree can carve each category into its own partition.\n\n\nMissing Values\nIt’s worth calling out that different implementations of the decision tree handle missing feature values in different ways. Notably, scikit-learn handles them by throwing an error and telling you not to pull such shenanigans.\nValueError: Input contains NaN, infinity or a value too large for dtype('float32').\nOn the other hand, XGBoost supports an elegant way to make use of missing values, which we will discuss more in a later post.\n\n\nInteractions\nFeature interactions can also be learned automatically. An interaction means that the effect of one feature on the target differs depending on the value of another feature. For example, the effect of some drug may depend on whether or not the patient exercises. After a tree splits on exercise, it can naturally learn the correct drug effects for both exercisers and non-exercisers. This intuition extends to higher-order interactions as well, as long as the tree has enough splits to parse the relationships.\n\n\nFeature Selection\nBecause trees choose the best feature and threshold value at each split, they essentially perform automatic feature selection. This is great because even if we throw a lot of irrelevant features at a decision tree, it will simply tend not to use them for splits. Similarly, if two or more features are highly correlated or even redundant, the tree will simply choose one or the other when making each split; having both in the model will not cause catastrophic instability as it could in a linear model.\n\n\nFeature-Target Relationship\nFinally, it is possible for trees to discover complex nonlinear feature-target relationships without the need for user-specification of the relationships. This is because trees use local piecewise constant approximations without making any parametric assumptions. With enough splits, the tree can approximate arbitrary feature-target relationships." + "objectID": "posts/xgboost-for-regression-in-python/index.html#prepare-raw-data-for-xgboost", + "href": "posts/xgboost-for-regression-in-python/index.html#prepare-raw-data-for-xgboost", + "title": "XGBoost for Regression in Python", + "section": "Prepare raw data for XGBoost", + "text": "Prepare raw data for XGBoost\nWhen faced with a new tabular dataset for modeling, we have two format considerations: data types and missingness. From the call to df.info() above, we can see we have both mixed types and missing values.\nWhen it comes to missing values, some models like the gradient booster or random forest in scikit-learn require purely non-missing inputs. One of the great strengths of XGBoost is that it relaxes this requirement, allowing us to pass in missing feature values, so we don’t have to worry about them.\nRegarding data types, all ML models for tabular data require inputs to be numeric, either integers or floats, so we’re going to have to deal with those object columns.\n\nEncode string features\nThe simplest way to encode string variables is to map each unique string value to an integer; this is called integer encoding.\nWe have a couple of options for how to implement this transformation: pandas categoricals or the scikit-learn label encoder. We can use the categorical type in pandas to generate mappings from string values to integers for each string feature. The category type is a bit like the factor type in R. Pandas stores the underlying data as integers, and it also keeps a mapping from the integers to the string values. XGBoost will be able to access the integers for model fitting. This is nice because we can still access the actual categories which can be helpful when we start taking a closer look at the data. If you prefer, you can also use the scikit-learn label encoder to replace the string columns with their integer-mapped counterparts.\n\ndef encode_string_features(df, use_cats=True):\n out_df = df.copy()\n for feature, feature_type in df.dtypes.items():\n if feature_type == 'object':\n if use_cats:\n out_df[feature] = out_df[feature].astype('category')\n else:\n from sklearn.preprocessing import LabelEncoder\n out_df[feature] = LabelEncoder() \\\n .fit_transform(out_df[feature].astype('str'))\n return out_df\n\ndf = encode_string_features(df, use_cats=False)\n\n\n\nEncode date and timestamp features\nWhile dates feel sort of numeric, they are not numbers, so we need to transform them into numeric columns. Unfortunately, encoding timestamps isn’t as straightforward as encoding strings, so we actually might need to engage in a little bit of feature engineering. A single date has many different attributes, e.g. days since epoch, year, quarter, month, day, day of year, day of week, is holiday, etc. As a starting point, we can just add a few of these attributes as features. Once a feature is represented as a date or timestamp data type, you can access various attributes via the dt attribute.\n\ndef encode_datetime_features(df, datetime_features, datetime_attributes):\n out_df = df.copy()\n for datetime_feature in datetime_features:\n for datetime_attribute in datetime_attributes:\n if datetime_attribute == 'days_since_epoch':\n out_df[f'{datetime_feature}_{datetime_attribute}'] = \\\n (out_df[datetime_feature] \n - pd.Timestamp(year=1970, month=1, day=1)).dt.days\n else:\n out_df[f'{datetime_feature}_{datetime_attribute}'] = \\\n getattr(out_df[datetime_feature].dt, datetime_attribute)\n return out_df\n\ndatetime_features = [\n 'saledate',\n]\ndatetime_attributes = [\n 'year',\n 'month',\n 'day',\n 'quarter',\n 'day_of_year',\n 'day_of_week',\n 'days_since_epoch',\n]\n\ndf = encode_datetime_features(df, datetime_features, datetime_attributes)\n\n\n\nTransform the target if necessary\nIn the interest of speed and efficiency, we didn’t bother doing any EDA with the feature data. Part of my justification for this is that trees are incredibly robust to outliers, colinearity, missingness, and other assorted nonsense in the feature data. However, they are not necessarily robust to nonsense in the target variable, so it’s worth having a look at it before proceeding any further.\n\ndf.SalePrice.hist(); plt.xlabel('SalePrice');\n\n\n\n\nOften when predicting prices it makes sense to use log price, especially when they span multiple orders of magnitude or have a strong right skew. These data look pretty friendly, lacking outliers and exhibiting only a mild positive skew; we could probably get away without doing any transformation. But checking the evaluation metric used to score the Kagle competition, we see they’re using root mean squared log error. That’s equivalent to using RMSE on log-transformed target data, so let’s go ahead and work with log prices.\n\ndf['logSalePrice'] = np.log1p(df['SalePrice'])\ndf.logSalePrice.hist(); plt.xlabel('logSalePrice');" }, { - "objectID": "posts/consider-the-decision-tree/index.html#why-trees-are-not-so-awesome", - "href": "posts/consider-the-decision-tree/index.html#why-trees-are-not-so-awesome", - "title": "Consider the Decision Tree", - "section": "Why trees are not so awesome", - "text": "Why trees are not so awesome\nThe main weakness of the decision tree is that, on its own, it tends to have poor predictive performance compared to other algorithms. The main reasons for this are the tendency to overfit and prediction quantization issues.\n\nOverfitting\nIf we grow a decision tree until each leaf has exactly one instance in it, we will have simply memorized the training data, and our model will not generalize well. Basically the only defense against overfitting is to reduce the number of leaf nodes in the tree, either by using hyperparameters to stop splitting earlier or by removing certain leaf nodes after growing a deep tree. The problem here is that some of the benefits of trees, like ability to approximate arbitrary target patterns and ability to learn interaction effects, depend on having enough splits for the task. We can sometimes find ourselves in a situation where we cannot learn these complex relationships without overfitting the tree.\n\n\nQuantization\nBecause regression trees use piecewise constant functions to approximate the target, prediction accuracy can deteriorate near split boundaries. For example, if the target is increasing with the feature, a tree might tend to overpredict the target on the left side of split boundaries and overpredict on the right side of split boundaries.\n\n\n\n\n\n\n\nExtrapolation\nBecause they are trained by partitioning the feature space in a training dataset, trees cannot intelligently extrapolate beyond the data on which they are trained. For example if we query a tree for predictions beyond the greatest feature value encountered in training, it will just return the prediction corresponding to the largest in-sample feature values.\n\n\n\n\n\n\n\nThe Dark Side of Convenience\nFinally, there is always a price to pay for convenience. While trees can work well even with a messy dataset containing outliers, redundant features, and thoughtlessly encoded categoricals, we will rarely achieve the best performance under these conditions. Taking the time to deal with outliers, removing redundant information, purposefully choosing appropriate categorical encodings, and building an understanding of the data will often lead to much better results." + "objectID": "posts/xgboost-for-regression-in-python/index.html#train-and-evaluate-the-xgboost-regression-model", + "href": "posts/xgboost-for-regression-in-python/index.html#train-and-evaluate-the-xgboost-regression-model", + "title": "XGBoost for Regression in Python", + "section": "Train and Evaluate the XGBoost regression model", + "text": "Train and Evaluate the XGBoost regression model\nHaving prepared our dataset, we are now ready to train an XGBoost model. We’ll walk through the flow step-by-step first, then later we’ll collect the code in a single cell, so it’s easier to quickly iterate through variations of the model.\n\nSpecify target and feature columns\nFirst we’ll put together a list of our features and define the target column. I like to have an actual list defined in the code so it’s easier to see everything we’re puting into the model and easier to add or remove features as we iterate. Just run something like list(df.columns) in a cel to get a copy-pasteable list of columns, then edit it down to the full list of features, i.e. remove the target, date columns, and other non-feature columns..\n\n# list(df.columns)\n\n\nfeatures = [\n 'SalesID',\n 'MachineID',\n 'ModelID',\n 'datasource',\n 'auctioneerID',\n 'YearMade',\n 'MachineHoursCurrentMeter',\n 'UsageBand',\n 'fiModelDesc',\n 'fiBaseModel',\n 'fiSecondaryDesc',\n 'fiModelSeries',\n 'fiModelDescriptor',\n 'ProductSize',\n 'fiProductClassDesc',\n 'state',\n 'ProductGroup',\n 'ProductGroupDesc',\n 'Drive_System',\n 'Enclosure',\n 'Forks',\n 'Pad_Type',\n 'Ride_Control',\n 'Stick',\n 'Transmission',\n 'Turbocharged',\n 'Blade_Extension',\n 'Blade_Width',\n 'Enclosure_Type',\n 'Engine_Horsepower',\n 'Hydraulics',\n 'Pushblock',\n 'Ripper',\n 'Scarifier',\n 'Tip_Control',\n 'Tire_Size',\n 'Coupler',\n 'Coupler_System',\n 'Grouser_Tracks',\n 'Hydraulics_Flow',\n 'Track_Type',\n 'Undercarriage_Pad_Width',\n 'Stick_Length',\n 'Thumb',\n 'Pattern_Changer',\n 'Grouser_Type',\n 'Backhoe_Mounting',\n 'Blade_Type',\n 'Travel_Controls',\n 'Differential_Type',\n 'Steering_Controls',\n 'saledate_year',\n 'saledate_month',\n 'saledate_day',\n 'saledate_quarter',\n 'saledate_day_of_year',\n 'saledate_day_of_week',\n 'saledate_days_since_epoch'\n]\n\ntarget = 'logSalePrice'\n\n\n\nSplit the data into training and validation sets\nNext we split the dataset into a training set and a validation set. Of course since we’re going to evaluate against the validation set a number of times as we iterate, it’s best practice to keep a separate test set reserved to check our final model to ensure it generalizes well. Assuming that final test set is hidden away, we can use the rest of the data for training and validation.\nThere are two main ways we might want to select the validation set. If there isn’t a temporal ordering of the observations, we might be able to randomly sample. In practice, it’s much more common that observations have a temporal ordering, and that models are trained on observations up to a certain time and used to predict on observations occuring after that time. Since this data is temporal, we don’t want to split randomly; instead we’ll split on observation date, reserving the latest observations for the validation set.\n\n# Temporal Validation Set\ndef train_test_split_temporal(df, datetime_column, n_test):\n idx_sort = np.argsort(df[datetime_column])\n idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]\n return df.iloc[idx_train, :], df.iloc[idx_test, :]\n\n\n# Random Validation Set\ndef train_test_split_random(df, n_test):\n np.random.seed(42)\n idx_sort = np.random.permutation(len(df))\n idx_train, idx_test = idx_sort[:-n_valid], idx_sort[-n_valid:]\n return df.iloc[idx_train, :], df.iloc[idx_test, :]\n\nmy_train_test_split = lambda d, n_valid: train_test_split_temporal(d, 'saledate', n_valid)\n# my_train_test_split = lambda d, n_valid: train_test_split_random(d, n_valid)\n\n\nn_valid = 12000\ntrain_df, valid_df = my_train_test_split(df, n_valid)\n\ntrain_df.shape, valid_df.shape\n\n((389125, 61), (12000, 61))\n\n\n\n\nCreate DMatrix data objects\nXGBoost uses a data type called dense matrix for efficient training and prediction, so next we need to create DMatrix objects for our training and validation datasets.\n\nIf you prefer to use the scikit-learn interface to XGBoost, you don’t need to create these dense matrix objects. More on that below.\n\n\ndtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)\ndvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)\n\n\n\nSet the XGBoost parameters\nXGBoost has numerous hyperparameters. Fortunately, just a handful of them tend to be the most influential; furthermore, the default values are not bad in most situations. I like to start out with a dictionary containing the default parameter values for just the ones I think are most important. For training there is one required boosting parameter called num_boost_round which I set to 50 as a starting point; you can make this smaller initially if training takes too long.\n\n# default values for important parameters\nparams = {\n 'learning_rate': 0.3,\n 'max_depth': 6,\n 'min_child_weight': 1,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',\n}\nnum_boost_round = 50\n\n\n\nTrain the XGBoost model\nCheck out the documentation on the learning API to see all the training options. During training, I like to have XGBoost print out the evaluation metric on the train and validation set after every few boosting rounds and again at the end of training; that can be done by setting evals and verbose_eval. You can also save the evaluation results in a dictionary passed into evals_result to inspect and plot the objective curve over the training iterations.\n\nevals_result = {}\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')],\n verbose_eval=10,\n evals_result=evals_result)\n\n[0] train-rmse:6.74422 valid-rmse:6.79733\n[10] train-rmse:0.34798 valid-rmse:0.37158\n[20] train-rmse:0.26289 valid-rmse:0.28239\n[30] train-rmse:0.25148 valid-rmse:0.27028\n[40] train-rmse:0.24375 valid-rmse:0.26420\n[49] train-rmse:0.23738 valid-rmse:0.25855\n\n\n\n\nTrain the XGBoost model using the sklearn interface\nYou can optionally use the sklearn estimator interface to XGBoost. This will bypass the need to use the DMatrix data objects for training and prediction, and it will allow you to leverage many of the other scikit-learn ecosystem tools like pipelines, parameter search, partial dependence plots, etc. The XGBRegressor is available in the xgboost library that we’ve already imported.\n\n# scikit-learn interface\nreg = xgb.XGBRegressor(n_estimators=num_boost_round, **params)\nreg.fit(train_df[features], train_df[target], \n eval_set=[(train_df[features], train_df[target]), (valid_df[features], valid_df[target])], \n verbose=10);\n\n[0] validation_0-rmse:6.74422 validation_1-rmse:6.79733\n[10] validation_0-rmse:0.34798 validation_1-rmse:0.37158\n[20] validation_0-rmse:0.26289 validation_1-rmse:0.28239\n[30] validation_0-rmse:0.25148 validation_1-rmse:0.27028\n[40] validation_0-rmse:0.24375 validation_1-rmse:0.26420\n[49] validation_0-rmse:0.23738 validation_1-rmse:0.25855\n\n\nSince not all features of XGBoost are available through the scikit-learn estimator interface, you might want to get the native booster object back out of the sklearn wrapper.\n\nm = reg.get_booster()\n\n\n\nEvaluate the model and check for overfitting\nWe get the model evaluation metrics on the training and validation sets printed to stdout when we use the evals argument to the training API. Typically I just look at those printed metrics, but let’s double check by hand.\n\ndef root_mean_squared_error(y_true, y_pred):\n return np.sqrt(np.mean((y_true - y_pred)**2))\n\nroot_mean_squared_error(dvalid.get_label(), m.predict(dvalid))\n\n0.25855368\n\n\nSo, how good is that RMSLE of 0.259? Well, checking the Kagle leaderboard for this competition, we would have come in 53rd out of 474, which is in the top 12% of submissions. That’s not bad for 10 minutes of work doing the bare minimum necessary to transform the raw data into a format consumable by XGBoost and then training a model using default hyperparameter values.\n\nNote that we’re using a different validation set from that used for the final leaderboard (which is long closed), but our score is likely still a decent approximation for how we would have done in the competition.\n\nIt can be helpful to take a look at objective curves for training and validation data to get a sense for the extent of overfitting. A huge difference between training and validation performance indicates overfitting. In the below curve, there is very little overfitting, indicating we can be aggressive with hyperparameters that increase model flexibility. More on that soon.\n\npd.DataFrame({\n 'train': evals_result['train']['rmse'],\n 'valid': evals_result['valid']['rmse']\n}).plot(); plt.xlabel('boosting round'); plt.ylabel('objective');\n\n\n\n\n\n\nCheck feature importance\nIt’s helpful to get an idea of how much the model is using each feature. In following iterations we might want to try dropping low-signal features or examining the important ones more closely for feature engineering ideas. The gigantic caveat to keep in mind here is that there are different measures of feature importance, and each one will give different importances. XGBoost provides three importance measures; I tend to prefer looking at the weight measure because its rankings usually seem most intuitive.\n\nfig, ax = plt.subplots(figsize=(5,10))\nfeature_importances = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)\nfeature_importances.plot.barh(ax=ax)\nplt.title('Feature Importance');" }, { - "objectID": "posts/consider-the-decision-tree/index.html#how-ensembling-makes-trees-shine", - "href": "posts/consider-the-decision-tree/index.html#how-ensembling-makes-trees-shine", - "title": "Consider the Decision Tree", - "section": "How ensembling makes trees shine", - "text": "How ensembling makes trees shine\nWe can go a long way toward addressing the issues of overfitting and prediction quantization by using trees as building blocks in larger algorithms called tree ensembles, the most popular examples being random forest and gradient boosted trees. A tree ensemble is a collection of different individual tree models whose predictions are averaged to generate an overall prediction.\nEnsembling helps address overfitting because even if each individual tree is overfitted, the average of their individual noisy predictions will tend to be more stable. Think of it in terms of the bias variance tradeoff, where bias refers to a model’s failure to capture certain patterns and variance refers to how different a model prediction would be if the model were trained on a different sample of training data. Since the ensemble is averaging over the predictions of all the individual models, training it on a different sample of training data would change the individual models predictions, but their overall average prediction will tend to remain stable. Thus, ensembling helps reduce the effects of overfitting by reducing model variance without increasing bias.\nEnsembling also helps address prediction quantization issues. While each individual tree’s predictions might express large jumps in the regression surface, averaging many different trees’ predictions together effectively generates a surface with more partitions and smaller jumps between them. This provides a smoother approximation of the feature-target relationship." + "objectID": "posts/xgboost-for-regression-in-python/index.html#improve-performance-using-a-model-iteration-loop", + "href": "posts/xgboost-for-regression-in-python/index.html#improve-performance-using-a-model-iteration-loop", + "title": "XGBoost for Regression in Python", + "section": "Improve performance using a model iteration loop", + "text": "Improve performance using a model iteration loop\nAt this point we have a half-decent prototype model. Now we enter the model iteration loop in which we adjust features and model parameters to find configurations that have better and better performance.\nLet’s start by putting the feature and target specification, the training/validation split, the model training, and the evaluation all together in one code block that we can copy paste for easy model iteration.\n\nNote that for this process to be effective, model training needs to take less than 10 seconds. Otherwise you’ll be sitting around waiting way too long. If training takes too long, try training on a sample of the training data, or try reducing the number of boosting rounds.\n\n\nfeatures = [\n 'SalesID',\n 'MachineID',\n 'ModelID',\n 'datasource',\n 'auctioneerID',\n 'YearMade',\n 'MachineHoursCurrentMeter',\n 'UsageBand',\n 'fiModelDesc',\n 'fiBaseModel',\n 'fiSecondaryDesc',\n 'fiModelSeries',\n 'fiModelDescriptor',\n 'ProductSize',\n 'fiProductClassDesc',\n 'state',\n 'ProductGroup',\n 'ProductGroupDesc',\n 'Drive_System',\n 'Enclosure',\n 'Forks',\n 'Pad_Type',\n 'Ride_Control',\n 'Stick',\n 'Transmission',\n 'Turbocharged',\n 'Blade_Extension',\n 'Blade_Width',\n 'Enclosure_Type',\n 'Engine_Horsepower',\n 'Hydraulics',\n 'Pushblock',\n 'Ripper',\n 'Scarifier',\n 'Tip_Control',\n 'Tire_Size',\n 'Coupler',\n 'Coupler_System',\n 'Grouser_Tracks',\n 'Hydraulics_Flow',\n 'Track_Type',\n 'Undercarriage_Pad_Width',\n 'Stick_Length',\n 'Thumb',\n 'Pattern_Changer',\n 'Grouser_Type',\n 'Backhoe_Mounting',\n 'Blade_Type',\n 'Travel_Controls',\n 'Differential_Type',\n 'Steering_Controls',\n 'saledate_year',\n 'saledate_month',\n 'saledate_day',\n 'saledate_quarter',\n 'saledate_day_of_year',\n 'saledate_day_of_week',\n 'saledate_days_since_epoch',\n]\n\ntarget = 'logSalePrice'\n\ntrain_df, valid_df = train_test_split_temporal(df, 'saledate', 12000)\ndtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)\ndvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)\n\nparams = {\n 'learning_rate': 0.3,\n 'max_depth': 6,\n 'min_child_weight': 1,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',\n}\nnum_boost_round = 50\n\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')],verbose_eval=10)\n\n[0] train-rmse:6.74422 valid-rmse:6.79733\n[10] train-rmse:0.34798 valid-rmse:0.37158\n[20] train-rmse:0.26289 valid-rmse:0.28239\n[30] train-rmse:0.25148 valid-rmse:0.27028\n[40] train-rmse:0.24375 valid-rmse:0.26420\n[49] train-rmse:0.23738 valid-rmse:0.25855\n\n\n\nFeature selection\n\nDrop low-importance features\nLet’s try training a model on only the top k most important features. You can try different values of k for the rankings created from each of the three importance measures. You can play with how many to keep, looking for the optimal number manually.\n\nfeature_importances_weight = pd.Series(m.get_score(importance_type='weight')).sort_values(ascending=False)\nfeature_importances_cover = pd.Series(m.get_score(importance_type='cover')).sort_values(ascending=False)\nfeature_importances_gain = pd.Series(m.get_score(importance_type='gain')).sort_values(ascending=False)\n\n\n# features = list(feature_importances_weight[:30].index)\n# features = list(feature_importances_cover[:35].index)\nfeatures = list(feature_importances_gain[:30].index)\n\ndtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)\ndvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)\n\nparams = {\n 'learning_rate': 0.3,\n 'max_depth': 6,\n 'min_child_weight': 1,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',\n}\nnum_boost_round = 50\n\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)\n\n[0] train-rmse:6.74422 valid-rmse:6.79733\n[10] train-rmse:0.34798 valid-rmse:0.37150\n[20] train-rmse:0.26182 valid-rmse:0.27986\n[30] train-rmse:0.24974 valid-rmse:0.26896\n[40] train-rmse:0.24282 valid-rmse:0.26043\n[49] train-rmse:0.23768 valid-rmse:0.25664\n\n\nLooks like keeping the top 30 from the gain importance type gives a slight performance improvement.\n\n\nDrop one feature at a time\nNext try dropping each feature out of the model one-at-a-time to see if there are any more features that you can drop. For each feature, drop it from the feature set, then train a new model, then record the evaluation score. At the end, sort the scores to see which features are the best candidates for removal.\n\nfeatures = [\n 'Coupler_System',\n 'Tire_Size',\n 'Scarifier',\n 'ProductSize',\n 'Ride_Control',\n 'fiBaseModel',\n 'Enclosure',\n 'Pad_Type',\n 'YearMade',\n 'fiSecondaryDesc',\n 'ProductGroup',\n 'Drive_System',\n 'Ripper',\n 'saledate_days_since_epoch',\n 'fiModelDescriptor',\n 'fiProductClassDesc',\n 'MachineID',\n 'Hydraulics',\n 'SalesID',\n 'Track_Type',\n 'ModelID',\n 'fiModelDesc',\n 'Travel_Controls',\n 'Transmission',\n 'Blade_Extension',\n 'fiModelSeries',\n 'Grouser_Tracks',\n 'Undercarriage_Pad_Width',\n 'Stick',\n 'Thumb'\n]\n\n# drop each feature one-at-a-time\nscores = []\nfor i, feature in enumerate(features):\n drop_one_features = features[:i] + features[i+1:]\n\n dtrain = xgb.DMatrix(data=train_df[drop_one_features], label=train_df[target], enable_categorical=True)\n dvalid = xgb.DMatrix(data=valid_df[drop_one_features], label=valid_df[target], enable_categorical=True)\n\n params = {\n 'learning_rate': 0.3,\n 'max_depth': 6,\n 'min_child_weight': 1,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',\n }\n num_boost_round = 50\n\n m = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')],\n verbose_eval=False)\n score = root_mean_squared_error(dvalid.get_label(), m.predict(dvalid))\n scores.append(score)\n\nresults_df = pd.DataFrame({\n 'feature': features,\n 'score': scores\n})\nresults_df.sort_values(by='score')\n\n\n\n\n\n\n\n\nfeature\nscore\n\n\n\n\n18\nSalesID\n0.252617\n\n\n5\nfiBaseModel\n0.253710\n\n\n27\nUndercarriage_Pad_Width\n0.254032\n\n\n17\nHydraulics\n0.254114\n\n\n20\nModelID\n0.254169\n\n\n4\nRide_Control\n0.254278\n\n\n16\nMachineID\n0.254413\n\n\n19\nTrack_Type\n0.254825\n\n\n6\nEnclosure\n0.254958\n\n\n28\nStick\n0.255164\n\n\n1\nTire_Size\n0.255365\n\n\n10\nProductGroup\n0.255404\n\n\n22\nTravel_Controls\n0.255895\n\n\n29\nThumb\n0.256300\n\n\n23\nTransmission\n0.256380\n\n\n26\nGrouser_Tracks\n0.256395\n\n\n11\nDrive_System\n0.256652\n\n\n24\nBlade_Extension\n0.256698\n\n\n7\nPad_Type\n0.256952\n\n\n25\nfiModelSeries\n0.257073\n\n\n2\nScarifier\n0.257590\n\n\n12\nRipper\n0.257848\n\n\n0\nCoupler_System\n0.258074\n\n\n21\nfiModelDesc\n0.258712\n\n\n13\nsaledate_days_since_epoch\n0.259856\n\n\n14\nfiModelDescriptor\n0.260439\n\n\n9\nfiSecondaryDesc\n0.260782\n\n\n15\nfiProductClassDesc\n0.263790\n\n\n3\nProductSize\n0.268068\n\n\n8\nYearMade\n0.313105\n\n\n\n\n\n\n\nNext try removing the feature with the best removal score. Then with that feature still removed, also try removing the feature with the next best removal score and so on. Repeat this process until the model evaluation metric is no longer improving. I think this could be considered a faster version of backward stepwise feature selection.\n\nfeatures = [\n 'Coupler_System',\n 'Tire_Size',\n 'Scarifier',\n 'ProductSize',\n 'Ride_Control',\n# 'fiBaseModel',\n 'Enclosure',\n 'Pad_Type',\n 'YearMade',\n 'fiSecondaryDesc',\n 'ProductGroup',\n 'Drive_System',\n 'Ripper',\n 'saledate_days_since_epoch',\n 'fiModelDescriptor',\n 'fiProductClassDesc',\n 'MachineID',\n# 'Hydraulics',\n# 'SalesID',\n 'Track_Type',\n 'ModelID',\n 'fiModelDesc',\n 'Travel_Controls',\n 'Transmission',\n 'Blade_Extension',\n 'fiModelSeries',\n 'Grouser_Tracks',\n# 'Undercarriage_Pad_Width',\n 'Stick',\n 'Thumb'\n]\n\ndtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], enable_categorical=True)\ndvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], enable_categorical=True)\n\nparams = {\n 'learning_rate': 0.3,\n 'max_depth': 6,\n 'min_child_weight': 1,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',\n}\nnum_boost_round = 50\n\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)\n\n[0] train-rmse:6.74422 valid-rmse:6.79145\n[10] train-rmse:0.34882 valid-rmse:0.37201\n[20] train-rmse:0.26050 valid-rmse:0.27386\n[30] train-rmse:0.24844 valid-rmse:0.26205\n[40] train-rmse:0.24042 valid-rmse:0.25426\n[49] train-rmse:0.23549 valid-rmse:0.25004\n\n\nSo here I was able to remove four more features before the score started getting worse. With our reduced feature set, we’re now ranking 39th on that Kagle leaderboard. Let’s see how far we can get with some hyperparameter tuning.\n\n\n\nTune the XGBoost hyperparameters\nThis is a topic which deserves its own full-length post, but just for fun, here I’ll do a quick and dirty hand tuning without a ton of explanation.\nBroadly speaking, my process is to increase model expressiveness by increasing the maximum tree depth untill it looks like I’m overfitting. At that point, I start pushing tree pruning parameters like min child weight and regularization parameters like lambda to counteract the overfitting. That process lead me to the following parameters.\n\nparams = {\n 'learning_rate': 0.3,\n 'max_depth': 10,\n 'min_child_weight': 14,\n 'lambda': 5,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',}\nnum_boost_round = 50\n\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)\n\n[0] train-rmse:6.74473 valid-rmse:6.80196\n[10] train-rmse:0.31833 valid-rmse:0.34151\n[20] train-rmse:0.22651 valid-rmse:0.24885\n[30] train-rmse:0.21501 valid-rmse:0.23904\n[40] train-rmse:0.20897 valid-rmse:0.23645\n[49] train-rmse:0.20418 valid-rmse:0.23412\n\n\nThat gets us up to 12th place. Next I start reducing the learning rate and increasing the boosting rounds in proportion to one another.\n\nparams = {\n 'learning_rate': 0.3/5,\n 'max_depth': 10,\n 'min_child_weight': 14,\n 'lambda': 5,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',}\nnum_boost_round = 50*5\n\nm = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10)\n\n[0] train-rmse:9.04930 valid-rmse:9.12743\n[10] train-rmse:4.88505 valid-rmse:4.93769\n[20] train-rmse:2.64630 valid-rmse:2.68501\n[30] train-rmse:1.44703 valid-rmse:1.47923\n[40] train-rmse:0.81123 valid-rmse:0.84079\n[50] train-rmse:0.48441 valid-rmse:0.51272\n[60] train-rmse:0.32887 valid-rmse:0.35434\n[70] train-rmse:0.26276 valid-rmse:0.28630\n[80] train-rmse:0.23720 valid-rmse:0.26026\n[90] train-rmse:0.22658 valid-rmse:0.24932\n[100] train-rmse:0.22119 valid-rmse:0.24441\n[110] train-rmse:0.21747 valid-rmse:0.24114\n[120] train-rmse:0.21479 valid-rmse:0.23923\n[130] train-rmse:0.21250 valid-rmse:0.23768\n[140] train-rmse:0.21099 valid-rmse:0.23618\n[150] train-rmse:0.20928 valid-rmse:0.23524\n[160] train-rmse:0.20767 valid-rmse:0.23445\n[170] train-rmse:0.20658 valid-rmse:0.23375\n[180] train-rmse:0.20558 valid-rmse:0.23307\n[190] train-rmse:0.20431 valid-rmse:0.23252\n[200] train-rmse:0.20316 valid-rmse:0.23181\n[210] train-rmse:0.20226 valid-rmse:0.23145\n[220] train-rmse:0.20133 valid-rmse:0.23087\n[230] train-rmse:0.20045 valid-rmse:0.23048\n[240] train-rmse:0.19976 valid-rmse:0.23023\n[249] train-rmse:0.19902 valid-rmse:0.23009\n\n\nDecreasing the learning rate and increasing the boosting rounds got us up to a 2nd place score. Notice that the score is still decreasing on the validation set. We can actually continue boosting on this model by passing it to the xgb_model argument in the train function. We want to go very very slowly here to avoid overshooting the minimum of the objective function. To do that I ramp up the lambda regularization parameter and boost a few more rounds from where we left off.\n\n# second stage\nparams = {\n 'learning_rate': 0.3/10,\n 'max_depth': 10,\n 'min_child_weight': 14,\n 'lambda': 60,\n 'subsample': 1,\n 'colsample_bynode': 1,\n 'objective': 'reg:squarederror',}\nnum_boost_round = 50*3\n\nm1 = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,\n evals=[(dtrain, 'train'), (dvalid, 'valid')], verbose_eval=10,\n xgb_model=m)\n\n[0] train-rmse:0.19900 valid-rmse:0.23007\n[10] train-rmse:0.19862 valid-rmse:0.22990\n[20] train-rmse:0.19831 valid-rmse:0.22975\n[30] train-rmse:0.19796 valid-rmse:0.22964\n[40] train-rmse:0.19768 valid-rmse:0.22955\n[50] train-rmse:0.19739 valid-rmse:0.22940\n[60] train-rmse:0.19714 valid-rmse:0.22935\n[70] train-rmse:0.19689 valid-rmse:0.22927\n[80] train-rmse:0.19664 valid-rmse:0.22915\n[90] train-rmse:0.19646 valid-rmse:0.22915\n[100] train-rmse:0.19620 valid-rmse:0.22910\n[110] train-rmse:0.19604 valid-rmse:0.22907\n[120] train-rmse:0.19583 valid-rmse:0.22901\n[130] train-rmse:0.19562 valid-rmse:0.22899\n[140] train-rmse:0.19546 valid-rmse:0.22898\n[149] train-rmse:0.19520 valid-rmse:0.22886\n\n\n\nroot_mean_squared_error(dvalid.get_label(), m1.predict(dvalid))\n\n0.22885828\n\n\nAnd that gets us to 1st place on the leaderboard." }, { - "objectID": "posts/consider-the-decision-tree/index.html#wrapping-up", - "href": "posts/consider-the-decision-tree/index.html#wrapping-up", - "title": "Consider the Decision Tree", + "objectID": "posts/xgboost-for-regression-in-python/index.html#wrapping-up", + "href": "posts/xgboost-for-regression-in-python/index.html#wrapping-up", + "title": "XGBoost for Regression in Python", "section": "Wrapping Up", - "text": "Wrapping Up\nWell, there you go, that’s my take on the high-level overview of the decision tree and its main strengths and weaknesses. As we’ve seen, ensembling allows us to keep the conveniences of the decision tree while mitigating its core weakness of relatively weak predictive power. This is why tree ensembles are so popular in practical applications. We glossed over pretty much all details of how trees actually do their magic, but fear not, next time we’re going to get rowdy and build one of these things from scratch." + "text": "Wrapping Up\nThere you have it, how to use XGBoost to solve a regression problem in python with world class performance. Remember you can use the XGBoost regression notebook from my ds-templates repo to make it easy to follow this flow on your own problems. If you found this helpful, or if you have additional ideas about solving regression problems with XGBoost, let me know down in the comments." }, { - "objectID": "posts/xgboost-from-scratch/index.html", - "href": "posts/xgboost-from-scratch/index.html", - "title": "XGBoost from Scratch", + "objectID": "posts/hello-world/index.html", + "href": "posts/hello-world/index.html", + "title": "Hello World! And Why I’m Inspired to Start a Blog", "section": "", - "text": "A weathered tree reaches toward the sea at Playa Mal País\nWell, dear reader, it’s that time again, time for us to do a seemingly unnecessary scratch build of a popular algorithm that most people would simply import from the library without a second thought. But readers of this blog are not most people. Of course you know that when we do scratch builds, it’s not for the hell of it, it’s for the purpose of demystification. To that end, today we are going to implement XGBoost from scratch in python, using only numpy and pandas.\nSpecifically we’re going to implement the core statistical learning algorithm of XGBoost, including most of the key hyperparameters and their functionality. Our implementation will also support user-defined custom objective functions, meaning that it can perform regression, classification, and whatever exotic learning tasks you can dream up, as long as you can write down a twice-differentiable objective function. We’ll refrain from implementing some simple features like column subsampling which will be left to you, gentle reader, as exercises. In terms of tree methods, we’re going to implement the exact tree-splitting algorithm, leaving the sparsity-aware method (used to handle missing feature values) and the approximate method (used for scalability) as exercises or maybe topics for future posts.\nAs always, if something is unclear, try backtracking through the previous posts on gradient boosting and decision trees to clarify your intuition. We’ve already built up all the statistical and computational background needed to make sense of this scratch build. Here are the most important prerequisite posts:\nGreat, let’s do this." + "text": "Matt raises his arms in joy at the world.!\nWell, I’ve been thinking about getting this blog started for months now. I guess a combination of inertia, up-front investment in blogging platform selection/setup, and spending a little too much time writing and rewriting the first content post has drawn out the period from initial inspiration to making the blog a reality. Needless to say, I’m pretty excited to finally get things going.\nBefore we dive headlong into the weeds of ML algorithms, statistical methods, and whatever I happen to be learning and teaching at the moment, I figured it would be good to articulate why I’ve felt inspired to get started blogging in the first place. Hopefully this will serve the dual purpose of clarifying my intentions and introducing a vastly underappreciated concept in data science that I hope to weave through the posts to come." }, { - "objectID": "posts/xgboost-from-scratch/index.html#the-xgboost-model-class", - "href": "posts/xgboost-from-scratch/index.html#the-xgboost-model-class", - "title": "XGBoost from Scratch", - "section": "The XGBoost Model Class", - "text": "The XGBoost Model Class\nWe begin with the user-facing API for our model, a class called XGBoostModel which will implement gradient boosting and prediction. To be more consistent with the XGBoost library, we’ll pass hyperparameters to our model in a parameter dictionary, so our init method is going to pull relevant parameters out of the dictionary and set them as object attributes. Note the use of python’s defaultdict so we don’t have to worry about handling key errors if we try to access a parameter that the user didn’t set in the dictionary.\n\nimport math\nimport numpy as np \nimport pandas as pd\nfrom collections import defaultdict\n\n\nclass XGBoostModel():\n '''XGBoost from Scratch\n '''\n \n def __init__(self, params, random_seed=None):\n self.params = defaultdict(lambda: None, params)\n self.subsample = self.params['subsample'] \\\n if self.params['subsample'] else 1.0\n self.learning_rate = self.params['learning_rate'] \\\n if self.params['learning_rate'] else 0.3\n self.base_prediction = self.params['base_score'] \\\n if self.params['base_score'] else 0.5\n self.max_depth = self.params['max_depth'] \\\n if self.params['max_depth'] else 5\n self.rng = np.random.default_rng(seed=random_seed)\n\nThe fit method, based on our classic GBM, takes a feature dataframe, a target vector, the objective function, and the number of boosting rounds as arguments. The user-supplied objective function should be an object with loss, gradient, and hessian methods, each of which takes a target vector and a prediction vector as input; the loss method should return a scalar loss score, the gradient method should return a vector of gradients, and the hessian method should return a vector of hessians.\nIn contrast to boosting in the classic GBM, instead of computing residuals between the current predictions and the target, we compute gradients and hessians of the loss function with respect to the current predictions, and instead of predicting residuals with a decision tree, we fit a special XGBoost tree booster (which we’ll implement in a moment) using the gradients and hessians. I’ve also added row subsampling by drawing a random subset of instance indices and passing them to the tree booster during each boosting round. The rest of the fit method is the same as the classic GBM, and the predict method is identical too.\n\ndef fit(self, X, y, objective, num_boost_round, verbose=False):\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n self.boosters = []\n for i in range(num_boost_round):\n gradients = objective.gradient(y, current_predictions)\n hessians = objective.hessian(y, current_predictions)\n sample_idxs = None if self.subsample == 1.0 \\\n else self.rng.choice(len(y), \n size=math.floor(self.subsample*len(y)), \n replace=False)\n booster = TreeBooster(X, gradients, hessians, \n self.params, self.max_depth, sample_idxs)\n current_predictions += self.learning_rate * booster.predict(X)\n self.boosters.append(booster)\n if verbose: \n print(f'[{i}] train loss = {objective.loss(y, current_predictions)}')\n \ndef predict(self, X):\n return (self.base_prediction + self.learning_rate \n * np.sum([booster.predict(X) for booster in self.boosters], axis=0))\n\nXGBoostModel.fit = fit\nXGBoostModel.predict = predict \n\nAll we have to do now is implement the tree booster." + "objectID": "posts/hello-world/index.html#learning", + "href": "posts/hello-world/index.html#learning", + "title": "Hello World! And Why I’m Inspired to Start a Blog", + "section": "Learning", + "text": "Learning\nThe initial inception about blogging probably originated from some comments about learning that Jeremy Howard makes in the Practical Deep Learning course from fastai. During one of the lectures, he mentions that it’s a great idea to start blogging. To paraphrase Jeremy:\n\nThe thing I really love about blogging is that it helps you learn; by writing things down, you synthesize your ideas.\n\nBeautiful. That definitely rings true for me. I tend to take notes and play around with code when learning new concepts anyway. One of my key hypotheses about this blogging experiment is that making the effort to transform those notes into blog posts will help me learn more effectively." }, { - "objectID": "posts/xgboost-from-scratch/index.html#the-xgboost-tree-booster", - "href": "posts/xgboost-from-scratch/index.html#the-xgboost-tree-booster", - "title": "XGBoost from Scratch", - "section": "The XGBoost Tree Booster", - "text": "The XGBoost Tree Booster\nThe XGBoost tree booster is a modified version of the decision tree that we built in the decision tree from scratch post. Like the decision tree, we recursively build a binary tree structure by finding the best split rule for each node in the tree. The main difference is the criterion for evaluating splits and the way that we define a leaf’s predicted value. Instead of being functions of the target values of the instances in each node, the criterion and predicted values are functions of the instance gradients and hessians. Thus we need only make a couple of modifications to our previous decision tree implementation to create the XGBoost tree booster.\n\nInitialization and Inserting Child Nodes\nMost of the init method is just parsing the parameter dictionary to assign parameters as object attributes. The one notable difference from our decision tree is in the way we define the node’s predicted value. We define self.value according to equation 5 of the XGBoost paper, a simple function of the gradient and hessian values of the instances in the current node. Of course the init also goes on to build the tree via the maybe insert child nodes method. This method is nearly identical to the one we implemented for our decision tree. So far so good.\n\nclass TreeBooster():\n \n def __init__(self, X, g, h, params, max_depth, idxs=None):\n self.params = params\n self.max_depth = max_depth\n assert self.max_depth >= 0, 'max_depth must be nonnegative'\n self.min_child_weight = params['min_child_weight'] \\\n if params['min_child_weight'] else 1.0\n self.reg_lambda = params['reg_lambda'] if params['reg_lambda'] else 1.0\n self.gamma = params['gamma'] if params['gamma'] else 0.0\n self.colsample_bynode = params['colsample_bynode'] \\\n if params['colsample_bynode'] else 1.0\n if isinstance(g, pd.Series): g = g.values\n if isinstance(h, pd.Series): h = h.values\n if idxs is None: idxs = np.arange(len(g))\n self.X, self.g, self.h, self.idxs = X, g, h, idxs\n self.n, self.c = len(idxs), X.shape[1]\n self.value = -g[idxs].sum() / (h[idxs].sum() + self.reg_lambda) # Eq (5)\n self.best_score_so_far = 0.\n if self.max_depth > 0:\n self._maybe_insert_child_nodes()\n\n def _maybe_insert_child_nodes(self):\n for i in range(self.c): self._find_better_split(i)\n if self.is_leaf: return\n x = self.X.values[self.idxs,self.split_feature_idx]\n left_idx = np.nonzero(x <= self.threshold)[0]\n right_idx = np.nonzero(x > self.threshold)[0]\n self.left = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[left_idx])\n self.right = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[right_idx])\n\n @property\n def is_leaf(self): return self.best_score_so_far == 0.\n\n def _find_better_split(self, feature_idx):\n pass\n\n\n\nSplit Finding\nSplit finding follows the exact same pattern that we used in the decision tree, except we keep track of gradient and hessian stats instead of target value stats, and of course we use the XGBoost gain criterion (equation 7 from the paper) for evaluating splits.\n\ndef _find_better_split(self, feature_idx):\n x = self.X.values[self.idxs, feature_idx]\n g, h = self.g[self.idxs], self.h[self.idxs]\n sort_idx = np.argsort(x)\n sort_g, sort_h, sort_x = g[sort_idx], h[sort_idx], x[sort_idx]\n sum_g, sum_h = g.sum(), h.sum()\n sum_g_right, sum_h_right = sum_g, sum_h\n sum_g_left, sum_h_left = 0., 0.\n\n for i in range(0, self.n - 1):\n g_i, h_i, x_i, x_i_next = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]\n sum_g_left += g_i; sum_g_right -= g_i\n sum_h_left += h_i; sum_h_right -= h_i\n if sum_h_left < self.min_child_weight or x_i == x_i_next:continue\n if sum_h_right < self.min_child_weight: break\n\n gain = 0.5 * ((sum_g_left**2 / (sum_h_left + self.reg_lambda))\n + (sum_g_right**2 / (sum_h_right + self.reg_lambda))\n - (sum_g**2 / (sum_h + self.reg_lambda))\n ) - self.gamma/2 # Eq(7) in the xgboost paper\n if gain > self.best_score_so_far: \n self.split_feature_idx = feature_idx\n self.best_score_so_far = gain\n self.threshold = (x_i + x_i_next) / 2\n \nTreeBooster._find_better_split = _find_better_split\n\n\n\nPrediction\nPrediction works exactly the same as in our decision tree, and the methods are nearly identical.\n\ndef predict(self, X):\n return np.array([self._predict_row(row) for i, row in X.iterrows()])\n\ndef _predict_row(self, row):\n if self.is_leaf: \n return self.value\n child = self.left if row[self.split_feature_idx] <= self.threshold \\\n else self.right\n return child._predict_row(row)\n\nTreeBooster.predict = predict \nTreeBooster._predict_row = _predict_row" + "objectID": "posts/hello-world/index.html#teaching", + "href": "posts/hello-world/index.html#teaching", + "title": "Hello World! And Why I’m Inspired to Start a Blog", + "section": "Teaching", + "text": "Teaching\nAh, teaching. Yes, sometimes it’s that thing that takes time away from your research, forcing you to sit alone in a windowless room squinting at hand-written math on a fat stack of homework assignments. But sometimes it actually involves interacting with students, endeavoring to explain a concept, and watching them light up when they get it. The latter manifestation of teaching was one of my favorite things about grad school and academia in general. While I certainly still get to do some teaching as an industry data scientist, I could see myself returning to a more teaching-centric gig somewhere off in the future. Thus we have our second key hypothesis about the blogging experiment, that the writing will entertain my inclination to teach." }, { - "objectID": "posts/xgboost-from-scratch/index.html#the-complete-xgboost-from-scratch-implementation", - "href": "posts/xgboost-from-scratch/index.html#the-complete-xgboost-from-scratch-implementation", - "title": "XGBoost from Scratch", - "section": "The Complete XGBoost From Scratch Implementation", - "text": "The Complete XGBoost From Scratch Implementation\nHere’s the entire implementation which produces a usable XGBoostModel class with fit and predict methods.\n\nclass XGBoostModel():\n '''XGBoost from Scratch\n '''\n \n def __init__(self, params, random_seed=None):\n self.params = defaultdict(lambda: None, params)\n self.subsample = self.params['subsample'] \\\n if self.params['subsample'] else 1.0\n self.learning_rate = self.params['learning_rate'] \\\n if self.params['learning_rate'] else 0.3\n self.base_prediction = self.params['base_score'] \\\n if self.params['base_score'] else 0.5\n self.max_depth = self.params['max_depth'] \\\n if self.params['max_depth'] else 5\n self.rng = np.random.default_rng(seed=random_seed)\n \n def fit(self, X, y, objective, num_boost_round, verbose=False):\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n self.boosters = []\n for i in range(num_boost_round):\n gradients = objective.gradient(y, current_predictions)\n hessians = objective.hessian(y, current_predictions)\n sample_idxs = None if self.subsample == 1.0 \\\n else self.rng.choice(len(y), \n size=math.floor(self.subsample*len(y)), \n replace=False)\n booster = TreeBooster(X, gradients, hessians, \n self.params, self.max_depth, sample_idxs)\n current_predictions += self.learning_rate * booster.predict(X)\n self.boosters.append(booster)\n if verbose: \n print(f'[{i}] train loss = {objective.loss(y, current_predictions)}')\n \n def predict(self, X):\n return (self.base_prediction + self.learning_rate \n * np.sum([booster.predict(X) for booster in self.boosters], axis=0))\n \nclass TreeBooster():\n \n def __init__(self, X, g, h, params, max_depth, idxs=None):\n self.params = params\n self.max_depth = max_depth\n assert self.max_depth >= 0, 'max_depth must be nonnegative'\n self.min_child_weight = params['min_child_weight'] \\\n if params['min_child_weight'] else 1.0\n self.reg_lambda = params['reg_lambda'] if params['reg_lambda'] else 1.0\n self.gamma = params['gamma'] if params['gamma'] else 0.0\n self.colsample_bynode = params['colsample_bynode'] \\\n if params['colsample_bynode'] else 1.0\n if isinstance(g, pd.Series): g = g.values\n if isinstance(h, pd.Series): h = h.values\n if idxs is None: idxs = np.arange(len(g))\n self.X, self.g, self.h, self.idxs = X, g, h, idxs\n self.n, self.c = len(idxs), X.shape[1]\n self.value = -g[idxs].sum() / (h[idxs].sum() + self.reg_lambda) # Eq (5)\n self.best_score_so_far = 0.\n if self.max_depth > 0:\n self._maybe_insert_child_nodes()\n\n def _maybe_insert_child_nodes(self):\n for i in range(self.c): self._find_better_split(i)\n if self.is_leaf: return\n x = self.X.values[self.idxs,self.split_feature_idx]\n left_idx = np.nonzero(x <= self.threshold)[0]\n right_idx = np.nonzero(x > self.threshold)[0]\n self.left = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[left_idx])\n self.right = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[right_idx])\n\n @property\n def is_leaf(self): return self.best_score_so_far == 0.\n \n def _find_better_split(self, feature_idx):\n x = self.X.values[self.idxs, feature_idx]\n g, h = self.g[self.idxs], self.h[self.idxs]\n sort_idx = np.argsort(x)\n sort_g, sort_h, sort_x = g[sort_idx], h[sort_idx], x[sort_idx]\n sum_g, sum_h = g.sum(), h.sum()\n sum_g_right, sum_h_right = sum_g, sum_h\n sum_g_left, sum_h_left = 0., 0.\n\n for i in range(0, self.n - 1):\n g_i, h_i, x_i, x_i_next = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]\n sum_g_left += g_i; sum_g_right -= g_i\n sum_h_left += h_i; sum_h_right -= h_i\n if sum_h_left < self.min_child_weight or x_i == x_i_next:continue\n if sum_h_right < self.min_child_weight: break\n\n gain = 0.5 * ((sum_g_left**2 / (sum_h_left + self.reg_lambda))\n + (sum_g_right**2 / (sum_h_right + self.reg_lambda))\n - (sum_g**2 / (sum_h + self.reg_lambda))\n ) - self.gamma/2 # Eq(7) in the xgboost paper\n if gain > self.best_score_so_far: \n self.split_feature_idx = feature_idx\n self.best_score_so_far = gain\n self.threshold = (x_i + x_i_next) / 2\n \n def predict(self, X):\n return np.array([self._predict_row(row) for i, row in X.iterrows()])\n\n def _predict_row(self, row):\n if self.is_leaf: \n return self.value\n child = self.left if row[self.split_feature_idx] <= self.threshold \\\n else self.right\n return child._predict_row(row)" + "objectID": "posts/hello-world/index.html#contributing", + "href": "posts/hello-world/index.html#contributing", + "title": "Hello World! And Why I’m Inspired to Start a Blog", + "section": "Contributing", + "text": "Contributing\nWorking in the field of data science today is a bit like standing in front of a massive complimentary all-you-can-learn buffet. There is an abundance of free material out on the interwebs for learning pretty much anything in data science from hello world python tutorials to research papers on cutting-edge deep learning techniques. I’ve personally benefited from many a blog post that helped me unpack a new concept or get started using a new tool. And let’s not forget the gigantic cyber warehouse full of freely available open source software tools that volunteer developers have straight-up donated to humanity.\nI realize that up to now, I’ve simply been consuming all of this free goodness without giving anything substantive back in return. Well then, it’s time to start evening the score. Which brings us to key hypothesis number three, that through these blog posts, I might be able to create something helpful, thereby being of service to a community that has freely given so much to me." }, { - "objectID": "posts/xgboost-from-scratch/index.html#testing", - "href": "posts/xgboost-from-scratch/index.html#testing", - "title": "XGBoost from Scratch", - "section": "Testing", - "text": "Testing\nLet’s take this baby for a spin and benchmark its performance against the actual XGBoost library. We use the scikit learn California housing dataset for benchmarking.\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.model_selection import train_test_split\n \nX, y = fetch_california_housing(as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, \n random_state=43)\n\nLet’s start with a nice friendly squared error objective function for training. We should probably have a future post all about how to define custom objective functions in XGBoost, but for now, here’s how I define squared error.\n\nclass SquaredErrorObjective():\n def loss(self, y, pred): return np.mean((y - pred)**2)\n def gradient(self, y, pred): return pred - y\n def hessian(self, y, pred): return np.ones(len(y))\n\nHere I use a more or less arbitrary set of hyperparameters for training. Feel free to play around with tuning and trying other parameter combinations yourself.\n\nimport xgboost as xgb\n\nparams = {\n 'learning_rate': 0.1,\n 'max_depth': 5,\n 'subsample': 0.8,\n 'reg_lambda': 1.5,\n 'gamma': 0.0,\n 'min_child_weight': 25,\n 'base_score': 0.0,\n 'tree_method': 'exact',\n}\nnum_boost_round = 50\n\n# train the from-scratch XGBoost model\nmodel_scratch = XGBoostModel(params, random_seed=42)\nmodel_scratch.fit(X_train, y_train, SquaredErrorObjective(), num_boost_round)\n\n# train the library XGBoost model\ndtrain = xgb.DMatrix(X_train, label=y_train)\ndtest = xgb.DMatrix(X_test, label=y_test)\nmodel_xgb = xgb.train(params, dtrain, num_boost_round)\n\nLet’s check the models’ performance on the held out test data to benchmark our implementation.\n\npred_scratch = model_scratch.predict(X_test)\npred_xgb = model_xgb.predict(dtest)\nprint(f'scratch score: {SquaredErrorObjective().loss(y_test, pred_scratch)}')\nprint(f'xgboost score: {SquaredErrorObjective().loss(y_test, pred_xgb)}')\n\nscratch score: 0.2434125759558149\nxgboost score: 0.24123239765807963\n\n\nWell, look at that! Our scratch-built SGBoost is looking pretty consistent with the library. Go us!" + "objectID": "posts/hello-world/index.html#live-long-and-prosper-blog", + "href": "posts/hello-world/index.html#live-long-and-prosper-blog", + "title": "Hello World! And Why I’m Inspired to Start a Blog", + "section": "Live Long and Prosper, Blog", + "text": "Live Long and Prosper, Blog\nPhew, there it is, the original source of inspiration for this blogging experiment, and three reasons I think it might be a good idea. The astute reader will have noticed that these three assertions have been formulated as hypotheses which are to be tested in the laboratory of experience. And thus, we also have our first glimpse of the scientific method, an underrated concept that is going to help us put the science back in data science.\nWith that, blog, I christen thee, Random Realizations." }, { - "objectID": "posts/xgboost-from-scratch/index.html#wrapping-up", - "href": "posts/xgboost-from-scratch/index.html#wrapping-up", - "title": "XGBoost from Scratch", - "section": "Wrapping Up", - "text": "Wrapping Up\nI’d say this is a pretty good milestone for us here at Random Realizations. We’ve been hammering away at the various concepts around gradient boosting, leaving a trail of equations and scratch-built algos in our wake. Today we put all of that together to create a legit scratch build of XGBoost, something that would have been out of reach for me before we embarked on this journey together over a year ago. To anyone with the patience to read through this stuff, cheers to you! I hope you’re learning and enjoying this as much as I am." + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "", + "text": "Cold water cascades over the rocks in Erwin, Tennessee.\nFriends, this is going to be an epic post! Today, we bring together all the ideas we’ve built up over the past few posts to nail down our understanding of the key ideas in Jerome Friedman’s seminal 2001 paper: “Greedy Function Approximation: A Gradient Boosting Machine.” In particular, we’ll summarize the highlights from the paper, and we’ll build an in-house python implementation of his generic gradient boosting algorithm which can train with any differentiable loss function. What’s more, we’ll go ahead and take our generic gradient boosting machine for a spin by training it with several of the most popular loss functions used in practice.\nAre you freaking stoked or what?\nSweet. Let’s do this." }, { - "objectID": "posts/xgboost-from-scratch/index.html#reader-exercises", - "href": "posts/xgboost-from-scratch/index.html#reader-exercises", - "title": "XGBoost from Scratch", - "section": "Reader Exercises", - "text": "Reader Exercises\nIf you want to take this a step further and deepen your understanding and coding abilities, let me recommend some exercises for you.\n\nImplement column subsampling. XGBoost itself provides column subsampling by tree, by level, and by node. Try implementing by tree first, then try adding by level or by node as well. These should be pretty straightforward to do.\nImplement sparsity aware split finding for missing feature values (Algorithm 2 in the XGBoost paper). This will be a little more involved, since you’ll need to refactor and modify several parts of the tree booster class." + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedman-2001-tldr", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedman-2001-tldr", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "Friedman 2001: TL;DR", + "text": "Friedman 2001: TL;DR\nI’ve mentioned this paper a couple of times before, but as far as I can tell, this is the origin of gradient boosting; it is therefore, a seminal work worth reading. You know what, I think you might like to pick up the paper and read it yourself. Like many papers, there is a lot of scary looking math in the first few pages, but if you’ve been following along on this blog, you’ll find that it’s actually totally approachable. This is the kind of thing that cures imposter syndrome, so give it a shot. That said, here’s the TL;DR as I see it.\nThe first part of the paper introduces the idea of fitting models by doing gradient descent in function space, an ingenious idea we spent an entire post demystifying earlier. Friedman goes on to introduce the generic gradient boost algorithm, which works with any differentiable loss function, as well as specific variants for minimizing absolute error, Huber loss, and binary deviance. In terms of hyperparameters, he points out that the learning rate can be used to reduce overfitting, while increased tree depth can help capture more complex interactions among features. He even discusses feature importance and partial dependence methods for interpreting fitted gradient boosting models.\nFriedman concludes by musing about the advantages of gradient boosting with trees. He notes some key advantages afforded by the use of decision trees including no need to rescale input data, robustness against irrelevant input features, and elegant handling of missing feature values. He points out that gradient boosting manages to capitalize on the benefits of decision trees while minimizing their key weakness (crappy accuracy). I think this offers a great insight into why gradient boosting models have become so widespread and successful in practical ML applications." }, { - "objectID": "posts/hello-pyspark/index.html", - "href": "posts/hello-pyspark/index.html", - "title": "Hello PySpark!", - "section": "", - "text": "A big day at Playa Guiones\nWell, you guessed it: it’s time for us to learn PySpark!\nI know, I know, I can hear you screaming into your pillow. Indeed we just spent all that time converting from R and learning python and why the hell do we need yet another API for working with dataframes?\nThat’s a totally fair question.\nSo what happens when we’re working on something in the real world, where datasets get large in a hurry, and we suddenly have a dataframe that no longer fits into memory? We need a way for our computations and datasets to scale across multiple nodes in a distributed system without having to get too fussy about all the distributed compute details.\nEnter PySpark.\nI think it’s fair to think of PySpark as a python package for working with arbitrarily large dataframes, i.e., it’s like pandas but scalable. It’s built on top of Apache Spark, a unified analytics engine for large-scale data processing. PySpark is essentially a way to access the functionality of spark via python code. While there are other high-level interfaces to Spark (such as Java, Scala, and R), for data scientists who are already working extensively with python, PySpark will be the natural interface of choice. PySpark also has great integration with SQL, and it has a companion machine learning library called MLlib that’s more or less a scalable scikit-learn (maybe we can cover it in a future post).\nSo, here’s the plan. First we’re going to get set up to run PySpark locally in a jupyter notebook on our laptop. This is my preferred environment for interactively playing with PySpark and learning the ropes. Then we’re going to get up and running in PySpark as quickly as possible by reviewing the most essential functionality for working with dataframes and comparing it to how we would do things in pandas. Once we’re comfortable running PySpark on the laptop, it’s going to be much easier to jump onto a distributed cluster and run PySpark at scale.\nLet’s do this." + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedmans-generic-gradient-boosting-algorithm", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedmans-generic-gradient-boosting-algorithm", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "Friedman’s Generic Gradient Boosting Algorithm", + "text": "Friedman’s Generic Gradient Boosting Algorithm\nLet’s take a closer look at Friedman’s original gradient boost algorithm, Alg. 1 in Section 3 of the paper (translated into the notation we’ve been using so far).\nLike last time, we have training data \\((\\mathbf{y}, \\mathbf{X})\\) where \\(\\mathbf{y}\\) is a length-\\(n\\) vector of target values, and \\(\\mathbf{X}\\) is an \\(n \\times p\\) matrix with \\(n\\) observations of \\(p\\) features. We also have a differentiable loss function \\(L(\\mathbf{y}, \\mathbf{\\hat{y}}) = \\sum_{i=1}^n l(y_i, \\hat{y}_i)\\), a “learning rate” hyperparameter \\(\\eta\\), and a fixed number of model iterations \\(M\\).\nAlgorithm: gradient_boost\\((\\mathbf{X},\\mathbf{y},L,\\eta, M)\\) returns: model \\(F_M\\)\n\nLet base model \\(F_0(\\mathbf{x}) = c\\), where \\(c = \\text{argmin}_{c} \\sum_{i=1}^n l(y_i, c)\\)\nfor \\(m\\) = \\(0\\) to \\(M-1\\):\n     Let “pseudo-residual” vector \\(\\mathbf{r}_m = -\\nabla_{\\mathbf{\\hat{y}}_m} L(\\mathbf{y},\\mathbf{\\hat{y}}_m)\\)\n     Train decision tree regressor \\(h_m(\\mathbf{X})\\) to predict \\(\\mathbf{r}_m\\) (minimizing squared error)\n     foreach terminal leaf node \\(t \\in h_m\\):\n          Let \\(v = \\text{argmin}_v \\sum_{i \\in t} l(y_i, F_m(\\mathbf{x}_i) + v)\\)\n          Set terminal leaf node \\(t\\) to predict value \\(v\\)\n     \\(F_{m+1}(\\mathbf{X}) = F_{m}(\\mathbf{X}) + \\eta h_m(\\mathbf{X})\\)\nReturn composite model \\(F_M\\)\n\nBy now, most of this is already familiar to us. We begin by setting the base model \\(F_0\\) equal to the constant prediction value that minimizes the loss over all examples in the training dataset (line 1). Then we begin the boosting iterations (line 2), each time computing the negative gradients of the loss with respect to the current model predictions (known as the pseudo residuals) (line 3). We then fit our next decision tree regressor to predict the pseudo residuals (line 4).\nThen we encounter something new on lines 5-7. When we fit a vanilla decision tree regressor to predict pseudo residuals, we’re using mean squared error as the loss function to train the tree. As you might imagine, this works well when the global loss function is also squared error. But if we want to use a global loss other than squared error, there is an additional trick we can use to further increase the composite model’s accuracy. The idea is to continue using squared error to train each decision tree, keeping its structure and split conditions but altering the predicted value in each leaf to help minimize the global loss function. Instead of using the mean target value as the prediction for each node (as we would do when minimizing squared error), we use a numerical optimization method like line search to choose the constant value for that leaf that leads to the best overall loss. This is the same thing we did in line 1 of the algorithm to set the base prediction, but here we choose the optimal prediction for each terminal node of the newly trained decision tree." }, { - "objectID": "posts/hello-pyspark/index.html#how-to-run-pyspark-in-a-jupyter-notebook-on-your-laptop", - "href": "posts/hello-pyspark/index.html#how-to-run-pyspark-in-a-jupyter-notebook-on-your-laptop", - "title": "Hello PySpark!", - "section": "How to Run PySpark in a Jupyter Notebook on Your Laptop", - "text": "How to Run PySpark in a Jupyter Notebook on Your Laptop\nOk, I’m going to walk us through how to get things installed on a Mac or Linux machine where we’re using homebrew and conda to manage virtual environments. If you have a different setup, your favorite search engine will help you get PySpark set up locally.\n\n\n\n\n\n\nNote\n\n\n\nIt’s possible for Homebrew and Anaconda to interfere with one another. The simple rule of thumb is that whenever you want to use the brew command, first deactivate your conda environment by running conda deactivate. See this Stack Overflow question for more details.\n\n\n\nInstall Spark\nInstall Spark with homebrew.\nbrew install apache-spark\nNext we need to set up a SPARK_HOME environment variable in the shell. Check where Spark is installed.\nbrew info apache-spark\nYou should see something like\n==> apache-spark: stable 3.3.2 (bottled), HEAD\nEngine for large-scale data processing\nhttps://spark.apache.org/\n/opt/homebrew/Cellar/apache-spark/3.3.2 (1,453 files, 320.9MB) *\n...\nSet the SPARK_HOME environment variable to your spark installation path with /libexec appended to the end. To do this I added the following line to my .zshrc file.\nexport SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.3.2/libexec\nRestart your shell, and test the installation by starting the Spark shell.\nspark-shell\n...\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /___/ .__/\\_,_/_/ /_/\\_\\ version 3.3.2\n /_/\n \nUsing Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 19.0.2)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala> \nIf you get the scala> prompt, then you’ve successfully installed Spark on your laptop!\n\n\nInstall PySpark\nUse conda to install the PySpark python package. As usual, it’s advisable to do this in a new virtual environment.\n$ conda install pyspark\nYou should be able to launch an interactive PySpark REPL by saying pyspark.\n$ pyspark\n...\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /__ / .__/\\_,_/_/ /_/\\_\\ version 3.1.2\n /_/\n\nUsing Python version 3.8.3 (default, Jul 2 2020 11:26:31)\nSpark context Web UI available at http://192.168.100.47:4041\nSpark context available as 'sc' (master = local[*], app id = local-1624127229929).\nSparkSession available as 'spark'.\n>>> \nThis time we get a familiar python >>> prompt. This is an interactive shell where we can easily experiment with PySpark. Feel free to run the example code in this post here in the PySpark shell, or, if you prefer a notebook, read on and we’ll get set up to run PySpark in a jupyter notebook.\n\n\n\n\n\n\nNote\n\n\n\nWhen I tried following this setup on a new Mac, I hit an error about being unable to find the Java Runtime. This stack overflow question lead me to the fix.\n\n\n\n\nThe Spark Session Object\nYou may have noticed that when we launched that PySpark interactive shell, it told us that something called SparkSession was available as 'spark'. So basically, what’s happening here is that when we launch the pyspark shell, it instantiates an object called spark which is an instance of class pyspark.sql.session.SparkSession. The spark session object is going to be our entry point for all kinds of PySpark functionality, i.e., we’re going to be saying things like spark.this() and spark.that() to make stuff happen.\nThe PySpark interactive shell is kind enough to instantiate one of these spark session objects for us automatically. However, when we’re using another interface to PySpark (like say a jupyter notebook running a python kernal), we’ll have to make a spark session object for ourselves.\n\n\nCreate a PySpark Session in a Jupyter Notebook\nThere are a few ways to run PySpark in jupyter which you can read about here.\nFor derping around with PySpark on your laptop, I think the best way is to instantiate a spark session from a jupyter notebook running on a regular python kernel. The method we’ll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session. So, first install the findspark package.\nconda install -c conda-forge findspark\nLaunch jupyter as usual.\njupyter notebook\nGo ahead and fire up a new notebook using a regular python 3 kernal. Once you land inside the notebook, there are a couple things we need to do to get a spark session instantiated. You can think of this as boilerplate code that we need to run in the first cell of a notebook where we’re going to use PySpark.\n\nimport pyspark\nimport findspark\nfrom pyspark.sql import SparkSession\n\nfindspark.init()\nspark = SparkSession.builder.appName('My Spark App').getOrCreate()\n\nFirst we’re running findspark’s init() method to find our Spark installation. If you run into errors here, make sure you got the SPARK_HOME environment variable correctly set in the install instructions above. Then we instantiate a spark session as spark. Once you run this, you’re ready to rock and roll with PySpark in your jupyter notebook.\n\n\n\n\n\n\nNote\n\n\n\nSpark provides a handy web UI that you can use for monitoring and debugging. Once you instantiate the spark session You can open the UI in your web browser at http://localhost:4040/jobs/." + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#implementation", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#implementation", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "Implementation", + "text": "Implementation\nI did some (half-assed) searching on the interweb for an implementation of GBM that allows the user to provide a custom loss function, and you know what? I couldn’t find anything. If you find another implementation, post in the comments so we can learn from it too.\nSince we need to modify the values predicted by our decision trees’ terminal nodes, we’ll want to brush up on the scikit-learn decision tree structure before we get going. You can see explanations of all the necessary decision tree hacks in this notebook.\n\nimport numpy as np\nfrom sklearn.tree import DecisionTreeRegressor \nfrom scipy.optimize import minimize\n\nclass GradientBoostingMachine():\n '''Gradient Boosting Machine supporting any user-supplied loss function.\n \n Parameters\n ----------\n n_trees : int\n number of boosting rounds\n \n learning_rate : float\n learning rate hyperparameter\n \n max_depth : int\n maximum tree depth\n '''\n \n def __init__(self, n_trees, learning_rate=0.1, max_depth=1):\n self.n_trees=n_trees; \n self.learning_rate=learning_rate\n self.max_depth=max_depth;\n \n def fit(self, X, y, objective):\n '''Fit the GBM using the specified loss function.\n \n Parameters\n ----------\n X : ndarray of size (number observations, number features)\n design matrix\n \n y : ndarray of size (number observations,)\n target values\n \n objective : loss function class instance\n Class specifying the loss function for training.\n Should implement two methods:\n loss(labels: ndarray, predictions: ndarray) -> float\n negative_gradient(labels: ndarray, predictions: ndarray) -> ndarray\n '''\n \n self.trees = []\n self.base_prediction = self._get_optimal_base_value(y, objective.loss)\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n for _ in range(self.n_trees):\n pseudo_residuals = objective.negative_gradient(y, current_predictions)\n tree = DecisionTreeRegressor(max_depth=self.max_depth)\n tree.fit(X, pseudo_residuals)\n self._update_terminal_nodes(tree, X, y, current_predictions, objective.loss)\n current_predictions += self.learning_rate * tree.predict(X)\n self.trees.append(tree)\n \n def _get_optimal_base_value(self, y, loss):\n '''Find the optimal initial prediction for the base model.'''\n fun = lambda c: loss(y, c)\n c0 = y.mean()\n return minimize(fun=fun, x0=c0).x[0]\n \n def _update_terminal_nodes(self, tree, X, y, current_predictions, loss):\n '''Update the tree's predictions according to the loss function.'''\n # terminal node id's\n leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0]\n # compute leaf for each sample in ``X``.\n leaf_node_for_each_sample = tree.apply(X)\n for leaf in leaf_nodes:\n samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0]\n y_in_leaf = y.take(samples_in_this_leaf, axis=0)\n preds_in_leaf = current_predictions.take(samples_in_this_leaf, axis=0)\n val = self._get_optimal_leaf_value(y_in_leaf, \n preds_in_leaf,\n loss)\n tree.tree_.value[leaf, 0, 0] = val\n \n def _get_optimal_leaf_value(self, y, current_predictions, loss):\n '''Find the optimal prediction value for a given leaf.'''\n fun = lambda c: loss(y, current_predictions + c)\n c0 = y.mean()\n return minimize(fun=fun, x0=c0).x[0]\n \n def predict(self, X):\n '''Generate predictions for the given input data.'''\n return (self.base_prediction \n + self.learning_rate \n * np.sum([tree.predict(X) for tree in self.trees], axis=0))\n\nIn terms of design, we implement a class for the GBM with scikit-like fit and predict methods. Notice in the below implementation that the fit method is only 10 lines long, and corresponds very closely to Friedman’s gradient boost algorithm from above. Most of the complexity comes from the helper methods for updating the leaf values according to the specified loss function.\nWhen the user wants to call the fit method, they’ll need to supply the loss function they want to use for boosting. We’ll make the user implement their loss (a.k.a. objective) function as a class with two methods: (1) a loss method taking the labels and the predictions and returning the loss score and (2) a negative_gradient method taking the labels and the predictions and returning an array of negative gradients." }, { - "objectID": "posts/hello-pyspark/index.html#pyspark-concepts", - "href": "posts/hello-pyspark/index.html#pyspark-concepts", - "title": "Hello PySpark!", - "section": "PySpark Concepts", - "text": "PySpark Concepts\nPySpark provides two main abstractions for data: the RDD and the dataframe. RDD’s are just a distributed list of objects; we won’t go into details about them in this post. For us, the key object in PySpark is the dataframe.\nWhile PySpark dataframes expose much of the functionality you would expect from a library for tabular data manipulation, they behave a little differently from pandas dataframes, both syntactically and under-the-hood. There are a couple of key concepts that will help explain these idiosyncracies.\nImmutability - Pyspark RDD’s and dataframes are immutable. This means that if you change an object, e.g. by adding a column to a dataframe, PySpark returns a reference to a new dataframe; it does not modify the existing dataframe. This is kind of nice, because we don’t have to worry about that whole view versus copy nonsense that happens in pandas.\nLazy Evaluation - Lazy evaluation means that when we start manipulating a dataframe, PySpark won’t actually perform any of the computations until we explicitly ask for the result. This is nice because it potentially allows PySpark to do fancy optimizations before executing a sequence of operations. It’s also confusing at first, because PySpark will seem to blaze through complex operations and then take forever to print a few rows of the dataframe." + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#testing-our-model", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#testing-our-model", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "Testing our Model", + "text": "Testing our Model\nLet’s test drive our custom-loss-ready GBM with a few different loss functions! We’ll compare it to the scikit-learn GBM to sanity check our implementation.\n\nfrom sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier\n\nrng = np.random.default_rng()\n\n# test data\ndef make_test_data(n, noise_scale):\n x = np.linspace(0, 10, 500).reshape(-1,1)\n y = (np.where(x < 5, x, 5) + rng.normal(0, noise_scale, size=x.shape)).ravel()\n return x, y\n \n# print model loss scores\ndef print_model_loss_scores(obj, y, preds, sk_preds):\n print(f'From Scratch Loss = {obj.loss(y, pred):0.4}')\n print(f'Scikit-Learn Loss = {obj.loss(y, sk_pred):0.4}')\n\n\nMean Squared Error\nMean Squared Error (a.k.a. Least Squares) loss produces estimates of the mean target value conditioned on the feature values. Here’s the implementation.\n\nx, y = make_test_data(500, 0.4)\n\n\n# from scratch GBM\nclass SquaredErrorLoss():\n '''User-Defined Squared Error Loss'''\n \n def loss(self, y, preds):\n return np.mean((y - preds)**2)\n \n def negative_gradient(self, y, preds):\n return y - preds\n \n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, SquaredErrorLoss())\npred = gbm.predict(x)\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='squared_error')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(SquaredErrorLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.168\nScikit-Learn Loss = 0.168\n\n\n\n\n\n\n\n\n\nMean Absolute Error\nMean Absolute Error (a.k.a.Least Absolute Deviations) loss produces estimates of the median target value conditioned on the feature values. Here’s the implementation.\n\nx, y = make_test_data(500, 0.4)\n\n\n\n# from scratch GBM\nclass AbsoluteErrorLoss():\n '''User-Defined Absolute Error Loss'''\n \n def loss(self, y, preds):\n return np.mean(np.abs(y - preds))\n \n def negative_gradient(self, y, preds):\n return np.sign(y - preds)\n\n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, AbsoluteErrorLoss())\npred = gbm.predict(x)\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='absolute_error')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(AbsoluteErrorLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.3225\nScikit-Learn Loss = 0.3208\n\n\n\n\n\n\n\n\n\nQuantile Loss\nQuantile loss yields estimates of a given quantile of the target variable conditioned on the features. Here’s my implementation.\n\nx, y = make_test_data(500, 1)\n\n\n\n# from scratch GBM\nclass QuantileLoss():\n '''Quantile Loss\n \n Parameters\n ----------\n alpha : float\n quantile to be estimated, 0 < alpha < 1\n '''\n \n def __init__(self, alpha):\n if alpha < 0 or alpha >1:\n raise ValueError('alpha must be between 0 and 1')\n self.alpha = alpha\n \n def loss(self, y, preds):\n e = y - preds\n return np.mean(np.where(e > 0, self.alpha * e, (self.alpha - 1) * e))\n \n def negative_gradient(self, y, preds):\n e = y - preds \n return np.where(e > 0, self.alpha, self.alpha - 1)\n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, QuantileLoss(alpha=0.9))\npred = gbm.predict(x) \n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='quantile', alpha=0.9)\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(QuantileLoss(alpha=0.9), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.1853\nScikit-Learn Loss = 0.1856\n\n\n\n\n\n\n\n\n\nBinary Cross Entropy Loss\nThe previous losses are useful for regression problems, where the target is numeric. But we can also solve classification problems, simply by swapping in an appropriate loss function. Here we’ll implement binary cross entropy, a.k.a. binary deviance, a.k.a. negative binomial log likelihood (sometimes abusively called log loss). One thing to remember is that, as with logistic regression, our model is actually predicting the log odds ratio, not the probability of the positive class. Thus we use expit transformations (the inverse of logit) whenever probabilities are needed, e.g., when predicting the probability that an observation belongs to the positive class.\n\n# make categorical test data\n\ndef expit(t):\n return np.exp(t) / (1 + np.exp(t))\n\nx = np.linspace(-3, 3, 500)\np = expit(x)\ny = rng.binomial(1, p, size=p.shape)\nx = x.reshape(-1,1)\n\n\n# from scratch GBM\nclass BinaryCrossEntropyLoss():\n '''Binary Cross Entropy Loss\n \n Note that the predictions should be log odds ratios.\n '''\n \n def __init__(self):\n self.expit = lambda t: np.exp(t) / (1 + np.exp(t))\n \n def loss(self, y, preds):\n p = self.expit(preds)\n return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))\n \n def negative_gradient(self, y, preds):\n p = self.expit(preds)\n return y / p - (1 - y) / (1 - p)\n\n \ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, BinaryCrossEntropyLoss())\npred = expit(gbm.predict(x))\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingClassifier(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='log_loss')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict_proba(x)[:, 1]\n\n\nprint_model_loss_scores(BinaryCrossEntropyLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.6379\nScikit-Learn Loss = 0.6403" }, { - "objectID": "posts/hello-pyspark/index.html#pyspark-dataframe-essentials", - "href": "posts/hello-pyspark/index.html#pyspark-dataframe-essentials", - "title": "Hello PySpark!", - "section": "PySpark Dataframe Essentials", - "text": "PySpark Dataframe Essentials\n\nCreating a PySpark dataframe with createDataFrame()\nThe first thing we’ll need is a way to make dataframes. createDataFrame() allows us to create PySpark dataframes from python objects like nested lists or pandas dataframes. Notice that createDataFrame() is a method of the spark session class, so we’ll call it from our spark session sparkby saying spark.createDataFrame().\n\n# create pyspark dataframe from nested lists\nmy_df = spark.createDataFrame(\n data=[\n [2022, \"tiger\"],\n [2023, \"rabbit\"],\n [2024, \"dragon\"]\n ],\n schema=['year', 'animal']\n)\n\nLet’s read the seaborn tips dataset into a pandas dataframe and then use it to create a PySpark dataframe.\n\nimport pandas as pd\n\n# load tips dataset into a pandas dataframe\npandas_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')\n\n# create pyspark dataframe from a pandas dataframe\npyspark_df = spark.createDataFrame(pandas_df)\n\n\n\n\n\n\n\nNote\n\n\n\nIn real life when we’re running PySpark on a large-scale distributed system, we would not generally want to use python lists or pandas dataframes to load data into PySpark. Ideally we would want to read data directly from where it is stored on HDFS, e.g. by reading parquet files, or by querying directly from a hive database using spark sql.\n\n\n\n\nPeeking at a dataframe’s contents\nThe default print method for the PySpark dataframe will just give you the schema.\n\npyspark_df\n\nDataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]\n\n\nIf we want to peek at some of the data, we’ll need to use the show() method, which is analogous to the pandas head(). Remember that show() will cause PySpark to execute any operations that it’s been lazily waiting to evaluate, so sometimes it can take a while to run.\n\n# show the first few rows of the dataframe\npyspark_df.show(5)\n\n+----------+----+------+------+---+------+----+\n|total_bill| tip| sex|smoker|day| time|size|\n+----------+----+------+------+---+------+----+\n| 16.99|1.01|Female| No|Sun|Dinner| 2|\n| 10.34|1.66| Male| No|Sun|Dinner| 3|\n| 21.01| 3.5| Male| No|Sun|Dinner| 3|\n| 23.68|3.31| Male| No|Sun|Dinner| 2|\n| 24.59|3.61|Female| No|Sun|Dinner| 4|\n+----------+----+------+------+---+------+----+\nonly showing top 5 rows\n\n\n\n\n[Stage 0:> (0 + 1) / 1]\n\n \n\n\nWe thus encounter our first rude awakening. PySpark’s default representation of dataframes in the notebook isn’t as pretty as that of pandas. But no one ever said it would be pretty, they just said it would be scalable.\nYou can also use the printSchema() method for a nice vertical representation of the schema.\n\n# show the dataframe schema\npyspark_df.printSchema()\n\nroot\n |-- total_bill: double (nullable = true)\n |-- tip: double (nullable = true)\n |-- sex: string (nullable = true)\n |-- smoker: string (nullable = true)\n |-- day: string (nullable = true)\n |-- time: string (nullable = true)\n |-- size: long (nullable = true)\n\n\n\n\n\nSelect columns by name\nYou can select specific columns from a dataframe using the select() method. You can pass either a list of names, or pass names as arguments.\n\n# select some of the columns\npyspark_df.select('total_bill', 'tip')\n\n# select columns in a list\npyspark_df.select(['day', 'time', 'total_bill'])\n\n\n\nFilter rows based on column values\nAnalogous to the WHERE clause in SQL, and the query() method in pandas, PySpark provides a filter() method which returns only the rows that meet the specified conditions. Its argument is a string specifying the condition to be met for rows to be included in the result. You specify the condition as an expression involving the column names and comparison operators like <, >, <=, >=, == (equal), and ~= (not equal). You can specify compound expressions using and and or, and you can even do a SQL-like in to check if the column value matches any items in a list.\n\n## compare a column to a value\npyspark_df.filter('total_bill > 20')\n\n# compare two columns with arithmetic\npyspark_df.filter('tip > 0.15 * total_bill')\n\n# check equality with a string value\npyspark_df.filter('sex == \"Male\"')\n\n# check equality with any of several possible values\npyspark_df.filter('day in (\"Sat\", \"Sun\")')\n\n# use \"and\" \npyspark_df.filter('day == \"Fri\" and time == \"Lunch\"')\n\nIf you’re into boolean indexing with the brackets, PySpark does support that too, but I encourage you to use filter() instead. Check out my rant about why you shouldn’t use boolean indexing for the details. The TLDR is that filter() requires less typing, makes your code more readable and portable, and it allows you to chain method calls together using dot chains.\nHere’s the boolean indexing equivalent of the last example from above.\n\n# using boolean indexing\npyspark_df[(pyspark_df.day == 'Fri') & (pyspark_df.time == 'Lunch')]\n\nI know, it looks horrendous, but not as horrendous as the error message you’ll get if you forget the parentheses.\n\n\nAdd new columns to a dataframe\nYou can add new columns which are functions of the existing columns with the withColumn() method.\n\nimport pyspark.sql.functions as f\n\n# add a new column using col() to reference other columns\npyspark_df.withColumn('tip_percent', f.col('tip') / f.col('total_bill'))\n\nNotice that we’ve imported the pyspark.sql.functions module. This module contains lots of useful functions that we’ll be using all over the place, so it’s probably a good idea to go ahead and import it whenever you’re using PySpark. BTW, it seems like folks usually import this module as f or F. In this example we’re using the col() function, which allows us to refer to columns in our dataframe using string representations of the column names.\nYou could also achieve the same result using the dot to reference the other columns, but this requires us to type the dataframe name over and over again, which makes it harder to reuse this code on different dataframes or in dot chains.\n\n# add a new column using the dot to reference other columns (less recommended)\npyspark_df.withColumn('tip_percent', pyspark_df.tip / pyspark_df.total_bill)\n\nIf you want to apply numerical transformations like exponents or logs, use the built-in functions in the pyspark.sql.functions module.\n\n# log \npyspark_df.withColumn('log_bill', f.log(f.col('total_bill')))\n\n# exponent\npyspark_df.withColumn('bill_squared', f.pow(f.col('total_bill'), 2))\n\nYou can implement conditional assignment like SQL’s CASE WHEN construct using the when() function and the otherwise() method.\n\n# conditional assignment (like CASE WHEN)\npyspark_df.withColumn('is_male', f.when(f.col('sex') == 'Male', True).otherwise(False))\n\n# using multiple when conditions and values\npyspark_df.withColumn('bill_size', \n f.when(f.col('total_bill') < 10, 'small')\n .when(f.col('total_bill') < 20, 'medium')\n .otherwise('large')\n)\n\nRemember that since PySpark dataframes are immutable, calling withColumns() on a dataframe returns a new dataframe. If you want to persist the result, you’ll need to make an assignment.\npyspark_df = pyspark_df.withColumns(...)\n\n\nGroup by and aggregate\nPySpark provides a groupBy() method similar to the pandas groupby(). Just like in pandas, we can call methods like count() and mean() on our grouped dataframe, and we also have a more flexible agg() method that allows us to specify column-aggregation mappings.\n\n\n# group by and count\npyspark_df.groupBy('time').count().show()\n\n+------+-----+\n| time|count|\n+------+-----+\n|Dinner| 176|\n| Lunch| 68|\n+------+-----+\n\n\n\n\n\n# group by and specify column-aggregation mappings with agg()\npyspark_df.groupBy('time').agg({'total_bill': 'mean', 'tip': 'max'}).show()\n\n+------+--------+------------------+\n| time|max(tip)| avg(total_bill)|\n+------+--------+------------------+\n|Dinner| 10.0| 20.79715909090909|\n| Lunch| 6.7|17.168676470588235|\n+------+--------+------------------+\n\n\n\nIf you want to get fancier with your aggregations, it might just be easier to express them using hive syntax. Read on to find out how.\n\n\nRun Hive SQL on dataframes\nOne of the mind-blowing features of PySpark is that it allows you to write hive SQL queries on your dataframes. To take a PySpark dataframe into the SQL world, use the createOrReplaceTempView() method. This method takes one string argument which will be the dataframes name in the SQL world. Then you can use spark.sql() to run a query. The result is returned as a PySpark dataframe.\n\n\n# put pyspark dataframe in SQL world and query it\npyspark_df.createOrReplaceTempView('tips')\nspark.sql('select * from tips').show(5)\n\n+----------+----+------+------+---+------+----+\n|total_bill| tip| sex|smoker|day| time|size|\n+----------+----+------+------+---+------+----+\n| 16.99|1.01|Female| No|Sun|Dinner| 2|\n| 10.34|1.66| Male| No|Sun|Dinner| 3|\n| 21.01| 3.5| Male| No|Sun|Dinner| 3|\n| 23.68|3.31| Male| No|Sun|Dinner| 2|\n| 24.59|3.61|Female| No|Sun|Dinner| 4|\n+----------+----+------+------+---+------+----+\nonly showing top 5 rows\n\n\n\nThis is awesome for a couple of reasons. First, it allows us to easily express any transformations in hive syntax. If you’re like me and you’ve already been using hive, this will dramatically reduce the PySpark learning curve, because when in doubt, you can always bump a dataframe into the SQL world and simply use hive to do what you need. Second, if you have a hive deployment, PySpark’s SQL world also has access to all of your hive tables. This means you can write queries involving both hive tables and your PySpark dataframes. It also means you can run hive commands, like inserting into a table, directly from PySpark.\nLet’s do some aggregations that might be a little trickier to do using the PySpark built-in functions.\n\n\n# run hive query and save result to dataframe\ntip_stats_by_time = spark.sql(\"\"\"\n select\n time\n , count(*) as n \n , avg(tip) as avg_tip\n , percentile_approx(tip, 0.5) as med_tip\n , avg(case when tip > 3 then 1 else 0 end) as pct_tip_gt_3\n from \n tips\n group by 1\n\"\"\")\n\ntip_stats_by_time.show()\n\n+------+---+------------------+-------+-------------------+\n| time| n| avg_tip|med_tip| pct_tip_gt_3|\n+------+---+------------------+-------+-------------------+\n|Dinner|176| 3.102670454545455| 3.0|0.44886363636363635|\n| Lunch| 68|2.7280882352941176| 2.2|0.27941176470588236|\n+------+---+------------------+-------+-------------------+" + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#wrapping-up", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#wrapping-up", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "Wrapping Up", + "text": "Wrapping Up\nWoohoo! We did it! We finally made it through Friedman’s paper in its entirety, and we implemented the generic gradient boosting algorithm which works with any differentiable loss function. If you made it this far, great job, gold star! By now you hopefully have a pretty solid grasp on gradient boosting, which is good, because soon we’re going to dive into the modern Newton descent gradient boosting frameworks like XGBoost. Onward!" }, { - "objectID": "posts/hello-pyspark/index.html#visualization-with-pyspark", - "href": "posts/hello-pyspark/index.html#visualization-with-pyspark", - "title": "Hello PySpark!", - "section": "Visualization with PySpark", - "text": "Visualization with PySpark\nThere aren’t any tools for visualization included in PySpark. But that’s no problem, because we can just use the toPandas() method on a PySpark dataframe to pull data back into pandas. Once we have a pandas dataframe, we can happily build visualizations as usual. Of course, if your PySpark dataframe is huge, you wouldn’t want to use toPandas() directly, because PySpark will attempt to read the entire contents of its huge dataframe into memory. Instead, it’s best to use PySpark to generate aggregations of your data for plotting or to pull only a sample of your full data into pandas.\n\n# read aggregated pyspark dataframe into pandas for plotting\nplot_pdf = tip_stats_by_time.toPandas()\nplot_pdf.plot.bar(x='time', y=['avg_tip', 'med_tip']);" + "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#references", + "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#references", + "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "section": "References", + "text": "References\nFriedman’s 2001 paper: Greedy Function Approximation: A Gradient Boosting Machine" }, { - "objectID": "posts/hello-pyspark/index.html#wrapping-up", - "href": "posts/hello-pyspark/index.html#wrapping-up", - "title": "Hello PySpark!", + "objectID": "posts/get-down-with-gradient-descent/index.html", + "href": "posts/get-down-with-gradient-descent/index.html", + "title": "Get Down with Gradient Descent", + "section": "", + "text": "Ahh, gradient descent. It’s probably one of the most ubiquitous algorithms used in data science, but you’re unlikely to see it being celebrated in the limelight of the Kaggle podium. Rather than taking center stage, gradient descent operates under the hood, powering the training for a wide range of models including deep neural networks, gradient boosting trees, generalized linear models, and mixed effects models. Getting an intuition for the algorithm will reveal how model fitting actually works and help us to see the common thread connecting a wide range of seemingly unrelated models. In this post we’ll get the intuition for gradient descent with a fresh analogy, develop the mathematical formulation, and ground our understanding by using it to train ourselves a linear regression model." + }, + { + "objectID": "posts/get-down-with-gradient-descent/index.html#intuition", + "href": "posts/get-down-with-gradient-descent/index.html#intuition", + "title": "Get Down with Gradient Descent", + "section": "Intuition", + "text": "Intuition\nBefore we dive into the intuition for gradient descent itself, let’s get a high-level view of why it’s useful in training or fiting a model. Training a model basically means finding the model parameter values that make the model fit a given dataset well. We measure how well a model fits data using a special function variously called a loss or cost or objective function. A loss function takes the dataset and the model as arguments and returns a number that tells us how well our model fits the data. Therefore training is an optimization problem in which we search for the model parameter values that result in the minimum value of the loss function. Enter gradient descent.\nGradient descent is a numerical optimization technique that helps us find the inputs that yield the minimum value of a function. Since most explanations of the gradient descent algorithm seem to use a story about hikers being lost in some foggy mountains, we’re going to try out a new analogy.\nLet’s say you’re at a concert. Remember those? They’re these things that used to happen where people played music and everyone danced and had a great time.\n\nNOTE: Chiming in here in 2023 from a sort-of-post COVID 19 world, happily I can report that concerts and live music are back!\n\nNow suppose at this concert there’s a dance floor which has become a bit sweltering from copious amounts of “getting down”. But the temperature isn’t quite uniform; maybe there’s a cool spot from a ceiling fan somewhere.\n\n\n\ndance floor\n\n\nLet’s get ourselves to that cool spot using the following procedure.\n\nFrom our current location, figure out which direction feels coolest.\nTake a step (or simply shimmy) in that direction.\nRepeat steps 1 and 2 until we reach the coolest spot on the dance floor.\n\nThe crux of this procedure is figuring out, at each step, which direction yields the greatest temperature reduction. Our skin is pretty sensitive to temperature, so we can just use awareness of body sensation to sense which direction feels coolest. Luckily, we have a mathematical equivalent to our skin’s ability to sense local variation in temperature.\n\nDetermine which way to go\nLet \\(f(x,y)\\) be the temperature on the dance floor at position \\((x,y)\\). The direction of fastest decrease in temperature is going to be given by some vector in our \\((x,y)\\) space, e.g.,\n[vector component in \\(x\\) direction, vector component in \\(y\\) direction]\nTurns out that the gradient of a function evaluated at a particular location yields a vector that points in the direction of fastest increase in the function, pretty similar to what we’re looking for. The gradient of \\(f(x,y)\\) is given by\n\\[ \\nabla f(x,y) = \\left [ \\frac{\\partial f(x,y)}{\\partial x}, \\frac{\\partial f(x,y)}{\\partial y} \\right ] \\]\nThe components of the gradient vector are the partial derivatives of our function \\(f(x,y)\\), evaluated at the point \\((x,y)\\). These partial derivatives just tell us the slope of \\(f(x,y)\\) in the \\(x\\) and \\(y\\) directions respectively. The intuition is that if \\(\\frac{\\partial f(x,y)}{\\partial x}\\) is a large positive number, then moving in the positive \\(x\\) direction will make \\(f(x,y)\\) increase a lot, whereas if \\(\\frac{\\partial f(x,y)}{\\partial x}\\) is a large negative number, then moving in the negative \\(x\\) direction will make \\(f(x,y)\\) increase a lot.\nIt’s not too hard to see that the direction of fastest decrease is actually just the exact opposite direction from that of fastest increase. Since we can point a vector in the opposite direction by negating its component values, our direction of fastest temperature decrease will be given by the negative gradient of the temperature field \\(-\\nabla f(x,y)\\).\n\n\n\ndance floor with hot and cold sides\n\n\n\n\nTake a step in the right direction\nNow that we have our direction vector, we’re ready to take a step toward the cool part of the dance floor. To do this, we’ll just add our direction vector to our current position. The update rule would look like this.\n\\[ [x_\\text{next}, y_\\text{next}] = [x_\\text{prev}, y_\\text{prev}] - \\nabla f (x_\\text{prev}, y_\\text{prev}) = [x_\\text{prev}, y_\\text{prev}] - \\left [ \\frac{\\partial f (x_\\text{prev}, y_\\text{prev})}{\\partial x}, \\frac{\\partial f (x_\\text{prev}, y_\\text{prev})}{\\partial y} \\right ] \\]\nIf we iteratively apply this update rule, we’ll end up tracing a trajectory through the \\((x,y)\\) space on the dance floor and we’ll eventually end up at the coolest spot!\n\n\n\ndance floor with trajectory from hot side to cool side\n\n\nGreat success!" + }, + { + "objectID": "posts/get-down-with-gradient-descent/index.html#general-formulation", + "href": "posts/get-down-with-gradient-descent/index.html#general-formulation", + "title": "Get Down with Gradient Descent", + "section": "General Formulation", + "text": "General Formulation\nLet’s generalize a bit to get to the form of gradient descent you’ll see in references like the wikipedia article.\nFirst we modify our update equation above to handle functions with more than two arguments. We’ll use a bold \\(\\mathbf{x}\\) to indicate a vector of inputs \\(\\mathbf{x} = [x_1,x_2,\\dots,x_p]\\). Our function \\(f(\\mathbf{x}): \\mathbb{R}^p \\mapsto \\mathbb{R}\\) maps a \\(p\\) dimensional input to a scalar output.\nSecond, instead of displacing our current location with the negative gradient vector itself, we’ll first rescale it with a learning rate parameter. This helps address any issues with units on inputs versus outputs. Imagine the input could range between 0 and 1, but the output ranged from 0 to 1,000. We would need to rescale the partial derivatives so the update step doesn’t send us way too far off in input space.\nFinally, we’ll index our updates with \\(t=0,1,\\dots\\). We’ll run for some prespecified number of iterations or we’ll stop the procedure once the change in \\(f(\\mathbf{x})\\) is sufficiently small from one iteration to the next. Our update equation will look like this.\n\\[\\mathbf{x}_{t+1} = \\mathbf{x}_t - \\eta \\nabla f ( \\mathbf{x}_t) \\]\nIn pseudocode we could write it like this.\n# gradient descent\nx = initial_value_of_x \nfor t in range(n_iterations): # or some other convergence condition\n x -= learning_rate * gradient_of_f(x)\nNow let’s see how this algorithm gets used to train models." + }, + { + "objectID": "posts/get-down-with-gradient-descent/index.html#training-a-linear-regression-model-with-gradient-descent", + "href": "posts/get-down-with-gradient-descent/index.html#training-a-linear-regression-model-with-gradient-descent", + "title": "Get Down with Gradient Descent", + "section": "Training a Linear Regression Model with Gradient Descent", + "text": "Training a Linear Regression Model with Gradient Descent\nTo get the intuition for how we use gradient descent to train models, let’s use it to train a linear regression model. Note that we wouldn’t actually use gradient descent to train a linear model in real life since there is an exact analytical solution for the best-fit parameter values.\nAnyway, in the simple linear regression problem we have numerical feature \\(x\\) and numerical target \\(y\\), and we want to find a model of the form\n\\[F(x) = \\alpha + \\beta x\\]\nThis model has two parameters, \\(\\alpha\\) and \\(\\beta\\). Here “training” means finding the parameter values that make \\(F(x)\\) fit our \\(y\\) data best. We measure how well, or really how poorly, our model fits the data by using a loss function that yields a small value when a model fits well. Ordinary least squares is so named because it uses mean squared error as its loss function.\n\\[L(y, F(x)) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - F(x_i))^2 = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - (\\alpha + \\beta x_i))^2 \\]\nThe loss function \\(L\\) takes four arguments: \\(x\\), \\(y\\), \\(\\alpha\\), and \\(\\beta\\). But since \\(x\\) and \\(y\\) are fixed given our dataset, we could write the loss as \\(L(\\alpha, \\beta | x, y)\\) to emphasize that \\(\\alpha\\) and \\(\\beta\\) are the only free parameters. So we’re looking for the following.\n\\[\\underset{\\alpha,\\beta}{\\operatorname{argmin}} ~ L(\\alpha,\\beta|x,y) \\]\nThat’s right, we’re looking for the values of \\(\\alpha\\) and \\(\\beta\\) that minimize scalar-valued function \\(L(\\alpha, \\beta)\\). Sounds familiar huh?\nTo solve this minimization problem with gradient descent, we can use the following update rule.\n\\[[\\alpha_{t+1}, \\beta_{t+1}] = [\\alpha_{t}, \\beta_{t}] - \\eta \\nabla L(\\alpha_t, \\beta_t | x, y) \\]\nTo get the gradient \\(\\nabla L(\\alpha,\\beta|x,y)\\), we need the partial derivatives of \\(L\\) with respect to \\(\\alpha\\) and \\(\\beta\\). Since \\(L\\) is just a big sum, it’s easy to calculate the derivatives.\n\\[ \\frac{\\partial L(\\alpha, \\beta)}{\\partial \\alpha} = \\frac{1}{n} \\sum_{i=1}^{n} -2 (y_i - (\\alpha + \\beta x_i)) \\] \\[ \\frac{\\partial L(\\alpha, \\beta)}{\\partial \\beta} = \\frac{1}{n} \\sum_{i=1}^{n} -2x_i (y_i - (\\alpha + \\beta x_i)) \\]\nGreat! We’ve got everything we need to implement gradient descent to train an ordinary least squares model. Everything except data that is.\n\nToy Data\nLet’s make a friendly little linear dataset where \\(\\alpha=-10\\) and \\(\\beta=2\\), i.e.\n\\[ y = -10 + 2x + \\text{noise}\\]\n\nimport numpy as np \n\nalpha_true = -10\nbeta_true = 2\n\nrng = np.random.default_rng(42)\nx = np.linspace(0, 10, 50)\ny = alpha_true + beta_true*x + rng.normal(0, 1, size=x.shape)\n\n\n\n\n\n\n\n\nImplementation\nOur implementation will use a function to compute the gradient of the loss function. Since we have two parameters, we’ll use length-2 arrays to hold their values and their partial derivatives. At each iteration, we update the parameter values by subtracting the rescaled partial derivatives.\n\n\n# linear regression using gradient descent \n\ndef gradient_of_loss(parameters, x, y):\n alpha = parameters[0]\n beta = parameters[1]\n partial_alpha = np.mean(-2*(y - (alpha + beta*x)))\n partial_beta = np.mean(-2*x*(y - (alpha + beta*x)))\n return np.array([partial_alpha, partial_beta])\n\nlearning_rate = 0.02\nparameters = np.array([0.0, 0.0]) # initial values of alpha and beta\n\nfor _ in range(500):\n partial_derivatives = gradient_of_loss(parameters, x, y)\n parameters -= learning_rate * partial_derivatives\n \nparameters\n\narray([-10.07049616, 2.03559051])\n\n\nWe can see the loss function decreasing throughout the 500 iterations.\n\n\n\n\n\nAnd we can visualize the loss function as a contour plot over \\((\\alpha,\\beta)\\) space. The blue points show the trajectory our gradient descent followed as it shimmied from the initial position to the coolest spot in \\((\\alpha, \\beta)\\) space where the loss function is nice and small.\n\n\n\n\n\nOur gradient descent settles in a spot pretty close to \\((-10, 2)\\) in \\((\\alpha,\\beta)\\) space, which gives us the final fitted model below." + }, + { + "objectID": "posts/get-down-with-gradient-descent/index.html#wrapping-up", + "href": "posts/get-down-with-gradient-descent/index.html#wrapping-up", + "title": "Get Down with Gradient Descent", "section": "Wrapping Up", - "text": "Wrapping Up\nSo that’s a wrap on our crash course in working with PySpark. You now have a good idea of what pyspark is and how to get started manipulating dataframes with it. Stay tuned for a future post on PySpark’s companion ML library MLlib. In the meantime, may no dataframe be too large for you ever again." + "text": "Wrapping Up\nThere you have it, gradient descent explained with a fresh new analogy having nothing whatsoever to do with foggy mountains, plus an implemented example fitting a linear model. While we often see gradient descent used to train models by performing an optimization in parameter space, as in generalized linear models and neural networks, there are other ways to use this powerful technique to train models. In particular, we’ll soon see how our beloved gradient boosting tree models use gradient descent in prediction space, rather than parameter space. Stay tuned for that mind bender in a future post." }, { "objectID": "posts/8020-pandas-tutorial/index.html", @@ -350,123 +378,144 @@ "text": "Wrapping Up\nThere you have it, how to pull off the five most essential data transformation tasks using pandas in a style reminiscent of my beloved dplyr. Remember that part of the beauty of pandas is that since there are so many ways to do most tasks, you can develop your own style based on the kind of data you work with, what you like about other tools, how you see others using the tools, and of course your own taste and preferences.\nIf you found this post helpful or if you have your own preferred style for accomplishing any of these key transformations with pandas, do let me know about it in the comments." }, { - "objectID": "posts/get-down-with-gradient-descent/index.html", - "href": "posts/get-down-with-gradient-descent/index.html", - "title": "Get Down with Gradient Descent", + "objectID": "posts/hello-pyspark/index.html", + "href": "posts/hello-pyspark/index.html", + "title": "Hello PySpark!", "section": "", - "text": "Ahh, gradient descent. It’s probably one of the most ubiquitous algorithms used in data science, but you’re unlikely to see it being celebrated in the limelight of the Kaggle podium. Rather than taking center stage, gradient descent operates under the hood, powering the training for a wide range of models including deep neural networks, gradient boosting trees, generalized linear models, and mixed effects models. Getting an intuition for the algorithm will reveal how model fitting actually works and help us to see the common thread connecting a wide range of seemingly unrelated models. In this post we’ll get the intuition for gradient descent with a fresh analogy, develop the mathematical formulation, and ground our understanding by using it to train ourselves a linear regression model." + "text": "A big day at Playa Guiones\nWell, you guessed it: it’s time for us to learn PySpark!\nI know, I know, I can hear you screaming into your pillow. Indeed we just spent all that time converting from R and learning python and why the hell do we need yet another API for working with dataframes?\nThat’s a totally fair question.\nSo what happens when we’re working on something in the real world, where datasets get large in a hurry, and we suddenly have a dataframe that no longer fits into memory? We need a way for our computations and datasets to scale across multiple nodes in a distributed system without having to get too fussy about all the distributed compute details.\nEnter PySpark.\nI think it’s fair to think of PySpark as a python package for working with arbitrarily large dataframes, i.e., it’s like pandas but scalable. It’s built on top of Apache Spark, a unified analytics engine for large-scale data processing. PySpark is essentially a way to access the functionality of spark via python code. While there are other high-level interfaces to Spark (such as Java, Scala, and R), for data scientists who are already working extensively with python, PySpark will be the natural interface of choice. PySpark also has great integration with SQL, and it has a companion machine learning library called MLlib that’s more or less a scalable scikit-learn (maybe we can cover it in a future post).\nSo, here’s the plan. First we’re going to get set up to run PySpark locally in a jupyter notebook on our laptop. This is my preferred environment for interactively playing with PySpark and learning the ropes. Then we’re going to get up and running in PySpark as quickly as possible by reviewing the most essential functionality for working with dataframes and comparing it to how we would do things in pandas. Once we’re comfortable running PySpark on the laptop, it’s going to be much easier to jump onto a distributed cluster and run PySpark at scale.\nLet’s do this." }, { - "objectID": "posts/get-down-with-gradient-descent/index.html#intuition", - "href": "posts/get-down-with-gradient-descent/index.html#intuition", - "title": "Get Down with Gradient Descent", - "section": "Intuition", - "text": "Intuition\nBefore we dive into the intuition for gradient descent itself, let’s get a high-level view of why it’s useful in training or fiting a model. Training a model basically means finding the model parameter values that make the model fit a given dataset well. We measure how well a model fits data using a special function variously called a loss or cost or objective function. A loss function takes the dataset and the model as arguments and returns a number that tells us how well our model fits the data. Therefore training is an optimization problem in which we search for the model parameter values that result in the minimum value of the loss function. Enter gradient descent.\nGradient descent is a numerical optimization technique that helps us find the inputs that yield the minimum value of a function. Since most explanations of the gradient descent algorithm seem to use a story about hikers being lost in some foggy mountains, we’re going to try out a new analogy.\nLet’s say you’re at a concert. Remember those? They’re these things that used to happen where people played music and everyone danced and had a great time.\n\nNOTE: Chiming in here in 2023 from a sort-of-post COVID 19 world, happily I can report that concerts and live music are back!\n\nNow suppose at this concert there’s a dance floor which has become a bit sweltering from copious amounts of “getting down”. But the temperature isn’t quite uniform; maybe there’s a cool spot from a ceiling fan somewhere.\n\n\n\ndance floor\n\n\nLet’s get ourselves to that cool spot using the following procedure.\n\nFrom our current location, figure out which direction feels coolest.\nTake a step (or simply shimmy) in that direction.\nRepeat steps 1 and 2 until we reach the coolest spot on the dance floor.\n\nThe crux of this procedure is figuring out, at each step, which direction yields the greatest temperature reduction. Our skin is pretty sensitive to temperature, so we can just use awareness of body sensation to sense which direction feels coolest. Luckily, we have a mathematical equivalent to our skin’s ability to sense local variation in temperature.\n\nDetermine which way to go\nLet \\(f(x,y)\\) be the temperature on the dance floor at position \\((x,y)\\). The direction of fastest decrease in temperature is going to be given by some vector in our \\((x,y)\\) space, e.g.,\n[vector component in \\(x\\) direction, vector component in \\(y\\) direction]\nTurns out that the gradient of a function evaluated at a particular location yields a vector that points in the direction of fastest increase in the function, pretty similar to what we’re looking for. The gradient of \\(f(x,y)\\) is given by\n\\[ \\nabla f(x,y) = \\left [ \\frac{\\partial f(x,y)}{\\partial x}, \\frac{\\partial f(x,y)}{\\partial y} \\right ] \\]\nThe components of the gradient vector are the partial derivatives of our function \\(f(x,y)\\), evaluated at the point \\((x,y)\\). These partial derivatives just tell us the slope of \\(f(x,y)\\) in the \\(x\\) and \\(y\\) directions respectively. The intuition is that if \\(\\frac{\\partial f(x,y)}{\\partial x}\\) is a large positive number, then moving in the positive \\(x\\) direction will make \\(f(x,y)\\) increase a lot, whereas if \\(\\frac{\\partial f(x,y)}{\\partial x}\\) is a large negative number, then moving in the negative \\(x\\) direction will make \\(f(x,y)\\) increase a lot.\nIt’s not too hard to see that the direction of fastest decrease is actually just the exact opposite direction from that of fastest increase. Since we can point a vector in the opposite direction by negating its component values, our direction of fastest temperature decrease will be given by the negative gradient of the temperature field \\(-\\nabla f(x,y)\\).\n\n\n\ndance floor with hot and cold sides\n\n\n\n\nTake a step in the right direction\nNow that we have our direction vector, we’re ready to take a step toward the cool part of the dance floor. To do this, we’ll just add our direction vector to our current position. The update rule would look like this.\n\\[ [x_\\text{next}, y_\\text{next}] = [x_\\text{prev}, y_\\text{prev}] - \\nabla f (x_\\text{prev}, y_\\text{prev}) = [x_\\text{prev}, y_\\text{prev}] - \\left [ \\frac{\\partial f (x_\\text{prev}, y_\\text{prev})}{\\partial x}, \\frac{\\partial f (x_\\text{prev}, y_\\text{prev})}{\\partial y} \\right ] \\]\nIf we iteratively apply this update rule, we’ll end up tracing a trajectory through the \\((x,y)\\) space on the dance floor and we’ll eventually end up at the coolest spot!\n\n\n\ndance floor with trajectory from hot side to cool side\n\n\nGreat success!" + "objectID": "posts/hello-pyspark/index.html#how-to-run-pyspark-in-a-jupyter-notebook-on-your-laptop", + "href": "posts/hello-pyspark/index.html#how-to-run-pyspark-in-a-jupyter-notebook-on-your-laptop", + "title": "Hello PySpark!", + "section": "How to Run PySpark in a Jupyter Notebook on Your Laptop", + "text": "How to Run PySpark in a Jupyter Notebook on Your Laptop\nOk, I’m going to walk us through how to get things installed on a Mac or Linux machine where we’re using homebrew and conda to manage virtual environments. If you have a different setup, your favorite search engine will help you get PySpark set up locally.\n\n\n\n\n\n\nNote\n\n\n\nIt’s possible for Homebrew and Anaconda to interfere with one another. The simple rule of thumb is that whenever you want to use the brew command, first deactivate your conda environment by running conda deactivate. See this Stack Overflow question for more details.\n\n\n\nInstall Spark\nInstall Spark with homebrew.\nbrew install apache-spark\nNext we need to set up a SPARK_HOME environment variable in the shell. Check where Spark is installed.\nbrew info apache-spark\nYou should see something like\n==> apache-spark: stable 3.3.2 (bottled), HEAD\nEngine for large-scale data processing\nhttps://spark.apache.org/\n/opt/homebrew/Cellar/apache-spark/3.3.2 (1,453 files, 320.9MB) *\n...\nSet the SPARK_HOME environment variable to your spark installation path with /libexec appended to the end. To do this I added the following line to my .zshrc file.\nexport SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.3.2/libexec\nRestart your shell, and test the installation by starting the Spark shell.\nspark-shell\n...\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /___/ .__/\\_,_/_/ /_/\\_\\ version 3.3.2\n /_/\n \nUsing Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 19.0.2)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala> \nIf you get the scala> prompt, then you’ve successfully installed Spark on your laptop!\n\n\nInstall PySpark\nUse conda to install the PySpark python package. As usual, it’s advisable to do this in a new virtual environment.\n$ conda install pyspark\nYou should be able to launch an interactive PySpark REPL by saying pyspark.\n$ pyspark\n...\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /__ / .__/\\_,_/_/ /_/\\_\\ version 3.1.2\n /_/\n\nUsing Python version 3.8.3 (default, Jul 2 2020 11:26:31)\nSpark context Web UI available at http://192.168.100.47:4041\nSpark context available as 'sc' (master = local[*], app id = local-1624127229929).\nSparkSession available as 'spark'.\n>>> \nThis time we get a familiar python >>> prompt. This is an interactive shell where we can easily experiment with PySpark. Feel free to run the example code in this post here in the PySpark shell, or, if you prefer a notebook, read on and we’ll get set up to run PySpark in a jupyter notebook.\n\n\n\n\n\n\nNote\n\n\n\nWhen I tried following this setup on a new Mac, I hit an error about being unable to find the Java Runtime. This stack overflow question lead me to the fix.\n\n\n\n\nThe Spark Session Object\nYou may have noticed that when we launched that PySpark interactive shell, it told us that something called SparkSession was available as 'spark'. So basically, what’s happening here is that when we launch the pyspark shell, it instantiates an object called spark which is an instance of class pyspark.sql.session.SparkSession. The spark session object is going to be our entry point for all kinds of PySpark functionality, i.e., we’re going to be saying things like spark.this() and spark.that() to make stuff happen.\nThe PySpark interactive shell is kind enough to instantiate one of these spark session objects for us automatically. However, when we’re using another interface to PySpark (like say a jupyter notebook running a python kernal), we’ll have to make a spark session object for ourselves.\n\n\nCreate a PySpark Session in a Jupyter Notebook\nThere are a few ways to run PySpark in jupyter which you can read about here.\nFor derping around with PySpark on your laptop, I think the best way is to instantiate a spark session from a jupyter notebook running on a regular python kernel. The method we’ll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session. So, first install the findspark package.\nconda install -c conda-forge findspark\nLaunch jupyter as usual.\njupyter notebook\nGo ahead and fire up a new notebook using a regular python 3 kernal. Once you land inside the notebook, there are a couple things we need to do to get a spark session instantiated. You can think of this as boilerplate code that we need to run in the first cell of a notebook where we’re going to use PySpark.\n\nimport pyspark\nimport findspark\nfrom pyspark.sql import SparkSession\n\nfindspark.init()\nspark = SparkSession.builder.appName('My Spark App').getOrCreate()\n\nFirst we’re running findspark’s init() method to find our Spark installation. If you run into errors here, make sure you got the SPARK_HOME environment variable correctly set in the install instructions above. Then we instantiate a spark session as spark. Once you run this, you’re ready to rock and roll with PySpark in your jupyter notebook.\n\n\n\n\n\n\nNote\n\n\n\nSpark provides a handy web UI that you can use for monitoring and debugging. Once you instantiate the spark session You can open the UI in your web browser at http://localhost:4040/jobs/." }, { - "objectID": "posts/get-down-with-gradient-descent/index.html#general-formulation", - "href": "posts/get-down-with-gradient-descent/index.html#general-formulation", - "title": "Get Down with Gradient Descent", - "section": "General Formulation", - "text": "General Formulation\nLet’s generalize a bit to get to the form of gradient descent you’ll see in references like the wikipedia article.\nFirst we modify our update equation above to handle functions with more than two arguments. We’ll use a bold \\(\\mathbf{x}\\) to indicate a vector of inputs \\(\\mathbf{x} = [x_1,x_2,\\dots,x_p]\\). Our function \\(f(\\mathbf{x}): \\mathbb{R}^p \\mapsto \\mathbb{R}\\) maps a \\(p\\) dimensional input to a scalar output.\nSecond, instead of displacing our current location with the negative gradient vector itself, we’ll first rescale it with a learning rate parameter. This helps address any issues with units on inputs versus outputs. Imagine the input could range between 0 and 1, but the output ranged from 0 to 1,000. We would need to rescale the partial derivatives so the update step doesn’t send us way too far off in input space.\nFinally, we’ll index our updates with \\(t=0,1,\\dots\\). We’ll run for some prespecified number of iterations or we’ll stop the procedure once the change in \\(f(\\mathbf{x})\\) is sufficiently small from one iteration to the next. Our update equation will look like this.\n\\[\\mathbf{x}_{t+1} = \\mathbf{x}_t - \\eta \\nabla f ( \\mathbf{x}_t) \\]\nIn pseudocode we could write it like this.\n# gradient descent\nx = initial_value_of_x \nfor t in range(n_iterations): # or some other convergence condition\n x -= learning_rate * gradient_of_f(x)\nNow let’s see how this algorithm gets used to train models." + "objectID": "posts/hello-pyspark/index.html#pyspark-concepts", + "href": "posts/hello-pyspark/index.html#pyspark-concepts", + "title": "Hello PySpark!", + "section": "PySpark Concepts", + "text": "PySpark Concepts\nPySpark provides two main abstractions for data: the RDD and the dataframe. RDD’s are just a distributed list of objects; we won’t go into details about them in this post. For us, the key object in PySpark is the dataframe.\nWhile PySpark dataframes expose much of the functionality you would expect from a library for tabular data manipulation, they behave a little differently from pandas dataframes, both syntactically and under-the-hood. There are a couple of key concepts that will help explain these idiosyncracies.\nImmutability - Pyspark RDD’s and dataframes are immutable. This means that if you change an object, e.g. by adding a column to a dataframe, PySpark returns a reference to a new dataframe; it does not modify the existing dataframe. This is kind of nice, because we don’t have to worry about that whole view versus copy nonsense that happens in pandas.\nLazy Evaluation - Lazy evaluation means that when we start manipulating a dataframe, PySpark won’t actually perform any of the computations until we explicitly ask for the result. This is nice because it potentially allows PySpark to do fancy optimizations before executing a sequence of operations. It’s also confusing at first, because PySpark will seem to blaze through complex operations and then take forever to print a few rows of the dataframe." }, { - "objectID": "posts/get-down-with-gradient-descent/index.html#training-a-linear-regression-model-with-gradient-descent", - "href": "posts/get-down-with-gradient-descent/index.html#training-a-linear-regression-model-with-gradient-descent", - "title": "Get Down with Gradient Descent", - "section": "Training a Linear Regression Model with Gradient Descent", - "text": "Training a Linear Regression Model with Gradient Descent\nTo get the intuition for how we use gradient descent to train models, let’s use it to train a linear regression model. Note that we wouldn’t actually use gradient descent to train a linear model in real life since there is an exact analytical solution for the best-fit parameter values.\nAnyway, in the simple linear regression problem we have numerical feature \\(x\\) and numerical target \\(y\\), and we want to find a model of the form\n\\[F(x) = \\alpha + \\beta x\\]\nThis model has two parameters, \\(\\alpha\\) and \\(\\beta\\). Here “training” means finding the parameter values that make \\(F(x)\\) fit our \\(y\\) data best. We measure how well, or really how poorly, our model fits the data by using a loss function that yields a small value when a model fits well. Ordinary least squares is so named because it uses mean squared error as its loss function.\n\\[L(y, F(x)) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - F(x_i))^2 = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - (\\alpha + \\beta x_i))^2 \\]\nThe loss function \\(L\\) takes four arguments: \\(x\\), \\(y\\), \\(\\alpha\\), and \\(\\beta\\). But since \\(x\\) and \\(y\\) are fixed given our dataset, we could write the loss as \\(L(\\alpha, \\beta | x, y)\\) to emphasize that \\(\\alpha\\) and \\(\\beta\\) are the only free parameters. So we’re looking for the following.\n\\[\\underset{\\alpha,\\beta}{\\operatorname{argmin}} ~ L(\\alpha,\\beta|x,y) \\]\nThat’s right, we’re looking for the values of \\(\\alpha\\) and \\(\\beta\\) that minimize scalar-valued function \\(L(\\alpha, \\beta)\\). Sounds familiar huh?\nTo solve this minimization problem with gradient descent, we can use the following update rule.\n\\[[\\alpha_{t+1}, \\beta_{t+1}] = [\\alpha_{t}, \\beta_{t}] - \\eta \\nabla L(\\alpha_t, \\beta_t | x, y) \\]\nTo get the gradient \\(\\nabla L(\\alpha,\\beta|x,y)\\), we need the partial derivatives of \\(L\\) with respect to \\(\\alpha\\) and \\(\\beta\\). Since \\(L\\) is just a big sum, it’s easy to calculate the derivatives.\n\\[ \\frac{\\partial L(\\alpha, \\beta)}{\\partial \\alpha} = \\frac{1}{n} \\sum_{i=1}^{n} -2 (y_i - (\\alpha + \\beta x_i)) \\] \\[ \\frac{\\partial L(\\alpha, \\beta)}{\\partial \\beta} = \\frac{1}{n} \\sum_{i=1}^{n} -2x_i (y_i - (\\alpha + \\beta x_i)) \\]\nGreat! We’ve got everything we need to implement gradient descent to train an ordinary least squares model. Everything except data that is.\n\nToy Data\nLet’s make a friendly little linear dataset where \\(\\alpha=-10\\) and \\(\\beta=2\\), i.e.\n\\[ y = -10 + 2x + \\text{noise}\\]\n\nimport numpy as np \n\nalpha_true = -10\nbeta_true = 2\n\nrng = np.random.default_rng(42)\nx = np.linspace(0, 10, 50)\ny = alpha_true + beta_true*x + rng.normal(0, 1, size=x.shape)\n\n\n\n\n\n\n\n\nImplementation\nOur implementation will use a function to compute the gradient of the loss function. Since we have two parameters, we’ll use length-2 arrays to hold their values and their partial derivatives. At each iteration, we update the parameter values by subtracting the rescaled partial derivatives.\n\n\n# linear regression using gradient descent \n\ndef gradient_of_loss(parameters, x, y):\n alpha = parameters[0]\n beta = parameters[1]\n partial_alpha = np.mean(-2*(y - (alpha + beta*x)))\n partial_beta = np.mean(-2*x*(y - (alpha + beta*x)))\n return np.array([partial_alpha, partial_beta])\n\nlearning_rate = 0.02\nparameters = np.array([0.0, 0.0]) # initial values of alpha and beta\n\nfor _ in range(500):\n partial_derivatives = gradient_of_loss(parameters, x, y)\n parameters -= learning_rate * partial_derivatives\n \nparameters\n\narray([-10.07049616, 2.03559051])\n\n\nWe can see the loss function decreasing throughout the 500 iterations.\n\n\n\n\n\nAnd we can visualize the loss function as a contour plot over \\((\\alpha,\\beta)\\) space. The blue points show the trajectory our gradient descent followed as it shimmied from the initial position to the coolest spot in \\((\\alpha, \\beta)\\) space where the loss function is nice and small.\n\n\n\n\n\nOur gradient descent settles in a spot pretty close to \\((-10, 2)\\) in \\((\\alpha,\\beta)\\) space, which gives us the final fitted model below." + "objectID": "posts/hello-pyspark/index.html#pyspark-dataframe-essentials", + "href": "posts/hello-pyspark/index.html#pyspark-dataframe-essentials", + "title": "Hello PySpark!", + "section": "PySpark Dataframe Essentials", + "text": "PySpark Dataframe Essentials\n\nCreating a PySpark dataframe with createDataFrame()\nThe first thing we’ll need is a way to make dataframes. createDataFrame() allows us to create PySpark dataframes from python objects like nested lists or pandas dataframes. Notice that createDataFrame() is a method of the spark session class, so we’ll call it from our spark session sparkby saying spark.createDataFrame().\n\n# create pyspark dataframe from nested lists\nmy_df = spark.createDataFrame(\n data=[\n [2022, \"tiger\"],\n [2023, \"rabbit\"],\n [2024, \"dragon\"]\n ],\n schema=['year', 'animal']\n)\n\nLet’s read the seaborn tips dataset into a pandas dataframe and then use it to create a PySpark dataframe.\n\nimport pandas as pd\n\n# load tips dataset into a pandas dataframe\npandas_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')\n\n# create pyspark dataframe from a pandas dataframe\npyspark_df = spark.createDataFrame(pandas_df)\n\n\n\n\n\n\n\nNote\n\n\n\nIn real life when we’re running PySpark on a large-scale distributed system, we would not generally want to use python lists or pandas dataframes to load data into PySpark. Ideally we would want to read data directly from where it is stored on HDFS, e.g. by reading parquet files, or by querying directly from a hive database using spark sql.\n\n\n\n\nPeeking at a dataframe’s contents\nThe default print method for the PySpark dataframe will just give you the schema.\n\npyspark_df\n\nDataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]\n\n\nIf we want to peek at some of the data, we’ll need to use the show() method, which is analogous to the pandas head(). Remember that show() will cause PySpark to execute any operations that it’s been lazily waiting to evaluate, so sometimes it can take a while to run.\n\n# show the first few rows of the dataframe\npyspark_df.show(5)\n\n+----------+----+------+------+---+------+----+\n|total_bill| tip| sex|smoker|day| time|size|\n+----------+----+------+------+---+------+----+\n| 16.99|1.01|Female| No|Sun|Dinner| 2|\n| 10.34|1.66| Male| No|Sun|Dinner| 3|\n| 21.01| 3.5| Male| No|Sun|Dinner| 3|\n| 23.68|3.31| Male| No|Sun|Dinner| 2|\n| 24.59|3.61|Female| No|Sun|Dinner| 4|\n+----------+----+------+------+---+------+----+\nonly showing top 5 rows\n\n\n\n\n[Stage 0:> (0 + 1) / 1]\n\n \n\n\nWe thus encounter our first rude awakening. PySpark’s default representation of dataframes in the notebook isn’t as pretty as that of pandas. But no one ever said it would be pretty, they just said it would be scalable.\nYou can also use the printSchema() method for a nice vertical representation of the schema.\n\n# show the dataframe schema\npyspark_df.printSchema()\n\nroot\n |-- total_bill: double (nullable = true)\n |-- tip: double (nullable = true)\n |-- sex: string (nullable = true)\n |-- smoker: string (nullable = true)\n |-- day: string (nullable = true)\n |-- time: string (nullable = true)\n |-- size: long (nullable = true)\n\n\n\n\n\nSelect columns by name\nYou can select specific columns from a dataframe using the select() method. You can pass either a list of names, or pass names as arguments.\n\n# select some of the columns\npyspark_df.select('total_bill', 'tip')\n\n# select columns in a list\npyspark_df.select(['day', 'time', 'total_bill'])\n\n\n\nFilter rows based on column values\nAnalogous to the WHERE clause in SQL, and the query() method in pandas, PySpark provides a filter() method which returns only the rows that meet the specified conditions. Its argument is a string specifying the condition to be met for rows to be included in the result. You specify the condition as an expression involving the column names and comparison operators like <, >, <=, >=, == (equal), and ~= (not equal). You can specify compound expressions using and and or, and you can even do a SQL-like in to check if the column value matches any items in a list.\n\n## compare a column to a value\npyspark_df.filter('total_bill > 20')\n\n# compare two columns with arithmetic\npyspark_df.filter('tip > 0.15 * total_bill')\n\n# check equality with a string value\npyspark_df.filter('sex == \"Male\"')\n\n# check equality with any of several possible values\npyspark_df.filter('day in (\"Sat\", \"Sun\")')\n\n# use \"and\" \npyspark_df.filter('day == \"Fri\" and time == \"Lunch\"')\n\nIf you’re into boolean indexing with the brackets, PySpark does support that too, but I encourage you to use filter() instead. Check out my rant about why you shouldn’t use boolean indexing for the details. The TLDR is that filter() requires less typing, makes your code more readable and portable, and it allows you to chain method calls together using dot chains.\nHere’s the boolean indexing equivalent of the last example from above.\n\n# using boolean indexing\npyspark_df[(pyspark_df.day == 'Fri') & (pyspark_df.time == 'Lunch')]\n\nI know, it looks horrendous, but not as horrendous as the error message you’ll get if you forget the parentheses.\n\n\nAdd new columns to a dataframe\nYou can add new columns which are functions of the existing columns with the withColumn() method.\n\nimport pyspark.sql.functions as f\n\n# add a new column using col() to reference other columns\npyspark_df.withColumn('tip_percent', f.col('tip') / f.col('total_bill'))\n\nNotice that we’ve imported the pyspark.sql.functions module. This module contains lots of useful functions that we’ll be using all over the place, so it’s probably a good idea to go ahead and import it whenever you’re using PySpark. BTW, it seems like folks usually import this module as f or F. In this example we’re using the col() function, which allows us to refer to columns in our dataframe using string representations of the column names.\nYou could also achieve the same result using the dot to reference the other columns, but this requires us to type the dataframe name over and over again, which makes it harder to reuse this code on different dataframes or in dot chains.\n\n# add a new column using the dot to reference other columns (less recommended)\npyspark_df.withColumn('tip_percent', pyspark_df.tip / pyspark_df.total_bill)\n\nIf you want to apply numerical transformations like exponents or logs, use the built-in functions in the pyspark.sql.functions module.\n\n# log \npyspark_df.withColumn('log_bill', f.log(f.col('total_bill')))\n\n# exponent\npyspark_df.withColumn('bill_squared', f.pow(f.col('total_bill'), 2))\n\nYou can implement conditional assignment like SQL’s CASE WHEN construct using the when() function and the otherwise() method.\n\n# conditional assignment (like CASE WHEN)\npyspark_df.withColumn('is_male', f.when(f.col('sex') == 'Male', True).otherwise(False))\n\n# using multiple when conditions and values\npyspark_df.withColumn('bill_size', \n f.when(f.col('total_bill') < 10, 'small')\n .when(f.col('total_bill') < 20, 'medium')\n .otherwise('large')\n)\n\nRemember that since PySpark dataframes are immutable, calling withColumns() on a dataframe returns a new dataframe. If you want to persist the result, you’ll need to make an assignment.\npyspark_df = pyspark_df.withColumns(...)\n\n\nGroup by and aggregate\nPySpark provides a groupBy() method similar to the pandas groupby(). Just like in pandas, we can call methods like count() and mean() on our grouped dataframe, and we also have a more flexible agg() method that allows us to specify column-aggregation mappings.\n\n\n# group by and count\npyspark_df.groupBy('time').count().show()\n\n+------+-----+\n| time|count|\n+------+-----+\n|Dinner| 176|\n| Lunch| 68|\n+------+-----+\n\n\n\n\n\n# group by and specify column-aggregation mappings with agg()\npyspark_df.groupBy('time').agg({'total_bill': 'mean', 'tip': 'max'}).show()\n\n+------+--------+------------------+\n| time|max(tip)| avg(total_bill)|\n+------+--------+------------------+\n|Dinner| 10.0| 20.79715909090909|\n| Lunch| 6.7|17.168676470588235|\n+------+--------+------------------+\n\n\n\nIf you want to get fancier with your aggregations, it might just be easier to express them using hive syntax. Read on to find out how.\n\n\nRun Hive SQL on dataframes\nOne of the mind-blowing features of PySpark is that it allows you to write hive SQL queries on your dataframes. To take a PySpark dataframe into the SQL world, use the createOrReplaceTempView() method. This method takes one string argument which will be the dataframes name in the SQL world. Then you can use spark.sql() to run a query. The result is returned as a PySpark dataframe.\n\n\n# put pyspark dataframe in SQL world and query it\npyspark_df.createOrReplaceTempView('tips')\nspark.sql('select * from tips').show(5)\n\n+----------+----+------+------+---+------+----+\n|total_bill| tip| sex|smoker|day| time|size|\n+----------+----+------+------+---+------+----+\n| 16.99|1.01|Female| No|Sun|Dinner| 2|\n| 10.34|1.66| Male| No|Sun|Dinner| 3|\n| 21.01| 3.5| Male| No|Sun|Dinner| 3|\n| 23.68|3.31| Male| No|Sun|Dinner| 2|\n| 24.59|3.61|Female| No|Sun|Dinner| 4|\n+----------+----+------+------+---+------+----+\nonly showing top 5 rows\n\n\n\nThis is awesome for a couple of reasons. First, it allows us to easily express any transformations in hive syntax. If you’re like me and you’ve already been using hive, this will dramatically reduce the PySpark learning curve, because when in doubt, you can always bump a dataframe into the SQL world and simply use hive to do what you need. Second, if you have a hive deployment, PySpark’s SQL world also has access to all of your hive tables. This means you can write queries involving both hive tables and your PySpark dataframes. It also means you can run hive commands, like inserting into a table, directly from PySpark.\nLet’s do some aggregations that might be a little trickier to do using the PySpark built-in functions.\n\n\n# run hive query and save result to dataframe\ntip_stats_by_time = spark.sql(\"\"\"\n select\n time\n , count(*) as n \n , avg(tip) as avg_tip\n , percentile_approx(tip, 0.5) as med_tip\n , avg(case when tip > 3 then 1 else 0 end) as pct_tip_gt_3\n from \n tips\n group by 1\n\"\"\")\n\ntip_stats_by_time.show()\n\n+------+---+------------------+-------+-------------------+\n| time| n| avg_tip|med_tip| pct_tip_gt_3|\n+------+---+------------------+-------+-------------------+\n|Dinner|176| 3.102670454545455| 3.0|0.44886363636363635|\n| Lunch| 68|2.7280882352941176| 2.2|0.27941176470588236|\n+------+---+------------------+-------+-------------------+" }, { - "objectID": "posts/get-down-with-gradient-descent/index.html#wrapping-up", - "href": "posts/get-down-with-gradient-descent/index.html#wrapping-up", - "title": "Get Down with Gradient Descent", + "objectID": "posts/hello-pyspark/index.html#visualization-with-pyspark", + "href": "posts/hello-pyspark/index.html#visualization-with-pyspark", + "title": "Hello PySpark!", + "section": "Visualization with PySpark", + "text": "Visualization with PySpark\nThere aren’t any tools for visualization included in PySpark. But that’s no problem, because we can just use the toPandas() method on a PySpark dataframe to pull data back into pandas. Once we have a pandas dataframe, we can happily build visualizations as usual. Of course, if your PySpark dataframe is huge, you wouldn’t want to use toPandas() directly, because PySpark will attempt to read the entire contents of its huge dataframe into memory. Instead, it’s best to use PySpark to generate aggregations of your data for plotting or to pull only a sample of your full data into pandas.\n\n# read aggregated pyspark dataframe into pandas for plotting\nplot_pdf = tip_stats_by_time.toPandas()\nplot_pdf.plot.bar(x='time', y=['avg_tip', 'med_tip']);" + }, + { + "objectID": "posts/hello-pyspark/index.html#wrapping-up", + "href": "posts/hello-pyspark/index.html#wrapping-up", + "title": "Hello PySpark!", "section": "Wrapping Up", - "text": "Wrapping Up\nThere you have it, gradient descent explained with a fresh new analogy having nothing whatsoever to do with foggy mountains, plus an implemented example fitting a linear model. While we often see gradient descent used to train models by performing an optimization in parameter space, as in generalized linear models and neural networks, there are other ways to use this powerful technique to train models. In particular, we’ll soon see how our beloved gradient boosting tree models use gradient descent in prediction space, rather than parameter space. Stay tuned for that mind bender in a future post." + "text": "Wrapping Up\nSo that’s a wrap on our crash course in working with PySpark. You now have a good idea of what pyspark is and how to get started manipulating dataframes with it. Stay tuned for a future post on PySpark’s companion ML library MLlib. In the meantime, may no dataframe be too large for you ever again." }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "objectID": "posts/xgboost-from-scratch/index.html", + "href": "posts/xgboost-from-scratch/index.html", + "title": "XGBoost from Scratch", "section": "", - "text": "Cold water cascades over the rocks in Erwin, Tennessee.\nFriends, this is going to be an epic post! Today, we bring together all the ideas we’ve built up over the past few posts to nail down our understanding of the key ideas in Jerome Friedman’s seminal 2001 paper: “Greedy Function Approximation: A Gradient Boosting Machine.” In particular, we’ll summarize the highlights from the paper, and we’ll build an in-house python implementation of his generic gradient boosting algorithm which can train with any differentiable loss function. What’s more, we’ll go ahead and take our generic gradient boosting machine for a spin by training it with several of the most popular loss functions used in practice.\nAre you freaking stoked or what?\nSweet. Let’s do this." + "text": "A weathered tree reaches toward the sea at Playa Mal País\nWell, dear reader, it’s that time again, time for us to do a seemingly unnecessary scratch build of a popular algorithm that most people would simply import from the library without a second thought. But readers of this blog are not most people. Of course you know that when we do scratch builds, it’s not for the hell of it, it’s for the purpose of demystification. To that end, today we are going to implement XGBoost from scratch in python, using only numpy and pandas.\nSpecifically we’re going to implement the core statistical learning algorithm of XGBoost, including most of the key hyperparameters and their functionality. Our implementation will also support user-defined custom objective functions, meaning that it can perform regression, classification, and whatever exotic learning tasks you can dream up, as long as you can write down a twice-differentiable objective function. We’ll refrain from implementing some simple features like column subsampling which will be left to you, gentle reader, as exercises. In terms of tree methods, we’re going to implement the exact tree-splitting algorithm, leaving the sparsity-aware method (used to handle missing feature values) and the approximate method (used for scalability) as exercises or maybe topics for future posts.\nAs always, if something is unclear, try backtracking through the previous posts on gradient boosting and decision trees to clarify your intuition. We’ve already built up all the statistical and computational background needed to make sense of this scratch build. Here are the most important prerequisite posts:\nGreat, let’s do this." }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedman-2001-tldr", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedman-2001-tldr", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", - "section": "Friedman 2001: TL;DR", - "text": "Friedman 2001: TL;DR\nI’ve mentioned this paper a couple of times before, but as far as I can tell, this is the origin of gradient boosting; it is therefore, a seminal work worth reading. You know what, I think you might like to pick up the paper and read it yourself. Like many papers, there is a lot of scary looking math in the first few pages, but if you’ve been following along on this blog, you’ll find that it’s actually totally approachable. This is the kind of thing that cures imposter syndrome, so give it a shot. That said, here’s the TL;DR as I see it.\nThe first part of the paper introduces the idea of fitting models by doing gradient descent in function space, an ingenious idea we spent an entire post demystifying earlier. Friedman goes on to introduce the generic gradient boost algorithm, which works with any differentiable loss function, as well as specific variants for minimizing absolute error, Huber loss, and binary deviance. In terms of hyperparameters, he points out that the learning rate can be used to reduce overfitting, while increased tree depth can help capture more complex interactions among features. He even discusses feature importance and partial dependence methods for interpreting fitted gradient boosting models.\nFriedman concludes by musing about the advantages of gradient boosting with trees. He notes some key advantages afforded by the use of decision trees including no need to rescale input data, robustness against irrelevant input features, and elegant handling of missing feature values. He points out that gradient boosting manages to capitalize on the benefits of decision trees while minimizing their key weakness (crappy accuracy). I think this offers a great insight into why gradient boosting models have become so widespread and successful in practical ML applications." + "objectID": "posts/xgboost-from-scratch/index.html#the-xgboost-model-class", + "href": "posts/xgboost-from-scratch/index.html#the-xgboost-model-class", + "title": "XGBoost from Scratch", + "section": "The XGBoost Model Class", + "text": "The XGBoost Model Class\nWe begin with the user-facing API for our model, a class called XGBoostModel which will implement gradient boosting and prediction. To be more consistent with the XGBoost library, we’ll pass hyperparameters to our model in a parameter dictionary, so our init method is going to pull relevant parameters out of the dictionary and set them as object attributes. Note the use of python’s defaultdict so we don’t have to worry about handling key errors if we try to access a parameter that the user didn’t set in the dictionary.\n\nimport math\nimport numpy as np \nimport pandas as pd\nfrom collections import defaultdict\n\n\nclass XGBoostModel():\n '''XGBoost from Scratch\n '''\n \n def __init__(self, params, random_seed=None):\n self.params = defaultdict(lambda: None, params)\n self.subsample = self.params['subsample'] \\\n if self.params['subsample'] else 1.0\n self.learning_rate = self.params['learning_rate'] \\\n if self.params['learning_rate'] else 0.3\n self.base_prediction = self.params['base_score'] \\\n if self.params['base_score'] else 0.5\n self.max_depth = self.params['max_depth'] \\\n if self.params['max_depth'] else 5\n self.rng = np.random.default_rng(seed=random_seed)\n\nThe fit method, based on our classic GBM, takes a feature dataframe, a target vector, the objective function, and the number of boosting rounds as arguments. The user-supplied objective function should be an object with loss, gradient, and hessian methods, each of which takes a target vector and a prediction vector as input; the loss method should return a scalar loss score, the gradient method should return a vector of gradients, and the hessian method should return a vector of hessians.\nIn contrast to boosting in the classic GBM, instead of computing residuals between the current predictions and the target, we compute gradients and hessians of the loss function with respect to the current predictions, and instead of predicting residuals with a decision tree, we fit a special XGBoost tree booster (which we’ll implement in a moment) using the gradients and hessians. I’ve also added row subsampling by drawing a random subset of instance indices and passing them to the tree booster during each boosting round. The rest of the fit method is the same as the classic GBM, and the predict method is identical too.\n\ndef fit(self, X, y, objective, num_boost_round, verbose=False):\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n self.boosters = []\n for i in range(num_boost_round):\n gradients = objective.gradient(y, current_predictions)\n hessians = objective.hessian(y, current_predictions)\n sample_idxs = None if self.subsample == 1.0 \\\n else self.rng.choice(len(y), \n size=math.floor(self.subsample*len(y)), \n replace=False)\n booster = TreeBooster(X, gradients, hessians, \n self.params, self.max_depth, sample_idxs)\n current_predictions += self.learning_rate * booster.predict(X)\n self.boosters.append(booster)\n if verbose: \n print(f'[{i}] train loss = {objective.loss(y, current_predictions)}')\n \ndef predict(self, X):\n return (self.base_prediction + self.learning_rate \n * np.sum([booster.predict(X) for booster in self.boosters], axis=0))\n\nXGBoostModel.fit = fit\nXGBoostModel.predict = predict \n\nAll we have to do now is implement the tree booster." }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedmans-generic-gradient-boosting-algorithm", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#friedmans-generic-gradient-boosting-algorithm", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", - "section": "Friedman’s Generic Gradient Boosting Algorithm", - "text": "Friedman’s Generic Gradient Boosting Algorithm\nLet’s take a closer look at Friedman’s original gradient boost algorithm, Alg. 1 in Section 3 of the paper (translated into the notation we’ve been using so far).\nLike last time, we have training data \\((\\mathbf{y}, \\mathbf{X})\\) where \\(\\mathbf{y}\\) is a length-\\(n\\) vector of target values, and \\(\\mathbf{X}\\) is an \\(n \\times p\\) matrix with \\(n\\) observations of \\(p\\) features. We also have a differentiable loss function \\(L(\\mathbf{y}, \\mathbf{\\hat{y}}) = \\sum_{i=1}^n l(y_i, \\hat{y}_i)\\), a “learning rate” hyperparameter \\(\\eta\\), and a fixed number of model iterations \\(M\\).\nAlgorithm: gradient_boost\\((\\mathbf{X},\\mathbf{y},L,\\eta, M)\\) returns: model \\(F_M\\)\n\nLet base model \\(F_0(\\mathbf{x}) = c\\), where \\(c = \\text{argmin}_{c} \\sum_{i=1}^n l(y_i, c)\\)\nfor \\(m\\) = \\(0\\) to \\(M-1\\):\n     Let “pseudo-residual” vector \\(\\mathbf{r}_m = -\\nabla_{\\mathbf{\\hat{y}}_m} L(\\mathbf{y},\\mathbf{\\hat{y}}_m)\\)\n     Train decision tree regressor \\(h_m(\\mathbf{X})\\) to predict \\(\\mathbf{r}_m\\) (minimizing squared error)\n     foreach terminal leaf node \\(t \\in h_m\\):\n          Let \\(v = \\text{argmin}_v \\sum_{i \\in t} l(y_i, F_m(\\mathbf{x}_i) + v)\\)\n          Set terminal leaf node \\(t\\) to predict value \\(v\\)\n     \\(F_{m+1}(\\mathbf{X}) = F_{m}(\\mathbf{X}) + \\eta h_m(\\mathbf{X})\\)\nReturn composite model \\(F_M\\)\n\nBy now, most of this is already familiar to us. We begin by setting the base model \\(F_0\\) equal to the constant prediction value that minimizes the loss over all examples in the training dataset (line 1). Then we begin the boosting iterations (line 2), each time computing the negative gradients of the loss with respect to the current model predictions (known as the pseudo residuals) (line 3). We then fit our next decision tree regressor to predict the pseudo residuals (line 4).\nThen we encounter something new on lines 5-7. When we fit a vanilla decision tree regressor to predict pseudo residuals, we’re using mean squared error as the loss function to train the tree. As you might imagine, this works well when the global loss function is also squared error. But if we want to use a global loss other than squared error, there is an additional trick we can use to further increase the composite model’s accuracy. The idea is to continue using squared error to train each decision tree, keeping its structure and split conditions but altering the predicted value in each leaf to help minimize the global loss function. Instead of using the mean target value as the prediction for each node (as we would do when minimizing squared error), we use a numerical optimization method like line search to choose the constant value for that leaf that leads to the best overall loss. This is the same thing we did in line 1 of the algorithm to set the base prediction, but here we choose the optimal prediction for each terminal node of the newly trained decision tree." + "objectID": "posts/xgboost-from-scratch/index.html#the-xgboost-tree-booster", + "href": "posts/xgboost-from-scratch/index.html#the-xgboost-tree-booster", + "title": "XGBoost from Scratch", + "section": "The XGBoost Tree Booster", + "text": "The XGBoost Tree Booster\nThe XGBoost tree booster is a modified version of the decision tree that we built in the decision tree from scratch post. Like the decision tree, we recursively build a binary tree structure by finding the best split rule for each node in the tree. The main difference is the criterion for evaluating splits and the way that we define a leaf’s predicted value. Instead of being functions of the target values of the instances in each node, the criterion and predicted values are functions of the instance gradients and hessians. Thus we need only make a couple of modifications to our previous decision tree implementation to create the XGBoost tree booster.\n\nInitialization and Inserting Child Nodes\nMost of the init method is just parsing the parameter dictionary to assign parameters as object attributes. The one notable difference from our decision tree is in the way we define the node’s predicted value. We define self.value according to equation 5 of the XGBoost paper, a simple function of the gradient and hessian values of the instances in the current node. Of course the init also goes on to build the tree via the maybe insert child nodes method. This method is nearly identical to the one we implemented for our decision tree. So far so good.\n\nclass TreeBooster():\n \n def __init__(self, X, g, h, params, max_depth, idxs=None):\n self.params = params\n self.max_depth = max_depth\n assert self.max_depth >= 0, 'max_depth must be nonnegative'\n self.min_child_weight = params['min_child_weight'] \\\n if params['min_child_weight'] else 1.0\n self.reg_lambda = params['reg_lambda'] if params['reg_lambda'] else 1.0\n self.gamma = params['gamma'] if params['gamma'] else 0.0\n self.colsample_bynode = params['colsample_bynode'] \\\n if params['colsample_bynode'] else 1.0\n if isinstance(g, pd.Series): g = g.values\n if isinstance(h, pd.Series): h = h.values\n if idxs is None: idxs = np.arange(len(g))\n self.X, self.g, self.h, self.idxs = X, g, h, idxs\n self.n, self.c = len(idxs), X.shape[1]\n self.value = -g[idxs].sum() / (h[idxs].sum() + self.reg_lambda) # Eq (5)\n self.best_score_so_far = 0.\n if self.max_depth > 0:\n self._maybe_insert_child_nodes()\n\n def _maybe_insert_child_nodes(self):\n for i in range(self.c): self._find_better_split(i)\n if self.is_leaf: return\n x = self.X.values[self.idxs,self.split_feature_idx]\n left_idx = np.nonzero(x <= self.threshold)[0]\n right_idx = np.nonzero(x > self.threshold)[0]\n self.left = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[left_idx])\n self.right = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[right_idx])\n\n @property\n def is_leaf(self): return self.best_score_so_far == 0.\n\n def _find_better_split(self, feature_idx):\n pass\n\n\n\nSplit Finding\nSplit finding follows the exact same pattern that we used in the decision tree, except we keep track of gradient and hessian stats instead of target value stats, and of course we use the XGBoost gain criterion (equation 7 from the paper) for evaluating splits.\n\ndef _find_better_split(self, feature_idx):\n x = self.X.values[self.idxs, feature_idx]\n g, h = self.g[self.idxs], self.h[self.idxs]\n sort_idx = np.argsort(x)\n sort_g, sort_h, sort_x = g[sort_idx], h[sort_idx], x[sort_idx]\n sum_g, sum_h = g.sum(), h.sum()\n sum_g_right, sum_h_right = sum_g, sum_h\n sum_g_left, sum_h_left = 0., 0.\n\n for i in range(0, self.n - 1):\n g_i, h_i, x_i, x_i_next = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]\n sum_g_left += g_i; sum_g_right -= g_i\n sum_h_left += h_i; sum_h_right -= h_i\n if sum_h_left < self.min_child_weight or x_i == x_i_next:continue\n if sum_h_right < self.min_child_weight: break\n\n gain = 0.5 * ((sum_g_left**2 / (sum_h_left + self.reg_lambda))\n + (sum_g_right**2 / (sum_h_right + self.reg_lambda))\n - (sum_g**2 / (sum_h + self.reg_lambda))\n ) - self.gamma/2 # Eq(7) in the xgboost paper\n if gain > self.best_score_so_far: \n self.split_feature_idx = feature_idx\n self.best_score_so_far = gain\n self.threshold = (x_i + x_i_next) / 2\n \nTreeBooster._find_better_split = _find_better_split\n\n\n\nPrediction\nPrediction works exactly the same as in our decision tree, and the methods are nearly identical.\n\ndef predict(self, X):\n return np.array([self._predict_row(row) for i, row in X.iterrows()])\n\ndef _predict_row(self, row):\n if self.is_leaf: \n return self.value\n child = self.left if row[self.split_feature_idx] <= self.threshold \\\n else self.right\n return child._predict_row(row)\n\nTreeBooster.predict = predict \nTreeBooster._predict_row = _predict_row" }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#implementation", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#implementation", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", - "section": "Implementation", - "text": "Implementation\nI did some (half-assed) searching on the interweb for an implementation of GBM that allows the user to provide a custom loss function, and you know what? I couldn’t find anything. If you find another implementation, post in the comments so we can learn from it too.\nSince we need to modify the values predicted by our decision trees’ terminal nodes, we’ll want to brush up on the scikit-learn decision tree structure before we get going. You can see explanations of all the necessary decision tree hacks in this notebook.\n\nimport numpy as np\nfrom sklearn.tree import DecisionTreeRegressor \nfrom scipy.optimize import minimize\n\nclass GradientBoostingMachine():\n '''Gradient Boosting Machine supporting any user-supplied loss function.\n \n Parameters\n ----------\n n_trees : int\n number of boosting rounds\n \n learning_rate : float\n learning rate hyperparameter\n \n max_depth : int\n maximum tree depth\n '''\n \n def __init__(self, n_trees, learning_rate=0.1, max_depth=1):\n self.n_trees=n_trees; \n self.learning_rate=learning_rate\n self.max_depth=max_depth;\n \n def fit(self, X, y, objective):\n '''Fit the GBM using the specified loss function.\n \n Parameters\n ----------\n X : ndarray of size (number observations, number features)\n design matrix\n \n y : ndarray of size (number observations,)\n target values\n \n objective : loss function class instance\n Class specifying the loss function for training.\n Should implement two methods:\n loss(labels: ndarray, predictions: ndarray) -> float\n negative_gradient(labels: ndarray, predictions: ndarray) -> ndarray\n '''\n \n self.trees = []\n self.base_prediction = self._get_optimal_base_value(y, objective.loss)\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n for _ in range(self.n_trees):\n pseudo_residuals = objective.negative_gradient(y, current_predictions)\n tree = DecisionTreeRegressor(max_depth=self.max_depth)\n tree.fit(X, pseudo_residuals)\n self._update_terminal_nodes(tree, X, y, current_predictions, objective.loss)\n current_predictions += self.learning_rate * tree.predict(X)\n self.trees.append(tree)\n \n def _get_optimal_base_value(self, y, loss):\n '''Find the optimal initial prediction for the base model.'''\n fun = lambda c: loss(y, c)\n c0 = y.mean()\n return minimize(fun=fun, x0=c0).x[0]\n \n def _update_terminal_nodes(self, tree, X, y, current_predictions, loss):\n '''Update the tree's predictions according to the loss function.'''\n # terminal node id's\n leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0]\n # compute leaf for each sample in ``X``.\n leaf_node_for_each_sample = tree.apply(X)\n for leaf in leaf_nodes:\n samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0]\n y_in_leaf = y.take(samples_in_this_leaf, axis=0)\n preds_in_leaf = current_predictions.take(samples_in_this_leaf, axis=0)\n val = self._get_optimal_leaf_value(y_in_leaf, \n preds_in_leaf,\n loss)\n tree.tree_.value[leaf, 0, 0] = val\n \n def _get_optimal_leaf_value(self, y, current_predictions, loss):\n '''Find the optimal prediction value for a given leaf.'''\n fun = lambda c: loss(y, current_predictions + c)\n c0 = y.mean()\n return minimize(fun=fun, x0=c0).x[0]\n \n def predict(self, X):\n '''Generate predictions for the given input data.'''\n return (self.base_prediction \n + self.learning_rate \n * np.sum([tree.predict(X) for tree in self.trees], axis=0))\n\nIn terms of design, we implement a class for the GBM with scikit-like fit and predict methods. Notice in the below implementation that the fit method is only 10 lines long, and corresponds very closely to Friedman’s gradient boost algorithm from above. Most of the complexity comes from the helper methods for updating the leaf values according to the specified loss function.\nWhen the user wants to call the fit method, they’ll need to supply the loss function they want to use for boosting. We’ll make the user implement their loss (a.k.a. objective) function as a class with two methods: (1) a loss method taking the labels and the predictions and returning the loss score and (2) a negative_gradient method taking the labels and the predictions and returning an array of negative gradients." + "objectID": "posts/xgboost-from-scratch/index.html#the-complete-xgboost-from-scratch-implementation", + "href": "posts/xgboost-from-scratch/index.html#the-complete-xgboost-from-scratch-implementation", + "title": "XGBoost from Scratch", + "section": "The Complete XGBoost From Scratch Implementation", + "text": "The Complete XGBoost From Scratch Implementation\nHere’s the entire implementation which produces a usable XGBoostModel class with fit and predict methods.\n\nclass XGBoostModel():\n '''XGBoost from Scratch\n '''\n \n def __init__(self, params, random_seed=None):\n self.params = defaultdict(lambda: None, params)\n self.subsample = self.params['subsample'] \\\n if self.params['subsample'] else 1.0\n self.learning_rate = self.params['learning_rate'] \\\n if self.params['learning_rate'] else 0.3\n self.base_prediction = self.params['base_score'] \\\n if self.params['base_score'] else 0.5\n self.max_depth = self.params['max_depth'] \\\n if self.params['max_depth'] else 5\n self.rng = np.random.default_rng(seed=random_seed)\n \n def fit(self, X, y, objective, num_boost_round, verbose=False):\n current_predictions = self.base_prediction * np.ones(shape=y.shape)\n self.boosters = []\n for i in range(num_boost_round):\n gradients = objective.gradient(y, current_predictions)\n hessians = objective.hessian(y, current_predictions)\n sample_idxs = None if self.subsample == 1.0 \\\n else self.rng.choice(len(y), \n size=math.floor(self.subsample*len(y)), \n replace=False)\n booster = TreeBooster(X, gradients, hessians, \n self.params, self.max_depth, sample_idxs)\n current_predictions += self.learning_rate * booster.predict(X)\n self.boosters.append(booster)\n if verbose: \n print(f'[{i}] train loss = {objective.loss(y, current_predictions)}')\n \n def predict(self, X):\n return (self.base_prediction + self.learning_rate \n * np.sum([booster.predict(X) for booster in self.boosters], axis=0))\n \nclass TreeBooster():\n \n def __init__(self, X, g, h, params, max_depth, idxs=None):\n self.params = params\n self.max_depth = max_depth\n assert self.max_depth >= 0, 'max_depth must be nonnegative'\n self.min_child_weight = params['min_child_weight'] \\\n if params['min_child_weight'] else 1.0\n self.reg_lambda = params['reg_lambda'] if params['reg_lambda'] else 1.0\n self.gamma = params['gamma'] if params['gamma'] else 0.0\n self.colsample_bynode = params['colsample_bynode'] \\\n if params['colsample_bynode'] else 1.0\n if isinstance(g, pd.Series): g = g.values\n if isinstance(h, pd.Series): h = h.values\n if idxs is None: idxs = np.arange(len(g))\n self.X, self.g, self.h, self.idxs = X, g, h, idxs\n self.n, self.c = len(idxs), X.shape[1]\n self.value = -g[idxs].sum() / (h[idxs].sum() + self.reg_lambda) # Eq (5)\n self.best_score_so_far = 0.\n if self.max_depth > 0:\n self._maybe_insert_child_nodes()\n\n def _maybe_insert_child_nodes(self):\n for i in range(self.c): self._find_better_split(i)\n if self.is_leaf: return\n x = self.X.values[self.idxs,self.split_feature_idx]\n left_idx = np.nonzero(x <= self.threshold)[0]\n right_idx = np.nonzero(x > self.threshold)[0]\n self.left = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[left_idx])\n self.right = TreeBooster(self.X, self.g, self.h, self.params, \n self.max_depth - 1, self.idxs[right_idx])\n\n @property\n def is_leaf(self): return self.best_score_so_far == 0.\n \n def _find_better_split(self, feature_idx):\n x = self.X.values[self.idxs, feature_idx]\n g, h = self.g[self.idxs], self.h[self.idxs]\n sort_idx = np.argsort(x)\n sort_g, sort_h, sort_x = g[sort_idx], h[sort_idx], x[sort_idx]\n sum_g, sum_h = g.sum(), h.sum()\n sum_g_right, sum_h_right = sum_g, sum_h\n sum_g_left, sum_h_left = 0., 0.\n\n for i in range(0, self.n - 1):\n g_i, h_i, x_i, x_i_next = sort_g[i], sort_h[i], sort_x[i], sort_x[i + 1]\n sum_g_left += g_i; sum_g_right -= g_i\n sum_h_left += h_i; sum_h_right -= h_i\n if sum_h_left < self.min_child_weight or x_i == x_i_next:continue\n if sum_h_right < self.min_child_weight: break\n\n gain = 0.5 * ((sum_g_left**2 / (sum_h_left + self.reg_lambda))\n + (sum_g_right**2 / (sum_h_right + self.reg_lambda))\n - (sum_g**2 / (sum_h + self.reg_lambda))\n ) - self.gamma/2 # Eq(7) in the xgboost paper\n if gain > self.best_score_so_far: \n self.split_feature_idx = feature_idx\n self.best_score_so_far = gain\n self.threshold = (x_i + x_i_next) / 2\n \n def predict(self, X):\n return np.array([self._predict_row(row) for i, row in X.iterrows()])\n\n def _predict_row(self, row):\n if self.is_leaf: \n return self.value\n child = self.left if row[self.split_feature_idx] <= self.threshold \\\n else self.right\n return child._predict_row(row)" }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#testing-our-model", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#testing-our-model", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", - "section": "Testing our Model", - "text": "Testing our Model\nLet’s test drive our custom-loss-ready GBM with a few different loss functions! We’ll compare it to the scikit-learn GBM to sanity check our implementation.\n\nfrom sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier\n\nrng = np.random.default_rng()\n\n# test data\ndef make_test_data(n, noise_scale):\n x = np.linspace(0, 10, 500).reshape(-1,1)\n y = (np.where(x < 5, x, 5) + rng.normal(0, noise_scale, size=x.shape)).ravel()\n return x, y\n \n# print model loss scores\ndef print_model_loss_scores(obj, y, preds, sk_preds):\n print(f'From Scratch Loss = {obj.loss(y, pred):0.4}')\n print(f'Scikit-Learn Loss = {obj.loss(y, sk_pred):0.4}')\n\n\nMean Squared Error\nMean Squared Error (a.k.a. Least Squares) loss produces estimates of the mean target value conditioned on the feature values. Here’s the implementation.\n\nx, y = make_test_data(500, 0.4)\n\n\n# from scratch GBM\nclass SquaredErrorLoss():\n '''User-Defined Squared Error Loss'''\n \n def loss(self, y, preds):\n return np.mean((y - preds)**2)\n \n def negative_gradient(self, y, preds):\n return y - preds\n \n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, SquaredErrorLoss())\npred = gbm.predict(x)\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='squared_error')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(SquaredErrorLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.168\nScikit-Learn Loss = 0.168\n\n\n\n\n\n\n\n\n\nMean Absolute Error\nMean Absolute Error (a.k.a.Least Absolute Deviations) loss produces estimates of the median target value conditioned on the feature values. Here’s the implementation.\n\nx, y = make_test_data(500, 0.4)\n\n\n\n# from scratch GBM\nclass AbsoluteErrorLoss():\n '''User-Defined Absolute Error Loss'''\n \n def loss(self, y, preds):\n return np.mean(np.abs(y - preds))\n \n def negative_gradient(self, y, preds):\n return np.sign(y - preds)\n\n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, AbsoluteErrorLoss())\npred = gbm.predict(x)\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='absolute_error')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(AbsoluteErrorLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.3225\nScikit-Learn Loss = 0.3208\n\n\n\n\n\n\n\n\n\nQuantile Loss\nQuantile loss yields estimates of a given quantile of the target variable conditioned on the features. Here’s my implementation.\n\nx, y = make_test_data(500, 1)\n\n\n\n# from scratch GBM\nclass QuantileLoss():\n '''Quantile Loss\n \n Parameters\n ----------\n alpha : float\n quantile to be estimated, 0 < alpha < 1\n '''\n \n def __init__(self, alpha):\n if alpha < 0 or alpha >1:\n raise ValueError('alpha must be between 0 and 1')\n self.alpha = alpha\n \n def loss(self, y, preds):\n e = y - preds\n return np.mean(np.where(e > 0, self.alpha * e, (self.alpha - 1) * e))\n \n def negative_gradient(self, y, preds):\n e = y - preds \n return np.where(e > 0, self.alpha, self.alpha - 1)\n\ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, QuantileLoss(alpha=0.9))\npred = gbm.predict(x) \n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingRegressor(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='quantile', alpha=0.9)\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict(x)\n\n\nprint_model_loss_scores(QuantileLoss(alpha=0.9), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.1853\nScikit-Learn Loss = 0.1856\n\n\n\n\n\n\n\n\n\nBinary Cross Entropy Loss\nThe previous losses are useful for regression problems, where the target is numeric. But we can also solve classification problems, simply by swapping in an appropriate loss function. Here we’ll implement binary cross entropy, a.k.a. binary deviance, a.k.a. negative binomial log likelihood (sometimes abusively called log loss). One thing to remember is that, as with logistic regression, our model is actually predicting the log odds ratio, not the probability of the positive class. Thus we use expit transformations (the inverse of logit) whenever probabilities are needed, e.g., when predicting the probability that an observation belongs to the positive class.\n\n# make categorical test data\n\ndef expit(t):\n return np.exp(t) / (1 + np.exp(t))\n\nx = np.linspace(-3, 3, 500)\np = expit(x)\ny = rng.binomial(1, p, size=p.shape)\nx = x.reshape(-1,1)\n\n\n# from scratch GBM\nclass BinaryCrossEntropyLoss():\n '''Binary Cross Entropy Loss\n \n Note that the predictions should be log odds ratios.\n '''\n \n def __init__(self):\n self.expit = lambda t: np.exp(t) / (1 + np.exp(t))\n \n def loss(self, y, preds):\n p = self.expit(preds)\n return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))\n \n def negative_gradient(self, y, preds):\n p = self.expit(preds)\n return y / p - (1 - y) / (1 - p)\n\n \ngbm = GradientBoostingMachine(n_trees=10,\n learning_rate=0.5,\n max_depth=1)\ngbm.fit(x, y, BinaryCrossEntropyLoss())\npred = expit(gbm.predict(x))\n\n\n# scikit-learn GBM\nsk_gbm = GradientBoostingClassifier(n_estimators=10,\n learning_rate=0.5,\n max_depth=1,\n loss='log_loss')\nsk_gbm.fit(x, y)\nsk_pred = sk_gbm.predict_proba(x)[:, 1]\n\n\nprint_model_loss_scores(BinaryCrossEntropyLoss(), y, pred, sk_pred)\n\nFrom Scratch Loss = 0.6379\nScikit-Learn Loss = 0.6403" + "objectID": "posts/xgboost-from-scratch/index.html#testing", + "href": "posts/xgboost-from-scratch/index.html#testing", + "title": "XGBoost from Scratch", + "section": "Testing", + "text": "Testing\nLet’s take this baby for a spin and benchmark its performance against the actual XGBoost library. We use the scikit learn California housing dataset for benchmarking.\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.model_selection import train_test_split\n \nX, y = fetch_california_housing(as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, \n random_state=43)\n\nLet’s start with a nice friendly squared error objective function for training. We should probably have a future post all about how to define custom objective functions in XGBoost, but for now, here’s how I define squared error.\n\nclass SquaredErrorObjective():\n def loss(self, y, pred): return np.mean((y - pred)**2)\n def gradient(self, y, pred): return pred - y\n def hessian(self, y, pred): return np.ones(len(y))\n\nHere I use a more or less arbitrary set of hyperparameters for training. Feel free to play around with tuning and trying other parameter combinations yourself.\n\nimport xgboost as xgb\n\nparams = {\n 'learning_rate': 0.1,\n 'max_depth': 5,\n 'subsample': 0.8,\n 'reg_lambda': 1.5,\n 'gamma': 0.0,\n 'min_child_weight': 25,\n 'base_score': 0.0,\n 'tree_method': 'exact',\n}\nnum_boost_round = 50\n\n# train the from-scratch XGBoost model\nmodel_scratch = XGBoostModel(params, random_seed=42)\nmodel_scratch.fit(X_train, y_train, SquaredErrorObjective(), num_boost_round)\n\n# train the library XGBoost model\ndtrain = xgb.DMatrix(X_train, label=y_train)\ndtest = xgb.DMatrix(X_test, label=y_test)\nmodel_xgb = xgb.train(params, dtrain, num_boost_round)\n\nLet’s check the models’ performance on the held out test data to benchmark our implementation.\n\npred_scratch = model_scratch.predict(X_test)\npred_xgb = model_xgb.predict(dtest)\nprint(f'scratch score: {SquaredErrorObjective().loss(y_test, pred_scratch)}')\nprint(f'xgboost score: {SquaredErrorObjective().loss(y_test, pred_xgb)}')\n\nscratch score: 0.2434125759558149\nxgboost score: 0.24123239765807963\n\n\nWell, look at that! Our scratch-built SGBoost is looking pretty consistent with the library. Go us!" }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#wrapping-up", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#wrapping-up", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", + "objectID": "posts/xgboost-from-scratch/index.html#wrapping-up", + "href": "posts/xgboost-from-scratch/index.html#wrapping-up", + "title": "XGBoost from Scratch", "section": "Wrapping Up", - "text": "Wrapping Up\nWoohoo! We did it! We finally made it through Friedman’s paper in its entirety, and we implemented the generic gradient boosting algorithm which works with any differentiable loss function. If you made it this far, great job, gold star! By now you hopefully have a pretty solid grasp on gradient boosting, which is good, because soon we’re going to dive into the modern Newton descent gradient boosting frameworks like XGBoost. Onward!" + "text": "Wrapping Up\nI’d say this is a pretty good milestone for us here at Random Realizations. We’ve been hammering away at the various concepts around gradient boosting, leaving a trail of equations and scratch-built algos in our wake. Today we put all of that together to create a legit scratch build of XGBoost, something that would have been out of reach for me before we embarked on this journey together over a year ago. To anyone with the patience to read through this stuff, cheers to you! I hope you’re learning and enjoying this as much as I am." }, { - "objectID": "posts/gradient-boosting-machine-with-any-loss-function/index.html#references", - "href": "posts/gradient-boosting-machine-with-any-loss-function/index.html#references", - "title": "How to Implement a Gradient Boosting Machine that Works with Any Loss Function", - "section": "References", - "text": "References\nFriedman’s 2001 paper: Greedy Function Approximation: A Gradient Boosting Machine" + "objectID": "posts/xgboost-from-scratch/index.html#reader-exercises", + "href": "posts/xgboost-from-scratch/index.html#reader-exercises", + "title": "XGBoost from Scratch", + "section": "Reader Exercises", + "text": "Reader Exercises\nIf you want to take this a step further and deepen your understanding and coding abilities, let me recommend some exercises for you.\n\nImplement column subsampling. XGBoost itself provides column subsampling by tree, by level, and by node. Try implementing by tree first, then try adding by level or by node as well. These should be pretty straightforward to do.\nImplement sparsity aware split finding for missing feature values (Algorithm 2 in the XGBoost paper). This will be a little more involved, since you’ll need to refactor and modify several parts of the tree booster class." }, { - "objectID": "posts/hello-world/index.html", - "href": "posts/hello-world/index.html", - "title": "Hello World! And Why I’m Inspired to Start a Blog", + "objectID": "posts/consider-the-decision-tree/index.html", + "href": "posts/consider-the-decision-tree/index.html", + "title": "Consider the Decision Tree", "section": "", - "text": "Matt raises his arms in joy at the world.!\nWell, I’ve been thinking about getting this blog started for months now. I guess a combination of inertia, up-front investment in blogging platform selection/setup, and spending a little too much time writing and rewriting the first content post has drawn out the period from initial inspiration to making the blog a reality. Needless to say, I’m pretty excited to finally get things going.\nBefore we dive headlong into the weeds of ML algorithms, statistical methods, and whatever I happen to be learning and teaching at the moment, I figured it would be good to articulate why I’ve felt inspired to get started blogging in the first place. Hopefully this will serve the dual purpose of clarifying my intentions and introducing a vastly underappreciated concept in data science that I hope to weave through the posts to come." + "text": "A California cypress tree abides in silence on Alameda Beach.\nAh, the decision tree. It’s an underrated and often overlooked hero of modern statistical learning. Trees aren’t particularly powerful learning algorithms on their own, but when utilized as building blocks in larger ensemble models like random forest and gradient boosted trees, they can achieve state of the art performance in many practical applications. Since we’ve been focusing on gradient boosting ensembles lately, let’s take a moment to consider the humble decision tree itself. This post gives a high-level intuition for how trees work, an opinionated list of their key strengths and weaknesses, and some perspective on why ensembling makes them truly shine.\nOnward!" }, { - "objectID": "posts/hello-world/index.html#learning", - "href": "posts/hello-world/index.html#learning", - "title": "Hello World! And Why I’m Inspired to Start a Blog", - "section": "Learning", - "text": "Learning\nThe initial inception about blogging probably originated from some comments about learning that Jeremy Howard makes in the Practical Deep Learning course from fastai. During one of the lectures, he mentions that it’s a great idea to start blogging. To paraphrase Jeremy:\n\nThe thing I really love about blogging is that it helps you learn; by writing things down, you synthesize your ideas.\n\nBeautiful. That definitely rings true for me. I tend to take notes and play around with code when learning new concepts anyway. One of my key hypotheses about this blogging experiment is that making the effort to transform those notes into blog posts will help me learn more effectively." + "objectID": "posts/consider-the-decision-tree/index.html#classification-and-regression-trees", + "href": "posts/consider-the-decision-tree/index.html#classification-and-regression-trees", + "title": "Consider the Decision Tree", + "section": "Classification and Regression Trees", + "text": "Classification and Regression Trees\nA Decision tree is a type of statistical model that takes features or covariates as input and yields a prediction as output. The idea of the decision tree as a statistical learning tool traces back to a monograph published in 1984 by Breiman, Freidman, Olshen, and Stone called “Classification and Regression Trees” (a.k.a. CART). As the name suggests, trees come in two main varieties: classification trees which predict discrete class labels (e.g. DecisionTreeClassifier) and regression trees which predict numeric values (e.g. DecisionTreeRegressor).\nAs I mentioned earlier, tree models are not very powerful learners on their own. You might find that an individual tree model is useful for creating a simple and highly interpretable model in specific situations, but in general, trees tend to shine most as building blocks in more complex algorithms. These composite models are called ensembles, and the most important tree ensembles are random forest and gradient boosted trees. While random forest uses either regression or classification trees depending on the type of target, gradient boosting can use regression trees to solve both classification and regression tasks." }, { - "objectID": "posts/hello-world/index.html#teaching", - "href": "posts/hello-world/index.html#teaching", - "title": "Hello World! And Why I’m Inspired to Start a Blog", - "section": "Teaching", - "text": "Teaching\nAh, teaching. Yes, sometimes it’s that thing that takes time away from your research, forcing you to sit alone in a windowless room squinting at hand-written math on a fat stack of homework assignments. But sometimes it actually involves interacting with students, endeavoring to explain a concept, and watching them light up when they get it. The latter manifestation of teaching was one of my favorite things about grad school and academia in general. While I certainly still get to do some teaching as an industry data scientist, I could see myself returning to a more teaching-centric gig somewhere off in the future. Thus we have our second key hypothesis about the blogging experiment, that the writing will entertain my inclination to teach." + "objectID": "posts/consider-the-decision-tree/index.html#regression-tree-in-action", + "href": "posts/consider-the-decision-tree/index.html#regression-tree-in-action", + "title": "Consider the Decision Tree", + "section": "Regression Tree in Action", + "text": "Regression Tree in Action\nLet’s have a closer look at regression trees by training one on the diabetes dataset from scikit learn. According to the documentation:\n\nTen baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.\n\nFirst we load the data. To make our lives easier, we’ll just use two features: average blood pressure (bp) and the first blood serum measurement (s1) to predict the target. I’ll rescale the features to make the values easier for me to read, but it won’t affect our tree–more on that later.\n\nimport numpy as np \nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\ncolor_palette = \"viridis\"\n\n\nfrom sklearn.datasets import load_diabetes\n\nX, y = load_diabetes(as_frame=True, return_X_y=True)\n\nX = 100 * X[['bp', 's1']]\n\n\n\n\n\n\nLet’s grow a tree to predict the target given values of blood pressure and blood serum.\n\nfrom sklearn.tree import DecisionTreeRegressor\n\ntree = DecisionTreeRegressor(max_depth=2)\ntree.fit(X,y);\n\n\n\n\n\n\nTo make predictions using our fitted tree, we start at the root node (which is at the top), and we work our way down moving left if our feature is less than the split threshold and to the right if it’s greater than the split threshold. For example let’s predict the target for a new case with bp= 1 and s1 = 5. Since our blood pressure of 1 is less than 2.359, we move to the left child node. Here, since our serum of 5 is greater than the threshold at 0.875, we move to the right child node. This node has no further children, and thus we return its predicted value of 155.343.\n\ntree.predict(pd.DataFrame({'bp': 1, 's1': 5}, index=[0]))\n\narray([155.34313725])\n\n\nLet’s overlay these splits on our feature scatterplot to see how the tree has partitioned the feature space.\n\n\n\n\n\nThe tree has managed to carve out regions of feature space where the target values tend to be similar within each region, e.g. we have low target values in the bottom left partition and high target values in the far right region.\nLet’s take a look at the regression surface predicted by our tree. Since the tree predicts the exact same value for all instances in a given partition, the surface has only four distinct values.\n\n\n\n\n\nFabulous, now that we’ve seen a tree in action, let’s talk about trees’ key strengths and weaknesses." }, { - "objectID": "posts/hello-world/index.html#contributing", - "href": "posts/hello-world/index.html#contributing", - "title": "Hello World! And Why I’m Inspired to Start a Blog", - "section": "Contributing", - "text": "Contributing\nWorking in the field of data science today is a bit like standing in front of a massive complimentary all-you-can-learn buffet. There is an abundance of free material out on the interwebs for learning pretty much anything in data science from hello world python tutorials to research papers on cutting-edge deep learning techniques. I’ve personally benefited from many a blog post that helped me unpack a new concept or get started using a new tool. And let’s not forget the gigantic cyber warehouse full of freely available open source software tools that volunteer developers have straight-up donated to humanity.\nI realize that up to now, I’ve simply been consuming all of this free goodness without giving anything substantive back in return. Well then, it’s time to start evening the score. Which brings us to key hypothesis number three, that through these blog posts, I might be able to create something helpful, thereby being of service to a community that has freely given so much to me." + "objectID": "posts/consider-the-decision-tree/index.html#why-trees-are-awesome", + "href": "posts/consider-the-decision-tree/index.html#why-trees-are-awesome", + "title": "Consider the Decision Tree", + "section": "Why trees are awesome", + "text": "Why trees are awesome\nTrees are awesome because they are easy to use, and trees are easy to use because they are robust, require minimal data preprocessing, and can learn complex relationships without user intervention.\n\nFeature Scaling\nTrees owe their minimal data preprocessing requirements and their robustness to the fact that split finding is controlled by the sort order of the input feature values, rather than the values themselves. This means that trees are invariant to the scaling of input features, which in turn means that we don’t need to fuss around with carefully rescaling all the numeric features before fitting a tree. It also means that trees tend to work well even if features are highly skewed or contain outliers.\n\n\nCategoricals\nSince trees just split data based on numeric feature values, we can easily handle most categorical features by using integer encoding. For example we might encode a size feature with small = 1, medium = 2, and large = 3. This works particularly well with ordered categories, because partitioning is consistent with the category semantics. It can also work well even if the categories have no order, because with enough splits a tree can carve each category into its own partition.\n\n\nMissing Values\nIt’s worth calling out that different implementations of the decision tree handle missing feature values in different ways. Notably, scikit-learn handles them by throwing an error and telling you not to pull such shenanigans.\nValueError: Input contains NaN, infinity or a value too large for dtype('float32').\nOn the other hand, XGBoost supports an elegant way to make use of missing values, which we will discuss more in a later post.\n\n\nInteractions\nFeature interactions can also be learned automatically. An interaction means that the effect of one feature on the target differs depending on the value of another feature. For example, the effect of some drug may depend on whether or not the patient exercises. After a tree splits on exercise, it can naturally learn the correct drug effects for both exercisers and non-exercisers. This intuition extends to higher-order interactions as well, as long as the tree has enough splits to parse the relationships.\n\n\nFeature Selection\nBecause trees choose the best feature and threshold value at each split, they essentially perform automatic feature selection. This is great because even if we throw a lot of irrelevant features at a decision tree, it will simply tend not to use them for splits. Similarly, if two or more features are highly correlated or even redundant, the tree will simply choose one or the other when making each split; having both in the model will not cause catastrophic instability as it could in a linear model.\n\n\nFeature-Target Relationship\nFinally, it is possible for trees to discover complex nonlinear feature-target relationships without the need for user-specification of the relationships. This is because trees use local piecewise constant approximations without making any parametric assumptions. With enough splits, the tree can approximate arbitrary feature-target relationships." }, { - "objectID": "posts/hello-world/index.html#live-long-and-prosper-blog", - "href": "posts/hello-world/index.html#live-long-and-prosper-blog", - "title": "Hello World! And Why I’m Inspired to Start a Blog", - "section": "Live Long and Prosper, Blog", - "text": "Live Long and Prosper, Blog\nPhew, there it is, the original source of inspiration for this blogging experiment, and three reasons I think it might be a good idea. The astute reader will have noticed that these three assertions have been formulated as hypotheses which are to be tested in the laboratory of experience. And thus, we also have our first glimpse of the scientific method, an underrated concept that is going to help us put the science back in data science.\nWith that, blog, I christen thee, Random Realizations." + "objectID": "posts/consider-the-decision-tree/index.html#why-trees-are-not-so-awesome", + "href": "posts/consider-the-decision-tree/index.html#why-trees-are-not-so-awesome", + "title": "Consider the Decision Tree", + "section": "Why trees are not so awesome", + "text": "Why trees are not so awesome\nThe main weakness of the decision tree is that, on its own, it tends to have poor predictive performance compared to other algorithms. The main reasons for this are the tendency to overfit and prediction quantization issues.\n\nOverfitting\nIf we grow a decision tree until each leaf has exactly one instance in it, we will have simply memorized the training data, and our model will not generalize well. Basically the only defense against overfitting is to reduce the number of leaf nodes in the tree, either by using hyperparameters to stop splitting earlier or by removing certain leaf nodes after growing a deep tree. The problem here is that some of the benefits of trees, like ability to approximate arbitrary target patterns and ability to learn interaction effects, depend on having enough splits for the task. We can sometimes find ourselves in a situation where we cannot learn these complex relationships without overfitting the tree.\n\n\nQuantization\nBecause regression trees use piecewise constant functions to approximate the target, prediction accuracy can deteriorate near split boundaries. For example, if the target is increasing with the feature, a tree might tend to overpredict the target on the left side of split boundaries and overpredict on the right side of split boundaries.\n\n\n\n\n\n\n\nExtrapolation\nBecause they are trained by partitioning the feature space in a training dataset, trees cannot intelligently extrapolate beyond the data on which they are trained. For example if we query a tree for predictions beyond the greatest feature value encountered in training, it will just return the prediction corresponding to the largest in-sample feature values.\n\n\n\n\n\n\n\nThe Dark Side of Convenience\nFinally, there is always a price to pay for convenience. While trees can work well even with a messy dataset containing outliers, redundant features, and thoughtlessly encoded categoricals, we will rarely achieve the best performance under these conditions. Taking the time to deal with outliers, removing redundant information, purposefully choosing appropriate categorical encodings, and building an understanding of the data will often lead to much better results." + }, + { + "objectID": "posts/consider-the-decision-tree/index.html#how-ensembling-makes-trees-shine", + "href": "posts/consider-the-decision-tree/index.html#how-ensembling-makes-trees-shine", + "title": "Consider the Decision Tree", + "section": "How ensembling makes trees shine", + "text": "How ensembling makes trees shine\nWe can go a long way toward addressing the issues of overfitting and prediction quantization by using trees as building blocks in larger algorithms called tree ensembles, the most popular examples being random forest and gradient boosted trees. A tree ensemble is a collection of different individual tree models whose predictions are averaged to generate an overall prediction.\nEnsembling helps address overfitting because even if each individual tree is overfitted, the average of their individual noisy predictions will tend to be more stable. Think of it in terms of the bias variance tradeoff, where bias refers to a model’s failure to capture certain patterns and variance refers to how different a model prediction would be if the model were trained on a different sample of training data. Since the ensemble is averaging over the predictions of all the individual models, training it on a different sample of training data would change the individual models predictions, but their overall average prediction will tend to remain stable. Thus, ensembling helps reduce the effects of overfitting by reducing model variance without increasing bias.\nEnsembling also helps address prediction quantization issues. While each individual tree’s predictions might express large jumps in the regression surface, averaging many different trees’ predictions together effectively generates a surface with more partitions and smaller jumps between them. This provides a smoother approximation of the feature-target relationship." + }, + { + "objectID": "posts/consider-the-decision-tree/index.html#wrapping-up", + "href": "posts/consider-the-decision-tree/index.html#wrapping-up", + "title": "Consider the Decision Tree", + "section": "Wrapping Up", + "text": "Wrapping Up\nWell, there you go, that’s my take on the high-level overview of the decision tree and its main strengths and weaknesses. As we’ve seen, ensembling allows us to keep the conveniences of the decision tree while mitigating its core weakness of relatively weak predictive power. This is why tree ensembles are so popular in practical applications. We glossed over pretty much all details of how trees actually do their magic, but fear not, next time we’re going to get rowdy and build one of these things from scratch." }, { "objectID": "posts/blogging-with-quarto-and-jupyter/index.html", @@ -648,7 +697,7 @@ "href": "archive.html", "title": "Archive", "section": "", - "text": "Blogging with Quarto and Jupyter: The Complete Guide\n\n\n\n\n\n\n\n\n\nSep 6, 2023\n\n\n\n\n\n\n\n\nRandom Realizations Resurrected\n\n\n\n\n\n\n\n\n\nAug 2, 2023\n\n\n\n\n\n\n\n\nXGBoost from Scratch\n\n\n\n\n\n\n\n\n\nMay 7, 2022\n\n\n\n\n\n\n\n\nXGBoost Explained\n\n\n\n\n\n\n\n\n\nMar 13, 2022\n\n\n\n\n\n\n\n\nDecision Tree from Scratch\n\n\n\n\n\n\n\n\n\nDec 13, 2021\n\n\n\n\n\n\n\n\nConsider the Decision Tree\n\n\n\n\n\n\n\n\n\nDec 12, 2021\n\n\n\n\n\n\n\n\nHow to Implement a Gradient Boosting Machine that Works with Any Loss Function\n\n\n\n\n\n\n\n\n\nOct 23, 2021\n\n\n\n\n\n\n\n\nHello PySpark!\n\n\n\n\n\n\n\n\n\nJun 22, 2021\n\n\n\n\n\n\n\n\nHow Gradient Boosting Does Gradient Descent\n\n\n\n\n\n\n\n\n\nApr 27, 2021\n\n\n\n\n\n\n\n\nGet Down with Gradient Descent\n\n\n\n\n\n\n\n\n\nJan 22, 2021\n\n\n\n\n\n\n\n\nHow to Build a Gradient Boosting Machine from Scratch\n\n\n\n\n\n\n\n\n\nDec 8, 2020\n\n\n\n\n\n\n\n\nThe 80/20 Pandas Tutorial\n\n\n\n\n\n\n\n\n\nNov 25, 2020\n\n\n\n\n\n\n\n\nHello World! And Why I’m Inspired to Start a Blog\n\n\n\n\n\n\n\n\n\nNov 22, 2020\n\n\n\n\n\n\nNo matching items" + "text": "XGBoost for Regression in Python\n\n\n\n\n\n\n\n\n\nOct 25, 2023\n\n\n\n\n\n\n\n\nBlogging with Quarto and Jupyter: The Complete Guide\n\n\n\n\n\n\n\n\n\nSep 6, 2023\n\n\n\n\n\n\n\n\nRandom Realizations Resurrected\n\n\n\n\n\n\n\n\n\nAug 2, 2023\n\n\n\n\n\n\n\n\nXGBoost from Scratch\n\n\n\n\n\n\n\n\n\nMay 7, 2022\n\n\n\n\n\n\n\n\nXGBoost Explained\n\n\n\n\n\n\n\n\n\nMar 13, 2022\n\n\n\n\n\n\n\n\nDecision Tree from Scratch\n\n\n\n\n\n\n\n\n\nDec 13, 2021\n\n\n\n\n\n\n\n\nConsider the Decision Tree\n\n\n\n\n\n\n\n\n\nDec 12, 2021\n\n\n\n\n\n\n\n\nHow to Implement a Gradient Boosting Machine that Works with Any Loss Function\n\n\n\n\n\n\n\n\n\nOct 23, 2021\n\n\n\n\n\n\n\n\nHello PySpark!\n\n\n\n\n\n\n\n\n\nJun 22, 2021\n\n\n\n\n\n\n\n\nHow Gradient Boosting Does Gradient Descent\n\n\n\n\n\n\n\n\n\nApr 27, 2021\n\n\n\n\n\n\n\n\nGet Down with Gradient Descent\n\n\n\n\n\n\n\n\n\nJan 22, 2021\n\n\n\n\n\n\n\n\nHow to Build a Gradient Boosting Machine from Scratch\n\n\n\n\n\n\n\n\n\nDec 8, 2020\n\n\n\n\n\n\n\n\nThe 80/20 Pandas Tutorial\n\n\n\n\n\n\n\n\n\nNov 25, 2020\n\n\n\n\n\n\n\n\nHello World! And Why I’m Inspired to Start a Blog\n\n\n\n\n\n\n\n\n\nNov 22, 2020\n\n\n\n\n\n\nNo matching items" }, { "objectID": "about.html", diff --git a/sitemap.xml b/sitemap.xml index a4f7fa9..d1c9292 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,70 +2,74 @@ https://randomrealizations.com/gradient-boosting-series.html - 2023-09-06T08:42:25.920Z + 2023-09-18T12:36:28.632Z https://randomrealizations.com/ - 2023-09-06T08:42:24.970Z + 2023-09-18T12:36:27.918Z https://randomrealizations.com/posts/xgboost-explained/ - 2023-09-06T08:42:23.748Z + 2023-09-18T12:36:26.685Z https://randomrealizations.com/posts/random-realizations-resurrected/ - 2023-09-06T08:42:22.712Z + 2023-09-18T12:36:25.726Z https://randomrealizations.com/posts/decision-tree-from-scratch/ - 2023-09-06T08:42:21.791Z + 2023-09-18T12:36:24.877Z - https://randomrealizations.com/posts/consider-the-decision-tree/ - 2023-09-06T08:42:20.407Z + https://randomrealizations.com/posts/xgboost-for-regression-in-python/ + 2023-09-18T12:36:23.472Z - https://randomrealizations.com/posts/xgboost-from-scratch/ - 2023-09-06T08:42:19.625Z + https://randomrealizations.com/posts/hello-world/ + 2023-09-18T12:36:22.012Z - https://randomrealizations.com/posts/hello-pyspark/ - 2023-09-06T08:42:17.993Z + https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/ + 2023-09-18T12:36:21.268Z + + + https://randomrealizations.com/posts/get-down-with-gradient-descent/ + 2023-09-18T12:36:20.524Z https://randomrealizations.com/posts/8020-pandas-tutorial/ - 2023-09-06T08:42:17.339Z + 2023-09-18T12:36:19.440Z - https://randomrealizations.com/posts/get-down-with-gradient-descent/ - 2023-09-06T08:42:18.457Z + https://randomrealizations.com/posts/hello-pyspark/ + 2023-09-18T12:36:20.069Z - https://randomrealizations.com/posts/gradient-boosting-machine-with-any-loss-function/ - 2023-09-06T08:42:19.194Z + https://randomrealizations.com/posts/xgboost-from-scratch/ + 2023-09-18T12:36:21.712Z - https://randomrealizations.com/posts/hello-world/ - 2023-09-06T08:42:19.959Z + https://randomrealizations.com/posts/consider-the-decision-tree/ + 2023-09-18T12:36:22.520Z https://randomrealizations.com/posts/blogging-with-quarto-and-jupyter/ - 2023-09-06T08:42:21.254Z + 2023-09-18T12:36:24.343Z https://randomrealizations.com/posts/how-gradient-boosting-does-gradient-descent/ - 2023-09-06T08:42:22.425Z + 2023-09-18T12:36:25.447Z https://randomrealizations.com/posts/gradient-boosting-machine-from-scratch/ - 2023-09-06T08:42:23.170Z + 2023-09-18T12:36:26.164Z https://randomrealizations.com/archive.html - 2023-09-06T08:42:24.220Z + 2023-09-18T12:36:27.165Z https://randomrealizations.com/about.html - 2023-09-06T08:42:25.195Z + 2023-09-18T12:36:28.132Z