Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric predictors draft, part 1 #9

Merged
merged 13 commits into from
Feb 10, 2024
15 changes: 15 additions & 0 deletions _freeze/chapters/numeric-predictors/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "b546b97ff8f9d0477978fae463702934",
"result": {
"engine": "knitr",
"markdown": "---\nknitr:\n opts_chunk:\n cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features, procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter mostly focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nThe next few chapters concern preprocessing and feature engineering tools that mostly affect the predictors. As previously noted, the training set data are used to estimate parameters; this is also true for preprocessing parameters. All of these computations use the training set. At no point do we re-estimate parameters when new data are encountered. \n\nFor example, a standardization tool that centers and scales the data is introduced in the next section. The mean and standard deviation are computed from the training set for each column being standardized. When the training set, test set, or any future data are standardized, it uses these statistics derived from the training set. Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution. \n\nSuppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario, it would be impossible to standardize using a recomputed standard deviation for the new sample (since there is a single value and we would divide by zero). \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n \\text{where}\\quad v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew. For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\lambda^{-1}(x^\\lambda-1) & \\text{if $\\lambda \\ne 0$,}\n\\\\[3pt]\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\lambda^{-1}\\left[(x + 1)^\\lambda-1\\right] & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\[3pt]\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\[3pt]\n-(2 - \\lambda)^{-1}\\left[(-x + 1)^{2 - \\lambda}-1\\right] & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\[3pt]\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\n\nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n| Name | Equation | Source |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Modulus | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$ | @john1980alternative |\n| Bickel-Docksum | $$x^* = \\lambda^{-1}\\left[sign(x)|x| - 1\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis |\n| Glog / Gpower | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$ | @durbin2002variance, @kelmansky2013new |\n| Neglog | $$x^* = sign(x) \\log{(|x|+1)}$$ | @whittaker2005neglog |\n| Dual | $$x^* = (2\\lambda)^{-1}\\left[x^\\lambda - x^{-\\lambda}\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified |\n\nExamples of other families of transformations for dense numeric predictors. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed. \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling (as previously mentioned). To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively. \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features? These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed. Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme. \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ book:
- chapters/whole-game.qmd
- part: "Preparation"
- chapters/initial-data-splitting.qmd
- chapters/numeric-predictors.qmd
- part: "Optimization"
- part: "Classification"
- part: "Regression"
Expand Down
Loading