From 20d0113d8137562041227f59bf8a09cfc8c7462e Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Thu, 30 Nov 2023 18:31:02 -0500
Subject: [PATCH 01/10] initial work

---
 .../execute-results/html.json                 |  15 +
 _quarto.yml                                   |   1 +
 chapters/numeric-predictors.qmd               | 325 ++++++++++++++++++
 3 files changed, 341 insertions(+)
 create mode 100644 _freeze/chapters/numeric-predictors/execute-results/html.json
 create mode 100644 chapters/numeric-predictors.qmd

diff --git a/_freeze/chapters/numeric-predictors/execute-results/html.json b/_freeze/chapters/numeric-predictors/execute-results/html.json
new file mode 100644
index 0000000..a3e3fd2
--- /dev/null
+++ b/_freeze/chapters/numeric-predictors/execute-results/html.json
@@ -0,0 +1,15 @@
+{
+  "hash": "830a9f2d4280a1a950b90da5245fc721",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## Individual transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTwo classes of transformations will be considered here: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\n### Resolving skewness\n\nFor example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\nbox-cox\n\n$$\nx^* =\n\\begin{cases} \\frac{x^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$,}\n\\\\\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nyeo-johnson\n\n$$\nx^* =\n\\begin{cases}\n\\frac{(x + 1)^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\\n-\\frac{(-x + 1)^{2 - \\lambda}-1}{2 - \\lambda} & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nMaximum likelihood is also used to estimate the $\\lambda$ parameter.\n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). Also, on occasion, the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range.\n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce a value of $\\hat{\\lambda} = 0.15$.. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero).\n\nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. Otherwise, there are a few common main approaches that are used. \n\ncentering/scaling\n\nWhen centering and scaling, what should be done with predictors converted from categorical predictors to binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and the centered and scaled version.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=80%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n## Group transformations\n\nTODO more here\n\n### Spatial Sign {#sec-spatial-sign}\n\n\n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\nlast panel: spatial sign transformation\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
\ No newline at end of file
diff --git a/_quarto.yml b/_quarto.yml
index e5eedf1..ef27ede 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -51,6 +51,7 @@ book:
   - chapters/whole-game.qmd
   - part: "Preparation"
   - chapters/initial-data-splitting.qmd
+  - chapters/numeric-predictors.qmd  
   - part: "Optmization"
   - part: "Classification"  
   - part: "Regression"  
diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
new file mode 100644
index 0000000..818c10f
--- /dev/null
+++ b/chapters/numeric-predictors.qmd
@@ -0,0 +1,325 @@
+---
+knitr:
+  opts_chunk:
+    cache.path: "../_cache/transformations/"
+---
+
+# Transforming Numeric Predictors {#sec-numeric-predictors}
+
+```{r}
+#| label: transformations-setup
+#| include: false
+
+source("../R/_common.R")
+
+# ------------------------------------------------------------------------------
+
+library(tidymodels)
+library(embed)
+library(bestNormalize)
+library(patchwork)
+library(ggforce)
+
+# ------------------------------------------------------------------------------
+# set options
+
+tidymodels_prefer()
+theme_set(theme_transparent())
+set_options()
+```
+
+```{r}
+#| label: ames-split
+#| include: false
+source("../R/setup_ames.R")
+```
+
+As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. 
+
+We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors "in place" but altered. 
+
+
+## When are transformations estimated and applied? 
+
+and using what data
+
+note about not re-estimating; use a single data point and scaling as an example. 
+
+## Individual transformations
+
+Many transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
+
+some based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?
+
+Two classes of transformations will be considered here: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). 
+
+### Resolving skewness
+
+For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). There are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 
+
+One might infer that "samples far beyond the mainstream of the data" is synonymous with the term "outlier"; The Cambridge dictionary defines an outlier as
+
+> a person, thing, or fact that is very different from other people, things, or facts [...]
+
+or 
+
+> a place that is far from the main part of something
+
+These statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.
+
+The @nist describes them as 
+
+> an observation that lies an abnormal distance from other values in a random sample from a population
+
+In our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of "houses in Ames, Iowa." These values are genuine, just extreme.
+
+This, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.
+
+```{r}
+#| label: ames-lot-area-calcs
+#| warning: false
+lot_area_raw <- 
+  ames_train %>% 
+  ggplot(aes(Lot_Area)) + 
+  geom_histogram(bins = 30, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  geom_rug(alpha = 1 / 2, length = unit(0.04, "npc"), linewidth = 1.2) +
+  labs(x = "Lot Area", title = "(a) original")
+
+lot_area_yj_rec  <- 
+  recipe(~ Lot_Area, data = ames_train) %>% 
+  step_YeoJohnson(Lot_Area) %>% 
+  prep()
+
+lot_area_bc_rec  <- 
+  recipe(~ Lot_Area, data = ames_train) %>% 
+  step_BoxCox(Lot_Area) %>% 
+  prep()
+    
+yj_est <- lot_area_yj_rec %>% tidy(number = 1) %>% pluck("value")
+bc_est <- lot_area_bc_rec %>% tidy(number = 1) %>% pluck("value")
+bc_skew <- lot_area_bc_rec %>% bake(new_data = NULL) %>% pluck("Lot_Area") %>% e1071::skewness()
+
+lot_area_yj <- 
+  lot_area_yj_rec %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Lot_Area)) +
+  geom_rug(alpha = 1 / 2, length = unit(0.04, "npc"), linewidth = 1.2) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Lot Area", title = "(b) Box-Cox/Yeo-Johnson")
+
+lot_area_norm <- 
+  recipe(~ Lot_Area, data = ames_train) %>% 
+  step_orderNorm(Lot_Area) %>% 
+  prep() %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Lot_Area)) +
+  geom_rug(alpha = 1 / 2, length = unit(0.04, "npc"), linewidth = 1.2) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Lot Area", title = "(d) ordered quantile normalization")
+
+lot_area_pctl <- 
+  recipe(~ Lot_Area, data = ames_train) %>% 
+  step_percentile(Lot_Area) %>% 
+  prep() %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Lot_Area)) +
+  geom_rug(alpha = 1 / 2, length = unit(0.04, "npc"), linewidth = 1.2) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Lot Area", title = "(c) percentile")
+```
+
+```{r}
+#| label: fig-ames-lot-area
+#| fig-width: 8
+#| fig-height: 5.5
+#| out-width: "80%"
+#| fig-cap: "Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations."
+(lot_area_raw  +  lot_area_yj) / (lot_area_pctl + lot_area_norm)
+```
+
+One way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. 
+
+box-cox
+
+$$
+x^* =
+\begin{cases} \frac{x^\lambda-1}{\lambda} & \text{if $\lambda \ne 0$,}
+\\
+log(x) &\text{if $\lambda = 0$.}
+\end{cases}
+$$
+
+yeo-johnson
+
+$$
+x^* =
+\begin{cases}
+\frac{(x + 1)^\lambda-1}{\lambda} & \text{if $\lambda \ne 0$ and $x \ge 0$,} \\
+log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\
+-\frac{(-x + 1)^{2 - \lambda}-1}{2 - \lambda} & \text{if $\lambda \ne 2$ and $x < 0$,} \\
+-log(-x + 1) &\text{if $\lambda = 2$ and $x < 0$.} 
+\end{cases}
+$$
+
+Maximum likelihood is also used to estimate the $\lambda$ parameter.
+
+In practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). Also, on occasion, the maximum likelihood estimates of $\lambda$ diverge to huge values; it is also sensible to use values within a suitable range.
+
+For the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce a value of $\hat{\lambda} = `r round(yj_est, 3)`$.. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of `r signif(bc_skew, 3)` (much closer to zero).
+
+Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.
+
+Numeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.
+
+Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
+ 
+ 
+### Standardizing to a common scale 
+
+Another goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between `r min(ames_train$Year_Built)` and `r max(ames_train$Year_Built)`. Another, the number of bathrooms, ranges from `r min(ames_train$Baths)` to `r max(ames_train$Baths)`. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.
+
+The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. Otherwise, there are a few common main approaches that are used. 
+
+centering/scaling
+
+When centering and scaling, what should be done with predictors converted from categorical predictors to binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
+
+@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. 
+
+```{r}
+#| label: fig-standardization
+#| fig-cap: "The original gross living area data and the centered and scaled version."
+#| fig-width: 9
+#| fig-height: 3
+#| out-width: "80%"
+gross_area_raw <- 
+  ames_train %>% 
+  ggplot(aes(Gr_Liv_Area)) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Gross Living Area") +
+  geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
+
+gross_area_norm <- 
+  recipe(~ Gr_Liv_Area, data = ames_train) %>% 
+  step_normalize(Gr_Liv_Area) %>% 
+  prep() %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Gr_Liv_Area)) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Gross Living Area", y = "") +
+  geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
+
+
+gross_area_range <- 
+  recipe(~ Gr_Liv_Area, data = ames_train) %>% 
+  step_range(Gr_Liv_Area) %>% 
+  prep() %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Gr_Liv_Area)) + 
+  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  labs(x = "Gross Living Area", y = "") +
+  geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
+
+gross_area_raw + gross_area_norm + gross_area_range
+```
+
+Another common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via
+
+$$
+x^* = \frac{x - \min(x)}{\max(x) - \min(x)}
+$$
+
+When new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.
+
+## Group transformations
+
+TODO more here
+
+### Spatial Sign {#sec-spatial-sign}
+
+
+
+$$
+x^*_{ij}=\frac{x_{ij}}{\sum^{P}_{j=1} x_{ij}^2}
+$$
+
+```{r}
+#| label: ames-lot-living-area-calc
+two_areas_rec <- 
+  recipe(~ Lot_Area + Gr_Liv_Area, data = ames_train) %>% 
+  step_mutate(
+    location = ifelse(Lot_Area > 30000 | Gr_Liv_Area > 3500, "'outlying'", "mainstream")
+  ) %>% 
+  prep()
+
+data_cols <- c(rgb(0.27, 0.59, 0.15), rgb(0, 0, 0, 1/5))
+
+two_areas_raw <- 
+  two_areas_rec %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Lot_Area, Gr_Liv_Area)) +
+  geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) +
+  labs(x = "Lot Area", y = "Gross Living Area") +
+  scale_color_manual(values = data_cols) + 
+  scale_size_manual(values = c(3, 1)) +
+  coord_fixed(ratio = 45)
+
+two_areas_norm <- 
+  two_areas_rec %>%
+  step_orderNorm(Lot_Area, Gr_Liv_Area) %>%
+  prep() %>%
+  bake(new_data = NULL) %>%
+  ggplot(aes(Lot_Area, Gr_Liv_Area)) +
+  geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) +
+  labs(x = "Lot Area", y = "Gross Living Area") +
+  scale_color_manual(values = data_cols) + 
+  scale_size_manual(values = c(3, 1)) +
+  coord_equal() +
+  theme(axis.title.y = element_blank())
+
+two_areas_ss <- 
+  two_areas_rec %>% 
+  step_normalize(Lot_Area, Gr_Liv_Area) %>% 
+  step_spatialsign(Lot_Area, Gr_Liv_Area) %>% 
+  prep() %>% 
+  bake(new_data = NULL) %>% 
+  ggplot(aes(Lot_Area, Gr_Liv_Area)) + 
+  geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) + 
+  labs(x = "Lot Area", y = "Gross Living Area") +
+  scale_color_manual(values = data_cols) + 
+  scale_size_manual(values = c(3, 1 /2)) +
+  coord_equal()  +
+  theme(axis.title.y = element_blank())
+```
+
+@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least `r sum(bake(two_areas_rec, new_data = NULL)$location == "'outlying'")` samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. 
+
+The second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  
+
+last panel: spatial sign transformation
+
+```{r}
+#| label: fig-ames-lot-living-area
+#| fig-cap: "Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations."
+#| fig-width: 8
+#| fig-height: 3
+#| out-width: "100%"
+
+two_areas_raw + two_areas_norm + two_areas_ss + 
+   plot_layout(guides = "collect") & 
+   theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"))
+```
+
+## Feature Extraction and Embeddings
+
+
+### Linear Projection Methods {#sec-linear-feature-extraction}
+
+
+
+### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}
+
+
+
+## Chapter References {.unnumbered}
+

From 05d7d95747afb4224bbf3065e42c11492720f035 Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Fri, 1 Dec 2023 15:00:50 -0500
Subject: [PATCH 02/10] standardization text

---
 chapters/numeric-predictors.qmd | 91 ++++++++++++++++++++++-----------
 1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 818c10f..5662b96 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -55,25 +55,14 @@ Two classes of transformations will be considered here: those that resolve distr
 
 ### Resolving skewness
 
-For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). There are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 
+The skew of a distribution can be quantified using the skewness statistic: 
 
-One might infer that "samples far beyond the mainstream of the data" is synonymous with the term "outlier"; The Cambridge dictionary defines an outlier as
-
-> a person, thing, or fact that is very different from other people, things, or facts [...]
-
-or 
-
-> a place that is far from the main part of something
-
-These statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.
-
-The @nist describes them as 
-
-> an observation that lies an abnormal distance from other values in a random sample from a population
-
-In our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of "houses in Ames, Iowa." These values are genuine, just extreme.
-
-This, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.
+$$\begin{align}
+  skewness &= \frac{1}{(n-1)v^{3/2}} \sum_{1=1}^n (x_i-\overline{x})^3 \notag \\
+  \text{where}\quad  v &= \frac{1}{(n-1)}\sum_{1=1}^n (x_i-\overline{x})^2 \notag
+\end{align}
+$$
+where values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). There are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 
 
 ```{r}
 #| label: ames-lot-area-calcs
@@ -137,9 +126,52 @@ lot_area_pctl <-
 (lot_area_raw  +  lot_area_yj) / (lot_area_pctl + lot_area_norm)
 ```
 
+
+One might infer that "samples far beyond the mainstream of the data" is synonymous with the term "outlier"; The Cambridge dictionary defines an outlier as
+
+> a person, thing, or fact that is very different from other people, things, or facts [...]
+
+or 
+
+> a place that is far from the main part of something
+
+These statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.
+
+The @nist describes them as 
+
+> an observation that lies an abnormal distance from other values in a random sample from a population
+
+In our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of "houses in Ames, Iowa." These values are genuine, just extreme.
+
+This, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.
+
 One way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. 
 
-box-cox
+@Box1964p3648 defined a power family of transformations that use a single parameter, $\lambda$, for different methods: 
+
+:::: {.columns}
+
+::: {.column width="10%"}
+:::
+
+::: {.column width="40%"}
+- no transformation via $\lambda = 1.0$
+- square ($x^2$) via $\lambda = 2$
+- logarithmic ($\log{x}$) via $\lambda = 0.0$
+:::
+
+::: {.column width="40%"}
+- square root ($\sqrt{x}$) via $\lambda = 0.5$
+- inverse square root ($1/\sqrt{x}$) via $\lambda = -0.5$
+- inverse ($1/x$) via $\lambda = -1.0$
+:::
+
+::: {.column width="10%"}
+:::
+
+::::
+
+and others in between. The transformed version of the variable is:
 
 $$
 x^* =
@@ -149,7 +181,7 @@ log(x) &\text{if $\lambda = 0$.}
 \end{cases}
 $$
 
-yeo-johnson
+Their paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors. @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: 
 
 $$
 x^* =
@@ -161,11 +193,11 @@ log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\
 \end{cases}
 $$
 
-Maximum likelihood is also used to estimate the $\lambda$ parameter.
+In either case, maximum likelihood is also used to estimate the $\lambda$ parameter. 
 
-In practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). Also, on occasion, the maximum likelihood estimates of $\lambda$ diverge to huge values; it is also sensible to use values within a suitable range.
+In practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\hat{\lambda}$ is within some range of zero (say between $\pm 0.01$)^[If you've never seen it, the "hat" notation (e.g. $\hat{\lambda}$) indicates an estimate of some unknown parameter.]. 
 
-For the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce a value of $\hat{\lambda} = `r round(yj_est, 3)`$.. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of `r signif(bc_skew, 3)` (much closer to zero).
+For the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\hat{\lambda} = `r round(yj_est, 3)`$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of `r signif(bc_skew, 3)` (much closer to zero). However, there are still outlying points.
 
 Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.
 
@@ -173,16 +205,17 @@ Numeric predictors can be converted to their percentiles, and these data, inhere
 
 Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
  
+In @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  
  
 ### Standardizing to a common scale 
 
 Another goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between `r min(ames_train$Year_Built)` and `r max(ames_train$Year_Built)`. Another, the number of bathrooms, ranges from `r min(ames_train$Baths)` to `r max(ames_train$Baths)`. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.
 
-The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. Otherwise, there are a few common main approaches that are used. 
+The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. 
 
-centering/scaling
+First is centering and scaling. To convert to a common scale, the mean ($\bar{x}$) and standard deviation ($\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \bar{x}) / \hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  
 
-When centering and scaling, what should be done with predictors converted from categorical predictors to binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
+In the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
 
 @fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. 
 
@@ -196,7 +229,7 @@ gross_area_raw <-
   ames_train %>% 
   ggplot(aes(Gr_Liv_Area)) + 
   geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
-  labs(x = "Gross Living Area") +
+  labs(x = "Gross Living Area", title = "(a) original") +
   geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
 
 gross_area_norm <- 
@@ -206,7 +239,7 @@ gross_area_norm <-
   bake(new_data = NULL) %>% 
   ggplot(aes(Gr_Liv_Area)) + 
   geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
-  labs(x = "Gross Living Area", y = "") +
+  labs(x = "Gross Living Area", y = "", title = "(b) centered and scaled") +
   geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
 
 
@@ -217,7 +250,7 @@ gross_area_range <-
   bake(new_data = NULL) %>% 
   ggplot(aes(Gr_Liv_Area)) + 
   geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
-  labs(x = "Gross Living Area", y = "") +
+  labs(x = "Gross Living Area", y = "", title = "(c) range scaled") +
   geom_rug(alpha = 1 / 2, length = unit(0.02, "npc"))
 
 gross_area_raw + gross_area_norm + gross_area_range

From c3b5d4982cad0b50073566b9cb07849366e5d551 Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Fri, 1 Dec 2023 16:12:43 -0500
Subject: [PATCH 03/10] spatial sign details

---
 chapters/numeric-predictors.qmd | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 5662b96..48f995d 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -45,13 +45,15 @@ and using what data
 
 note about not re-estimating; use a single data point and scaling as an example. 
 
-## Individual transformations
+## General Rransformations
 
 Many transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
 
 some based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?
 
-Two classes of transformations will be considered here: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). 
+To start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). 
+
+After these, an example of a _group_ transformation is described. 
 
 ### Resolving skewness
 
@@ -156,7 +158,7 @@ One way to resolve skewness is to apply a transformation that makes the data mor
 
 ::: {.column width="40%"}
 - no transformation via $\lambda = 1.0$
-- square ($x^2$) via $\lambda = 2$
+- square ($x^2$) via $\lambda = 2.0$
 - logarithmic ($\log{x}$) via $\lambda = 0.0$
 :::
 
@@ -221,10 +223,10 @@ In the next chapter, methods are discussed to convert categorical predictors to
 
 ```{r}
 #| label: fig-standardization
-#| fig-cap: "The original gross living area data and the centered and scaled version."
+#| fig-cap: "The original gross living area data and two standardized versions."
 #| fig-width: 9
 #| fig-height: 3
-#| out-width: "80%"
+#| out-width: "92%"
 gross_area_raw <- 
   ames_train %>% 
   ggplot(aes(Gr_Liv_Area)) + 
@@ -264,18 +266,16 @@ $$
 
 When new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.
 
-## Group transformations
-
-TODO more here
-
 ### Spatial Sign {#sec-spatial-sign}
 
-
+Some transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: 
 
 $$
 x^*_{ij}=\frac{x_{ij}}{\sum^{P}_{j=1} x_{ij}^2}
 $$
 
+Notice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. 
+
 ```{r}
 #| label: ames-lot-living-area-calc
 two_areas_rec <- 
@@ -290,12 +290,12 @@ data_cols <- c(rgb(0.27, 0.59, 0.15), rgb(0, 0, 0, 1/5))
 two_areas_raw <- 
   two_areas_rec %>% 
   bake(new_data = NULL) %>% 
-  ggplot(aes(Lot_Area, Gr_Liv_Area)) +
+  ggplot(aes(Lot_Area/1000, Gr_Liv_Area)) +
   geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) +
-  labs(x = "Lot Area", y = "Gross Living Area") +
+  labs(x = "Lot Area (thousands)", y = "Gross Living Area") +
   scale_color_manual(values = data_cols) + 
   scale_size_manual(values = c(3, 1)) +
-  coord_fixed(ratio = 45)
+  coord_fixed(ratio = 1/25)
 
 two_areas_norm <- 
   two_areas_rec %>%
@@ -329,7 +329,6 @@ two_areas_ss <-
 
 The second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  
 
-last panel: spatial sign transformation
 
 ```{r}
 #| label: fig-ames-lot-living-area
@@ -343,11 +342,14 @@ two_areas_raw + two_areas_norm + two_areas_ss +
    theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"))
 ```
 
+The panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. 
+
 ## Feature Extraction and Embeddings
 
 
 ### Linear Projection Methods {#sec-linear-feature-extraction}
 
+spatial sign for robustness
 
 
 ### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}

From bc486bfec4c3265c698983576ba1289edf456543 Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Tue, 5 Dec 2023 15:36:20 -0500
Subject: [PATCH 04/10] add some extra transformation functions

---
 .../execute-results/html.json                 |  4 +-
 chapters/numeric-predictors.qmd               | 40 +++++++-
 includes/references.bib                       | 96 +++++++++++++++++++
 3 files changed, 135 insertions(+), 5 deletions(-)

diff --git a/_freeze/chapters/numeric-predictors/execute-results/html.json b/_freeze/chapters/numeric-predictors/execute-results/html.json
index a3e3fd2..5bc585a 100644
--- a/_freeze/chapters/numeric-predictors/execute-results/html.json
+++ b/_freeze/chapters/numeric-predictors/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "830a9f2d4280a1a950b90da5245fc721",
+  "hash": "1a5c4797d90f8899717bd848f5c0c55b",
   "result": {
     "engine": "knitr",
-    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## Individual transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTwo classes of transformations will be considered here: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\n### Resolving skewness\n\nFor example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\nbox-cox\n\n$$\nx^* =\n\\begin{cases} \\frac{x^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$,}\n\\\\\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nyeo-johnson\n\n$$\nx^* =\n\\begin{cases}\n\\frac{(x + 1)^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\\n-\\frac{(-x + 1)^{2 - \\lambda}-1}{2 - \\lambda} & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nMaximum likelihood is also used to estimate the $\\lambda$ parameter.\n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). Also, on occasion, the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range.\n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce a value of $\\hat{\\lambda} = 0.15$.. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero).\n\nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. Otherwise, there are a few common main approaches that are used. \n\ncentering/scaling\n\nWhen centering and scaling, what should be done with predictors converted from categorical predictors to binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and the centered and scaled version.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=80%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n## Group transformations\n\nTODO more here\n\n### Spatial Sign {#sec-spatial-sign}\n\n\n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\nlast panel: spatial sign transformation\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
+    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n  skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n  \\text{where}\\quad  v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\frac{x^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$,}\n\\\\\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\frac{(x + 1)^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\\n-\\frac{(-x + 1)^{2 - \\lambda}-1}{2 - \\lambda} & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n|  Name            |  Equation                                                      | Source                 |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Bickel-Docksum   | $$x^* = \\lambda^{-1}\\left(sign(x)|x| - 1\\right)\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis  |\n| Dual             | $$x^* = (2\\lambda)^{-1}(x^\\lambda - x^{-\\lambda})\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified    |\n| Glog / Gpower    | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right]  & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |\n| Modulus          | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @john1980alternative    |\n| Neglog           | $$x^* = sign(x) \\log{(|x|+1)}$$                                | @whittaker2005neglog  |\n\nExamples of other families of transformations. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling. To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 48f995d..14f6d76 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -18,7 +18,6 @@ library(tidymodels)
 library(embed)
 library(bestNormalize)
 library(patchwork)
-library(ggforce)
 
 # ------------------------------------------------------------------------------
 # set options
@@ -45,7 +44,7 @@ and using what data
 
 note about not re-estimating; use a single data point and scaling as an example. 
 
-## General Rransformations
+## General Transformations
 
 Many transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
 
@@ -183,7 +182,7 @@ log(x) &\text{if $\lambda = 0$.}
 \end{cases}
 $$
 
-Their paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors. @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: 
+Their paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: 
 
 $$
 x^* =
@@ -207,6 +206,41 @@ Numeric predictors can be converted to their percentiles, and these data, inhere
 
 Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
  
+There are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\lambda$. 
+
+
+:::: {.columns}
+
+::: {.column width="15%"}
+:::
+
+::: {.column width="70%"}
+
+::: {#tbl-transforms}
+
+|  Name            |  Equation                                                      | Source                 |
+|------------------|:--------------------------------------------------------------:|:----------------------:|
+| Bickel-Docksum   | $$x^* = \lambda^{-1}\left(sign(x)|x| - 1\right)\quad\text{if $\lambda \neq 0$}$$ | @bickel1981analysis  |
+| Dual             | $$x^* = (2\lambda)^{-1}(x^\lambda - x^{-\lambda})\quad\text{if $\lambda \neq 0$}$$ | @yang2006modified    |
+| Glog / Gpower    | $$x^* = \begin{cases} \lambda^{-1}\left[({x+ \sqrt{x^2+1}})^\lambda-1\right]  & \text{if $\lambda \neq 0$,}\\[3pt]
+\log({x+ \sqrt{x^2+1}}) &\text{if $\lambda = 0$}
+\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |
+| Modulus          | $$x^* = \begin{cases} sign(x)\lambda^{-1}\left[(|x|+1)^\lambda-1\right] & \text{if $\lambda \neq 0$,}\\[3pt]
+sign(x) \log{(|x|+1)} &\text{if $\lambda = 0$}
+\end{cases}$$  | @john1980alternative    |
+| Neglog           | $$x^* = sign(x) \log{(|x|+1)}$$                                | @whittaker2005neglog  |
+
+Examples of other families of transformations. 
+
+:::
+ 
+:::
+
+::: {.column width="15%"}
+:::
+
+:::: 
+ 
 In @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  
  
 ### Standardizing to a common scale 
diff --git a/includes/references.bib b/includes/references.bib
index b8694b1..783446b 100644
--- a/includes/references.bib
+++ b/includes/references.bib
@@ -4893,3 +4893,99 @@ @article{kaufman2012leakage
   year={2012},
   publisher={ACM New York, NY, USA}
 }
+
+@article{whittaker2005neglog,
+  title={The neglog transformation and quantile regression for the analysis of a large credit scoring database},
+  author={Whittaker, J and Whitehead, C and Somers, M},
+  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume={54},
+  number={5},
+  pages={863-878},
+  year={2005},
+  publisher={Oxford University Press}
+}
+
+@article{manly1976exponential,
+  title={Exponential data transformations},
+  author={Manly, B},
+  journal={Journal of the Royal Statistical Society Series {D}: The Statistician},
+  volume={25},
+  number={1},
+  pages={37-42},
+  year={1976},
+  publisher={Oxford University Press}
+}
+
+@article{feng2016note,
+  title={A note on automatic data transformation},
+  author={Feng, Q and Hannig, J and Marron, JS},
+  journal={Stat},
+  volume={5},
+  number={1},
+  pages={82-87},
+  year={2016},
+  publisher={Wiley Online Library}
+}
+
+@article{kelmansky2013new,
+  title={A new variance stabilizing transformation for gene expression data analysis},
+  author={Kelmansky, D and Mart{\'\i}nez, E and Leiva, V},
+  journal={Statistical Applications in Genetics and Molecular Biology},
+  volume={12},
+  number={6},
+  pages={653-666},
+  year={2013},
+  publisher={De Gruyter}
+}
+
+@article{durbin2002variance,
+  title={A variance-stabilizing transformation for gene-expression microarray data},
+    author={Durbin, B and Hardin, J and Hawkins, D and Rocke, D},
+  journal={Bioinformatics},
+  volume={18},
+  year={2002}
+}
+
+@article{yang2006modified,
+  title={A modified family of power transformations},
+  author={Yang, Z},
+  journal={Economics Letters},
+  volume={92},
+  number={1},
+  pages={14--19},
+  year={2006},
+  publisher={Elsevier}
+}
+
+@article{bickel1981analysis,
+  title={An analysis of transformations revisited},
+  author={Bickel, P and Doksum, K},
+  journal={Journal of the American Statistical Association},
+  volume={76},
+  number={374},
+  pages={296-311},
+  year={1981},
+  publisher={Taylor \& Francis}
+}
+
+@article{asar2017estimating,
+  title={Estimating {Box-Cox} power transformation parameter via goodness-of-fit tests},
+  author={Asar, O and Ilk, O and Dag, O},
+  journal={Communications in Statistics-Simulation and Computation},
+  volume={46},
+  number={1},
+  pages={91-105},
+  year={2017},
+  publisher={Taylor \& Francis}
+}
+
+@article{john1980alternative,
+  title={An alternative family of transformations},
+  author={John, J and Draper, N},
+  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume={29},
+  number={2},
+  pages={190--197},
+  year={1980},
+  publisher={Oxford University Press}
+}
\ No newline at end of file

From 4fb3d08b8477f6d76af077f93f451e2ff065a7ee Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Tue, 5 Dec 2023 19:57:07 -0500
Subject: [PATCH 05/10] small updates to equations

---
 .../execute-results/html.json                 |  4 +--
 chapters/numeric-predictors.qmd               | 33 ++++++++++---------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/_freeze/chapters/numeric-predictors/execute-results/html.json b/_freeze/chapters/numeric-predictors/execute-results/html.json
index 5bc585a..1cae1dc 100644
--- a/_freeze/chapters/numeric-predictors/execute-results/html.json
+++ b/_freeze/chapters/numeric-predictors/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "1a5c4797d90f8899717bd848f5c0c55b",
+  "hash": "d46baad36b55214e996321bf7eb1f7d5",
   "result": {
     "engine": "knitr",
-    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n  skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n  \\text{where}\\quad  v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\frac{x^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$,}\n\\\\\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\frac{(x + 1)^\\lambda-1}{\\lambda} & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\\n-\\frac{(-x + 1)^{2 - \\lambda}-1}{2 - \\lambda} & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n|  Name            |  Equation                                                      | Source                 |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Bickel-Docksum   | $$x^* = \\lambda^{-1}\\left(sign(x)|x| - 1\\right)\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis  |\n| Dual             | $$x^* = (2\\lambda)^{-1}(x^\\lambda - x^{-\\lambda})\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified    |\n| Glog / Gpower    | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right]  & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |\n| Modulus          | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @john1980alternative    |\n| Neglog           | $$x^* = sign(x) \\log{(|x|+1)}$$                                | @whittaker2005neglog  |\n\nExamples of other families of transformations. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling. To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
+    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n  skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n  \\text{where}\\quad  v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\lambda^{-1}(x^\\lambda-1) & \\text{if $\\lambda \\ne 0$,}\n\\\\[3pt]\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\lambda^{-1}\\left[(x + 1)^\\lambda-1\\right] & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\[3pt]\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\[3pt]\n-(2 - \\lambda)^{-1}\\left[(-x + 1)^{2 - \\lambda}-1\\right] & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\[3pt]\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\n\nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n|  Name            |  Equation                                                      | Source                 |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Modulus          | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @john1980alternative    |\n| Bickel-Docksum   | $$x^* = \\lambda^{-1}\\left[sign(x)|x| - 1\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis  |\n| Glog / Gpower    | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right]  & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |\n| Neglog           | $$x^* = sign(x) \\log{(|x|+1)}$$                                | @whittaker2005neglog  |\n| Dual             | $$x^* = (2\\lambda)^{-1}\\left[x^\\lambda - x^{-\\lambda}\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified    |\n\nExamples of other families of transformations for dense numeric predictors. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling. To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 14f6d76..cca82b3 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -176,8 +176,8 @@ and others in between. The transformed version of the variable is:
 
 $$
 x^* =
-\begin{cases} \frac{x^\lambda-1}{\lambda} & \text{if $\lambda \ne 0$,}
-\\
+\begin{cases} \lambda^{-1}(x^\lambda-1) & \text{if $\lambda \ne 0$,}
+\\[3pt]
 log(x) &\text{if $\lambda = 0$.}
 \end{cases}
 $$
@@ -187,9 +187,9 @@ Their paper defines this as a supervised transformation of a non-negative outcom
 $$
 x^* =
 \begin{cases}
-\frac{(x + 1)^\lambda-1}{\lambda} & \text{if $\lambda \ne 0$ and $x \ge 0$,} \\
-log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\
--\frac{(-x + 1)^{2 - \lambda}-1}{2 - \lambda} & \text{if $\lambda \ne 2$ and $x < 0$,} \\
+\lambda^{-1}\left[(x + 1)^\lambda-1\right] & \text{if $\lambda \ne 0$ and $x \ge 0$,} \\[3pt]
+log(x + 1) &\text{if $\lambda = 0$ and $x \ge 0$.} \\[3pt]
+-(2 - \lambda)^{-1}\left[(-x + 1)^{2 - \lambda}-1\right] & \text{if $\lambda \ne 2$ and $x < 0$,} \\[3pt]
 -log(-x + 1) &\text{if $\lambda = 2$ and $x < 0$.} 
 \end{cases}
 $$
@@ -200,12 +200,7 @@ In practice, these two transformations might be limited to predictors with accep
 
 For the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\hat{\lambda} = `r round(yj_est, 3)`$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of `r signif(bc_skew, 3)` (much closer to zero). However, there are still outlying points.
 
-Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.
-
-Numeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.
 
-Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
- 
 There are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\lambda$. 
 
 
@@ -220,17 +215,17 @@ There are numerous other transformations that attempt to make the distribution o
 
 |  Name            |  Equation                                                      | Source                 |
 |------------------|:--------------------------------------------------------------:|:----------------------:|
-| Bickel-Docksum   | $$x^* = \lambda^{-1}\left(sign(x)|x| - 1\right)\quad\text{if $\lambda \neq 0$}$$ | @bickel1981analysis  |
-| Dual             | $$x^* = (2\lambda)^{-1}(x^\lambda - x^{-\lambda})\quad\text{if $\lambda \neq 0$}$$ | @yang2006modified    |
-| Glog / Gpower    | $$x^* = \begin{cases} \lambda^{-1}\left[({x+ \sqrt{x^2+1}})^\lambda-1\right]  & \text{if $\lambda \neq 0$,}\\[3pt]
-\log({x+ \sqrt{x^2+1}}) &\text{if $\lambda = 0$}
-\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |
 | Modulus          | $$x^* = \begin{cases} sign(x)\lambda^{-1}\left[(|x|+1)^\lambda-1\right] & \text{if $\lambda \neq 0$,}\\[3pt]
 sign(x) \log{(|x|+1)} &\text{if $\lambda = 0$}
 \end{cases}$$  | @john1980alternative    |
+| Bickel-Docksum   | $$x^* = \lambda^{-1}\left[sign(x)|x| - 1\right]\quad\text{if $\lambda \neq 0$}$$ | @bickel1981analysis  |
+| Glog / Gpower    | $$x^* = \begin{cases} \lambda^{-1}\left[({x+ \sqrt{x^2+1}})^\lambda-1\right]  & \text{if $\lambda \neq 0$,}\\[3pt]
+\log({x+ \sqrt{x^2+1}}) &\text{if $\lambda = 0$}
+\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |
 | Neglog           | $$x^* = sign(x) \log{(|x|+1)}$$                                | @whittaker2005neglog  |
+| Dual             | $$x^* = (2\lambda)^{-1}\left[x^\lambda - x^{-\lambda}\right]\quad\text{if $\lambda \neq 0$}$$ | @yang2006modified    |
 
-Examples of other families of transformations. 
+Examples of other families of transformations for dense numeric predictors. 
 
 :::
  
@@ -241,6 +236,12 @@ Examples of other families of transformations.
 
 :::: 
  
+Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.
+
+Numeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.
+
+Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
+ 
 In @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  
  
 ### Standardizing to a common scale 

From 2455e94268af466f562098cec8934647b907f5c7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E2=80=98topepo=E2=80=99?= <‘mxkuhn@gmail.com’>
Date: Fri, 8 Dec 2023 23:27:14 -0500
Subject: [PATCH 06/10] data usage summary

---
 .../numeric-predictors/execute-results/html.json     |  4 ++--
 chapters/numeric-predictors.qmd                      | 12 +++++++-----
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/_freeze/chapters/numeric-predictors/execute-results/html.json b/_freeze/chapters/numeric-predictors/execute-results/html.json
index 1cae1dc..cbe785d 100644
--- a/_freeze/chapters/numeric-predictors/execute-results/html.json
+++ b/_freeze/chapters/numeric-predictors/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "d46baad36b55214e996321bf7eb1f7d5",
+  "hash": "b546b97ff8f9d0477978fae463702934",
   "result": {
     "engine": "knitr",
-    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nand using what data\n\nnote about not re-estimating; use a single data point and scaling as an example. \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n  skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n  \\text{where}\\quad  v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\lambda^{-1}(x^\\lambda-1) & \\text{if $\\lambda \\ne 0$,}\n\\\\[3pt]\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\lambda^{-1}\\left[(x + 1)^\\lambda-1\\right] & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\[3pt]\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\[3pt]\n-(2 - \\lambda)^{-1}\\left[(-x + 1)^{2 - \\lambda}-1\\right] & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\[3pt]\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\n\nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n|  Name            |  Equation                                                      | Source                 |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Modulus          | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @john1980alternative    |\n| Bickel-Docksum   | $$x^* = \\lambda^{-1}\\left[sign(x)|x| - 1\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis  |\n| Glog / Gpower    | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right]  & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |\n| Neglog           | $$x^* = sign(x) \\log{(|x|+1)}$$                                | @whittaker2005neglog  |\n| Dual             | $$x^* = (2\\lambda)^{-1}\\left[x^\\lambda - x^{-\\lambda}\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified    |\n\nExamples of other families of transformations for dense numeric predictors. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling. To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
+    "markdown": "---\nknitr:\n  opts_chunk:\n    cache.path: \"../_cache/transformations/\"\n---\n\n\n# Transforming Numeric Predictors {#sec-numeric-predictors}\n\n\n\n\n\n\n\nAs mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. \n\nWe'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter mostly focuses on transformations that leave the predictors \"in place\" but altered. \n\n\n## When are transformations estimated and applied? \n\nThe next few chapters concern preprocessing and feature engineering tools that mostly affect the predictors. As previously noted, the training set data are used to estimate parameters; this is also true for preprocessing parameters. All of these computations use the training set. At no point do we re-estimate parameters when new data are encountered. \n\nFor example, a standardization tool that centers and scales the data is introduced in the next section. The mean and standard deviation are computed from the training set for each column being standardized. When the training set, test set, or any future data are standardized, it uses these statistics derived from the training set. Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution. \n\nSuppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario,  it would be impossible to standardize using a recomputed standard deviation for the new sample (since there is a single value and we would divide by zero). \n\n## General Transformations\n\nMany transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. \n\nsome based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?\n\nTo start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). \n\nAfter these, an example of a _group_ transformation is described. \n\n### Resolving skewness\n\nThe skew of a distribution can be quantified using the skewness statistic: \n\n$$\\begin{align}\n  skewness &= \\frac{1}{(n-1)v^{3/2}} \\sum_{1=1}^n (x_i-\\overline{x})^3 \\notag \\\\\n  \\text{where}\\quad  v &= \\frac{1}{(n-1)}\\sum_{1=1}^n (x_i-\\overline{x})^2 \\notag\n\\end{align}\n$$\nwhere values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of 13.5). There are 2 samples in the training set that sit far beyond the mainstream of the data. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations.](../figures/fig-ames-lot-area-1.svg){#fig-ames-lot-area fig-align='center' width=80%}\n:::\n:::\n\n\n\nOne might infer that \"samples far beyond the mainstream of the data\" is synonymous with the term \"outlier\"; The Cambridge dictionary defines an outlier as\n\n> a person, thing, or fact that is very different from other people, things, or facts [...]\n\nor \n\n> a place that is far from the main part of something\n\nThese statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.\n\nThe @nist describes them as \n\n> an observation that lies an abnormal distance from other values in a random sample from a population\n\nIn our experience, researchers are quick to label (and discard) extreme data points as outliers. Often, especially when the sample size is not large, these data points are not abnormal but belong to a highly skewed distribution. They are ordinary in a distributional sense. That is the most likely case here; some houses in Ames have very large lot areas, but they certainly fall under the definition of \"houses in Ames, Iowa.\" These values are genuine, just extreme.\n\nThis, by itself, is okay. However, suppose that this column is used in a calculation that involves squaring values, such as Euclidean distance or the sample variance. Extreme values in a skewed distribution can influence some predictive models and cause them to place more emphasis on these predictors^[The field of robust techniques is predicated on making statistical calculations insensitive to these types of data points.]. When the predictor is left in its original form, the extreme samples can end up degrading a model's predictive performance.\n\nOne way to resolve skewness is to apply a transformation that makes the data more symmetric. There are several methods to do this. The first is to use a standard transformation, such as logarithmic or the square root, the latter being a better choice when the skewness is not drastic, and the data contains zeros. A simple visualization of the data can be enough to make this choice. The problem is when there are many numeric predictors; it may be inefficient to visually inspect each predictor to make a subjective judgment on what if any, transformation function to apply. \n\n@Box1964p3648 defined a power family of transformations that use a single parameter, $\\lambda$, for different methods: \n\n:::: {.columns}\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"40%\"}\n- no transformation via $\\lambda = 1.0$\n- square ($x^2$) via $\\lambda = 2.0$\n- logarithmic ($\\log{x}$) via $\\lambda = 0.0$\n:::\n\n::: {.column width=\"40%\"}\n- square root ($\\sqrt{x}$) via $\\lambda = 0.5$\n- inverse square root ($1/\\sqrt{x}$) via $\\lambda = -0.5$\n- inverse ($1/x$) via $\\lambda = -1.0$\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::::\n\nand others in between. The transformed version of the variable is:\n\n$$\nx^* =\n\\begin{cases} \\lambda^{-1}(x^\\lambda-1) & \\text{if $\\lambda \\ne 0$,}\n\\\\[3pt]\nlog(x) &\\text{if $\\lambda = 0$.}\n\\end{cases}\n$$\n\nTheir paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: \n\n$$\nx^* =\n\\begin{cases}\n\\lambda^{-1}\\left[(x + 1)^\\lambda-1\\right] & \\text{if $\\lambda \\ne 0$ and $x \\ge 0$,} \\\\[3pt]\nlog(x + 1) &\\text{if $\\lambda = 0$ and $x \\ge 0$.} \\\\[3pt]\n-(2 - \\lambda)^{-1}\\left[(-x + 1)^{2 - \\lambda}-1\\right] & \\text{if $\\lambda \\ne 2$ and $x < 0$,} \\\\[3pt]\n-log(-x + 1) &\\text{if $\\lambda = 2$ and $x < 0$.} \n\\end{cases}\n$$\n\nIn either case, maximum likelihood is also used to estimate the $\\lambda$ parameter. \n\nIn practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\\hat{\\lambda}$ is within some range of zero (say between $\\pm 0.01$)^[If you've never seen it, the \"hat\" notation (e.g. $\\hat{\\lambda}$) indicates an estimate of some unknown parameter.]. \n\nFor the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\\hat{\\lambda} = 0.15$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of 0.114 (much closer to zero). However, there are still outlying points.\n\n\nThere are numerous other transformations that attempt to make the distribution of a variable more Gaussian. @tbl-transforms shows several more, most of which are indexed by a transformation parameter $\\lambda$. \n\n\n:::: {.columns}\n\n::: {.column width=\"15%\"}\n:::\n\n::: {.column width=\"70%\"}\n\n::: {#tbl-transforms}\n\n|  Name            |  Equation                                                      | Source                 |\n|------------------|:--------------------------------------------------------------:|:----------------------:|\n| Modulus          | $$x^* = \\begin{cases} sign(x)\\lambda^{-1}\\left[(|x|+1)^\\lambda-1\\right] & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\nsign(x) \\log{(|x|+1)} &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @john1980alternative    |\n| Bickel-Docksum   | $$x^* = \\lambda^{-1}\\left[sign(x)|x| - 1\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @bickel1981analysis  |\n| Glog / Gpower    | $$x^* = \\begin{cases} \\lambda^{-1}\\left[({x+ \\sqrt{x^2+1}})^\\lambda-1\\right]  & \\text{if $\\lambda \\neq 0$,}\\\\[3pt]\n\\log({x+ \\sqrt{x^2+1}}) &\\text{if $\\lambda = 0$}\n\\end{cases}$$  | @durbin2002variance, @kelmansky2013new  |\n| Neglog           | $$x^* = sign(x) \\log{(|x|+1)}$$                                | @whittaker2005neglog  |\n| Dual             | $$x^* = (2\\lambda)^{-1}\\left[x^\\lambda - x^{-\\lambda}\\right]\\quad\\text{if $\\lambda \\neq 0$}$$ | @yang2006modified    |\n\nExamples of other families of transformations for dense numeric predictors. \n\n:::\n \n:::\n\n::: {.column width=\"15%\"}\n:::\n\n:::: \n \nSkewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is 4,726 square feet, which means that 10{{< pct >}} of the training set has lot areas less than 4,726 square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.\n\nNumeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.\n\nAdditionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where \"normalization\" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.\n \nIn @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  \n \n### Standardizing to a common scale \n\nAnother goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between 1872 and 2010. Another, the number of bathrooms, ranges from 0 to 5. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.\n\nThe previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. \n\nFirst is centering and scaling (as previously mentioned). To convert to a common scale, the mean ($\\bar{x}$) and standard deviation ($\\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \\bar{x}) / \\hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  \n\nIn the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. \n\n@fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![The original gross living area data and two standardized versions.](../figures/fig-standardization-1.svg){#fig-standardization fig-align='center' width=92%}\n:::\n:::\n\n\nAnother common approach is range standardization. Based on the training set, a predictor's minimum and maximum values are computed, and the data are transformed to a `[0, 1]` scale via\n\n$$\nx^* = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}\n$$\n\nWhen new data are outside the training set range, they can either be clipped to zero/one or allowed to go slightly beyond the intended range. The nice feature of this approach is that the range of the raw numeric predictors matches the range of any indicator variables created from previously categorical predictors. However, this does not imply that the distributional properties are the same (e.g., mean and variance) across predictors. Whether this is an issue depends on the model being used downstream. @fig-standardization(c) shows the result when the gross living predictor is range transformed.  Notice that the shape of the distributions across panels (a), (b), and (c) are the same — only the scale of the x-axis changes.\n\n### Spatial Sign {#sec-spatial-sign}\n\nSome transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: \n\n$$\nx^*_{ij}=\\frac{x_{ij}}{\\sum^{P}_{j=1} x_{ij}^2}\n$$\n\nNotice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. \n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least 29 samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. \n\nThe second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Lot area (x) versus gross living area (y) in raw format as well as with order-norm and spatial sign transformations.](../figures/fig-ames-lot-living-area-1.svg){#fig-ames-lot-living-area fig-align='center' width=100%}\n:::\n:::\n\n\nThe panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. \n\n## Feature Extraction and Embeddings\n\n\n### Linear Projection Methods {#sec-linear-feature-extraction}\n\nspatial sign for robustness\n\n\n### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}\n\n\n\n## Chapter References {.unnumbered}\n\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index cca82b3..dccfd36 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -35,14 +35,16 @@ source("../R/setup_ames.R")
 
 As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. 
 
-We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors "in place" but altered. 
+We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter mostly focuses on transformations that leave the predictors "in place" but altered. 
 
 
 ## When are transformations estimated and applied? 
 
-and using what data
+The next few chapters concern preprocessing and feature engineering tools that mostly affect the predictors. As previously noted, the training set data are used to estimate parameters; this is also true for preprocessing parameters. All of these computations use the training set. At no point do we re-estimate parameters when new data are encountered. 
 
-note about not re-estimating; use a single data point and scaling as an example. 
+For example, a standardization tool that centers and scales the data is introduced in the next section. The mean and standard deviation are computed from the training set for each column being standardized. When the training set, test set, or any future data are standardized, it uses these statistics derived from the training set. Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution. 
+
+Suppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario,  it would be impossible to standardize using a recomputed standard deviation for the new sample (since there is a single value and we would divide by zero). 
 
 ## General Transformations
 
@@ -114,7 +116,7 @@ lot_area_pctl <-
   bake(new_data = NULL) %>% 
   ggplot(aes(Lot_Area)) +
   geom_rug(alpha = 1 / 2, length = unit(0.04, "npc"), linewidth = 1.2) + 
-  geom_histogram(bins = 20, col = "white", fill = "#8E195C", alpha = 1 / 2) +
+  geom_histogram(binwidth = 0.05, col = "white", fill = "#8E195C", alpha = 1 / 2) +
   labs(x = "Lot Area", title = "(c) percentile")
 ```
 
@@ -250,7 +252,7 @@ Another goal for transforming individual predictors is to convert them to a comm
 
 The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. 
 
-First is centering and scaling. To convert to a common scale, the mean ($\bar{x}$) and standard deviation ($\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \bar{x}) / \hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  
+First is centering and scaling (as previously mentioned). To convert to a common scale, the mean ($\bar{x}$) and standard deviation ($\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \bar{x}) / \hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  
 
 In the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
 

From 82e31f7352e04fa6a3edba358dd85316e827ff15 Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Mon, 11 Dec 2023 08:29:53 -0500
Subject: [PATCH 07/10] update references with links

---
 chapters/numeric-predictors.qmd  |    2 +-
 includes/references.bib          | 5003 ------------------------------
 includes/references_linked.bib   |  131 +
 includes/references_original.bib |  132 +
 4 files changed, 264 insertions(+), 5004 deletions(-)
 delete mode 100644 includes/references.bib

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index dccfd36..26e3cbe 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -184,7 +184,7 @@ log(x) &\text{if $\lambda = 0$.}
 \end{cases}
 $$
 
-Their paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in the same manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: 
+Their paper defines this as a supervised transformation of a non-negative outcome ($y$) in a linear regression model. They find a value of $\lambda$ that minimizes the residual sums of squared errors. In our case, we can co-opt this method to use for unsupervised transformations of non-negative predictors (in a similar manner as @asar2017estimating). @yeojohnson extend this method by allowing the data to be negative via a slightly different transformation: 
 
 $$
 x^* =
diff --git a/includes/references.bib b/includes/references.bib
deleted file mode 100644
index 674b3b3..0000000
--- a/includes/references.bib
+++ /dev/null
@@ -1,5003 +0,0 @@
-@Article{Breiman1996ty,
-  Author = {Breiman, L},
-  Journal = {The Annals of Statistics},
-  Number = {6},
-  Pages = {2350-2383},
-  Title = {Heuristics of Instability and Stabilization in Model Selection},
-  Volume = {24},
-  Year = {1996}
-}
-
-
-@Book{HastieEtAl2017,
-  Author = {Hastie, T and Tibshirani, R and  Friedman, J},
-  Publisher = {Springer},
-  Title = {{The Elements of Statistical Learning: Data Mining, Inference and Prediction}},
-  Year = {2017}
-}
-
-
-@Article{Hothorn2006p1165,
-  Author = {Hothorn, T and Hornik, T and Zeileis, A},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Month = {Sep},
-  Number = {3},
-  Pages = {651-674},
-  Title = {Unbiased Recursive Partitioning: A Conditional Inference Framework},
-  Volume = {15},
-  Year = {2006}
-}
-
-
-@Article{guide,
-  Author = {Loh, WY},
-  Journal = {Statistica Sinica},
-  Pages = {361-386},
-  Title = {Regression Trees With Unbiased Variable Selection and Interaction Detection},
-  Volume = {12},
-  Year = {2002}
-}
-
-
-@Article{Silverman,
-  Author = {Silverman, B},
-  Journal = {Journal of the Royal Statistical Society: Series B (Methodological)},
-  Number = {1},
-  Pages = {1-21},
-  Title = {Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression Curve Fitting},
-  Volume = {47},
-  Year = {1985}
-}
-
-
-@Article{wires09,
-  Author = {Loh, WY},
-  Journal = {Wiley Interdisciplinary Reviews: Computational Statistics},
-  Pages = {364-369},
-  Title = {Tree-Structured Classifiers},
-  Volume = {2},
-  Year = {2010}
-}
-
-
-@Article{lohshih97,
-  Author = {Loh, WY and Shih, YS},
-  Journal = {Statistica Sinica},
-  Pages = {815-840},
-  Title = {Split Selection Methods for Classification Trees},
-  Volume = {7},
-  Year = {1997}
-}
-
-
-@Incollection{eqr08,
-  Address = {Chichester, UK},
-  Author = {WY Loh},
-  Booktitle = {Encyclopedia of Statistics in Quality and Reliability},
-  Editor = {F. Ruggeri and R. Kenett and F. W. Faltin},
-  Pages = {315-323},
-  Publisher = {Wiley},
-  Title = {Classification and Regression Tree Methods},
-  Year = {2008}
-}
-
-
-@Article{STROBL2007kk,
-  Author = {Carolin, C and Boulesteix, A and Augustin, T},
-  Journal = {Computational Statistics {\&} Data Analysis},
-  Month = {},
-  Number = {1},
-  Pages = {483-501},
-  Title = {Unbiased Split Selection for Classification Trees Based on the Gini Index},
-  Volume = {52},
-  Year = {2007}
-}
-
-
-@Article{17254353,
-  Author = {Strobl, Carolin and Boulesteix, Anne and Zeileis, Achim and Hothorn, Torsten},
-  Journal = {BMC Bioinformatics},
-  Number = {1},
-  Pages = {25},
-  Title = {Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution},
-  Volume = {8},
-  Year = {2007}
-}
-
-
-@Article{citeulike3904208,
-  Author = {Hesterberg, T and Choi, N and Meier, Lukas and Fraley, Chris},
-  Journal = {Statistics Surveys},
-  Pages = {61-93},
-  Title = {Least Angle and $L_1$ Penalized Regression: A Review},
-  Volume = {2},
-  Year = {2008}
-}
-
-
-@Article{Zou05regularizationand,
-  Author = {Hui Zou and Trevor Hastie},
-  Journal = {Journal of the Royal Statistical Society, Series B},
-  Number = {2},
-  Pages = {301-320},
-  Title = {Regularization and Variable Selection via the Elastic Net},
-  Volume = {67},
-  Year = {2005}
-}
-
-
-@Article{Efron2004tz,
-  Author = {Efron, B and Hastie, T and Johnstone, I and Tibshirani, R},
-  Journal = {The Annals of Statistics},
-  Number = {2},
-  Pages = {407-499},
-  Title = {Least Angle Regression},
-  Volume = {32},
-  Year = {2004}
-}
-
-
-@Article{Tibshirani1996wb,
-  Author = {Tibshirani, R},
-  Journal = {Journal of the Royal Statistical Society Series B (Methodological)},
-  Number = {1},
-  Pages = {267-288},
-  Title = {Regression Shrinkage and Selection via the Lasso},
-  Volume = {58},
-  Year = {1996}
-}
-
-
-@Article{Hoerl1970us,
-  Author = {Hoerl, A},
-  Journal = {Technometrics},
-  Number = {1},
-  Pages = {55-67},
-  Title = {Ridge Regression: Biased Estimation for Nonorthogonal Problems},
-  Volume = {12},
-  Year = {1970}
-}
-
-
-@Article{Chun2010p71,
-  Author = {H Chun and S Keles},
-  Journal = {Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
-  Number = {1},
-  Pages = {3-25},
-  Title = {Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection},
-  Volume = {72},
-  Year = {2010}
-}
-
-
-@Article{Jolliffe2003p964,
-  Author = {I Jolliffe and N Trendafilov and M Uddin},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Month = {Sep},
-  Number = {3},
-  Pages = {531-547},
-  Title = {A Modified Principal Component Technique Based on the Lasso},
-  Volume = {12},
-  Year = {2003}
-}
-
-
-@Article{Zou04sparseprincipal,
-  Author = {Hui Zou and Trevor Hastie and Robert Tibshirani},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Pages = {2006},
-  Title = {Sparse Principal Component Analysis},
-  Volume = {15},
-  Year = {2004}
-}
-
-
-@Article{Smola1996p4596,
-  Author = {A  Smola},
-  Journal = {Master's thesis, Technische Universit at Munchen},
-  Title = {Regression Estimation with Support Vector Learning Machines},
-  Year = {1996}
-}
-
-
-@Article{Drucker1997p3558,
-  Author = {H Drucker and C  Burges and L Kaufman and A Smola and V Vapnik},
-  Journal = {Advances in Neural Information Processing Systems},
-  Pages = {155-161},
-  Title = {Support Vector Regression Machines},
-  Year = {1997}
-}
-
-
-@Inproceedings{GenRuleSets,
-  Author = {G Holmes and M Hall and  E Frank},
-  Booktitle = {Australian Joint Conference on Artificial Intelligence},
-  Title = {Generating Rule Sets from Model Trees},
-  Year = {1993},
-  Masid = {3141963}
-}
-
-
-@Inproceedings{BackProp,
-  Author = {D Rumelhart and G Hinton and R Williams},
-  Booktitle = {Parallel Distributed Processing: Explorations in the Microstructure of Cognition},
-  Publisher = {The MIT Press},
-  Title = {Learning Internal Representations by Error Propagation},
-  Year = {1986},
-  Vol = {1},
-  Pagws = {318-362}
-}
-
-
-@Article{Wanguk,
-  Author = {C Wang and S Venkatesh},
-  Journal = {Advances in NIPS},
-  Pages = {303-310},
-  Title = {Optimal Stopping and Effective Machine Complexity in Learning},
-  Year = {1984}
-}
-
-
-@Book{WittenFrankHall11,
-  Address = {Amsterdam},
-  Author = {I Witten and E Frank and M Hall},
-  Publisher = {Morgan Kaufmann},
-  Title = {Data Mining: Practical Machine Learning Tools and Techniques},
-  Year = {2011}
-}
-
-
-@Article{Zeileis2008p180,
-  Author = {Achim Zeileis and Torsten Hothorn and Kurt Hornik},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Month = {Jun},
-  Number = {2},
-  Pages = {492-514},
-  Title = {Model-Based Recursive Partitioning},
-  Volume = {17},
-  Year = {2008}
-}
-
-
-@Article{Giuliano1997p4944,
-  Author = {K Giuliano and R DeBiasio and R Dunlay and A Gough and J Volosky and J Zock and G Pavlakis and D Taylor},
-  Journal = {Journal of Biomolecular Screening},
-  Number = {4},
-  Pages = {249-259},
-  Title = {High-Content Screening: A New Approach to Easing Key Bottlenecks in the Drug Discovery Process},
-  Volume = {2},
-  Year = {1997}
-}
-
-
-@Article{BrodnjakVonina2005p4943,
-  Author = {D Brodnjak-Vonina and Z Kodba and M Novi},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Number = {1},
-  Pages = {31-43},
-  Title = {Multivariate Data Analysis in Classification of Vegetable Oils Characterized by the Content of Fatty Acids},
-  Volume = {75},
-  Year = {2005}
-}
-
-
-@Article{Fanning1998p4931,
-  Author = {K Fanning and K Cogger},
-  Journal = {International Journal of Intelligent Systems in Accounting, Finance {\&} Management},
-  Number = {1},
-  Pages = {21-41},
-  Title = {Neural Network Detection of Management Fraud Using Published Financial Data},
-  Volume = {7},
-  Year = {1998}
-}
-
-
-@Article{Quinlan1987p3691,
-  Author = {R Quinlan},
-  Journal = {International Journal of Man-Machine Studies},
-  Number = {3},
-  Pages = {221-234},
-  Title = {Simplifying Decision Trees},
-  Volume = {27},
-  Year = {1987}
-}
-
-
-@Article{Kohavi1999p4351,
-  Author = {R Kohavi and R Quinlan},
-  Journal = {in Handbook of Data Mining and Knowledge Discovery},
-  Title = {Decision Tree Discovery},
-  Year = {1999}
-}
-
-
-@Article{Quinlan1992p3537,
-  Author = {R Quinlan},
-  Journal = {Proceedings of the 5th Australian Joint Conference On Artificial Intelligence},
-  Pages = {343-348},
-  Title = {Learning with Continuous Classes},
-  Year = {1992}
-}
-
-
-@Article{Quinlan1993p1150,
-  Author = {R Quinlan},
-  Journal = {Proceedings of the Tenth International Conference on Machine Learning},
-  Pages = {236-243},
-  Title = {Combining Instance-Based and Model-Based Learning},
-  Year = {1993}
-}
-
-
-@Article{Wang1997p1265,
-  Author = {Y Wang and I Witten},
-  Journal = {Proceedings of the Ninth European Conference on Machine Learning},
-  Pages = {128-137},
-  Title = {Inducing Model Trees for Continuous Classes},
-  Year = {1997}
-}
-
-
-@Book{quinlan1993c4,
-  Author = {R Quinlan},
-  Publisher = {Morgan Kaufmann Publishers},
-  Title = {{C4.5}: Programs for Machine Learning},
-  Year = {1993}
-}
-
-
-@Book{everitt2011cluster,
-  Author = {Everitt, B and Landau, S. and Leese, M. and Stahl, D},
-  Publisher = {Wiley},
-  Title = {Cluster Analysis},
-  Year = {2011}
-}
-
-
-@Book{Steyerberg2010wh,
-  Author = {Steyerberg, Ewout},
-  Edition = {1st ed. Softcover of orig. ed. 2009},
-  Month = {dec},
-  Publisher = {Springer},
-  Title = {Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating},
-  Year = {2010}
-}
-
-
-@Book{Myers94,
-  Address = {Boston, {MA}},
-  Author = {Myers, R.},
-  Edition = {second},
-  Publisher = {PWS-KENT Publishing Company},
-  Title = {Classical and Modern Regression with Applications},
-  Year = {1994}
-}
-
-
-@Inproceedings{Caputo,
-  Author = {B Caputo and K Sim and F Furesjo and A Smola},
-  Booktitle = {Proceedings of {NIPS} Workshop on Statitsical Methods for Computational Experiments in Visual Processing and Computer Vision},
-  Title = {Appearance-Based Object Recognition Using {SVMs}: Which Kernel Should I Use?},
-  Year = {2002}
-}
-
-
-@Article{Kuiper,
-  Author = {S Kuiper},
-  Journal = {Journal of Statistics Education},
-  Number = {3},
-  Title = {Introduction to Multiple Regression: How Much Is Your Car Worth?},
-  Volume = {16},
-  Year = {2008}
-}
-
-
-@Article{Wolpert,
-  Author = {D Wolpert},
-  Journal = {Neural Computation},
-  Number = {7},
-  Pages = {1341-1390},
-  Title = {The Lack of {\it a priori} Distinctions Between Learning Algorithms},
-  Volume = {8},
-  Year = {1996}
-}
-
-
-@Article{Kim2009p4370,
-  Author = {Jungsu Kim and Jacob Basak and David  Holtzman},
-  Journal = {Neuron},
-  Month = {Aug},
-  Number = {3},
-  Pages = {287-303},
-  Title = {The Role of Apolipoprotein {E} in {Alzheimer's} Disease},
-  Volume = {63},
-  Year = {2009}
-}
-
-
-@Article{Troyanskaya2001p1524,
-  Author = {O Troyanskaya and M Cantor and G Sherlock and P Brown and T Hastie and R Tibshirani and D Botstein and R Altman},
-  Journal = {Bioinformatics},
-  Number = {6},
-  Pages = {520-525},
-  Title = {Missing Value Estimation Methods for DNA Microarrays},
-  Volume = {17},
-  Year = {2001}
-}
-
-
-@Article{Jerez2010p3549,
-  Author = {J Jerez and I Molina and P Garcia-Laencina and R Alba and N Ribelles and M Martin and L Franco},
-  Journal = {Artificial Intelligence in Medicine},
-  Pages = {105-115},
-  Title = {Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem},
-  Volume = {50},
-  Year = {2010}
-}
-
-
-@Article{SaarTsechansky2007p3871,
-  Author = {M Saar-Tsechansky and F Provost},
-  Journal = {Journal of Machine Learning Research},
-  Pages = {1625-1657},
-  Title = {Handling Missing Values When Applying Classification Models},
-  Volume = {8},
-  Year = {2007}
-}
-
-
-@Article{Heyman2001p4663,
-  Author = {R  Heyman and A Slep},
-  Journal = {Journal of Marriage and the Family},
-  Number = {2},
-  Pages = {473},
-  Title = {The Hazards of Predicting Divorce Without Crossvalidation},
-  Volume = {63},
-  Year = {2001}
-}
-
-
-@Article{Tetko2001p4542,
-  Author = {I Tetko and V Tanchuk and T Kasheva and A Villa},
-  Journal = {Journal of Chemical Information and Computer Sciences},
-  Number = {6},
-  Pages = {1488-1493},
-  Title = {Estimation of Aqueous Solubility of Chemical Compounds Using {E}-State Indices},
-  Volume = {41},
-  Year = {2001}
-}
-
-
-@Article{Huuskonen2000p4541,
-  Author = {J Huuskonen},
-  Journal = {Journal of Chemical Information and Computer Sciences},
-  Number = {3},
-  Pages = {773-777},
-  Title = {Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology},
-  Volume = {40},
-  Year = {2000}
-}
-
-
-@Book{mitchell1998introduction,
-  Author = {Mitchell, M.},
-  Publisher = {MIT Press},
-  Title = {An {I}ntroduction to {G}enetic {A}lgorithms},
-  Year = {1998}
-}
-
-
-@Article{Olsson1975p3609,
-  Author = {D Olsson and L Nelson},
-  Journal = {Technometrics},
-  Number = {1},
-  Pages = {45-51},
-  Title = {The {N}elder-{M}ead Simplex Procedure for Function Minimization},
-  Volume = {17},
-  Year = {1975}
-}
-
-
-@Incollection{SVMsearch,
-  Author = {G Cohen and M Hilario and C Pellegrini and A Geissbuhler},
-  Booktitle = {Connecting Medical Informatics and Bio-Informatics},
-  Editor = {R Engelbrecht and A Geissbuhlerand C Lovis},
-  Pages = {193-198},
-  Publisher = {{IOS} {P}ress},
-  Title = {{SVM} Modeling via a Hybrid Genetic Strategy. A Health Care Application},
-  Year = {2005}
-}
-
-
-@Book{supercrunchers,
-  Author = {I Ayres},
-  Publisher = {Bantam},
-  Title = {Super Crunchers: Why Thinking-By-Numbers Is The New Way To Be Smart},
-  Year = {2007}
-}
-
-
-@Book{Westphal,
-  Author = {C Westphal},
-  Publisher = {CRC Press},
-  Title = {Data Mining for Intelligence, Fraud and Criminal Detection: Advanced Analytics and Information Sharing Technologies},
-  Year = {2008}}
-}
-
-
-@Article{ShachtmanWired,
-  Author = {N Shachtman},
-  Journal = {Wired},
-  Month = {February},
-  Title = {Pentagon's Prediction Software Didn't Spot Egypt Unrest},
-  Year = {2011}
-}
-
-
-@Article{LevyWired,
-  Author = {S Levy},
-  Journal = {Wired},
-  Month = {December},
-  Title = {The AI Revolution is On},
-  Year = {2010}
-}
-
-
-@Manual{SECTech,
-  Author = {U.S. Commodity Futures Trading Commission and U.S. Securities and Exchange Commission},
-  Title = {Findings Regarding the Market Events of May 6, 2010},
-  Year = {2010}
-}
-
-
-@Techreport{RodriguezTech,
-  Author = {M Rodriguez},
-  Institution = {Concepcion, Martinez and Bellido},
-  Month = {April},
-  Title = {The Failure of Predictive Modeling and Why We Follow the Herd},
-  Year = {2011}
-}
-
-
-@Inproceedings{GoldbloomPC,
-  Author = {A Goldbloom},
-  Publisher = {Presented as Predictive Analytics World, San Francisco, CA},
-  Title = {Crowdsourcing Data Prediction},
-  Year = {2011}
-}
-
-
-@Book{Geisser,
-  Author = {S Geisser},
-  Publisher = {Chapman and Hall},
-  Title = {Predictive Inference: An Introduction},
-  Year = {1993}}
-}
-
-
-@Book{cart,
-  Address = {New York},
-  Author = {L. Breiman and J. Friedman and R. Olshen and C. Stone},
-  Pages = {358},
-  Publisher = {Chapman and Hall},
-  Title = {Classification and Regression Trees},
-  Year = {1984}
-}
-
-
-@Article{Box1964p3648,
-  Author = {{GEP} Box and D Cox},
-  Journal = {Journal of the Royal Statistical Society. Series B (Methodological)},
-  Pages = {211-252},
-  Title = {An Analysis of Transformations},
-  Year = {1964}
-}
-
-
-@Article{Box1962p3647,
-  Author = {{GEP} Box and P Tidwell},
-  Journal = {Technometrics},
-  Number = {4},
-  Pages = {531-550},
-  Title = {Transformation of the Independent Variables},
-  Volume = {4},
-  Year = {1962}
-}
-
-
-@Book{RMuenchen2009,
-  Author = {Robert Muenchen},
-  Publisher = {Springer},
-  Title = {R for {SAS} and {SPSS} Users},
-  Year = {2009}
-}
-
-
-@Book{RMaindonaldBraun2007,
-  Author = {John Maindonald and John Braun},
-  Edition = {2nd},
-  Publisher = {Cambridge University Press},
-  Title = {Data Analysis and Graphics Using R},
-  Year = {2007}
-}
-
-
-@Manual{RIntro,
-  Address = {Vienna, Austria},
-  Author = {W Venables  and D Smith and the R Development Core Team},
-  Edition = {Version 1.6.2},
-  Month = {1},
-  Note = {ISBN 3-901167-55-2},
-  Organization = {R Foundation for Statistical Computing},
-  Title = {An Introduction to R},
-  Year = {2003},
-  Url = {http://www.R-project.org}
-}
-
-
-@Online{simpleR,
-  Author = {J Verzani},
-  Month = {},
-  Title = {{simpleR} - Using R for Introductory Statistics},
-  Year = {2002},
-  Url = {http://www.math.csi.cuny.edu/Statistics/R/simpleR}
-}
-
-
-@Online{caretRForge,
-  Author = {M Kuhn},
-  Month = {},
-  Title = {The {caret} Package Homepage},
-  Year = {2010},
-  Url = {http://caret.r-forge.r-project.org/}
-}
-
-
-@Book{GPL,
-  Author = {Free Software Foundation},
-  Title = {GNU General Public License},
-  Year = {2007}
-}
-
-
-@Article{BioC,
-  Journal = {Genome Biology},
-  Number = {10},
-  Pages = {R80},
-  Title = {Bioconductor: Open Software Development for Computational Biology and Bioinformatics},
-  Volume = {5},
-  Year = {2004}
-}
-
-
-@Book{Gentleman,
-  Author = {R Gentleman},
-  Publisher = {CRC Press},
-  Title = {R Programming for Bioinformatics},
-  Year = {2008}
-}
-
-
-@Book{Chambers,
-  Author = {J Chambers},
-  Publisher = {Springer},
-  Title = {Software for Data Analysis: Programming with R},
-  Year = {2008}
-}
-
-
-@Book{Spector,
-  Author = {P Spector},
-  Publisher = {Springer},
-  Title = {Data Manipulation with R},
-  Year = {2008}
-}
-
-
-@Book{Lattice,
-  Author = {D Sarkar},
-  Publisher = {Springer},
-  Title = {{Lattice}: Multivariate Data Visualization with R},
-  Year = {2008}
-}
-
-
-@Manual{RLang,
-  Address = {Vienna, Austria},
-  Author = {R Development Core Team},
-  Organization = {R Foundation for Statistical Computing},
-  Title = {},
-  Year = {2010}}
-}
-
-
-@Manual{RReg,
-  Address = {Vienna, Austria},
-  Author = {R Development Core Team},
-  Organization = {R Foundation for Statistical Computing},
-  Title = {R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments},
-  Year = {2008}
-}
-
-
-@Article{Ihaka1996,
-  Author = {R Ihaka and R Gentleman},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Number = {},
-  Pages = {299-314},
-  Title = {R: A Language for Data Analysis and Graphics},
-  Volume = {},
-  Year = {1996}
-}
-
-
-@Article{Heath1979p3559,
-  Author = {G Golub and M Heath and G Wahba},
-  Journal = {Technometrics},
-  Month = {Jan},
-  Number = {2},
-  Pages = {215-223},
-  Title = {Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter},
-  Volume = {21},
-  Year = {1979}
-}
-
-
-@Article{ClevelandLoess,
-  Author = {W Cleveland},
-  Journal = {Journal of the American Statistical Association},
-  Number = {368},
-  Pages = {829-836},
-  Title = {Robust Locally Weighted Regression and Smoothing Scatterplots},
-  Volume = {74},
-  Year = {1979}
-}
-
-
-@Article{CraigSchapiro,
-  Author = {Craig-Schapiro, Rebecca AND Kuhn, Max AND Xiong, Chengjie AND Pickering, Eve  AND Liu, Jingxia AND Misko, Thomas P. AND Perrin, Richard AND Bales, Kelly. AND Soares, Holly AND Fagan, Anne AND Holtzman, David},
-  Journal = {PLoS ONE},
-  Month = {04},
-  Number = {4},
-  Pages = {e18850},
-  Publisher = {Public Library of Science},
-  Title = {Multiplexed Immunoassay Panel Identifies Novel {CSF} Biomarkers for {Alzheimer's} Disease Diagnosis and Prognosis},
-  Volume = {6},
-  Year = {2011}
-}
-
-
-@Article{Mente2005p1248,
-  Author = {S Mente and F Lombardo},
-  Journal = {Journal of Computer-Aided Molecular Design},
-  Number = {7},
-  Pages = {465-481},
-  Title = {A Recursive-Partitioning Model for Blood-Brain Barrier Permeation},
-  Volume = {19},
-  Year = {2005}
-}
-
-
-@Book{Bishop1995,
-  Address = {Oxford},
-  Author = {C Bishop},
-  Publisher = {Oxford University Press},
-  Title = {Neural Networks for Pattern Recognition},
-  Year = {1995}
-}
-
-
-@Article{Titterington2010p2888,
-  Author = {M Titterington},
-  Journal = {Wiley Interdisciplinary Reviews: Computational Statistics},
-  Number = {1},
-  Pages = {1-8},
-  Title = {Neural Networks},
-  Volume = {2},
-  Year = {2010}}
-}
-
-
-@Article{Abdi2010p3532,
-  Author = {Herve Abdi and Lynne Williams},
-  Journal = {Wiley Interdisciplinary Reviews: Computational Statistics},
-  Number = {4},
-  Pages = {433-459},
-  Title = {Principal Component Analysis},
-  Volume = {2},
-  Year = {2010}
-}
-
-
-@Book{Ripley,
-  Author = {B Ripley},
-  Publisher = {Cambridge University Press},
-  Title = {Pattern Recognition and Neural Networks},
-  Year = {1996}
-}
-
-
-@Book{neal96a,
-  Author = {R  Neal},
-  Publisher = {Springer-Verlag},
-  Title = {Bayesian Learning for Neural Networks},
-  Year = {1996}
-}
-
-
-@Book{IntroToChem,
-  Author = {A Leach and V Gillet},
-  Publisher = {Springer},
-  Title = {{An Introduction to Chemoinformatics}},
-  Year = {2003}
-}
-
-@Article{Kuhn2008BPM,
-  Author = {M Kuhn},
-  Journal = {Journal of Statistical Software},
-  Month = {nov},
-  Number = {5},
-  Title = {Building Predictive Models in {R} Using the {caret} Package},
-  Volume = {28},
-  Year = {2008}
-}
-
-
-@Article{Hothorn2005p1186,
-  Author = {T Hothorn and F Leisch and A Zeileis and K Hornik},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Number = {3},
-  Pages = {675-699},
-  Title = {The Design and Analysis of Benchmark Experiments},
-  Volume = {14},
-  Year = {2005}
-}
-
-
-@Article{Eugster2008p2536,
-  Author = {M Eugster and T Hothorn and F Leisch},
-  Journal = {Ludwigs-Maximilians-Universitat Munchen, Department of Statistics, Tech. Rep},
-  Title = {Exploratory and Inferential Analysis of Benchmark Experiments},
-  Volume = {30},
-  Year = {2008}
-}
-
-
-@Article{Bartenhagen,
-  Author = {C Bartenhagen and H Klein and C Ruckert and X Jiang and M Dugas},
-  Journal = {BMC Bioinformatics},
-  Number = {1},
-  Pages = {567-578},
-  Title = {Comparative Study of Unsupervised Dimension Reduction Techniques for the Visualization of Microarray Gene Expression Data},
-  Volume = {11},
-  Year = {2010}
-}
-
-
-@Article{Hyvarinen2000p1831,
-  Author = {A Hyvarinen and E Oja},
-  Journal = {Neural networks},
-  Number = {4-5},
-  Pages = {411-430},
-  Title = {Independent Component Analysis: Algorithms and Applications},
-  Volume = {13},
-  Year = {2000}
-}
-
-
-@Article{Austin2004p3212,
-  Author = {P Austin and L Brunner},
-  Journal = {Statistics in Medicine},
-  Number = {7},
-  Pages = {1159-1178},
-  Title = {Inflation of the {T}ype {I} Error Rate When a Continuous Confounding Variable Is Categorized in Logistic Regression Analyses},
-  Volume = {23},
-  Year = {2004}
-}
-
-
-@Manual{Probetec,
-  Author = {Becton Dickinson and Company},
-  Title = {ProbeTec ET {Chlamydia trachomatis} and {Neisseria gonorrhoeae} Amplified DNA Assays (Package Insert)},
-  Year = {1991}
-}
-
-
-@Article{mars,
-  Author = {J Friedman},
-  Journal = {The Annals of Statistics},
-  Number = {1},
-  Pages = {1-141},
-  Title = {Multivariate Adaptive Regression Splines},
-  Volume = {19},
-  Year = {1991},
-  Uri = {papers://B8B5D302-E53E-4247-B570-F78A74DA1B28/Paper/p109}
-}
-
-
-@Article{Barker2003p341,
-  Author = {Matthew Barker and William Rayens},
-  Journal = {Journal of Chemometrics},
-  Number = {3},
-  Pages = {166-173},
-  Title = {Partial Least Squares for Discrimination},
-  Volume = {17},
-  Year = {2003}}
-}
-
-
-@Article{Forina2009p1119,
-  Author = {M Forina and M Casale and P Oliveri and S Lanteri},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Number = {2},
-  Pages = {239-245},
-  Title = {CAIMAN brothers: A Family of Powerful Classification and Class Modeling Techniques},
-  Volume = {96},
-  Year = {2009}
-}
-
-
-@Article{HASTIE1994p2184,
-  Author = {T Hastie and R Tibshirani and A Buja},
-  Journal = {Journal of the American Statistical Association},
-  Number = {428},
-  Title = {Flexible Discriminant Analysis by Optimal Scoring.},
-  Volume = {89},
-  Year = {1994}
-}
-
-
-@Article{htgam,
-  Author = {T Hastie and R Tibshirani},
-  Journal = {Statistical Science},
-  Month = {Jan},
-  Number = {3},
-  Pages = {297-318},
-  Volume = {1},
-  Year = {1986}
-}
-
-
-@Article{Bone1992p2926,
-  Author = {R Bone and R Balk and F Cerra and R Dellinger and A Fein and W Knaus and R Schein and W Sibbald},
-  Journal = {Chest},
-  Number = {6},
-  Pages = {1644-1655},
-  Title = {Definitions for Sepsis and Organ Failure and Guidelines for the Use of Innovative Therapies in Sepsis},
-  Volume = {101},
-  Year = {1992}
-}
-
-
-@Article{Kohavi1995p57,
-  Author = {R Kohavi},
-  Journal = {International Joint Conference on Artificial Intelligence},
-  Pages = {1137-1145},
-  Title = {A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection},
-  Volume = {14},
-  Year = {1995}
-}
-
-
-@Article{Golbraikh2002p1241,
-  Author = {A Golbraikh and A Tropsha},
-  Journal = {Journal of Molecular Graphics and Modelling},
-  Number = {4},
-  Pages = {269-276},
-  Title = {Beware of {\$}q^2{\$}!},
-  Volume = {20},
-  Year = {2002}
-}
-
-
-@Article{Guyon2002p1475,
-  Author = {I Guyon and J Weston and S Barnhill and V Vapnik},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {389-422},
-  Title = {Gene Selection for Cancer Classification Using Support Vector Machines},
-  Volume = {46},
-  Year = {2002}
-}
-
-
-@Article{Geladi2003p1635,
-  Author = {P Geladi and M Manley and T Lestander},
-  Journal = {Journal of Chemometrics},
-  Number = {8-9},
-  Pages = {503-511},
-  Title = {Scatter Plotting in Multivariate Data Analysis},
-  Volume = {17},
-  Year = {2003}
-}
-
-
-@Article{Breiman2001p1620,
-  Author = {L Breiman},
-  Journal = {Statistical Science},
-  Number = {3},
-  Pages = {199-215},
-  Title = {Statistical Modeling: The Two Cultures},
-  Volume = {16},
-  Year = {2001}
-}
-
-
-@Article{Hand2006p1141,
-  Author = {D Hand},
-  Journal = {Statistical Science},
-  Number = {1},
-  Pages = {1-14},
-  Title = {Classifier Technology and the Illusion of Progress},
-  Volume = {21},
-  Year = {2006}
-}
-
-
-@Article{Friedman1989p2195,
-  Author = {J Friedman},
-  Journal = {Journal of the American Statistical Association},
-  Number = {405},
-  Pages = {165-175},
-  Title = {Regularized Discriminant Analysis},
-  Volume = {84},
-  Year = {1989}
-}
-
-
-@Article{Hill2007p2948,
-  Author = {A Hill and P LaPan and Y Li and S Haney},
-  Journal = {BMC Bioinformatics},
-  Number = {1},
-  Pages = {340},
-  Title = {Impact of Image Segmentation on High-Content Screening Data Quality for {SK}-{BR}-3 Cells},
-  Volume = {8},
-  Year = {2007}
-}
-
-
-@Article{Friedman2002p257,
-  Author = {J Friedman},
-  Journal = {Computational Statistics and Data Analysis},
-  Number = {4},
-  Pages = {367-378},
-  Title = {Stochastic Gradient Boosting},
-  Volume = {38},
-  Year = {2002}
-}
-
-
-@Article{Tibshirani2003p1130,
-  Author = {R Tibshirani and T Hastie and B Narasimhan and G Chu},
-  Journal = {Statistical Science},
-  Pages = {104-117},
-  Title = {Class Prediction By Nearest Shrunken Centroids, with Applications to DNA Microarrays},
-  Year = {2003}
-}
-
-
-@Article{Serneels,
-  Author = {S Serneels and E De Nolf and P Van Espen},
-  Journal = {Journal of Chemical Information and Modeling},
-  Number = {3},
-  Pages = {1402-1409},
-  Title = {Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators},
-  Volume = {46},
-  Year = {2006}}
-}
-
-
-@Article{Efron1983p51,
-  Author = {B Efron},
-  Journal = {Journal of the American Statistical Association},
-  Pages = {316-331},
-  Title = {Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation},
-  Year = {1983}
-}
-
-
-@Article{Hastie1995p1232,
-  Author = {T Hastie and A Buja and R Tibshirani},
-  Journal = {The Annals of Statistics},
-  Number = {1},
-  Pages = {73-102},
-  Title = {Penalized Discriminant Analysis},
-  Volume = {23},
-  Year = {1995}
-}
-
-
-@Article{Smialowski2010p2048,
-  Author = {P Smialowski and D Frishman and S Kramer},
-  Journal = {Bioinformatics},
-  Month = {Feb},
-  Number = {3},
-  Pages = {440-443},
-  Title = {Pitfalls of Supervised Feature Selection},
-  Volume = {26},
-  Year = {2010}
-}
-
-
-@Article{Kvalseth1985p1685,
-  Author = {T Kvalseth},
-  Journal = {American Statistician},
-  Number = {4},
-  Pages = {279-285},
-  Title = {Cautionary Note About {$R^2$}},
-  Volume = {39},
-  Year = {1985}
-}
-
-
-@Article{Wehberg2004p49,
-  Author = {Sonja Wehberg and Martin Schumacher},
-  Journal = {Biometrical Journal},
-  Month = {Feb},
-  Number = {1},
-  Pages = {35-47},
-  Title = {A Comparison of Nonparametric Error Rate Estimation Methods in Classification Problems},
-  Volume = {46},
-  Year = {2004}
-}
-
-
-@Article{Kim2009p62,
-  Author = {Ji-Hyun Kim},
-  Journal = {Computational Statistics {\&} Data Analysis},
-  Month = {Sep},
-  Number = {11},
-  Pages = {3735-3745},
-  Title = {Estimating Classification Error Rate: Repeated Cross-Validation, Repeated Hold-Out and Bootstrap},
-  Volume = {53},
-  Year = {2009},
-  Read = {Yes},
-  Rating = {0},
-  Pii = {S0167947309001601}
-}
-
-
-@Article{Zhu1996p54,
-  Author = {H Zhu and R Rohwer},
-  Journal = {Neural Computation},
-  Number = {7},
-  Pages = {1421-1426},
-  Title = {No Free Lunch for Cross-Validation},
-  Volume = {8},
-  Year = {1996}
-}
-
-
-@Article{Breiman2001p1218,
-  Author = {L Breiman},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {5-32},
-  Title = {Random Forests},
-  Volume = {45},
-  Year = {2001}
-}
-
-
-@Article{Weiss2001p285,
-  Author = {G Weiss and F Provost},
-  Journal = {Department of Computer Science, Rutgers University},
-  Title = {The Effect of Class Distribution on Classifier Learning: An Empirical Study},
-  Year = {2001}
-}
-
-
-@Article{Tropsha2003p1290,
-  Author = {A Tropsha and P Gramatica and VK Gombar},
-  Journal = {QSAR {\&} Combinatorial Science},
-  Number = {1},
-  Pages = {69-77},
-  Title = {The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of {QSPR} Models},
-  Volume = {22},
-  Year = {2003}
-}
-
-
-@Article{Breiman1996p265,
-  Author = {L Breiman},
-  Journal = {Machine Learning},
-  Number = {2},
-  Pages = {123-140},
-  Title = {Bagging Predictors},
-  Volume = {24},
-  Year = {1996}
-}
-
-
-@Article{Sima2006p2553,
-  Author = {C Sima and E Dougherty},
-  Journal = {Bioinformatics},
-  Number = {19},
-  Pages = {2430},
-  Title = {What Should Be Expected from Feature Selection in Small-Sample Settings},
-  Volume = {22},
-  Year = {2006}
-}
-
-
-@Article{Bradley1997p2792,
-  Author = {A Bradley},
-  Journal = {Pattern Recognition},
-  Number = {7},
-  Pages = {1145-1159},
-  Title = {The Use of the Area Under the {ROC} Curve in the Evaluation of Machine Learning Algorithms},
-  Volume = {30},
-  Year = {1997}
-}
-
-
-@Article{Efron1986p1920,
-  Author = {B Efron and R Tibshirani},
-  Journal = {Statistical Science},
-  Pages = {54-75},
-  Title = {Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy},
-  Year = {1986}
-}
-
-
-@Article{Boser1992p2872,
-  Author = {B Boser and I Guyon and V Vapnik},
-  Journal = {Proceedings of the Fifth Annual Workshop On Computational Learning Theory},
-  Pages = {144-152},
-  Title = {A Training Algorithm for Optimal Margin Classifiers},
-  Year = {1992}
-}
-
-
-@Article{Tibshirani2002p1157,
-  Author = {R Tibshirani and T Hastie and B Narasimhan and G Chu},
-  Journal = {Proceedings of the National Academy of Sciences},
-  Number = {10},
-  Pages = {6567},
-  Title = {Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression},
-  Volume = {99},
-  Year = {2002}
-}
-
-
-@Article{BragaNeto2004p48,
-  Author = {U Braga-Neto},
-  Journal = {Bioinformatics},
-  Month = {Feb},
-  Number = {3},
-  Pages = {374-380},
-  Title = {Is Cross-Validation Valid for Small-Sample Microarray Classification?},
-  Volume = {20},
-  Year = {2004},
-  Read = {Yes},
-  Rating = {0}
-}
-
-
-@Article{Hanczar2010p2045,
-  Author = {B Hanczar and J Hua and C Sima and J Weinstein and M Bittner and E. R Dougherty},
-  Journal = {Bioinformatics},
-  Month = {Mar},
-  Number = {6},
-  Pages = {822-830},
-  Title = {Small-Sample Precision of {ROC}-Related Estimates},
-  Volume = {26},
-  Year = {2010}}
-}
-
-
-@Article{Salzberg1997p2210,
-  Author = {S Salzberg},
-  Journal = {Data Mining and Knowledge Discovery},
-  Number = {3},
-  Pages = {317-328},
-  Title = {On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach},
-  Volume = {1},
-  Year = {1997}
-}
-
-
-@Article{Clark2004p115,
-  Author = {T Clark},
-  Journal = {Journal of Forecasting},
-  Month = {Mar},
-  Number = {2},
-  Pages = {115-139},
-  Title = {Can Out-of-Sample Forecast Comparisons Help Prevent Overfitting?},
-  Volume = {23},
-  Year = {2004}
-}
-
-
-@Article{Simon2003p14,
-  Author = {R Simon and M Radmacher and K Dobbin and L McShane},
-  Journal = {Journal of the National Cancer Institute},
-  Month = {Jan},
-  Number = {1},
-  Pages = {14-18},
-  Title = {Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification},
-  Volume = {95},
-  Year = {2003}
-}
-
-
-@Article {Gowen2010,
-  Author = {A Gowen and G Downey and C Esquerre and C O'Donnell},
-  Journal = {Journal of Chemometrics},
-  Pages = {375-381},
-  Publisher = {John Wiley and Sons, Ltd.},
-  Title = {Preventing Over-Fitting in PLS Calibration Models of Near-Infrared (NIR) Spectroscopy Data Using Regression Coefficients},
-  Volume = {25},
-  Year = {2010}
-}
-
-
-@Article{Hawkins2004p1,
-  Author = {D Hawkins},
-  Journal = {Journal of Chemical Information and Computer Sciences},
-  Month = {Jan-Feb},
-  Number = {1},
-  Pages = {1-12},
-  Title = {The Problem of Overfitting},
-  Volume = {44},
-  Year = {2004}
-}
-
-
-@Article{Defernez1997p216,
-  Author = {M Defernez and E Kemsley},
-  Journal = {TrAC Trends in Analytical Chemistry},
-  Number = {4},
-  Pages = {216-221},
-  Title = {The Use and Misuse of Chemometrics for Treating Classification Problems},
-  Volume = {16},
-  Year = {1997}
-}
-
-
-@Article{Hsieh1998p1855,
-  Author = {W Hsieh and B Tang},
-  Journal = {Bulletin of the American Meteorological Society},
-  Number = {9},
-  Pages = {1855-1870},
-  Title = {Applying Neural Network Models to Prediction and Data Analysis in Meteorology and Oceanography},
-  Volume = {79},
-  Year = {1998}
-}
-
-
-@Techreport{DwyerTech,
-  Author = {D Dwyer},
-  Institution = {Moody's KMV},
-  Month = {April},
-  Title = {Examples of Overfitting Encountered When Building Private Firm Default Prediction Models},
-  Year = {2005}
-}
-
-
-@Article{Svetnik2005p786,
-  Author = {S Svetnik and T Wang and C Tong and A Liaw and RP Sheridan and Q Song},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {786-799},
-  Title = {Boosting:  An Ensemble Learning Tool for Compound Classification and QSAR Modeling},
-  Volume = {45},
-  Year = {2005}
-}
-
-
-@Article{Deconinck2006p1410,
-  Author = {E Deconinck and M Zhang and D Coomans and Y  Haden},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {1410-1419},
-  Title = {Classification Tree Models for the Prediction of Blood-Brain Barrier Passage of Drugs},
-  Volume = {46},
-  Year = {2006}
-}
-
-
-@Article{Palmer2007p150,
-  Author = {D Palmer and N O'Boyle and R Glen and J Mitchell},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {150-158},
-  Title = {Random Forest Models to Predict Aqueous Solubility},
-  Volume = {47},
-  Year = {2007}
-}
-
-
-@Article{Plewczynski2006p1098,
-  Author = {D Plewczynski and S Spieser and U Koch},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {1098-1106},
-  Title = {Assessing Different Classification Methods for Virtual Screening},
-  Volume = {46},
-  Year = {2006}
-}
-
-
-@Article{Ehrman2007p264,
-  Author = {T Ehrman and D Barlow and P Hylands},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {264-278},
-  Title = {Virtual Screening of Chinese Herbs with Random Forest},
-  Volume = {47},
-  Year = {2007}
-}
-
-
-@Article{Zhang2007p1,
-  Author = {Q Zhang and J Aires-de-Sousa},
-  Journal = {Journal of Chemical Information and Modeling},
-  Pages = {1-8},
-  Title = {Random Forest Prediction of Mutagenicity from Empirical Physiochemical Descriptors},
-  Volume = {47},
-  Year = {2007}
-}
-
-
-@Article{Dietterich2000p139,
-  Author = {T Dietterich},
-  Journal = {Machine Learning},
-  Pages = {139-158},
-  Title = {An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees:  Bagging, Boosting, and Randomization},
-  Volume = {40},
-  Year = {2000}
-}
-
-
-@Article{Ho1998p340,
-  Author = {T Ho},
-  Journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  Pages = {340-354},
-  Title = {The Random Subspace Method for Constructing Decision Forests},
-  Volume = {13},
-  Year = {1998}
-}
-
-
-@Article{Amit1997p1545,
-  Author = {Y Amit and D Geman},
-  Journal = {Neural Computation},
-  Pages = {1545-1588},
-  Title = {Shape Quantization and Recognition with Randomized Trees},
-  Volume = {9},
-  Year = {1997}
-}
-
-
-@Article{Breiman2001p5,
-  Author = {L Breiman},
-  Journal = {Machine Learning},
-  Pages = {5-32},
-  Title = {Random Forests},
-  Volume = {45},
-  Year = {2001}
-}
-
-
-@Article{Schapire1990p197,
-  Author = {R Schapire},
-  Journal = {Machine Learning},
-  Pages = {197-227},
-  Title = {The Strength of Weak Learnability},
-  Volume = {45},
-  Year = {1990}
-}
-
-
-@Article{Freund1995p256,
-  Author = {Y Freund},
-  Journal = {Information and Computation},
-  Pages = {256-285},
-  Title = {Boosting a Weak Learning Algorithm by Majority},
-  Volume = {121},
-  Year = {1995}
-}
-
-
-@Article{Freund1999p79,
-  Author = {Y Freund R Schapire},
-  Journal = {Games and Economic Behavior},
-  Pages = {79-103},
-  Title = {Adaptive Game Playing Using Multiplicative Weights},
-  Volume = {29},
-  Year = {1999}
-}
-
-
-@Article{Valiant1984p1134,
-  Author = {L Valiant},
-  Journal = {Communications of the ACM},
-  Pages = {1134-1142},
-  Title = {A Theory of the Learnable},
-  Volume = {27},
-  Year = {1984}
-}
-
-
-@Inproceedings{Kearns1989,
-  Author = {M Kearns and LG Valiant},
-  Booktitle = {Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing},
-  Title = {Crytographic Limitations on Learning Boolean Formulae and Finite Automata},
-  Year = {1989}
-}
-
-
-@Article{Friedman2000p337,
-  Author = {J Friedman and T Hastie and R Tibshirani},
-  Journal = {Annals of Statistics},
-  Pages = {337-374},
-  Title = {Additive Logistic Regression:  A Statistical View of Boosting},
-  Volume = {38},
-  Year = {2000}
-}
-
-
-@Article{Schapire1998p1651,
-  Author = {R Schapire and Y Freund and P Bartlett and W Lee},
-  Journal = {Annals of Statistics},
-  Pages = {1651-1686},
-  Title = {Boosting the Margin:  A New Explanation for the Effectiveness of Voting Methods},
-  Volume = {26},
-  Year = {1998}
-}
-
-
-@Article{Freund2004p1698,
-  Author = {Y Freund and Y Mansour and R Schapire},
-  Journal = {Annals of Statistics},
-  Pages = {1698-1722},
-  Title = {Generalization Bounds for Averaged Classifiers},
-  Volume = {32},
-  Year = {2004}
-}
-
-
-@Article{Massy1965p234,
-  Author = {W Massy},
-  Journal = {Journal of the American Statistical Association},
-  Pages = {234-246},
-  Title = {Principal Components Regression in Exploratory Statistical Research},
-  Volume = {60},
-  Year = {1965}
-}
-
-
-@Inproceedings{Wold1966p391,
-  Address = {New York},
-  Author = {H Wold},
-  Booktitle = {Multivariate Analyses},
-  Editor = {PR Krishnaiah},
-  Pages = {391-420},
-  Publisher = {Academic Press},
-  Title = {Estimation of Principal Components and Related Models by Iterative Least Squares},
-  Year = {1966}
-}
-
-
-@Inproceedings{Wold1982p1,
-  Address = {Amsterdam},
-  Author = {H Wold},
-  Booktitle = {Systems Under Indirect Observation: Causality, Structure, Prediction},
-  Editor = {Joreskog, K.G. and Wold, H.O.A.},
-  Number = {pt. 2},
-  Pages = {1-54},
-  Publisher = {North-Holland},
-  Title = {Soft Modeling:  The Basic Design and Some Extensions},
-  Year = {1982}
-}
-
-
-@Inproceedings{Wold1983,
-  Address = {Heidelberg},
-  Author = {S Wold and H Martens and H Wold},
-  Booktitle = {Proceedings from the Conference on Matrix Pencils},
-  Publisher = {Springer-Verlag},
-  Title = {The Multivariate Calibration Problem in Chemistry Solved by the PLS Method},
-  Year = {1983}
-}
-
-
-@Article{Stone1990p237,
-  Author = {M Stone and R Brooks},
-  Journal = {Journal of the Royal Statistical Society, Series B},
-  Pages = {237-269},
-  Title = {Continuum Regression:  Cross-validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares, and Principal Component Regression},
-  Volume = {52},
-  Year = {1990}
-}
-
-
-@Article{Alin2009p711,
-  Author = {A Alin},
-  Journal = {Statistical Papers},
-  Pages = {711-720},
-  Title = {Comparison of PLS Algorithms when Number of Objects is Much Larger than Number of Variables},
-  Volume = {50},
-  Year = {2009}
-}
-
-
-@Article{Lindgren1993p45,
-  Author = {F Lindgren and P Geladi and S Wold},
-  Journal = {Journal of Chemometrics},
-  Pages = {45-59},
-  Title = {The Kernel Algorithm for PLS},
-  Volume = {7},
-  Year = {1993}
-}
-
-
-@Article{deJong1993p251,
-  Author = {de Jong, S.},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Pages = {251-263},
-  Title = {SIMPLS: An Alternative Approach to Partial Least Squares Regression},
-  Volume = {18},
-  Year = {1993}
-}
-
-
-@Article{deJong1994p169,
-  Author = {de Jong, S and Ter Braak, C.},
-  Journal = {Journal of Chemometrics},
-  Pages = {169-174},
-  Title = {Short Communication:  Comments on the PLS Kernel Algorithm},
-  Volume = {8},
-  Year = {1994}
-}
-
-
-@Article{Dayal1997p73,
-  Author = {Dayal, B and MacGregor, J},
-  Journal = {Journal of Chemometrics},
-  Pages = {73-85},
-  Title = {Improved PLS Algorithms},
-  Volume = {11},
-  Year = {1997}
-}
-
-
-@Article{Rannar1994p111,
-  Author = {Rannar, S. and Lindgren, F. and Geladi, P. and Wold, S.},
-  Journal = {Journal of Chemometrics},
-  Pages = {111-125},
-  Title = {A PLS Kernel Algorithm for Data Sets with Many Variables and Fewer Objects.  Part 1:  Theory and Algorithm},
-  Volume = {8},
-  Year = {1994}
-}
-
-
-@Article{Berglund1997p141,
-  Author = {A Berglund and S Wold},
-  Journal = {Journal of Chemometrics},
-  Pages = {141-156},
-  Title = {INLR, Implicit Non-Linear Latent Variable Regression},
-  Volume = {11},
-  Year = {1997}
-}
-
-
-@Article{Berglund2001p321,
-  Author = {A Berglund and N Kettaneh and LL Uppgard and S Wold and N Bendwell DR  and Cameron},
-  Journal = {Journal of Chemometrics},
-  Pages = {321-336},
-  Title = {The GIFI Approach to Non-Linear PLS Modeling},
-  Volume = {15},
-  Year = {2001}
-}
-
-
-@Article{CC01a,
-  Author = {Chang, Chih-Chung and Lin, Chih-Jen},
-  Journal = {ACM Transactions on Intelligent Systems and Technology},
-  Pages = {27:1-27:27},
-  Title = {{LIBSVM}: A Library for Support Vector Machines},
-  Volume = {2},
-  Year = {2011},
-  Issue = {3}
-}
-
-
-@Article{kernlab,
-  Author = {Alexandros Karatzoglou and Alexandros Smola and Kurt Hornik and Achim Zeileis},
-  Journal = {Journal of Statistical Software},
-  Month = {11},
-  Number = {9},
-  Pages = {1-20},
-  Title = {{kernlab} - An {S4} Package for Kernel Methods in R},
-  Volume = {},
-  Year = {2004},
-  Day = {2}
-}
-
-
-@Article{TippingRVM,
-  Author = {Tipping, M},
-  Journal = {Journal of Machine Learning Research},
-  Month = {},
-  Pages = {211-244},
-  Title = {Sparse Bayesian Learning and the Relevance Vector Machine},
-  Volume = {1},
-  Year = {2001}
-}
-
-
-@Book{Graybill1976,
-  Address = {Pacific Grove, CA},
-  Author = {Graybill, F},
-  Publisher = {Wadsworth {\&} Brooks},
-  Title = {Theory and Application of the Linear Model},
-  Year = {1976}
-}
-
-
-@Book{Faraway2005,
-  Address = {Boca Raton},
-  Author = {Faraway, J},
-  Publisher = {Chapman {\&} Hall/CRC},
-  Title = {Linear Models with R},
-  Year = {2005}
-}
-
-
-@Article{Dudoit2002p77,
-  Author = {S Dudoit and J Fridlyand and T Speed},
-  Journal = {Journal of the American Statistical Association},
-  Number = {457},
-  Pages = {77-87},
-  Title = {Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data},
-  Volume = {97},
-  Year = {2002}
-}
-
-
-@Article{BenDor2000p559,
-  Author = {A Ben-Dor and L Bruhn and N Friedman and I Nachman and M Schummer and Z Yakhini},
-  Journal = {Journal of Computational Biology},
-  Number = {3},
-  Pages = {559-583},
-  Title = {Tissue Classification with Gene Expression Profiles},
-  Volume = {7},
-  Year = {2000}
-}
-
-
-@Article{Varmuza2003p391,
-  Author = {K Varmuza and P He and KT Fang},
-  Journal = {Journal of Data Science},
-  Pages = {391-404},
-  Title = {Boosting Applied to Classification of Mass Spectral Data},
-  Volume = {1},
-  Year = {2003}
-}
-
-
-@Article{Bergstra2006p473,
-  Author = {J Bergstra and N Casagrande and D Erhan and D Eck and B K\'{e}gl},
-  Journal = {Machine Learning},
-  Pages = {473-484},
-  Title = {Aggregate Features and AdaBoost for Music Classification},
-  Volume = {65},
-  Year = {2006}
-}
-
-
-@Article{Friedman2001p1189,
-  Author = {J Friedman},
-  Journal = {Annals of Statistics},
-  Number = {5},
-  Pages = {1189-1232},
-  Title = {Greedy Function Approximation:  A Gradient Boosting Machine},
-  Volume = {29},
-  Year = {2001}
-}
-
-
-@Article{Hastie1994tb,
-  Author = {Hastie, T and Tibshirani, R and Buja, A},
-  Journal = {Journal of the American Statistical Association},
-  Number = {428},
-  Pages = {1255-1270},
-  Title = {Flexible Discriminant Analysis by Optimal Scoring},
-  Volume = {89},
-  Year = {1994}
-}
-
-
-@Article{Guo2007te,
-  Author = {Guo, Y and Hastie, T and Tibshirani, R},
-  Journal = {Biostatistics},
-  Number = {1},
-  Pages = {86-100},
-  Title = {Regularized Linear Discriminant Analysis and its Application in Microarrays},
-  Volume = {8},
-  Year = {2007}
-}
-
-
-@Article{Tibshirani2003vs,
-  Author = {Tibshirani, R and Hastie, T and Narasimhan, B and Chu, G},
-  Journal = {Statistical Science},
-  Number = {1},
-  Pages = {104-117},
-  Title = {Class Prediction by Nearest Shrunken Centroids, with Applications to {DNA} Microarrays},
-  Volume = {18},
-  Year = {2003}
-}
-
-
-@Article{Tibshirani2002vm,
-  Author = {Tibshirani, R and Hastie, T and Narasimhan, B and Chu, G},
-  Journal = {Proceedings of the National Academy of Sciences},
-  Number = {10},
-  Pages = {6567-6572},
-  Title = {Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression},
-  Volume = {99},
-  Year = {2002}
-}
-
-
-@Inproceedings{Rennie03tacklingthe,
-  Author = {Jason D. M. Rennie and Lawrence Shih and Jaime Teevan and David R. Karger},
-  Booktitle = {Proceedings of the Twentieth International Conference on Machine Learning},
-  Pages = {616-623},
-  Title = {Tackling the Poor Assumptions of Naive Bayes Text Classifiers},
-  Year = {2003}
-}
-
-
-@Book{good1983good,
-  Author = {Good, I},
-  Publisher = {University of Minnesota Press},
-  Title = {Good Thinking: the Foundations of Probability and its Applications},
-  Year = {1983}
-}
-
-
-@Inproceedings{Vapnik1997uu,
-  Author = {Vapnik, V},
-  Booktitle = {Proceedings of the International Conference on Artificial Neural Networks},
-  Pages = {261-271},
-  Title = {The Support Vector Method},
-  Year = {1997}
-}
-
-
-@Article{Cortes1995wa,
-  Author = {Cortes, C and Vapnik, V},
-  Journal = {Machine Learning},
-  Number = {3},
-  Pages = {273-297},
-  Title = {Support-Vector Networks},
-  Volume = {20},
-  Year = {1995}
-}
-
-
-@Inproceedings{Boser1992uo,
-  Author = {Boser, B and Guyon, I and Vapnik, V},
-  Booktitle = {Proceedings of the Fifth Annual Workshop on Computational Learning Theory},
-  Pages = {144-152},
-  Title = {A Training Algorithm for Optimal Margin Classifiers},
-  Year = {1992}
-}
-
-
-@Article{VapnikLerner1963,
-  Author = {Vapnik, V and Lerner, A},
-  Journal = {Automation and Remote Control},
-  Number = {6},
-  Pages = {774-780},
-  Title = {Pattern Recognition using Generalized Portrait Method},
-  Volume = {24},
-  Year = {1963}
-}
-
-
-@Inproceedings{Platt2000,
-  Author = {Platt, J},
-  Booktitle = {Advances in Kernel Methods Support Vector Learning},
-  Editor = {B Bartlett and  B Scholkopf and D Schuurmans and A Smola},
-  Pages = {61-74},
-  Publisher = {Cambridge, MA: MIT Press},
-  Title = {Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods},
-  Year = {2000}
-}
-
-
-@Article{Duan2005uq,
-  Author = {Duan, K. and Keerthi, S},
-  Journal = {Multiple Classifier Systems},
-  Pages = {278-285},
-  Title = {Which is the Best Multiclass SVM Method? An Empirical Study},
-  Year = {2005}
-}
-
-
-@Article{Hsu2002vl,
-  Author = {Hsu, C and Lin, C},
-  Journal = {IEEE Transactions on Neural Networks},
-  Number = {2},
-  Pages = {415-425},
-  Title = {A Comparison of Methods for Multiclass Support Vector Machines},
-  Volume = {13},
-  Year = {2002}
-}
-
-
-@Article{Suykens1999uv,
-  Author = {Suykens, J and Vandewalle, J},
-  Journal = {Neural processing letters},
-  Number = {3},
-  Pages = {293-300},
-  Title = {Least Squares Support Vector Machine Classifiers},
-  Volume = {9},
-  Year = {1999}
-}
-
-
-@Article{Zhu2005ha,
-  Author = {Zhu, J and Hastie, Trevor},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Number = {1},
-  Pages = {185-205},
-  Title = {Kernel Logistic Regression and the Import Vector Machine},
-  Volume = {14},
-  Year = {2005}
-}
-
-
-@Article{Lodhi2002ts,
-  Author = {Lodhi, H and Saunders, C and Shawe-Taylor, J and Cristianini, N and Watkins, C},
-  Journal = {The Journal of Machine Learning Research},
-  Pages = {419-444},
-  Title = {Text Classification Using String Kernels},
-  Volume = {2},
-  Year = {2002}
-}
-
-
-@Article{Cancedda2003vf,
-  Author = {Cancedda, N. and Gaussier, E and Goutte, C and Renders, J},
-  Journal = {The Journal of Machine Learning Research},
-  Pages = {1059-1082},
-  Title = {Word-Sequence Kernels},
-  Volume = {3},
-  Year = {2003}
-}
-
-
-@Article{Mahe2009ww,
-  Author = {Mahe, P. and Vert, J},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {3-35},
-  Title = {Graph Kernels Based on Tree Patterns for Molecules},
-  Volume = {75},
-  Year = {2009}
-}
-
-
-@Article{Mahe2005wn,
-  Author = {Mahe, P. and Ueda, N. and Akutsu, T. and Perret, J and Vert, J},
-  Journal = {Journal of Chemical Information and Modeling},
-  Number = {4},
-  Pages = {939-951},
-  Title = {Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines},
-  Volume = {45},
-  Year = {2005}
-}
-
-
-@Book{vapnik2010nature,
-  Author = {Vapnik, V.},
-  Publisher = {Springer},
-  Title = {The Nature of Statistical Learning Theory},
-  Year = {2010}
-}
-
-
-@Article{glmnet,
-  Journal = {},
-  Month = {},
-  Number = {},
-  Pages = {},
-  Title = {},
-  Volume = {},
-  Year = {2010},
-  Day = {}
-}
-
-
-@Article{Park2008vk,
-  Author = {Park, M and Hastie, T},
-  Journal = {Biostatistics},
-  Number = {1},
-  Pages = {30},
-  Title = {Penalized Logistic Regression for Detecting Gene Interactions},
-  Volume = {9},
-  Year = {2008}
-}
-
-
-@Inproceedings{eilers2001classification,
-  Author = {Eilers, P and Boer, J and van Ommen, G and van Houwelingen, H},
-  Booktitle = {Proceedings of SPIE},
-  Pages = {187},
-  Title = {Classification of Microarray Data with Penalized Logistic Regression},
-  Volume = {4266},
-  Year = {2001}
-}
-
-
-@Article{ChungKeles2010Sparse,
-  Author = {Chung, Dongjun and Keles, Sunduz},
-  Journal = {Statistical Applications in Genetics and Molecular Biology},
-  Number = {1},
-  Pages = {17},
-  Title = {Sparse Partial Least Squares Classification for High Dimensional Data},
-  Volume = {9},
-  Year = {2010}
-}
-
-
-@Book{dobson2002introduction,
-  Author = {Dobson, A},
-  Publisher = {Chapman \& Hall/CRC},
-  Title = {An Introduction to Generalized Linear Models},
-  Year = {2002}
-}
-
-
-@Article{Breiman2000p229,
-  Author = {Breiman, Leo},
-  Journal = {Mach. Learn.},
-  Month = {September},
-  Pages = {229-242},
-  Title = {Randomizing Outputs to Increase Prediction Accuracy},
-  Volume = {40},
-  Year = {2000},
-  Issue = {3},
-  Issn = {0885-6125}
-}
-
-
-@Article{Michailidis1998p307,
-  Author = {George Michailidis and Jan de Leeuw},
-  Journal = {Statistical Science},
-  Pages = {307-336},
-  Title = {The Gifi System Of Descriptive Multivariate Analysis},
-  Volume = {13},
-  Year = {1998}
-}
-
-
-@Online{Ridgewaygbm,
-  Author = {G Ridgeway},
-  Month = {},
-  Title = {Generalized Boosed Models: A Guide to the {gbm} Package},
-  Year = {2007},
-  Url = {http://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf}
-}
-
-
-@Article{Huber1973p799,
-  Author = {Peter J. Huber},
-  Journal = {Annals of Statistics},
-  Number = {5},
-  Pages = {799-821},
-  Title = {Robust Regression: Asymptotics, Conjectures and Monte Carlo},
-  Volume = {1},
-  Year = {1973}
-}
-
-
-@Incollection{niblettewsl87,
-  Address = {},
-  Booktitle = {},
-  Editor = {c},
-  Pages = {67-78},
-  Publisher = {},
-  Title = {},
-  Year = {1987}
-}
-
-
-@Article{Witten2009vf,
-  Author = {Witten, D and Tibshirani, R},
-  Journal = {Journal of the Royal Statistical Society. Series B (Statistical Methodology)},
-  Number = {3},
-  Pages = {615-636},
-  Title = {Covariance-Regularized Regression and Classification For High Dimensional Problems},
-  Volume = {71},
-  Year = {2009}
-}
-
-
-@Article{Witten2011te,
-  Author = {Witten, D  and Tibshirani, R},
-  Journal = {Journal of the Royal Statistical Society. Series B (Statistical Methodology)},
-  Number = {5},
-  Pages = {753-772},
-  Title = {Penalized Classification Using Fisher's Linear Discriminant},
-  Volume = {73},
-  Year = {2011}
-}
-
-
-@Article{Clemmensen,
-  Author = {Clemmensen, Line and Hastie, Trevor and Witten, Daniela and Ersboll, Bjarne},
-  Journal = {Technometrics},
-  Number = {4},
-  Pages = {406-413},
-  Title = {Sparse Discriminant Analysis},
-  Volume = {53},
-  Year = {2011}
-}
-
-
-@Article{Yeh1998vp,
-  Author = {Yeh, I},
-  Journal = {Cement and Concrete research},
-  Number = {12},
-  Pages = {1797-1808},
-  Title = {Modeling of Strength of High-Performance Concrete Using Artificial Neural Networks},
-  Volume = {28},
-  Year = {1998}
-}
-
-
-@Article{Yeh2006wj,
-  Author = {Yeh, I},
-  Journal = {Journal of Materials in Civil Engineering},
-  Pages = {597-604},
-  Title = {Analysis of Strength of Concrete Using Design of Experiments and Neural Networks},
-  Volume = {18},
-  Year = {2006}
-}
-
-
-@Book{BHH,
-  Address = {New York},
-  Author = {wq{GEP} Box and W Hunter and J. Hunter},
-  Publisher = {Wiley},
-  Title = {Statistics for Experimenters},
-  Year = {1978}
-}
-
-
-@Book{tMYE02a,
-  Address = {New York, NY},
-  Author = {R Myers and D Montgomery},
-  Publisher = {Wiley},
-  Title = {Response Surface Methodology: Process and Product Optimization Using Designed Experiments},
-  Year = {2009}
-}
-
-
-@Book{mixtures,
-  Address = {New York, NY},
-  Author = {J Cornell},
-  Publisher = {Wiley},
-  Title = {Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data},
-  Year = {2002}
-}
-
-
-@Article{Cohn1994un,
-  Author = {Cohn, D and Atlas, L and Ladner, R},
-  Journal = {Machine Learning},
-  Number = {2},
-  Pages = {201-221},
-  Title = {Improving Generalization with Active Learning},
-  Volume = {15},
-  Year = {1994}
-}
-
-
-@Article{SaarTsechansky2007ta,
-  Author = {Saar-Tsechansky, M and Provost, F},
-  Journal = {Information Systems Research},
-  Number = {1},
-  Pages = {4-22},
-  Title = {Decision-Centric Active Learning of Binary-Outcome Models},
-  Volume = {18},
-  Year = {2007}
-}
-
-
-@Article{Derringer1980wq,
-  Author = {Derringer, G and Suich, R},
-  Journal = {Journal of Quality Technology},
-  Number = {4},
-  Pages = {214-219},
-  Title = {Simultaneous Optimization of Several Response Variables},
-  Volume = {12},
-  Year = {1980}
-}
-
-
-@Article{Costa2011vv,
-  Author = {Costa, N and Lourenco, J and Pereira, Z},
-  Journal = {Chemometrics and Intelligent Lab Systems},
-  Number = {2},
-  Pages = {234-244},
-  Title = {Desirability Function Approach: A Review and Performance Evaluation in Adverse Conditions},
-  Volume = {107},
-  Year = {2011}
-}
-
-
-@Article{Wager2010eb,
-  Author = {Wager, Travis T and Hou, Xinjun and Verhoest, Patrick R and Villalobos, Anabella},
-  Journal = {ACS Chemical Neuroscience},
-  Month = {},
-  Number = {6},
-  Pages = {435-449},
-  Title = {Moving Beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties},
-  Volume = {1},
-  Year = {2010}
-}
-
-
-@Article{Nelder1965tk,
-  Author = {Nelder, J and Mead, R},
-  Journal = {The Computer Journal},
-  Number = {4},
-  Pages = {308-313},
-  Title = {A Simplex Method for Function Minimization},
-  Volume = {7},
-  Year = {1965}
-}
-
-
-@Article{Bohachevsky1986wt,
-  Author = {Bohachevsky, I and Johnson, M and Stein, M},
-  Journal = {Technometrics},
-  Number = {3},
-  Pages = {209-217},
-  Title = {Generalized Simulated Annealing for Function Optimization},
-  Volume = {28},
-  Year = {1986}
-}
-
-
-@Inproceedings{e1071papersleisch2002,
-  Author = {Friedrich Leisch},
-  Booktitle = {Compstat 2002 - Proceedings in Computational  Statistics},
-  Editor = {Wolfgang Hardle and Bernd Ronz},
-  Pages = {575-580},
-  Publisher = {Physica Verlag, Heidelberg},
-  Title = {{Sweave}: Dynamic Generation of Statistical Reports Using Literate Data Analysis},
-  Year = {2002}
-}
-
-
-@Article{reproLeisch2002a,
-  Author = {Friedrich Leisch},
-  Journal = {R News},
-  Month = {December},
-  Number = {3},
-  Pages = {28-31},
-  Title = {{Sweave}, Part {I}: Mixing R and {\LaTeX}},
-  Volume = {2},
-  Year = {2002}
-}
-
-
-@Incollection{reproBuckheitDonoho1995,
-  Address = {New York},
-  Author = {J Buckheit and D. L. Donoho},
-  Booktitle = {Wavelets in Statistics},
-  Editor = {A. Antoniadis and G. Oppenheim},
-  Pages = {55-82},
-  Publisher = {Springer-Verlag},
-  Title = {{WaveLab} and Reproducible Research},
-  Year = {1995}
-}
-
-
-@Article{Baggerly2009tm,
-  Author = {Baggerly, K and Coombes, K},
-  Journal = {The Annals of Applied Statistics},
-  Number = {4},
-  Pages = {1309-1334},
-  Title = {Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology},
-  Volume = {3},
-  Year = {2009}
-}
-
-
-@Article{Kira1992tn,
-  Author = {Kira, K and Rendell, L},
-  Journal = {Proceedings of the National Conference on Artificial Intelligence},
-  Pages = {129-129},
-  Title = {The Feature Selection Problem: Traditional Methods and a New Algorithm},
-  Year = {1992}
-}
-
-
-@Article{RobnikSikonja2003uo,
-  Author = {Robnik-Sikonja, M and Kononenko, I},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {23-69},
-  Title = {Theoretical and Empirical Analysis of {ReliefF} and {RReliefF}},
-  Volume = {53},
-  Year = {2003}
-}
-
-
-@Article{Ahdesmaki2010ug,
-  Author = {Ahdesmaki, M and Strimmer, K},
-  Journal = {The Annals of Applied Statistics},
-  Number = {1},
-  Pages = {503-519},
-  Title = {Feature Selection in Omics Prediction Problems Using CAT Scores and False Nondiscovery Rate Control},
-  Volume = {4},
-  Year = {2010}
-}
-
-
-@Article{Golub1999un,
-  Author = {Golub, T and Slonim, D and Tamayo, P and Huard, C. and Gaasenbeek, M. and Mesirov, J and Coller, H. and Loh, M and Downing, J and Caligiuri, M},
-  Journal = {Science},
-  Number = {5439},
-  Pages = {531-537},
-  Title = {Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring},
-  Volume = {286},
-  Year = {1999}
-}
-
-
-@Article{John1994us,
-  Author = {John, G and Kohavi, R and Pfleger, K},
-  Journal = {Proceedings of the Eleventh International Conference on Machine Learning},
-  Pages = {121-129},
-  Title = {Irrelevant Features and the Subset Selection Problem},
-  Volume = {129},
-  Year = {1994}
-}
-
-
-@Incollection {Kononenko,
-  Author = {Kononenko, Igor},
-  Booktitle = {Machine Learning: ECML-94},
-  Editor = {Bergadano, Francesco and De Raedt, Luc},
-  Pages = {171-182},
-  Publisher = {Springer Berlin / Heidelberg},
-  Title = {Estimating Attributes: Analysis and Extensions of {Relief}},
-  Volume = {784},
-  Year = {1994}
-}
-
-
-@Article{Reshef2011if,
-  Author = {Reshef, D and Reshef, Y and Finucane, H and Grossman, S and McVean, G and Turnbaugh, P and Lander, E and Mitzenmacher, M and Sabeti, P},
-  Journal = {Science},
-  Month = {},
-  Number = {6062},
-  Pages = {1518-1524},
-  Title = {Detecting Novel Associations in Large Data Sets},
-  Volume = {334},
-  Year = {2011}
-}
-
-
-@Article{Brillinger2004vw,
-  Author = {Brillinger, D},
-  Journal = {Brazilian Journal of Probability and Statistics},
-  Number = {6},
-  Pages = {163-183},
-  Title = {Some Data Analyses Using Mutual Information},
-  Volume = {18},
-  Year = {2004}
-}
-
-
-@Inproceedings{Zadrozny2001um,
-  Author = {Bianca Zadrozny and Charles Elkan},
-  Booktitle = {Proceedings of the 18th International Conference on Machine Learning},
-  Pages = {609-616},
-  Publisher = {Morgan Kaufmann},
-  Title = {Obtaining Calibrated Probability Estimates from Decision Trees and Naive {B}ayesian Classifiers},
-  Year = {2001}
-}
-
-
-@Article{Provost2003uf,
-  Author = {Provost, F and Domingos, P},
-  Journal = {Machine Learning},
-  Number = {3},
-  Pages = {199-215},
-  Title = {Tree Induction for Probability-Based Ranking},
-  Volume = {52},
-  Year = {2003}
-}
-
-
-@Book{burdick2005design,
-  Author = {Burdick, R and Borror, C and Montgomery, D},
-  Publisher = {Society for Industrial Applied Mathematics},
-  Title = {Design and Analysis of Gauge R and R Studies: Making Decisions with Confidence Intervals in Random and Mixed ANOVA Models},
-  Year = {2005}
-}
-
-
-@Book{Liu2007,
-  Author = {Bing Liu},
-  Publisher = {Springer Berlin / Heidelberg},
-  Title = {Web Data Mining},
-  Year = {2007}
-}
-
-
-@Article{CruzMonteagudo6,
-  Author = {Cruz-Monteagudo, Maykel and Borges, Fernanda and Cordeiro, M. Natalia D},
-  Journal = {Journal of Chemical Information and Modeling},
-  Number = {12},
-  Pages = {3060-3077},
-  Title = {Jointly Handling Potency and Toxicity of Antimicrobial Peptidomimetics by Simple Rules from Desirability Theory and Chemoinformatics},
-  Volume = {51},
-  Year = {2011}
-}
-
-
-@Article{McCarren2011,
-  Author = {McCarren, Patrick and Springer, Clayton and Whitehead, Lewis},
-  Journal = {Journal of Cheminformatics},
-  Number = {51},
-  Pages = {},
-  Title = {An Investigation into Pharmaceutically Relevant Mutagenicity Data and the Influence on Ames Predictive Potential},
-  Volume = {3},
-  Year = {2011}
-}
-
-
-@Article{Bentley1975,
-  Author = {Bentley, J},
-  Journal = {Communications of the ACM},
-  Number = {9},
-  Pages = {509-517},
-  Title = {Multidimensional Binary Search Trees Used for Associative Searching},
-  Volume = {18},
-  Year = {1975}
-}
-
-
-@Article{Mandal2006,
-  Author = {Mandal, A. and Wu, C. and Johnson, K.},
-  Journal = {Technometrics},
-  Number = {2},
-  Pages = {273-283},
-  Title = {SELC: Sequential Elimination of Level Combinations by Means of Modified Genetic Algorithms},
-  Volume = {48},
-  Year = {2006}
-}
-
-
-@Article{Mandal2007,
-  Author = {Mandal, A. and Johnson, K. and Wu, C. and Bornemeier, D.},
-  Journal = {Journal of Chemical Information and Modeling},
-  Number = {3},
-  Pages = {981-988},
-  Title = {Identifying Promising Compounds in Drug Discovery:  Genetic Algorithms and Some New Statistical Techniques},
-  Volume = {47},
-  Year = {2007}
-}
-
-
-@Article{Ripley1995uh,
-  Author = {Ripley, B},
-  Journal = {Neural Networks: Artificial Intelligence and Industrial Applications},
-  Pages = {183-190},
-  Title = {Statistical Ideas for Selecting Network Architectures},
-  Year = {1995}
-}
-
-
-@Incollection{Perrone1993vh,
-  Author = {Perrone, M and Cooper, L},
-  Booktitle = {Artificial Neural Networks for Speech and Vision},
-  Editor = {Mammone, R. J.},
-  Pages = {126-142},
-  Publisher = {Chapman {and} Hall, London},
-  Title = {When Networks Disagree: Ensemble Methods for Hybrid Neural Networks},
-  Year = {1993}
-}
-
-
-@Article{Tumer1996tz,
-  Author = {Tumer, K. and Ghosh, J.},
-  Journal = {Pattern Recognition},
-  Number = {2},
-  Pages = {341-348},
-  Title = {Analysis of Decision Boundaries in Linearly Combined Neural Classifiers},
-  Volume = {29},
-  Year = {1996}
-}
-
-
-@Article{Geladi1986,
-  Author = {Geladi, P. and Kowalski, B},
-  Journal = {Analytica Chimica Acta},
-  Pages = {1-17},
-  Title = {Partial Least-Squares Regression: A Tutorial},
-  Volume = {185},
-  Year = {1986}
-}
-
-
-@Article{Wold1986,
-  Author = {Wold, S. and Johansson, E., and Cocchi, M.},
-  Journal = {Analytica Chimica Acta},
-  Pages = {1-17},
-  Title = {Partial Least-Squares Regression: A Tutorial},
-  Volume = {185},
-  Year = {1986}
-}
-
-
-@Inproceedings{Wold1993,
-  Address = {The Netherlands},
-  Author = {Wold, S. and Johansson, M. and Cocchi, M.},
-  Booktitle = {3D QSAR in Drug Design},
-  Editor = {Kubinyi, H.},
-  Pages = {523-550},
-  Publisher = {Kluwer Academic Publishers},
-  Title = {PLS-Partial Least-Squares Projections to Latent Structures},
-  Volume = {1},
-  Year = {1993}
-}
-
-
-@Inproceedings{Wold1995,
-  Address = {Weinheim},
-  Author = {Wold, S},
-  Booktitle = {Chemometric Methods in Molecular Design},
-  Editor = {van de Waterbeemd, Han},
-  Pages = {195-218},
-  Publisher = {VCH},
-  Title = {PLS for Multivariate Linear Modeling},
-  Year = {1995}
-}
-
-
-@Book{KohonenSOM1995,
-  Author = {Kohonen, Teuvo},
-  Publisher = {Springer},
-  Title = {Self-Organizing Maps},
-  Year = {1995}
-}
-
-
-@Article{Melssen2006wx,
-  Author = {Melssen, W. and Wehrens, R and Buydens, L.},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Number = {2},
-  Pages = {99-113},
-  Title = {Supervised Kohonen Networks for Classification Problems},
-  Volume = {83},
-  Year = {2006}
-}
-
-
-@Book{TLarose2006vf,
-  Author = {Larose, Danel},
-  Publisher = {Wiley},
-  Title = {Data Mining Methods and Models},
-  Year = {2006}
-}
-
-
-@Techreport{HastiePr90,
-  Author = {Trevor Hastie and Daryl Pregibon},
-  Institution = {AT and T Bell Laboratories Technical Report},
-  Title = {Shrinking Trees},
-  Year = {1990}
-}
-
-
-@Inproceedings{Drummond2000uo,
-  Author = {Drummond, C and Holte, R.},
-  Booktitle = {Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
-  Pages = {198-207},
-  Title = {Explicitly Representing Expected Cost: An Alternative to ROC Representation},
-  Year = {2000}
-}
-
-
-@Article{Altman1994uv,
-  Author = {Altman, D and Bland, J},
-  Journal = {British Medical Journal},
-  Number = {6948},
-  Pages = {188},
-  Title = {Diagnostic Tests 3: Receiver Operating Characteristic Plots.},
-  Volume = {309},
-  Year = {1994}
-}
-
-
-@Article{McClish,
-  Author = {D McClish},
-  Journal = {Medical Decision Making},
-  Pages = {190-195},
-  Title = {Analyzing a Portion of the ROC Curve},
-  Volume = {9},
-  Year = {1989}
-}
-
-
-@Article{Hand2001us,
-  Author = {Hand, D and Till, R},
-  Journal = {Machine Learning},
-  Number = {2},
-  Pages = {171-186},
-  Title = {A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems},
-  Volume = {45},
-  Year = {2001}
-}
-
-
-@Article{21414208,
-  Author = {Robin, Xavier and Turck, Natacha and Hainard, Alexandre and Tiberti, Natalia and Lisacek, Frederique and Sanchez, Jean-Charles and Muller, Markus},
-  Journal = {BMC Bioinformatics},
-  Number = {1},
-  Pages = {77},
-  Title = {{pROC}: an open-source package for R and {S+} to analyze and compare ROC curves},
-  Volume = {12},
-  Year = {2011}
-}
-
-
-@Article{Cohen1960,
-  Author = {Cohen, J},
-  Journal = {Educational and Psychological Measurement},
-  Pages = {37-46},
-  Title = {A Coefficient of Agreement for Nominal Data},
-  Volume = {20},
-  Year = {1960}
-}
-
-
-@Article{Hall2004,
-  Author = {Hall, P and Hyndman, R and Fan, Y},
-  Journal = {Biometrika},
-  Pages = {743-750},
-  Title = {Nonparametric Confidence Intervals for Receiver Operating Characteristic Curves},
-  Volume = {91},
-  Year = {2004}
-}
-
-
-@Book{Agresti2002vi,
-  Author = {Agresti, Alan},
-  Publisher = {Wiley-Interscience},
-  Title = {Categorical Data Analysis},
-  Year = {2002}
-}
-
-
-@Article{delong1988com,
-  Author = {DeLong, E and DeLong, D and Clarke-Pearson, D},
-  Journal = {Biometrics},
-  Number = {3},
-  Pages = {837-45},
-  Title = {Comparing the Areas Under Two Or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.},
-  Volume = {44},
-  Year = {1988}}
-}
-
-
-@Article{Venkatraman2000ek,
-  Author = {Venkatraman, E},
-  Journal = {Biometrics},
-  Number = {4},
-  Pages = {1134-1138},
-  Title = {A Permutation Test to Compare Receiver Operating Characteristic Curves},
-  Volume = {56},
-  Year = {2000}
-}
-
-
-@Article{Hanley1982uc,
-  Author = {Hanley, J and McNeil, B},
-  Journal = {Radiology},
-  Number = {1},
-  Pages = {29-36},
-  Title = {The Meaning and Use of the Area under a Receiver Operating (ROC) Curvel Characteristic},
-  Volume = {143},
-  Year = {1982}
-}
-
-
-@Article{Pepe,
-  Author = {Pepe, Margaret S. and Longton, Gary and Janes, Holly},
-  Journal = {Stata Journal},
-  Number = {1},
-  Pages = {1-16},
-  Title = {Estimation and Comparison of Receiver Operating Characteristic Curves},
-  Volume = {9},
-  Year = {2009}
-}
-
-
-@Inproceedings{Lachiche2003vp,
-  Author = {Lachiche, N. and Flach, P.},
-  Booktitle = {Proceedings of the Twentieth International Conference on Machine Learning},
-  Pages = {416-424},
-  Title = {Improving Accuracy and Cost of Two-Class and Multi-Class Probabilistic Classifiers using ROC Curves},
-  Volume = {20},
-  Year = {2003}
-}
-
-
-@Article{Li2008hh,
-  Author = {Li, Jialiang and Fine, Jason P.},
-  Journal = {Biostatistics},
-  Month = {},
-  Number = {3},
-  Pages = {566-576},
-  Title = {ROC Analysis with Multiple Classes and Multiple Tests: Methodology and Its Application in Microarray Studies},
-  Volume = {9},
-  Year = {2008}
-}
-
-
-@Article{Fisher1936,
-  Author = {Fisher, R.A.},
-  Journal = {Annals of Eugenics},
-  Number = {2},
-  Pages = {179-188},
-  Title = {The Use of Multiple Measurements in Taxonomic Problems},
-  Volume = {7},
-  Year = {1936}
-}
-
-
-@Article{Welch1939,
-  Author = {Welch, B},
-  Journal = {Biometrika},
-  Pages = {218-220},
-  Title = {Note on Discriminant Functions},
-  Volume = {31},
-  Year = {1939}
-}
-
-
-@Article{Berntsson1986,
-  Author = {Berntsson, P. and Wold, S.},
-  Journal = {Quantitative Structure-Activity Relationships},
-  Pages = {45-50},
-  Title = {Comparison Between X-ray Crystallographic Data and Physiochemical Parameters with Respect to Their Information About the Calcium Channel Antagonist Activity of 4-Phenyl-1,4-Dihydropyridines},
-  Volume = {5},
-  Year = {1986}
-}
-
-
-@Incollection{Dunn1990,
-  Author = {Dunn, W and Wold, S.},
-  Booktitle = {Comprehensive Medicinal Chemistry},
-  Editor = {Hansch, C. and Sammes, P.G. and Taylor, J.B.},
-  Pages = {691-714},
-  Publisher = {Pergamon Press, Oxford},
-  Title = {Pattern Recognition Techniques in Drug Design},
-  Year = {1990}
-}
-
-
-@Techreport{Siegelwk,
-  Author = {Siegel, E},
-  Institution = {Prediction Impact Inc.},
-  Title = {Uplift Modeling: Predictive Analytics Can't Optimize Marketing Decisions Without It},
-  Year = {2011}
-}
-
-
-@Article{Quinlan1996vq,
-  Author = {Quinlan, R},
-  Journal = {Journal of Artificial Intelligence Research},
-  Pages = {77-90},
-  Title = {Improved use of continuous attributes in {C4.5}},
-  Volume = {4},
-  Year = {1996}
-}
-
-
-@Article{Holte1993wl,
-  Author = {Holte, R},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {63-90},
-  Title = {Very Simple Classification Rules Perform Well On Most Commonly Used Datasets},
-  Volume = {11},
-  Year = {1993}
-}
-
-
-@Article{Cohen1995ti,
-  Author = {Cohen, W},
-  Journal = {Proceedings of the Twelfth International Conference on Machine Learning},
-  Pages = {115-123},
-  Title = {Fast Effective Rule Induction},
-  Year = {1995}
-}
-
-
-@Article{Frank1998uy,
-  Author = {Frank, E and Witten, I},
-  Journal = {Proceedings of the Fifteenth International Conference on Machine Learning},
-  Pages = {144-151},
-  Title = {Generating Accurate Rule Sets Without Global Optimization},
-  Year = {1998}
-}
-
-
-@Article{Liu2007b,
-  Author = {Liu, Y and Rayens, W},
-  Journal = {Computational Statistics},
-  Pages = {189-208},
-  Title = {PLS and Dimension Reduction for Classification},
-  Year = {2007}
-}
-
-
-@Manual{earthManual,
-  Author = {Stephen Milborrow},
-  Publisher = {CRAN},
-  Title = {Notes On the {earth} Package},
-  Year = {2012},
-  Url = {http://cran.r-project.org/package=earth}
-}
-
-
-@Book{gams,
-  Author = {Hastie, T. and Tibshirani, R.},
-  Publisher = {Chapman \& Hall/CRC},
-  Title = {Generalized Additive Models},
-  Year = {1990}
-}
-
-
-@Techreport{Osuna1997to,
-  Author = {Osuna, E and Freund, R. and Girosi, F.},
-  Institution = {MIT Artificial Intelligence Laboratory},
-  Title = {Support Vector Machines: Training and Applications},
-  Year = {1997}
-}
-
-
-@Article{Hastie1996,
-  Author = {Hastie, T and Tibshirani, R},
-  Journal = {Journal of the Royal Statistical Society. Series B},
-  Pages = {155-176},
-  Title = {Discriminant Analysis by Gaussian Mixtures},
-  Year = {1996}
-}
-
-
-@Article{Dempster1977,
-  Author = {Dempster, A and Laird, N and Rubin, D},
-  Journal = {Journal of the Royal Statistical Society. Series B},
-  Pages = {1-38},
-  Title = {Maximum Likelihood from Incomplete Data via the EM Algorithm},
-  Year = {1977}
-}
-
-
-@Article{Castaldi2011hq,
-  Author = {Castaldi, P and Dahabreh, I and Ioannidis, J},
-  Journal = {Briefings in Bioinformatics},
-  Number = {3},
-  Pages = {189-202},
-  Title = {An Empirical Assessment of Validation Practices for Molecular Classifiers},
-  Volume = {12},
-  Year = {2011}
-}
-
-
-@Article{Dupuy2007tv,
-  Author = {Dupuy, A. and Simon, R},
-  Journal = {Journal of the National Cancer Institute},
-  Number = {2},
-  Pages = {147-157},
-  Title = {Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting},
-  Volume = {99},
-  Year = {2007}
-}
-
-
-@Article{Varma2006ch,
-  Author = {Varma, Sudhir and Simon, Richard},
-  Journal = {BMC Bioinformatics},
-  Number = {1},
-  Pages = {91},
-  Title = {Bias in Error Estimation When Using Cross-Validation for Model Selection},
-  Volume = {7},
-  Year = {2006}
-}
-
-
-@Article{Boulesteix2009is,
-  Author = {Boulesteix, A and Strobl, C},
-  Journal = {BMC Medical Research Methodology},
-  Number = {1},
-  Pages = {85},
-  Title = {Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction},
-  Volume = {9},
-  Year = {2009}
-}
-
-
-@Article{Kline2005il,
-  Author = {Kline, Douglas M and Berardi, Victor L},
-  Journal = {Neural Computing and Applications},
-  Number = {4},
-  Pages = {310-318},
-  Title = {Revisiting Squared-Error and Cross-Entropy Functions for Training Neural Network Classifiers},
-  Volume = {14},
-  Year = {2005}
-}
-
-
-@Inproceedings{Bridle1990,
-  Author = {Bridle, J},
-  Booktitle = {Neurocomputing: Algorithms, Architectures and Applications},
-  Pages = {227-236},
-  Publisher = {Springer-Verlag},
-  Title = {Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition},
-  Year = {1990}
-}
-
-
-@Incollection {NonparametricDensityEstimation,
-  Author = {Hardle, Wolfgang and Werwatz, Axel and M\"{u}ller, Marlene and Sperlich, Stefan and Härdle, Wolfgang and Werwatz, Axel and M\"{u}ller, Marlene and Sperlich, Stefan},
-  Booktitle = {Nonparametric and Semiparametric Models},
-  Pages = {39-83},
-  Publisher = {Springer Berlin Heidelberg},
-  Title = {Nonparametric Density Estimation},
-  Year = {2004}
-}
-
-
-@Book{VRmass,
-  Author = {Venables, W. and B. Ripley},
-  Publisher = {Springer},
-  Title = {Modern Applied Statistics with {S}},
-  Year = {2002}
-}
-
-
-@Article{Breiman1998p123,
-  Author = {Breiman, L},
-  Journal = {The Annals of Statistics},
-  Pages = {123-140},
-  Title = {Arcing Classifiers},
-  Volume = {26},
-  Year = {1998}
-}
-
-
-@Article{Freund1996p148,
-  Author = {Freund, Y. and Schapire, R},
-  Journal = {Machine Learning: Proceedings of the Thirteenth International Conference},
-  Pages = {148-156},
-  Title = {Experiments with a New Boosting Algorithm},
-  Year = {1996}
-}
-
-
-@Article{Bauer1999p105,
-  Author = {Bauer, E. and Kohavi, R.},
-  Journal = {Machine Learning},
-  Pages = {105-142},
-  Title = {An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants},
-  Volume = {36},
-  Year = {1999}
-}
-
-
-@Incollection{Johnson2007p7,
-  Author = {Johnson, K. and Rayens, W},
-  Booktitle = {Pharmaceutical Statistics Using {SAS}: A Practical Guide},
-  Editor = {Dmitrienko, A. and Chuang-Stein, C. and D'Agostino, R.},
-  Pages = {7-43},
-  Publisher = {Cary, NC: SAS Institute Inc.},
-  Title = {Modern Classification Methods for Drug Discovery},
-  Year = {2007}
-}
-
-
-@Article{RossQuinlan1989wg,
-  Author = {Quinlan, R and Rivest, R},
-  Journal = {Information and computation},
-  Number = {3},
-  Pages = {227-248},
-  Title = {Inferring Decision Trees Using the Minimum Description Length Principle},
-  Volume = {80},
-  Year = {1989}
-}
-
-
-@Article{Chan2004wv,
-  Author = {Chan, K and Loh, W},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Number = {4},
-  Pages = {826-852},
-  Title = {LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees},
-  Volume = {13},
-  Year = {2004}
-}
-
-
-@Article{Frank1998wg,
-  Author = {Frank, E and Wang, Y and Inglis, S and Holmes, G},
-  Journal = {Machine Learning},
-  Title = {Using Model Trees for Classification},
-  Year = {1998}
-}
-
-
-@Article{menze2011oblique,
-  Author = {Menze, B. and Kelm, B. and Splitthoff, D. and Koethe, U. and Hamprecht, F.},
-  Journal = {Machine Learning and Knowledge Discovery in Databases},
-  Pages = {453-469},
-  Publisher = {Springer},
-  Title = {On Oblique Random Forests},
-  Year = {2011}
-}
-
-
-@Article{Shannon1948wk,
-  Author = {Shannon, C},
-  Journal = {The Bell System Technical Journal},
-  Month = {},
-  Number = {3},
-  Pages = {379-423},
-  Title = {A Mathematical Theory of Communication},
-  Volume = {27},
-  Year = {1948}
-}
-
-
-@Article{Molinaro2010eo,
-  Author = {Molinaro, A and Lostritto, K and Van Der Laan, M},
-  Journal = {Bioinformatics},
-  Month = {},
-  Number = {10},
-  Pages = {1357-1363},
-  Title = {{partDSA}: Deletion/Substitution/Addition Algorithm for Partitioning the Covariate Space in Prediction},
-  Volume = {26},
-  Year = {2010}
-}
-
-
-@Techreport{evtree,
-  Author = {Thomas Grubinger and Achim Zeileis and Karl-Peter Pfeiffer},
-  Institution = {Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universitat Innsbruck},
-  Month = {October},
-  Number = {2011-20},
-  Title = {{evtree}: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R},
-  Type = {Working Paper},
-  Year = {2011}
-}
-
-
-@Article{Ruczinski2003it,
-  Author = {Ruczinski, Ingo and Kooperberg, Charles and Leblanc, Michael},
-  Journal = {Journal of Computational and Graphical Statistics},
-  Month = {},
-  Number = {3},
-  Pages = {475-511},
-  Title = {Logic Regression},
-  Volume = {12},
-  Year = {2003}
-}
-
-
-@Article{Ozuysal2010wy,
-  Author = {Ozuysal, M. and Calonder, M. and Lepetit, V. and Fua, P.},
-  Journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  Number = {3},
-  Pages = {448-461},
-  Title = {Fast Keypoint Recognition Using Random Ferns},
-  Volume = {32},
-  Year = {2010}
-}
-
-
-@Book{Dillon1984,
-  Address = {New York},
-  Author = {Dillon, W and Goldstein, M.},
-  Publisher = {Wiley},
-  Title = {Multivariate Analysis: Methods and Applications},
-  Year = {1984}
-}
-
-
-@Book{Wallace2005ck,
-  Author = {Wallace, C},
-  Month = {},
-  Publisher = {Springer-Verlag},
-  Title = {Statistical and Inductive Inference by Minimum Message Length},
-  Year = {2005}
-}
-
-
-@Book{Cover2006ub,
-  Author = {Cover, Thomas M and Thomas, Joy A},
-  Publisher = {Wiley-Interscience},
-  Title = {Elements of Information Theory},
-  Year = {2006}
-}
-
-
-@Inproceedings{Quinlan1996uf,
-  Author = {Quinlan, R},
-  Booktitle = {In Proceedings of the Thirteenth National Conference on Artificial Intelligence},
-  Title = {Bagging, Boosting, and {C4.5}},
-  Year = {1996}
-}
-
-
-@Inproceedings{Kohavi1996uf,
-  Author = {Kohavi, R},
-  Booktitle = {Proceedings of the second international conference on knowledge discovery and data mining},
-  Title = {Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid},
-  Volume = {7},
-  Year = {1996}
-}
-
-
-@Article{Youdencs,
-  Author = {Youden, W},
-  Journal = {Cancer},
-  Number = {1},
-  Pages = {32-35},
-  Title = {Index for Rating Diagnostic Tests},
-  Volume = {3},
-  Year = {1950}
-}
-
-
-@Article{Ewald2006uc,
-  Author = {Ewald, B.},
-  Journal = {Journal of clinical epidemiology},
-  Number = {8},
-  Pages = {798-801},
-  Title = {Post Hoc Choice of Cut Points Introduced Bias to Diagnostic Research},
-  Volume = {59},
-  Year = {2006}
-}
-
-
-@Article{Veropoulos1999wc,
-  Author = {Veropoulos, K. and Campbell, C and Cristianini, N},
-  Journal = {Proceedings of the International Joint Conference on Artificial Intelligence},
-  Pages = {55-60},
-  Title = {Controlling the Sensitivity of Support Vector Machines},
-  Volume = {1999},
-  Year = {1999}
-}
-
-
-@Techreport{Therneau1997va,
-  Author = {Therneau, T and Atkinson, E},
-  Institution = {Mayo Foundation},
-  Number = {61},
-  Pages = {1-52},
-  Title = {An Introduction to Recursive Partitioning using the {rpart} Routines},
-  Year = {1997}
-}
-
-
-@Article{Chawla2002ty,
-  Author = {Chawla, N and Bowyer, K and Hall, L and Kegelmeyer, W},
-  Journal = {Journal of Artificial Intelligence Research},
-  Number = {1},
-  Pages = {321-357},
-  Title = {SMOTE: Synthetic Minority Over-Sampling Technique},
-  Volume = {16},
-  Year = {2002}
-}
-
-
-@Inproceedings{Ling1998ua,
-  Author = {Ling, C and Li, C},
-  Booktitle = {Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining},
-  Pages = {73-79},
-  Title = {Data Mining for Direct Marketing: Problems and solutions},
-  Year = {1998}
-}
-
-
-@Article{VanDerPutten2004wf,
-  Author = {Van Der Putten, P. and Van Someren, M},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {177-195},
-  Title = {A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000},
-  Volume = {57},
-  Year = {2004}
-}
-
-
-@Article{Batista2004wi,
-  Author = {Batista, G and Prati, R and Monard, M},
-  Journal = {ACM SIGKDD Explorations Newsletter},
-  Number = {1},
-  Pages = {20-29},
-  Title = {A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data},
-  Volume = {6},
-  Year = {2004}
-}
-
-
-@Article{Burez2009ta,
-  Author = {Burez, J and Van den Poel, D},
-  Journal = {Expert Systems with Applications},
-  Number = {3},
-  Pages = {4626-4636},
-  Title = {Handling Class Imbalance In Customer Churn Prediction},
-  Volume = {36},
-  Year = {2009}
-}
-
-
-@Article{Jeatrakul2010td,
-  Author = {Jeatrakul, P and Wong, K and Fung, C},
-  Journal = {Neural Information Processing. Models and Applications},
-  Pages = {152-159},
-  Title = {Classification of Imbalanced Data By Combining the Complementary Neural Network and SMOTE Algorithm},
-  Year = {2010}
-}
-
-
-@Techreport{Weiss2001wo,
-  Author = {Weiss, G and Provost, F},
-  Institution = {Department of Computer Science, Rutgers University},
-  Number = {ML-TR-44},
-  Title = {The Effect of Class Distribution On Classifier Learning: An Empirical Study},
-  Year = {2001}
-}
-
-
-@Inproceedings{VanHulse2007tg,
-  Author = {Van Hulse, J and Khoshgoftaar, T and Napolitano, A},
-  Booktitle = {Proceedings of the 24$^{th}$ International Conference On Machine learning},
-  Pages = {935-942},
-  Title = {Experimental Perspectives On Learning From Imbalanced Data},
-  Year = {2007}
-}
-
-
-@Inproceedings{Richardson2007uu,
-  Author = {Richardson, M. and Dominowska, E. and Ragno, R.},
-  Booktitle = {Proceedings of the 16$^{th}$ International Conference on the World Wide Web},
-  Pages = {521-530},
-  Title = {Predicting Clicks: Estimating the Click-Through Rate for New Ads},
-  Year = {2007}
-}
-
-
-@Article{Artis2002tq,
-  Author = {Artis, Manuel and Ayuso, Mercedes and Guillen, Montserrat},
-  Journal = {The Journal of Risk and Insurance},
-  Number = {3},
-  Pages = {325-340},
-  Title = {Detection of Automobile Insurance Fraud with Discrete Choice Models and Misclassified Claims},
-  Volume = {69},
-  Year = {2002}
-}
-
-
-@Book{johnson2001applied,
-  Author = {Johnson, R and Wichern, D},
-  Publisher = {Prentice Hall},
-  Title = {Applied Multivariate Statistical Analysis},
-  Year = {2001}
-}
-
-
-@Article{Ting2002wn,
-  Author = {Ting, K},
-  Journal = {IEEE Transactions on Knowledge and Data Engineering},
-  Number = {3},
-  Pages = {659-665},
-  Title = {An Instance-Weighting Method to Induce Cost-Sensitive Trees},
-  Volume = {14},
-  Year = {2002}
-}
-
-
-@Article{Cleveland1988ci,
-  Author = {Cleveland, W and Devlin, S},
-  Journal = {Journal of the American Statistical Association},
-  Pages = {596-610},
-  Title = {Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting},
-  Year = {1988}
-}
-
-
-@Article{Bland2000ur,
-  Author = {Bland, J and Altman, D},
-  Journal = {British Medical Journal},
-  Number = {7247},
-  Pages = {1468},
-  Title = {The Odds Ratio},
-  Volume = {320},
-  Year = {2000}
-}
-
-
-@Book{good00,
-  Publisher = {},
-  Title = {},
-  Year = {2000}
-}
-
-
-@Article{RobnikSikonja1997ui,
-  Author = {Robnik-Sikonja, M and Kononenko, I},
-  Journal = {Proceedings of the Fourteenth International Conference on Machine Learning},
-  Pages = {296-304},
-  Title = {An Adaptation of {Relief} for Attribute Estimation in Regression},
-  Year = {1997}
-}
-
-
-@Article{type3,
-  Author = {Kimball, A},
-  Journal = {Journal of the American Statistical Association},
-  Pages = {133-142},
-  Title = {Errors of the Third Kind in Statistical Consulting},
-  Volume = {52},
-  Year = {1957}
-}
-
-
-@Article{Lo2002ub,
-  Author = {Lo, V},
-  Journal = {ACM SIGKDD Explorations Newsletter},
-  Number = {2},
-  Pages = {78-86},
-  Title = {The True Lift Model: A Novel Data Mining Approach To Response Modeling in Database Marketing},
-  Volume = {4},
-  Year = {2002}
-}
-
-
-@Techreport{Radcliffe2011vt,
-  Author = {Radcliffe, N and Surry, P},
-  Institution = {Stochastic Solutions},
-  Title = {Real-World Uplift Modelling With Significance-Based Uplift Trees},
-  Year = {2011}
-}
-
-
-@Article{Rzepakowskivl,
-  Author = {Rzepakowski, P. and Jaroszewicz, S.},
-  Journal = {Journal of Telecommunications and Information Technology},
-  Pages = {43-50},
-  Title = {Uplift Modeling in Direct Marketing},
-  Volume = {2},
-  Year = {2012}
-}
-
-
-@Article{deLeon2006bj,
-  Author = {de Leon, Mony and Klunk, William},
-  Journal = {The Lancet Neurology},
-  Month = {},
-  Number = {3},
-  Pages = {198-199},
-  Title = {Biomarkers for the Early Diagnosis of Alzheimer's Disease},
-  Volume = {5},
-  Year = {2006}
-}
-
-
-@Article{Hampel2010va,
-  Author = {Hampel, H and Frank, R and Broich, K and Teipel, S and Katz, R and Hardy, J and Herholz, K and Bokde, A and Jessen, F and Hoessler, Y},
-  Journal = {Nature Reviews Drug Discovery},
-  Number = {7},
-  Pages = {560-574},
-  Title = {Biomarkers for Alzheimer's Disease: Academic, Industry and Regulatory Perspectives},
-  Volume = {9},
-  Year = {2010}
-}
-
-
-@Article{Bu2009kg,
-  Author = {Bu, Guojun},
-  Journal = {Nature Reviews Neuroscience},
-  Month = {},
-  Number = {5},
-  Pages = {333-344},
-  Title = {Apolipoprotein E and Its Receptors in Alzheimer's Disease: Pathways, Pathogenesis and Therapy},
-  Volume = {10},
-  Year = {2009}
-}
-
-
-@Article{Guyon2003uc,
-  Author = {Guyon, I and Elisseeff, A},
-  Journal = {The Journal of Machine Learning Research},
-  Pages = {1157-1182},
-  Title = {An Introduction to Variable and Feature Selection},
-  Volume = {3},
-  Year = {2003}
-}
-
-
-@Article{Saeys2007wc,
-  Author = {Saeys, Y and Inza, I and Larranaga, P},
-  Journal = {Bioinformatics},
-  Number = {19},
-  Pages = {2507-–2517},
-  Title = {A Review of Feature Selection Techniques in Bioinformatics},
-  Volume = {23},
-  Year = {2007}
-}
-
-
-@Article{Gauge,
-  Author = {Montgomery, D and Runger, G},
-  Journal = {Quality Engineering},
-  Number = {1},
-  Pages = {115-135},
-  Title = {Gauge Capability and Designed Experiments. Part I: Basic Methods},
-  Volume = {6},
-  Year = {1993}
-}
-
-
-@Book{davison1983,
-  Author = {Davison, M},
-  Publisher = {John Wiley and Sons, Inc.},
-  Title = {Multidimensional Scaling},
-  Year = {1983}
-}
-
-
-@Article{Olden2000uf,
-  Author = {Olden, J and Jackson, D},
-  Journal = {Ecoscience},
-  Number = {4},
-  Pages = {501-510},
-  Title = {Torturing Data for the Sake of Generality: How Valid Are Our Regression Models?},
-  Volume = {7},
-  Year = {2000}
-}
-
-
-@Article{Derksen1992wm,
-  Author = {Derksen, S and Keselman, H},
-  Journal = {British Journal of Mathematical and Statistical Psychology},
-  Number = {2},
-  Pages = {265-282},
-  Title = {Backward, Forward and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables},
-  Volume = {45},
-  Year = {1992}
-}
-
-
-@Article{Whittingham2006vq,
-  Author = {Whittingham, M and Stephens, P and Bradbury, R and Freckleton, R},
-  Journal = {Journal of Animal Ecology},
-  Number = {5},
-  Pages = {1182-1189},
-  Title = {Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?},
-  Volume = {75},
-  Year = {2006}
-}
-
-
-@Article{Hall1997wp,
-  Author = {Hall, M and Smith, L},
-  Journal = {International Conference on Neural Information Processing and Intelligent Information Systems},
-  Pages = {855-858},
-  Title = {Feature Subset Selection: A Correlation Based Filter Approach},
-  Year = {1997}
-}
-
-
-@Article{Henderson1981wx,
-  Author = {Henderson, H and Velleman, P},
-  Journal = {Biometrics},
-  Pages = {391-411},
-  Title = {Building Multiple Regression Models Interactively},
-  Year = {1981}
-}
-
-
-@Article{Schmidberger2009tx,
-  Author = {Schmidberger, M and Morgan, M and Eddelbuettel, D and Yu, H and Tierney, L and Mansmann, U},
-  Journal = {Journal of Statistical Software},
-  Number = {1},
-  Title = {State-of-the-Art in Parallel Computing with R},
-  Volume = {31},
-  Year = {2009}
-}
-
-
-@Article{Hand2006vl,
-  Author = {Hand, D},
-  Journal = {Statistical Science},
-  Number = {1},
-  Pages = {1-15},
-  Title = {Classifier Technology and the Illusion of Progress},
-  Volume = {21},
-  Year = {2006}
-}
-
-
-@Book{Holland1975,
-  Address = {Ann Arbor, MI},
-  Author = {Holland, J},
-  Publisher = {University of Michigan Press},
-  Title = {Adaptation in Natural and Artificial Systems},
-  Year = {1975}
-}
-
-
-@Book{Goldberg1989,
-  Address = {Boston},
-  Author = {Goldberg, D},
-  Publisher = {Addison-Wesley},
-  Title = {Genetic Algorithms in Search, Optimization, and Machine Learning},
-  Year = {1989}
-}
-
-
-@Book{Holland1992,
-  Address = {Cambridge, MA},
-  Author = {Holland, J},
-  Publisher = {MIT Press},
-  Title = {Adaptation in Natural and Artificial Systems},
-  Year = {1992}
-}
-
-
-@Article{Lavine2002p161,
-  Author = {Lavine, B and Davidson, C and Moores, A},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Number = {1},
-  Pages = {161-171},
-  Title = {Innovative Genetic Algorithms for Chemoinformatics},
-  Volume = {60},
-  Year = {2002}
-}
-
-
-@Article{Bhanu2003p591,
-  Author = {Bhanu, B and Lin, Y},
-  Journal = {Image and Vision Computing},
-  Pages = {591-608},
-  Title = {Genetic Algorithm Based Feature Selection for Target Detection in SAR Images},
-  Volume = {21},
-  Year = {2003}
-}
-
-
-@Article{Min2006p652,
-  Author = {Min, S and Lee, J and Han, I},
-  Journal = {Expert Systems with Applications},
-  Number = {3},
-  Pages = {652-660},
-  Title = {Hybrid Genetic Algorithms and Support Vector Machines for Bankruptcy Prediction},
-  Volume = {31},
-  Year = {2006}
-}
-
-
-@Article{Huang2012p65,
-  Author = {Huang, C and Chang, B and Cheng, D and Chang, C},
-  Journal = {International Journal of Fuzzy Systems},
-  Number = {1},
-  Pages = {65-75},
-  Title = {Feature Selection and Parameter Optimization of a Fuzzy-Based Stock Selection Model Using Genetic Algorithms},
-  Volume = {14},
-  Year = {2012}
-}
-
-
-@Book{Armitage1994,
-  Address = {Oxford},
-  Author = {Armitage, P and Berry, G},
-  Edition = {3rd},
-  Publisher = {Blackwell Scientific Pubilcations},
-  Title = {Statistical Methods in Medical Research},
-  Year = {1994}
-}
-
-
-@Article{Bland1995cv,
-  Author = {Bland, J and Altman, D},
-  Journal = {British Medical Journal},
-  Month = {},
-  Number = {6973},
-  Pages = {170-170},
-  Title = {Statistics Notes: Multiple Significance Tests: The Bonferroni Method},
-  Volume = {310},
-  Year = {1995}
-}
-
-
-@Article{Breiman1996vz,
-  Author = {Breiman, L},
-  Journal = {Machine Learning},
-  Number = {1},
-  Pages = {41-47},
-  Title = {Technical Note: Some Properties of Splitting Criteria},
-  Volume = {24},
-  Year = {1996}
-}
-
-
-@Article{Efron1997tc,
-  Author = {Efron, B and Tibshirani, R},
-  Journal = {Journal of the American Statistical Association},
-  Number = {438},
-  Pages = {548-560},
-  Title = {Improvements on Cross-Validation: The 632+ Bootstrap Method},
-  Volume = {92},
-  Year = {1997}
-}
-
-
-@Inproceedings{Netzeva2005wc,
-  Author = {Netzeva, T and Worth, A and Aldenberg, T and Benigni, R and Cronin, M  and Gramatica, P and Jaworska, J  and Kahn, S and Klopman, G and Marchant, C},
-  Booktitle = {The Report and Recommendations of European Centre for the Validation of Alternative Methods Workshop 52},
-  Number = {2},
-  Pages = {1-19},
-  Title = {Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships},
-  Volume = {33},
-  Year = {2005}
-}
-
-
-@Article{Kansy1998p1007,
-  Author = {Kansy, M. and Senner, F. and Gubernator, K.},
-  Journal = {Journal of Medicinal Chemistry},
-  Pages = {1007-1010},
-  Title = {Physiochemical High Throughput Screening:  Parallel Artificial Membrane Permeation Assay in the Description of Passive Absorption Processes},
-  Volume = {41},
-  Year = {1998}
-}
-
-
-@Article{Duhigg2012,
-  Author = {Duhigg, C.},
-  Journal = {The New York Times},
-  Month = {Feb 16},
-  Title = {How Companies Learn Your Secrets},
-  Year = {2012},
-  Url = {http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
-}
-
-
-
-@Article{Ambroise2002p1493,
-  Author = {C Ambroise and G McLachlan},
-  Journal = {Proceedings of the National Academy of Sciences},
-  Number = {10},
-  Pages = {6562-6566},
-  Title = {Selectiaon Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data},
-  Volume = {99},
-  Year = {2002}
-}
-
-
-@Article{Molinaro2005p47,
-  Author = {A  Molinaro},
-  Journal = {Bioinformatics},
-  Number = {15},
-  Pages = {3301-3307},
-  Title = {Prediction Error Estimation: A Comparison of Resampling Methods},
-  Volume = {21},
-  Year = {2005}
-}
-
-
-@Article{Martin1996p52,
-  Author = {J Martin and D Hirschberg},
-  Journal = {Department of Informatics and Computer Science Technical Report},
-  Title = {Small Sample Statistics for Classification Error Rates {I}: Error Rate Measurements},
-  Year = {1996}
-}
-
-
-@Article{Hawkins2003p2906,
-  Author = {D Hawkins and S Basak and D Mills},
-  Journal = {Journal of Chemical Information and Computer Sciences},
-  Number = {2},
-  Pages = {579-586},
-  Title = {Assessing Model Fit by Cross-Validation},
-  Volume = {43},
-  Year = {2003}
-}
-
-
-@Article{Willett1999p8,
-  Author = {P Willett},
-  Journal = {Journal of Computational Biology},
-  Month = {Jan},
-  Number = {3},
-  Pages = {447-457},
-  Title = {Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds},
-  Volume = {6},
-  Year = {1999}
-}
-
-
-@Article{Clark1997p1352,
-  Author = {R Clark},
-  Journal = {Journal of Chemical Information and Computer Sciences},
-  Number = {6},
-  Pages = {1181-1188},
-  Title = {OptiSim: An extended dissimilarity delection method for finding diverse representative subsets},
-  Volume = {37},
-  Year = {1997}
-}
-
-
-@Article{Martin2012hr,
-  Author = {Martin, T and Harten, P and Young, D  and Muratov, E  and Golbraikh, A and Zhu, H and Tropsha, A},
-  Journal = {Journal of Chemical Information and Modeling},
-  Number = {10},
-  Pages = {2570-2578},
-  Title = {Does rational selection of training and test sets Improve the outcome of {QSAR} modeling?},
-  Volume = {52},
-  Year = {2012}
-}
-
-
-@Book{fes,
-  Author = {Kuhn, M and Johnson, K},
-  Publisher = {CRC Press},
-  Title = {{\href{https://bookdown.org/max/FES}{Feature Engineering and Selection}}: {A Practical Approach for Predictive Models}},
-  Year = {2019}
-}
-
-
-@Article{gower,
-  Author = {J Gower},
-  Journal = {Biometrics},
-  Number = {4},
-  Pages = {857-871},
-  Publisher = {Wiley},
-  Title = {A general coefficient of similarity and some of its properties},
-  Volume = {27},
-  Year = {1971}
-}
-
-
-@Article{bioinformaticsbtg484,
-  Author = {Baggerly, K and Morris, J and Coombes, K},
-  Journal = {Bioinformatics},
-  Month = {01},
-  Number = {5},
-  Pages = {777-785},
-  Title = {Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments},
-  Volume = {20},
-  Year = {2004}
-}
-
-
-@Article{ames,
-  Author = {De Cock, D.},
-  Journal = {Journal of Statistics Education},
-  Number = {3},
-  Publisher = {Taylor & Francis},
-  Title = {{Ames, Iowa}: Alternative to the {Boston} housing data as an end of semester regression project},
-  Volume = {19},
-  Year = {2011}
-}
-
-
-@Book{tmwr,
-  Author = {M Kuhn and J Silge},
-  Publisher = {O'Reilly Media, Inc.},
-  Title = {\href{https://www.tmwr.org}{{Tidy Modeling with {R}}}},
-  Year = {2022}
-}
-
-
-@Article{ruberg,
-  Author = {S Ruberg},
-  Journal = {Journal of the American Statistical Association},
-  Number = {407},
-  Pages = {816-822},
-  Title = {Contrasts for Identifying the Minimum Effective Dose},
-  Volume = {84},
-  Year = {1989}
-}
-
-
-@Article{estimable,
-  Author = {RK Elswick and  C Gennings  and  V Chinchilli  and  K Dawson},
-  Journal = {The American Statistician},
-  Number = {1},
-  Pages = {51-53},
-  Title = {A Simple Approach for Finding Estimable Functions in Linear Models},
-  Volume = {45},
-  Year = {1991}
-}
-
-
-@Inproceedings{weinberger2009feature,
-  Author = {Weinberger, K and Dasgupta, A and Langford, J and Smola, A and Attenberg, J},
-  Booktitle = {Proceedings of the 26th Annual International Conference on Machine Learning},
-  Organization = {ACM},
-  Pages = {1113-1120},
-  Title = {Feature hashing for large scale multitask learning},
-  Year = {2009}
-}
-
-
-@Article{MicciBarreca2001,
-  Author = {Micci-Barreca, D},
-  Journal = {ACM SIGKDD Explorations Newsletter},
-  Number = {1},
-  Pages = {27-32},
-  Title = {A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems},
-  Volume = {3},
-  Year = {2001}
-}
-
-
-@Article{Cerda2022,
-  Author = {Cerda, P and Varoquaux, G},
-  Journal = {IEEE Transactions on Knowledge and Data Engineering},
-  Number = {3},
-  Pages = {1164-1176},
-  Title = {Encoding High-Cardinality String Categorical Variables},
-  Volume = {34},
-  Year = {2022}
-}
-
-
-@Article{pargent2022regularized,
-  Author = {Pargent, F and Pfisterer, F and Thomas, J and Bischl, B},
-  Journal = {Computational Statistics},
-  Pages = {1-22},
-  Publisher = {Springer},
-  Title = {Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features},
-  Year = {2022}
-}
-
-
-@Book{Rethinking,
-  Author = {R McElreath},
-  Publisher = {Chapman and Hall/CRC},
-  Title = {Statistical Rethinking: A {Bayesian} Course with Examples in {R} and {Stan}},
-  Year = {2015}
-}
-
-
-@Book{kruschke2014doing,
-  Author = {Kruschke, J},
-  Publisher = {Academic Press},
-  Title = {Doing {Bayesian} Data Analysis: A tutorial with {R}, {JAGS}, and {Stan}},
-  Year = {2014}
-}
-
-
-@Book{gelman1995bayesian,
-  Author = {Gelman, A and Carlin, J and Stern, H and Rubin, D},
-  Publisher = {Chapman and Hall/CRC},
-  Title = {Bayesian Data Analysis},
-  Year = {2013}
-}
-
-
-@Article{bolker2009generalized,
-  Author = {Bolker, B and Brooks, M and Clark, C and Geange, S and Poulsen, J and Stevens, H and White, JS},
-  Journal = {Trends in Ecology and Evolution},
-  Number = {3},
-  Pages = {127-135},
-  Publisher = {Elsevier},
-  Title = {Generalized linear mixed models: A practical guide for ecology and evolution},
-  Volume = {24},
-  Year = {2009}
-}
-
-
-@Book{stroup2012generalized,
-  Author = {Stroup, W},
-  Publisher = {CRC press},
-  Title = {Generalized linear mixed models: Modern concepts, methods and applications},
-  Year = {2012}
-}
-
-
-@Book{BayesRules,
-  Author = {Johnson, A and Ott, M and Dogucu, M},
-  Publisher = {Chapman and Hall/CRC},
-  Title = {Bayes Rules!: An introduction to applied Bayesian modeling},
-  Year = {2022}
-}
-
-
-@Book{EfronHastie2016a,
-  Author = {Efron, B and Hastie, T},
-  Publisher = {Cambridge University Press},
-  Title = {Computer age statistical inference},
-  Year = {2016}
-}
-
-
-@Article{Provost1998p1185,
-  Author = {F Provost and T Fawcett and R Kohavi},
-  Journal = {Proceedings of the Fifteenth International Conference on Machine Learning},
-  Pages = {445-453},
-  Title = {The Case Against Accuracy Estimation for Comparing Induction Algorithms},
-  Year = {1998}
-}
-
-
-@Article{Brown2006wp,
-  Author = {Brown, C and Davis, H},
-  Journal = {Chemometrics and Intelligent Laboratory Systems},
-  Number = {1},
-  Pages = {24-38},
-  Title = {Receiver Operating Characteristics Curves and Related Decision Measures: A Tutorial},
-  Volume = {80},
-  Year = {2006}
-}
-
-
-@Article{Fawcett2006gr,
-  Author = {Fawcett, T},
-  Journal = {Pattern Recognition Letters},
-  Number = {8},
-  Pages = {861-874},
-  Title = {An Introduction to ROC Analysis},
-  Volume = {27},
-  Year = {2006}
-}
-
-
-@Article{GORODKIN2004367,
-  Author = {J. Gorodkin},
-  Journal = {Computational Biology and Chemistry},
-  Number = {5},
-  Pages = {367-374},
-  Title = {Comparing two {K}-category assignments by a {K}-category correlation coefficient},
-  Volume = {28},
-  Year = {2004}
-}
-
-
-@Article{MATTHEWS1975442,
-  Author = {B.W. Matthews},
-  Journal = {Biochimica et Biophysica Acta (BBA) - Protein Structure},
-  Number = {2},
-  Pages = {442-451},
-  Title = {Comparison of the predicted and observed secondary structure of T4 phage lysozyme},
-  Volume = {405},
-  Year = {1975}
-}
-
-
-@Article{bella2013effect,
-  Author = {Bella, A and Ferri, C and Hernandez-Orallo, J and Ramirez-Quintana, Maria J},
-  Journal = {Applied intelligence},
-  Number = {4},
-  Pages = {566-585},
-  Publisher = {Springer},
-  Title = {On the effect of calibration in classifier combination},
-  Volume = {38},
-  Year = {2013}
-}
-
-@Inproceedings{allikivi2019non,
-  Author = {Allikivi, M and Kull, M},
-  Booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
-  Organization = {Springer},
-  Pages = {103-120},
-  Title = {Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification},
-  Year = {2019}
-}
-
-
-@Article{pava,
-  Author = {A Miriam and H Brunk and G Ewing and W Reid and E Silverman},
-  Journal = {The Annals of Mathematical Statistics},
-  Number = {4},
-  Pages = {641-647},
-  Publisher = {Institute of Mathematical Statistics},
-  Title = {An Empirical Distribution Function for Sampling with Incomplete Information},
-  Volume = {26},
-  Year = {1955}
-}
-
-
-@Inproceedings{isotonic,
-  Address = {New York, NY, USA},
-  Author = {Zadrozny, B and Elkan, C},
-  Booktitle = {Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
-  Pages = {694–699},
-  Publisher = {Association for Computing Machinery},
-  Title = {Transforming Classifier Scores into Accurate Multiclass Probability Estimates},
-  Year = {2002},
-  Numpages = {6}
-}
-
-
-@Inproceedings{betacal,
-  Author = {Kull, M and Filho, T and Flach, P},
-  Booktitle = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics},
-  Editor = {Singh, A and Zhu, J},
-  Pages = {623-631},
-  Series = {Proceedings of Machine Learning Research},
-  Title = {Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers},
-  Volume = {54},
-  Year = {2017}
-}
-
-
-@Inproceedings{johansson21a,
-  Author = {Johansson, U and Lofstrom, T and Bostrom, H},
-  Booktitle = {Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction and Applications},
-  Editor = {Carlsson, L and Luo, Z and Cherubin, G and An Nguyen, K},
-  Month = {Sep},
-  Pages = {111-130},
-  Series = {Proceedings of Machine Learning Research},
-  Title = {Calibrating multi-class models},
-  Volume = {152},
-  Year = {2021}
-}
-
-
-@Inproceedings{NIPS2001abdbeb4d,
-  Author = {Zadrozny, B},
-  Booktitle = {Advances in Neural Information Processing Systems},
-  Editor = {T Dietterich and S Becker and Z Ghahramani},
-  Pages = {},
-  Publisher = {MIT Press},
-  Title = {Reducing multiclass to binary by coupling probability estimates},
-  Volume = {14},
-  Year = {2001}
-}
-
-
-@Inproceedings{NEURIPS2021bbc92a64,
-  Author = {Zhao, S and Kim, M and Sahoo, R and Ma, T and Ermon, S},
-  Booktitle = {Advances in Neural Information Processing Systems},
-  Editor = {M Ranzato and A Beygelzimer and Y Dauphin and P Liang and J Wortman Vaughan},
-  Pages = {22313-22324},
-  Publisher = {Curran Associates, Inc.},
-  Title = {Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration},
-  Volume = {34},
-  Year = {2021}
-}
-
-
-@Article{platt1999probabilistic,
-  Author = {Platt, J},
-  Journal = {Advances in Large Margin Classifiers},
-  Number = {3},
-  Pages = {61-74},
-  Publisher = {Cambridge, MA},
-  Title = {Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods},
-  Volume = {10},
-  Year = {1999}
-}
-
-
-@Article{Piersma2004,
-  Author = {Piersma, AG and Genschow, E and Verhoef, A and Spanjersberg, M and Brown, N and Brady, M and Burns, A and Clemann, N and Seiler, A and Spielmann, H},
-  Journal = {Alternatives to Laboratory Animals},
-  Pages = {275-307},
-  Title = {Validation of the Postimplantation Rat Whole-embryo Culture Test in the International ECVAM Validation Study on Three In Vitro Embryotoxicity Tests},
-  Volume = {32},
-  Year = {2004}
-}
-
-
-@Article{Gupta2006tn,
-  Author = {Gupta, S and Hanssens, D and Hardie, B and Kahn, W and Kumar, V and Lin, N and Ravishanker, N and Sriram, S},
-  Journal = {Journal of Service Research},
-  Number = {2},
-  Pages = {139-155},
-  Title = {Modeling Customer Lifetime Value},
-  Volume = {9},
-  Year = {2006}
-}
-
-
-@Article{brier1950verification,
-  Author = {Brier, G},
-  Journal = {Monthly Weather Review},
-  Number = {1},
-  Pages = {1-3},
-  Title = {Verification of forecasts expressed in terms of probability},
-  Volume = {78},
-  Year = {1950}
-}
-
-@book{nist,
- author = {{National Institute of Standards and Technology}},
- editor = {Croarkin, Carroll and Tobias, Paul and Filliben, James J. and Hembree, Barry and Guthrie, William and Trutna, Ledi and Prins, Jack},
- publisher = {NIST/SEMATECH},
- title = {{NIST/SEMATECH e-Handbook of Statistical Methods}},
- url = {http://www.itl.nist.gov/div898/handbook/},
- year = {2012}
- }
-
-@article{yeojohnson,
-  author = {I Yeo and R Johnson},
-  journal = {Biometrika},
-  number = {4},
-  pages = {954-959},
-  title = {A New Family of Power Transformations to Improve Normality or Symmetry},
-  volume = {87},
-  year = {2000}
- }
-
-@article{ORQ,
- author = {R Peterson and J Cavanaugh},
- title = {Ordered quantile normalization: a semiparametric transformation built for the cross-validation era},
- journal = {Journal of Applied Statistics},
- volume = {47},
- number = {13-15},
- pages = {2312-2327},
- year  = {2020},
- publisher = {Taylor & Francis}
- }
-
-@article{twosd,
- author = {Gelman, A},
- title = {Scaling regression inputs by dividing by two standard deviations},
- journal = {Statistics in Medicine},
- volume = {27},
- number = {15},
- pages = {2865-2873},
- year = {2008}
- }
-
-@article{ruberg,
- author = {S Ruberg},
- title = {Contrasts for Identifying the Minimum Effective Dose},
- journal = {Journal of the American Statistical Association},
- volume = {84},
- number = {407},
- pages = {816-822},
- year  = {1989}
- }
-
-@article{estimable,
- author = { RK Elswick and  C Gennings  and  V Chinchilli  and  K Dawson },
- title = {A Simple Approach for Finding Estimable Functions in Linear Models},
- journal = {The American Statistician},
- volume = {45},
- number = {1},
- pages = {51-53},
- year  = {1991}
- }
-
-@inproceedings{weinberger2009feature,
-   title={Feature hashing for large scale multitask learning},
-   author={Weinberger, K and Dasgupta, A and Langford, J and Smola, A and Attenberg, J},
-   booktitle={Proceedings of the 26th Annual International Conference on Machine Learning},
-   pages={1113-1120},
-   year={2009},
-   organization={ACM}
- }
-
-@article{MicciBarreca2001,
-  author = {Micci-Barreca, D},
-  title = {A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems},
-  journal = {ACM SIGKDD Explorations Newsletter},
-  volume = {3},
-  number = {1},
-  year = {2001},
-  pages = {27-32}
- }
-
-@article{Cerda2022,
-   author={Cerda, P and Varoquaux, G},
-   journal={IEEE Transactions on Knowledge and Data Engineering},
-   title={Encoding High-Cardinality String Categorical Variables},
-   year={2022},
-   volume={34},
-   number={3},
-   pages={1164-1176}}
-
-@article{pargent2022regularized,
-   title={Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features},
-   author={Pargent, F and Pfisterer, F and Thomas, J and Bischl, B},
-   journal={Computational Statistics},
-   pages={1-22},
-   year={2022},
-   publisher={Springer}
- }
-
-@book{Rethinking,
-   title={Statistical Rethinking: A {Bayesian} Course with Examples in {R} and {Stan}},
-   author={R McElreath},
-   year={2015},
-   publisher={Chapman and Hall/CRC}
- }
-
-@book{kruschke2014doing,
-   title={Doing {Bayesian} Data Analysis: A tutorial with {R}, {JAGS}, and {Stan}},
-   author={Kruschke, J},
-   year={2014},
-   publisher={Academic Press}
- }
-
-@book{gelman1995bayesian,
-   title={Bayesian Data Analysis},
-   author={Gelman, A and Carlin, J and Stern, H and Rubin, D},
-   year={2013},
-   publisher={Chapman and Hall/CRC}
- }
-
-@article{bolker2009generalized,
-   title={Generalized linear mixed models: A practical guide for ecology and evolution},
-   author={Bolker, B and Brooks, M and Clark, C and Geange, S and Poulsen, J and Stevens, H and White, JS},
-   journal={Trends in Ecology and Evolution},
-   volume={24},
-   number={3},
-   pages={127-135},
-   year={2009},
-   publisher={Elsevier}
- }
-
-@book{stroup2012generalized,
-   title={Generalized linear mixed models: Modern concepts, methods and applications},
-   author={Stroup, W},
-   year={2012},
-   publisher={CRC press}
- }
-
-@book{BayesRules,
-   title={Bayes Rules!: An introduction to applied Bayesian modeling},
-   author={Johnson, A and Ott, M and Dogucu, M},
-   year={2022},
-   publisher={Chapman and Hall/CRC}
- }
-
-@Book{EfronHastie2016a,
-   author    = {Efron, B and Hastie, T},
-   title     = {Computer age statistical inference},
-   year      = {2016},
-   publisher = {Cambridge University Press}
- }
-
-@article{Ambroise2002p1493,
-	Author = {C Ambroise and G McLachlan},
-	Journal = {Proceedings of the National Academy of Sciences},
-	Number = {10},
-	Pages = {6562-6566},
-	Title = {Selection bias in gene extraction on the basis of microarray gene-expression data},
-	Volume = {99},
-	Year = {2002}}
-
-@article{kennard1969computer,
-  title={Computer aided design of experiments},
-  author={Kennard, R W and Stone, L A},
-  journal={Technometrics},
-  volume={11},
-  number={1},
-  pages={137-148},
-  year={1969}
-}
-
-@article{szekely2013energy,
-  title={Energy statistics: {A} class of statistics based on distances},
-  author={Székely, G J and Rizzo, M L},
-  journal={Journal of Statistical Planning and Inference},
-  volume={143},
-  number={8},
-  pages={1249-1272},
-  year={2013}
-}
-
-
-@article{vakayil2022data,
-  title={Data twinning},
-  author={Vakayil, A and Joseph, V R},
-  journal={Statistical Analysis and Data Mining: {The} ASA Data Science Journal},
-  volume={15},
-  number={5},
-  pages={598-610},
-  year={2022}
-}
-
-@article{mahoney2023assessing,
-  title={Assessing the performance of spatial cross-validation approaches for models of spatially structured data},
-  author={Mahoney, M J and Johnson, L K and Silge, J and Frick, H and Kuhn, M and Beier, C M},
-  journal={arXiv},
-  year={2023}
-}
-
-@book{ruppert2003semiparametric,
-  title={Semiparametric Regression},
-  author={Ruppert, D and Wand, M and Carroll, R},
-  number={12},
-  year={2003},
-  publisher={Cambridge university press}
-}
-
-@book{arnold2019computational,
-  title={A Computational Approach to Statistical Learning},
-  author={Arnold, T and Kane, M and Lewis, B},
-  year={2019},
-  publisher={Chapman and Hall/CRC}
-}
-
-@book{wood2006generalized,
-  title={Generalized Additive Models: An introduction with {R}},
-  author={Wood, S},
-  year={2006},
-  publisher={chapman and Hall/CRC}
-}
-
-@article{Breimansplines,
- author = {L Breiman},
- journal = {Statistical Science},
- number = {4},
- pages = {442-445},
- publisher = {Institute of Mathematical Statistics},
- title = {Monotone regression splines in action (Comment)},
- volume = {3},
- year = {1988}
-}
-
-@article{liu1999multivariate,
-  title={Multivariate analysis by data depth: {D}escriptive statistics, graphics and inference},
-  author={Liu, R and Parelius, J and Singh, K},
-  journal={The Annals of Statistics},
-  volume={27},
-  number={3},
-  pages={783-858},
-  year={1999}
-}
-
-@inproceedings{tukey1975mathematics,
-  title={Mathematics and the Picturing of Data},
-  author={Tukey, J},
-  booktitle={{Proceedings of the International Congress of Mathematicians}},
-  volume={2},
-  pages={523-531},
-  year={1975}
-}
-
-
-@article{tibshirani2003class,
-  title={Class prediction by nearest shrunken centroids, with applications to {DNA} microarrays},
-  author={Tibshirani, R and Hastie, T and Narasimhan, B  and Chu, G},
-  journal={Statistical Science},
-  pages={104-117},
-  year={2003}
-}
-
-@article{wangImprovedCentroids,
-    author = {Wang, S and Zhu, J},
-    title = {Improved centroids estimation for the nearest shrunken centroid classifier},
-    journal = {Bioinformatics},
-    volume = {23},
-    number = {8},
-    pages = {972-979},
-    year = {2007},
-    month = {03}
-}
-
-@article{efron2009empirical,
-  title={Empirical {Bayes} estimates for large-scale prediction problems},
-  author={Efron, B},
-  journal={Journal of the American Statistical Association},
-  volume={104},
-  number={487},
-  pages={1015-1028},
-  year={2009}
-}
-
-@article{sinnott1984virtues,
-  title={Virtues of the {Haversine}},
-  author={Sinnott, R},
-  journal={Sky and Telescope},
-  volume={68},
-  number={2},
-  pages={158},
-  year={1984}
-}
-
-@article{mozharovskyi2015classifying,
-  title={Classifying real-world data with the {DD}$\alpha$-procedure},
-  author={Mozharovskyi, P and Mosler, K and Lange, T},
-  journal={Advances in Data Analysis and Classification},
-  volume={9},
-  number={3},
-  pages={287-314},
-  year={2015},
-  publisher={Springer}
-}
-
-@article{ghosh2005data,
-  title={On data depth and distribution-free discriminant analysis using separating surfaces},
-  author={Ghosh, A K and Chaudhuri, P},
-  journal={Bernoulli},
-  volume={11},
-  number={1},
-  pages={1-27},
-  year={2005}
-}
-
-@article{linCCC,
-  author = {L Lin},
-  journal = {Biometrics},
-  number = {1},
-  pages = {255-268},
-  title = {A concordance correlation coefficient to evaluate reproducibility},
-  volume = {45},
-  year = {1989}
- }
-
-@ARTICLE{Buettner1997bt,
-title = {Problems in defining cutoff points of continuous prognostic factors: {E}xample of tumor thickness in primary cutaneous melanoma},
-author = {Buettner, P and Garbe, C and Guggenmoos-Holzmann, I},
-journal = {Journal of Clinical Epidemiology},
-publisher = {Elsevier},
-volume =  {50},
-number =  {11},
-pages = {1201-1210},
-year =  {1997}
-}
-
-
-@ARTICLE{Fernandes2019na,
-title = {Why quantitative variables should not be recoded as categorical},
-author = {Fernandes, A and Malaquias, C and Figueiredo, D and da Rocha, E and Lins, R},
-journal = {Journal of applied mathematics and physics},
-volume =  {7},
-number =  {7},
-pages = {1519-1530},
-year =  {2019}
-}
-
-@ARTICLE{Kuss2013zi,
-title = {The danger of dichotomizing continuous variables: A visualization},
-author = {Kuss, O},
-journal = {Teaching Statistics},
-volume =  {35},
-number =  {2},
-pages = {78-79},
-year =  {2013}
-}
-
-@ARTICLE{MacCallum2002ox,
-title = {On the practice of dichotomization of quantitative variables},
-author = {MacCallum, R C and Zhang, S and Preacher, K J and Rucker, D},
-journal = {Psychological Methods},
-volume =  {7},
-number =  {1},
-pages = {19-40},
-year =  {2002},
-}
-
-@ARTICLE{Naggara2011xu,
-title = {Analysis by categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms},
-author = {Naggara, O and Raymond, J and Guilbert, F and Roy, D and Weill, A and Altman, D G},
-journal = {AJNR. American Journal of Neuroradiology},
-volume =  {32},
-number =  {3},
-pages = {437-440},
-year =  {2011}
-}
-
-@ARTICLE{Altman1991ro,
-title = {Categorising continuous variables},
-author = {Altman, D G},
-journal = {British Journal of Cancer},
-volume =  {64},
-number =  {5},
-pages = {975},
-year =  {1991}
-}
-
-@ARTICLE{VanWalraven2008ne,
-title = {Leave 'em alone - {W}hy continuous variables should be analyzed as such},
-author = {van Walraven, C and Hart, R},
-journal = {Neuroepidemiology},
-volume =  {30},
-number =  {3},
-pages = {138-139},
-year =  {2008}
-}
-
-@ARTICLE{Owen2005do,
-title = {Why carve up your continuous data?},
-author = {Owen, S and Froman, R},
-journal = {Research in Nursing and Health},
-volume =  {28},
-number =  {6},
-pages = {496-503},
-year =  {2005}
-}
-
-@ARTICLE{Maxwell1993ig,
-title = {Bivariate median splits and spurious statistical significance},
-author = {Maxwell, S and Delaney, H},
-journal = {Psychological Bulletin},
-volume =  {113},
-number =  {1},
-pages = {181-190},
-year =  {1993}
-}
-
-@ARTICLE{Fedorov2009jy,
-title = {Consequences of dichotomization},
-author = {Fedorov, V and Mannino, F and Zhang, R},
-journal = {Pharmaceutical Statistics},
-volume =  {8},
-number =  {1},
-pages = {50-61},
-year =  {2009}
-}
-
-@ARTICLE{Cohen1983mn,
-title = {The Cost of dichotomization},
-author = {Cohen, J},
-journal = {Applied Psychological Measurement},
-volume =  {7},
-number =  {3},
-pages = {249-253},
-year =  {1983}
-}
-
-@ARTICLE{Bennette2012ua,
-title = {Against quantiles: {C}ategorization of continuous variables in epidemiologic research, and its discontents},
-author = {Bennette, C and Vickers, A},
-journal = {BMC Medical Research Methodology},
-volume =  {12},
-pages = {21},
-year =  {2012}
-}
-
-@ARTICLE{BarnwellMenard2015xa,
-title = {Effects of categorization method, regression type, and variable distribution on the inflation of {Type-I} error rate when categorizing a confounding variable},
-author = {Barnwell-Menard, JL and Li, Q and Cohen, A},
-journal = {Statistics in Medicine},
-volume =  {34},
-number =  {6},
-pages = {936-949},
-year =  {2015}
-}
-
-@ARTICLE{Faraggi1996id,
-title = {A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis},
-author = {Faraggi, D and Simon, R},
-journal = {Statistics in medicine},
-volume =  {15},
-number =  {20},
-pages = {2203-2213},
-year =  {1996}
-}
-
-@ARTICLE{Altman2006gn,
-title = {The cost of dichotomising continuous variables},
-author = {Altman, D G and Royston, P},
-journal = {BMJ},
-volume =  {332},
-number =  {7549},
-pages = {1080},
-year =  {2006}
-}
-
-@ARTICLE{Altman1998vs,
-title = {Suboptimal analysis using 'optimal' cutpoints},
-author = {Altman, D G},
-journal = {British Journal of Cancer},
-volume =  {78},
-number =  {4},
-pages = {556-557},
-year =  {1998}
-}
-
-@ARTICLE{Taylor2002jj,
-title = {Bias and efficiency Loss due to categorizing an explanatory variable},
-author = {Taylor, J M G and Yu, M},
-journal = {Journal of Multivariate Analysis},
-volume =  {83},
-number =  {1},
-pages = {248-263},
-year =  {2002}
-}
-
-@ARTICLE{Altman1991ec,
-title = {Categorising continuous variables},
-author = {Altman, D G},
-journal = {British Journal of Cancer},
-volume =  {64},
-number =  {5},
-pages = {975},
-year =  {1991}
-}
-
-@ARTICLE{Irwin2003mp,
-title = {Negative consequences of dichotomizing continuous predictor variables},
-author = {Irwin, J R and McClelland, G H},
-journal = {Journal of Marketing Research},
-volume =  {40},
-number =  {3},
-pages = {366-371},
-year =  {2003}
-}
-
-@ARTICLE{Royston2006md,
-title = {Dichotomizing continuous predictors in multiple regression: a bad idea},
-author = {Royston, P and Altman, D G and Sauerbrei, W},
-journal = {Statistics in Medicine},
-volume =  {25},
-number =  {1},
-pages = {127-141},
-year =  {2006}
-}
-
-@ARTICLE{Altman1994oa,
-title = {Dangers of using "optimal" cutpoints in the evaluation of prognostic factors},
-author = {Altman, D G and Lausen, B and Sauerbrei, W and Schumacher, M},
-journal = {Journal of the National Cancer Institute},
-volume =  {86},
-number =  {11},
-pages = {829},
-year =  {1994}
-}
-
-@ARTICLE{Kenny2013cf,
-title = {Inflation of correlation in the pursuit of drug-likeness},
-author = {Kenny, P W and Montanari, C A},
-journal = {Journal of Computer-Aided Molecular Design},
-volume =  {27},
-number =  {1},
-pages = {1-13},
-year =  {2013}
-}
-
-@article{harrell2017regression,
-  title={Regression Modeling Strategies},
-  author={Harrell, FF},
-  journal={Bios},
-  volume={330},
-  number={2018},
-  pages={14},
-  year={2017},
-  publisher={Springer}
-}
-
-@article{pettersson2016quantitative,
-  title={Quantitative assessment of the impact of fluorine substitution on {P-}glycoprotein {(P-gp)} mediated efflux, permeability, lipophilicity, and metabolic stability},
-  author={Pettersson, M and Hou, XiXnjun and Kuhn, M and Wager, T T and Kauffman, G W and Verhoest, P R},
-  journal={Journal of Medicinal Chemistry},
-  volume={59},
-  number={11},
-  pages={5284-5296},
-  year={2016}
-}
-
-@article{Bone1992p2926,
-	Author = {R Bone and R Balk and F Cerra and R Dellinger and A Fein and W Knaus and R Schein and W Sibbald},
-  title = {Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis},
-	Journal = {Chest},
-	Number = {6},
-	Pages = {1644-1655},
-	Volume = {101},
-	Year = {1992}}
-
-@article{subsemble,
- author = {S Sapp and M van der Laan and J Canny},
- title = {Subsemble: an ensemble method for combining subset-specific algorithm fits},
- journal = {Journal of Applied Statistics},
- volume = {41},
- number = {6},
- pages = {1247-1259},
- year  = {2014}
-}
-
-@article{ANTONIO201941,
-title = {Hotel booking demand datasets},
-journal = {Data in Brief},
-volume = {22},
-pages = {41-49},
-year = {2019},
-issn = {2352-3409},
-author = {N Antonio and A {de Almeida} and L Nunes}
-}
-
-@Online{tagrisso2023,
-  Author = {},
-  Month = {},
-  Title = {Tagrisso plus chemotherapy demonstrated strong improvement in progression-free survival for patients with EGFR-mutated advanced lung cancer in FLAURA2 Phase III trial},
-  Year = {2023},
-  Url = {https://www.astrazeneca.com/media-centre/press-releases/2023/tagrisso-plus-chemo-improved-pfs-in-lung-cancer.html}
-}
-
-@article{altorki2021neoadjuvant,
-  Author = {Altorki, Nasser K and McGraw, Timothy E and Borczuk, Alain C and Saxena, Ashish and Port, Jeffrey L and Stiles, Brendon M and Lee, Benjamin E and Sanfilippo, Nicholas J and Scheff, Ronald J and Pua, Bradley B and others},
-  Title = {Neoadjuvant durvalumab with or without stereotactic body radiotherapy in patients with early-stage non-small-cell lung cancer: a single-centre, randomised phase 2 trial},
-  Journal = {The Lancet Oncology},
-  Volume = {22},
-  Number = {6},
-  Pages = {824-835},
-  Year = {2021},
-  Publisher = {Elsevier}
-}
-
-@article{singh2017suppressive,
-  title = {Suppressive drug combinations and their potential to combat antibiotic resistance},
-  author = {Singh, Nina and Yeh, Pamela J},
-  journal = {The Journal of Antibiotics},
-  volume = {70},
-  number = {11},
-  pages = {1033-1042},
-  year = {2017},
-  publisher = {Nature Publishing Group}
-}
-
-@article{mokhtari2017combination,
-  title = {Combination therapy in combating cancer},
-  author = {Mokhtari, Reza Bayat and Homayouni, Tina S and Baluch, Narges and Morgatskaya, Evgeniya and Kumar, Sushil and Das, Bikul and Yeger, Herman},
-  journal = {Oncotarget},
-  volume = {8},
-  number = {23},
-  pages = {38022-38043},
-  year = {2017},
-  publisher = {Impact Journals, LLC}
-}
-
-
-@Book{oathbringer,
-  Author = {B Sanderson},
-  Publisher = {Tor Books},
-  Title = {Oathbringer},
-  Year = {2017}
-}
-
-@article{meijering2012cell,
-  title={Cell segmentation: 50 years down the road},
-  author={Meijering, E},
-  journal={IEEE Signal Processing Magazine},
-  volume={29},
-  number={5},
-  pages={140-145},
-  year={2012},
-  publisher={IEEE}
-}
-
-@book{hvitfeldt2021supervised,
-  title={\href{https://smltar.com}{{Supervised Machine Learning for Text Analysis in {R}}}},
-  author={Hvitfeldt, E and Silge, J},
-  year={2021},
-  publisher={CRC Press}
-}
-
-@book{arnold2019computational,
-  title={{A Computational Approach to Statistical Learning}},
-  author={Arnold, T and Kane, M and Lewis, B},
-  year={2019},
-  publisher={CRC Press}
-}
-
-@book{bishop2006pattern,
-  title={{Pattern Recognition and Machine Learning}},
-  author={Bishop, C M and Nasrabadi, N M},
-  volume={4},
-  number={4},
-  year={2006},
-  publisher={Springer}
-}
-
-@book{goodfellow2016deep,
-  title={\href{https://www.deeplearningbook.org}{{Deep Learning}}},
-  author={Goodfellow, I and Bengio, Y and Courville, A},
-  year={2016},
-  publisher={MIT press}
-}
-
-@Book{perkins2010,
-   author = {Perkins, D},
-   title = {{Making Learning Whole}},
-   year = {2010},
-   publisher = {Wiley}
- }
- 
-@book{davison1997bootstrap,
-  title={{Bootstrap Methods and Their Application}},
-  author={Davison, A and Hinkley, D},
-  year={1997},
-  publisher={Cambridge University Press}
-}
-
-
-@book{holmes2018modern,
-  title={\href{https://web.stanford.edu/class/bios221/book/}{{Modern Statistics for Modern Biology}}},
-  author={Holmes, S and Huber, W},
-  year={2018},
-  publisher={Cambridge University Press}
-}
-
-@misc{Yucells,
-author = {Yu, W, and Lee, HK, and Hariharan, S, and Bu, WY and Ahmed, S },
-year = {2007}, % date taken from image metadata
-title = {{\href{https://doi.org/doi:10.7295/W9CCDB6843}{CCDB:6843}}, mus musculus, Neuroblastoma.}
-}
-
-@article{simonyan2014very,
-  title={Very deep convolutional networks for large-scale image recognition},
-  author={Simonyan, K and Zisserman, A},
-  journal={arXiv},
-  year={2014}
-}
-
-@article{mcelfresh2023neural,
-  title={When Do Neural Nets Outperform Boosted Trees on Tabular Data?},
-  author={McElfresh, D and Khandagale, S and Valverde, J and Prasad, V and Ramakrishnan, G and Goldblum, M and White, C},
-  journal={arXiv},
-  year={2023}
-}
-
-@book{apm,
-  title={{Applied Predictive Modeling}},
-  author={Kuhn, M and Johnson, K},
-  year={2013},
-  publisher={Springer}
-}
-
-@article{borisov2022deep,
-  title={Deep neural networks and tabular data: A survey},
-  author={Borisov, V and Leemann, T and Se{\ss}ler, K and Haug, J and Pawelczyk, M and Kasneci, G},
-  journal={IEEE Transactions on Neural Networks and Learning Systems},
-  year={2022},
-  publisher={IEEE}
-}
-
-@book{ChemoinformaticsBook,
-author = {Engel, T, and Gasteiger, J},
-address = {Weinheim},
-publisher = {Wiley-VCH},
-title = {{Chemoinformatics : A Textbook }},
-year = {2018},
-}
-
-@book{udl2023,
-  title={\href{https://udlbook.github.io/udlbook/}{{Understanding Deep Learning}}},
-  author={Prince, S},
-  year={2023},
-  publisher={MIT press}
-}
-
-
-@article{kapoor2023leakage,
-  title={Leakage and the reproducibility crisis in machine-learning-based science},
-  author={Kapoor, S and Narayanan, A},
-  journal={Patterns},
-  volume={4},
-  number={9},
-  year={2023},
-  publisher={Elsevier}
-}
-
-@article{kaufman2012leakage,
-  title={Leakage in data mining: Formulation, detection, and avoidance},
-  author={Kaufman, S and Rosset, S and Perlich, C and Stitelman, O},
-  journal={ACM Transactions on Knowledge Discovery from Data},
-  volume={6},
-  number={4},
-  pages={1-21},
-  year={2012},
-  publisher={ACM New York, NY, USA}
-}
-
-@article{whittaker2005neglog,
-  title={The neglog transformation and quantile regression for the analysis of a large credit scoring database},
-  author={Whittaker, J and Whitehead, C and Somers, M},
-  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
-  volume={54},
-  number={5},
-  pages={863-878},
-  year={2005},
-  publisher={Oxford University Press}
-}
-
-@article{manly1976exponential,
-  title={Exponential data transformations},
-  author={Manly, B},
-  journal={Journal of the Royal Statistical Society Series {D}: The Statistician},
-  volume={25},
-  number={1},
-  pages={37-42},
-  year={1976},
-  publisher={Oxford University Press}
-}
-
-@article{feng2016note,
-  title={A note on automatic data transformation},
-  author={Feng, Q and Hannig, J and Marron, JS},
-  journal={Stat},
-  volume={5},
-  number={1},
-  pages={82-87},
-  year={2016},
-  publisher={Wiley Online Library}
-}
-
-@article{kelmansky2013new,
-  title={A new variance stabilizing transformation for gene expression data analysis},
-  author={Kelmansky, D and Mart{\'\i}nez, E and Leiva, V},
-  journal={Statistical Applications in Genetics and Molecular Biology},
-  volume={12},
-  number={6},
-  pages={653-666},
-  year={2013},
-  publisher={De Gruyter}
-}
-
-@article{durbin2002variance,
-  title={A variance-stabilizing transformation for gene-expression microarray data},
-    author={Durbin, B and Hardin, J and Hawkins, D and Rocke, D},
-  journal={Bioinformatics},
-  volume={18},
-  year={2002}
-}
-
-@article{yang2006modified,
-  title={A modified family of power transformations},
-  author={Yang, Z},
-  journal={Economics Letters},
-  volume={92},
-  number={1},
-  pages={14--19},
-  year={2006},
-  publisher={Elsevier}
-}
-
-@article{bickel1981analysis,
-  title={An analysis of transformations revisited},
-  author={Bickel, P and Doksum, K},
-  journal={Journal of the American Statistical Association},
-  volume={76},
-  number={374},
-  pages={296-311},
-  year={1981},
-  publisher={Taylor \& Francis}
-}
-
-@article{asar2017estimating,
-  title={Estimating {Box-Cox} power transformation parameter via goodness-of-fit tests},
-  author={Asar, O and Ilk, O and Dag, O},
-  journal={Communications in Statistics-Simulation and Computation},
-  volume={46},
-  number={1},
-  pages={91-105},
-  year={2017},
-  publisher={Taylor \& Francis}
-}
-
-@article{john1980alternative,
-  title={An alternative family of transformations},
-  author={John, J and Draper, N},
-  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
-  volume={29},
-  number={2},
-  pages={190--197},
-  year={1980},
-  publisher={Oxford University Press}
-}
-
-@software{Boykis_What_are_embeddings_2023,
-author = {Boykis, V},
-doi = {10.5281/zenodo.8015029},
-month = jun,
-title = {{What are embeddings?}},
-url = {https://github.com/veekaybee/what_are_embeddings},
-version = {1.0.1},
-year = {2023}
-}
-
-
diff --git a/includes/references_linked.bib b/includes/references_linked.bib
index 2e611d8..8d682ae 100644
--- a/includes/references_linked.bib
+++ b/includes/references_linked.bib
@@ -1,3 +1,11 @@
+@Article{Box1964p3648,
+  author = {GEP Box and D Cox},
+  journal = {Journal of the Royal Statistical Society. Series B (Methodological)},
+  pages = {211-252},
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=An+Analysis+of+Transformations&as_ylo=1964&as_yhi=1964&btnG=}{An Analysis of Transformations}},
+  year = {1964},
+}
+
 @Article{kennard1969computer,
   title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Computer+aided+design+of+experiments&as_ylo=1969&as_yhi=1969&btnG=}{Computer aided design of experiments}},
   author = {R W Kennard and L A Stone},
@@ -19,6 +27,28 @@ @Article{gower
   year = {1971},
 }
 
+@Article{john1980alternative,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=An+alternative+family+of+transformations&as_ylo=1980&as_yhi=1980&btnG=}{An alternative family of transformations}},
+  author = {J John and N Draper},
+  journal = {Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume = {29},
+  number = {2},
+  pages = {190--197},
+  year = {1980},
+  publisher = {Oxford University Press},
+}
+
+@Article{bickel1981analysis,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=An+analysis+of+transformations+revisited&as_ylo=1981&as_yhi=1981&btnG=}{An analysis of transformations revisited}},
+  author = {P Bickel and K Doksum},
+  journal = {Journal of the American Statistical Association},
+  volume = {76},
+  number = {374},
+  pages = {296-311},
+  year = {1981},
+  publisher = {Taylor \& Francis},
+}
+
 @Book{Bishop1995,
   address = {Oxford},
   author = {C Bishop},
@@ -81,6 +111,16 @@ @Article{Willett1999p8
   year = {1999},
 }
 
+@Article{yeojohnson,
+  author = {I Yeo and R Johnson},
+  journal = {Biometrika},
+  number = {4},
+  pages = {954-959},
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=A+New+Family+of+Power+Transformations+to+Improve+Normality+or+Symmetry&as_ylo=2000&as_yhi=2000&btnG=}{A New Family of Power Transformations to Improve Normality or Symmetry}},
+  volume = {87},
+  year = {2000},
+}
+
 @Article{Ambroise2002p1493,
   author = {C Ambroise and G McLachlan},
   journal = {Proceedings of the National Academy of Sciences},
@@ -91,6 +131,14 @@ @Article{Ambroise2002p1493
   year = {2002},
 }
 
+@Article{durbin2002variance,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=A+variance+stabilizing+transformation+for+gene+expression+microarray+data&as_ylo=2002&as_yhi=2002&btnG=}{A variance-stabilizing transformation for gene-expression microarray data}},
+  author = {B Durbin and J Hardin and D Hawkins and D Rocke},
+  journal = {Bioinformatics},
+  volume = {18},
+  year = {2002},
+}
+
 @Article{Hawkins2003p2906,
   author = {D Hawkins and S Basak and D Mills},
   journal = {Journal of Chemical Information and Computer Sciences},
@@ -129,6 +177,17 @@ @Article{Molinaro2005p47
   year = {2005},
 }
 
+@Article{whittaker2005neglog,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=The+neglog+transformation+and+quantile+regression+for+the+analysis+of+a+large+credit+scoring+database&as_ylo=2005&as_yhi=2005&btnG=}{The neglog transformation and quantile regression for the analysis of a large credit scoring database}},
+  author = {J Whittaker and C Whitehead and M Somers},
+  journal = {Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume = {54},
+  number = {5},
+  pages = {863-878},
+  year = {2005},
+  publisher = {Oxford University Press},
+}
+
 @Book{bishop2006pattern,
   title = {{Pattern Recognition and Machine Learning}},
   author = {C M Bishop and N M Nasrabadi},
@@ -138,12 +197,43 @@ @Book{bishop2006pattern
   publisher = {Springer},
 }
 
+@Article{yang2006modified,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=A+modified+family+of+power+transformations&as_ylo=2006&as_yhi=2006&btnG=}{A modified family of power transformations}},
+  author = {Z Yang},
+  journal = {Economics Letters},
+  volume = {92},
+  number = {1},
+  pages = {14--19},
+  year = {2006},
+  publisher = {Elsevier},
+}
+
+@Article{Serneels,
+  author = {S Serneels and E De Nolf and P Van Espen},
+  journal = {Journal of Chemical Information and Modeling},
+  number = {3},
+  pages = {1402-1409},
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Spatial+Sign+Preprocessing+A+Simple+Way+to+Impart+Moderate+Robustness+to+Multivariate+Estimators&as_ylo=2006&as_yhi=2006&btnG=}{Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators}},
+  volume = {46},
+  year = {2006},
+}
+
 @Misc{Yucells,
   author = {W Yu and HK Lee and S Hariharan and WY Bu and S Ahmed},
   year = {2007}, % date taken from image metadata
 title = {{\href{https://doi.org/doi:10.7295/W9CCDB6843}{CCDB:6843}}, mus musculus, Neuroblastoma.},
 }
 
+@Article{twosd,
+  author = {A Gelman},
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Scaling+regression+inputs+by+dividing+by+two+standard+deviations&as_ylo=2008&as_yhi=2008&btnG=}{Scaling regression inputs by dividing by two standard deviations}},
+  journal = {Statistics in Medicine},
+  volume = {27},
+  number = {15},
+  pages = {2865-2873},
+  year = {2008},
+}
+
 @Book{perkins2010,
   author = {D Perkins},
   title = {{Making Learning Whole}},
@@ -182,6 +272,14 @@ @Article{Martin2012hr
   year = {2012},
 }
 
+@Book{nist,
+  editor = {C Croarkin and P Tobias and J Filliben and B Hembree and W Guthrie and L Trutna and J Prins},
+  publisher = {NIST/SEMATECH},
+  title = {{NIST/SEMATECH e-Handbook of Statistical Methods}},
+  url = {http://www.itl.nist.gov/div898/handbook/},
+  year = {2012},
+}
+
 @Article{szekely2013energy,
   title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Energy+statistics+A+class+of+statistics+based+on+distances&as_ylo=2013&as_yhi=2013&btnG=}{Energy statistics: {A} class of statistics based on distances}},
   author = {G J Székely and M L Rizzo},
@@ -199,6 +297,17 @@ @Book{apm
   publisher = {Springer},
 }
 
+@Article{kelmansky2013new,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=A+new+variance+stabilizing+transformation+for+gene+expression+data+analysis&as_ylo=2013&as_yhi=2013&btnG=}{A new variance stabilizing transformation for gene expression data analysis}},
+  author = {D Kelmansky and E Martínez and V Leiva},
+  journal = {Statistical Applications in Genetics and Molecular Biology},
+  volume = {12},
+  number = {6},
+  pages = {653-666},
+  year = {2013},
+  publisher = {De Gruyter},
+}
+
 @Article{simonyan2014very,
   title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Very+deep+convolutional+networks+for+large+scale+image+recognition&as_ylo=2014&as_yhi=2014&btnG=}{Very deep convolutional networks for large-scale image recognition}},
   author = {K Simonyan and A Zisserman},
@@ -227,6 +336,17 @@ @Book{oathbringer
   year = {2017},
 }
 
+@Article{asar2017estimating,
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Estimating+Box+Cox+power+transformation+parameter+via+goodness+of+fit+tests&as_ylo=2017&as_yhi=2017&btnG=}{Estimating {Box-Cox} power transformation parameter via goodness-of-fit tests}},
+  author = {O Asar and O Ilk and O Dag},
+  journal = {Communications in Statistics-Simulation and Computation},
+  volume = {46},
+  number = {1},
+  pages = {91-105},
+  year = {2017},
+  publisher = {Taylor \& Francis},
+}
+
 @Book{ChemoinformaticsBook,
   author = {T Engel and J Gasteiger},
   address = {Weinheim},
@@ -263,6 +383,17 @@ @Book{fes
   year = {2019},
 }
 
+@Article{ORQ,
+  author = {R Peterson and J Cavanaugh},
+  title = {\href{https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Ordered+quantile+normalization+a+semiparametric+transformation+built+for+the+cross+validation+era&as_ylo=2020&as_yhi=2020&btnG=}{Ordered quantile normalization: a semiparametric transformation built for the cross-validation era}},
+  journal = {Journal of Applied Statistics},
+  volume = {47},
+  number = {13-15},
+  pages = {2312-2327},
+  year = {2020},
+  publisher = {Taylor & Francis},
+}
+
 @Book{hvitfeldt2021supervised,
   title = {\href{https://smltar.com}{{Supervised Machine Learning for Text Analysis in {R}}}},
   author = {E Hvitfeldt and J Silge},
diff --git a/includes/references_original.bib b/includes/references_original.bib
index 358c5aa..7b21668 100644
--- a/includes/references_original.bib
+++ b/includes/references_original.bib
@@ -348,3 +348,135 @@ @unpublished{Boykis_What_are_embeddings_2023
 note = {version 1.0.1},
 year = {2023}
 }
+
+@book{nist,
+ editor = {Croarkin, C and Tobias, P and Filliben, J and Hembree, B and Guthrie, W and Trutna, L and Prins, J},
+ publisher = {NIST/SEMATECH},
+ title = {{NIST/SEMATECH e-Handbook of Statistical Methods}},
+ url = {http://www.itl.nist.gov/div898/handbook/},
+ year = {2012}
+}
+ 
+@Article{Box1964p3648,
+  Author = {{GEP} Box and D Cox},
+  Journal = {Journal of the Royal Statistical Society. Series B (Methodological)},
+  Pages = {211-252},
+  Title = {An Analysis of Transformations},
+  Year = {1964}
+}
+
+@article{asar2017estimating,
+  title={Estimating {Box-Cox} power transformation parameter via goodness-of-fit tests},
+  author={Asar, O and Ilk, O and Dag, O},
+  journal={Communications in Statistics-Simulation and Computation},
+  volume={46},
+  number={1},
+  pages={91-105},
+  year={2017},
+  publisher={Taylor \& Francis}
+}
+
+@article{yeojohnson,
+  author = {I Yeo and R Johnson},
+  journal = {Biometrika},
+  number = {4},
+  pages = {954-959},
+  title = {A New Family of Power Transformations to Improve Normality or Symmetry},
+  volume = {87},
+  year = {2000}
+ }
+ 
+@article{john1980alternative,
+  title={An alternative family of transformations},
+  author={John, J and Draper, N},
+  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume={29},
+  number={2},
+  pages={190--197},
+  year={1980},
+  publisher={Oxford University Press}
+}
+
+@article{bickel1981analysis,
+  title={An analysis of transformations revisited},
+  author={Bickel, P and Doksum, K},
+  journal={Journal of the American Statistical Association},
+  volume={76},
+  number={374},
+  pages={296-311},
+  year={1981},
+  publisher={Taylor \& Francis}
+}
+
+@article{durbin2002variance,
+  title={A variance-stabilizing transformation for gene-expression microarray data},
+    author={Durbin, B and Hardin, J and Hawkins, D and Rocke, D},
+  journal={Bioinformatics},
+  volume={18},
+  year={2002}
+}
+
+@article{yang2006modified,
+  title={A modified family of power transformations},
+  author={Yang, Z},
+  journal={Economics Letters},
+  volume={92},
+  number={1},
+  pages={14--19},
+  year={2006},
+  publisher={Elsevier}
+}
+
+@article{whittaker2005neglog,
+  title={The neglog transformation and quantile regression for the analysis of a large credit scoring database},
+  author={Whittaker, J and Whitehead, C and Somers, M},
+  journal={Journal of the Royal Statistical Society Series {C}: Applied Statistics},
+  volume={54},
+  number={5},
+  pages={863-878},
+  year={2005},
+  publisher={Oxford University Press}
+}
+
+@article{kelmansky2013new,
+  title={A new variance stabilizing transformation for gene expression data analysis},
+  author={Kelmansky, D and Mart{\'\i}nez, E and Leiva, V},
+  journal={Statistical Applications in Genetics and Molecular Biology},
+  volume={12},
+  number={6},
+  pages={653-666},
+  year={2013},
+  publisher={De Gruyter}
+}
+
+@article{ORQ,
+ author = {R Peterson and J Cavanaugh},
+ title = {Ordered quantile normalization: a semiparametric transformation built for the cross-validation era},
+ journal = {Journal of Applied Statistics},
+ volume = {47},
+ number = {13-15},
+ pages = {2312-2327},
+ year  = {2020},
+ publisher = {Taylor & Francis}
+ }
+ 
+@article{twosd,
+ author = {Gelman, A},
+ title = {Scaling regression inputs by dividing by two standard deviations},
+ journal = {Statistics in Medicine},
+ volume = {27},
+ number = {15},
+ pages = {2865-2873},
+ year = {2008}
+ }
+ 
+@Article{Serneels,
+  Author = {S Serneels and E De Nolf and P Van Espen},
+  Journal = {Journal of Chemical Information and Modeling},
+  Number = {3},
+  Pages = {1402-1409},
+  Title = {Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators},
+  Volume = {46},
+  Year = {2006}}
+}
+

From b7575ac499aa498b4eeffb1180c110209e97acde Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Mon, 15 Jan 2024 12:16:21 -0500
Subject: [PATCH 08/10] Apply suggestions from code review

suggestion from @krz
---
 chapters/numeric-predictors.qmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 26e3cbe..b5efdf0 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -61,8 +61,8 @@ After these, an example of a _group_ transformation is described.
 The skew of a distribution can be quantified using the skewness statistic: 
 
 $$\begin{align}
-  skewness &= \frac{1}{(n-1)v^{3/2}} \sum_{1=1}^n (x_i-\overline{x})^3 \notag \\
-  \text{where}\quad  v &= \frac{1}{(n-1)}\sum_{1=1}^n (x_i-\overline{x})^2 \notag
+  skewness &= \frac{1}{(n-1)v^{3/2}} \sum_{i=1}^n (x_i-\overline{x})^3 \notag \\
+  \text{where}\quad  v &= \frac{1}{(n-1)}\sum_{i=1}^n (x_i-\overline{x})^2 \notag
 \end{align}
 $$
 where values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). There are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 

From f11919041bbf862394e9269542fd1367b03920cb Mon Sep 17 00:00:00 2001
From: topepo <mxkuhn@gmail.com>
Date: Sat, 3 Feb 2024 12:21:34 -0500
Subject: [PATCH 09/10] changes based on more review

---
 chapters/numeric-predictors.qmd | 64 ++++++++++++++-------------------
 1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index b5efdf0..216e584 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -33,9 +33,9 @@ set_options()
 source("../R/setup_ames.R")
 ```
 
-As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to predict the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The previous chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. 
+As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to explin the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The next chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. 
 
-We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter mostly focuses on transformations that leave the predictors "in place" but altered. 
+We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors "in place" but altered. 
 
 
 ## When are transformations estimated and applied? 
@@ -44,28 +44,31 @@ The next few chapters concern preprocessing and feature engineering tools that m
 
 For example, a standardization tool that centers and scales the data is introduced in the next section. The mean and standard deviation are computed from the training set for each column being standardized. When the training set, test set, or any future data are standardized, it uses these statistics derived from the training set. Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution. 
 
-Suppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario,  it would be impossible to standardize using a recomputed standard deviation for the new sample (since there is a single value and we would divide by zero). 
+Suppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario,  it would be impossible to standardize using a recomputed standard deviation for the new sample (which means we try to divide with a zero standard deviation). 
 
 ## General Transformations
 
-Many transformations that involve a single predictor change the distribution of the data. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
+Many transformations that involve a single predictor change the data distribution. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
 
-some based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?
+TODO some based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?
 
-To start, we'll consier two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). 
+To start, we'll consider two classes of transformations for individual predictors: those that resolve distributional skewness and those that convert each predictor to a common distribution (or scale). 
 
 After these, an example of a _group_ transformation is described. 
 
-### Resolving skewness
+### Resolving asymmetry and skewness
 
-The skew of a distribution can be quantified using the skewness statistic: 
+An asymmetric statistical distribution is one in which the probability of a sample occurring is not symmetric around the center of the distribution (e.g., the mean). For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. There is a much higher likelihood of the lot area being lower than the mean (or median) lot size. There are fewer large lots than there are proportionally smaller lots. And, in a few cases, the lot sizes can be extremely large. 
+
+The skew of a distribution indicates the direction and magnitude of the asymmetry. It can be quantified using the skewness statistic: 
 
 $$\begin{align}
   skewness &= \frac{1}{(n-1)v^{3/2}} \sum_{i=1}^n (x_i-\overline{x})^3 \notag \\
   \text{where}\quad  v &= \frac{1}{(n-1)}\sum_{i=1}^n (x_i-\overline{x})^2 \notag
 \end{align}
 $$
-where values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  For example, @fig-ames-lot-area (panel a) shows the training set distribution of the lot area of houses in Ames. The data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). There are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 
+
+where values near zero indicate a symmetric distribution, positive values correspond a right skew, and negative values left skew.  The lot size data are significantly right-skewed (with a skewness value of `r signif(e1071::skewness(ames_train$Lot_Area), 3)`). As previously mentioned, there are `r sum(ames_train$Lot_Area > 100000)` samples in the training set that sit far beyond the mainstream of the data. 
 
 ```{r}
 #| label: ames-lot-area-calcs
@@ -138,7 +141,7 @@ or
 
 > a place that is far from the main part of something
 
-These statements imply that outliers belong to a different distribution than the bulk of the data, perhaps due to a typographical error or an incorrect merging of data sources.
+These statements imply that outliers belong to a different distribution than the bulk of the data. For example, a typographical error or an incorrect merging of data sources could be the cause.
 
 The @nist describes them as 
 
@@ -160,11 +163,11 @@ One way to resolve skewness is to apply a transformation that makes the data mor
 ::: {.column width="40%"}
 - no transformation via $\lambda = 1.0$
 - square ($x^2$) via $\lambda = 2.0$
-- logarithmic ($\log{x}$) via $\lambda = 0.0$
+- square root ($\sqrt{x}$) via $\lambda = 0.5$
 :::
 
 ::: {.column width="40%"}
-- square root ($\sqrt{x}$) via $\lambda = 0.5$
+- logarithmic ($\log{x}$) via $\lambda = 0.0$
 - inverse square root ($1/\sqrt{x}$) via $\lambda = -0.5$
 - inverse ($1/x$) via $\lambda = -1.0$
 :::
@@ -242,7 +245,7 @@ Skewness can also be resolved using techniques related to distributional percent
 
 Numeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.
 
-Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
+Additionally, the original predictor data can be coerced to a specific probability distribution. @ORQ define the Ordered Quantile (ORQ) normalization procedure. It estimates a transformation of the data to emulate the true normalizing function where "normalization" literally maps the data to a standard normal distribution. In other words, we can coerce the original distribution to a near exact replica of a standard normal. @fig-ames-lot-area (panel d) illustrates the result for the lot area. In this instance, the resulting distribution is precisely what would be seen if the true distribution was Gaussian with zero mean and a standard deviation of one.
  
 In @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of predictors is discussed.  
  
@@ -250,11 +253,11 @@ In @sec-spatial-sign below, another tool for attenuating outliers in _groups_ of
 
 Another goal for transforming individual predictors is to convert them to a common scale. This is a pre-processing requirement for some models. For example, a _K_-nearest neighbors model computes the distances between data points. Suppose Euclidean distance is used with the Ames data. One predictor, the year a house was built, has training set values ranging between `r min(ames_train$Year_Built)` and `r max(ames_train$Year_Built)`. Another, the number of bathrooms, ranges from `r min(ames_train$Baths)` to `r max(ames_train$Baths)`. If these raw data were used to compute the distance, the value would be inappropriately dominated by the year variable simply because its values were large. See TODO appendix for a summary of which models require a common scale.
 
-The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two standardization methods are commonly used. 
+The previous section discussed two transformations that automatically convert predictors to a common distribution. The percentile transformation generates values roughly uniformly distributed on the `[0, 1]` scale, and the ORQ transformation results in predictors with standard normal distributions. However, two other standardization methods are commonly used. 
 
 First is centering and scaling (as previously mentioned). To convert to a common scale, the mean ($\bar{x}$) and standard deviation ($\hat{s}$) are computed from the training data and the standardized version of the data is $x^* = (x - \bar{x}) / \hat{s}$. The shape of the original distribution is preserved; only the location and scale are modified to be zero and one, respectively.  
 
-In the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
+In the next chapter, methods are discussed to convert categorical predictors to a numeric format. The standard tool is to create a set of columns consisting of zeros and ones called _indicator_ or _dummy variables_. When centering and scaling, what should we do with these binary features?  These should be treated the same as the dense numeric predictors. The result is that a binary column will still have two unique values, one positive and one negative. The values will depend on the prevalence of the zeros and ones in the training data. While this seems awkward, it is required to ensure each predictor has the same mean and standard deviation. Note that if the predictor set is _only_ scaled, @twosd suggests that the indicator variables be divided by two standard deviations instead of one. 
 
 @fig-standardization(b) shows the results of centering and scaling the gross living area predictor from the Ames data. Note that the shape of the distribution does not change; only the magnitude of the values is different. 
 
@@ -305,13 +308,13 @@ When new data are outside the training set range, they can either be clipped to
 
 ### Spatial Sign {#sec-spatial-sign}
 
-Some transformations involve multiple predictors. The next section describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: 
+Some transformations involve multiple predictors. An upcoming chapter describes a specific class of simultaneous _feature extraction_ transformations. Here, we will focus on the spatial sign transformation [@Serneels]. This method, which requires $p$ standardized predictors as inputs, projects the data points onto a $p$ dimensional unit hypersphere. This makes all of the data points equally distant from the center of the hypersphere, thereby eliminating all potential outliers. The equation is: 
 
 $$
-x^*_{ij}=\frac{x_{ij}}{\sum^{P}_{j=1} x_{ij}^2}
+x^*_{ij}=\frac{x_{ij}}{\sum\limits^{p}_{j=1} x_{ij}^2}
 $$
 
-Notice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns are now combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. 
+Notice that all of the predictors are simultaneously modified and that the calculations occur in a row-wise pattern. Because of this, the individual predictor columns become combinations of the other columns and now reflect more than the individual contribution of the original predictors. In other words, after this transformation is applied, if any individual predictor is considered important, its significance should be attributed to all of the predictors used in the transformation. 
 
 ```{r}
 #| label: ames-lot-living-area-calc
@@ -331,7 +334,7 @@ two_areas_raw <-
   geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) +
   labs(x = "Lot Area (thousands)", y = "Gross Living Area") +
   scale_color_manual(values = data_cols) + 
-  scale_size_manual(values = c(3, 1)) +
+  scale_size_manual(values = c(3, 1 / 2)) +
   coord_fixed(ratio = 1/25)
 
 two_areas_norm <- 
@@ -343,7 +346,7 @@ two_areas_norm <-
   geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) +
   labs(x = "Lot Area", y = "Gross Living Area") +
   scale_color_manual(values = data_cols) + 
-  scale_size_manual(values = c(3, 1)) +
+  scale_size_manual(values = c(3, 1 / 2)) +
   coord_equal() +
   theme(axis.title.y = element_blank())
 
@@ -357,12 +360,12 @@ two_areas_ss <-
   geom_point(aes(col = location, pch = location, size = location), alpha = 1 / 2) + 
   labs(x = "Lot Area", y = "Gross Living Area") +
   scale_color_manual(values = data_cols) + 
-  scale_size_manual(values = c(3, 1 /2)) +
+  scale_size_manual(values = c(3, 1 / 2)) +
   coord_equal()  +
   theme(axis.title.y = element_blank())
 ```
 
-@fig-ames-lot-living-area shows predictors from the Ames data. In these data, at least `r sum(bake(two_areas_rec, new_data = NULL)$location == "'outlying'")` samples appear farther away from most of the data in either Lot Area and/or Gross Living Area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. 
+@fig-ames-lot-living-area shows predictors from the Ames data. In these data, we somewhat arbitrarily labeled `r sum(bake(two_areas_rec, new_data = NULL)$location == "'outlying'")` samples as being "far away" from most of the data in either lot area and/or gross living area. Each of these predictors may follow a right-skewed distribution, or there is some other characteristic that is associated with these samples. Regardless, we would like to transform these predictors simultaneously. 
 
 The second panel of the data shows the same predictors _after_ an orderNorm transformation. Note that, after this operation, the outlying values appear less extreme.  
 
@@ -374,24 +377,11 @@ The second panel of the data shows the same predictors _after_ an orderNorm tran
 #| fig-height: 3
 #| out-width: "100%"
 
-two_areas_raw + two_areas_norm + two_areas_ss + 
-   plot_layout(guides = "collect") & 
-   theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"))
+((two_areas_raw + two_areas_norm + two_areas_ss) +
+   plot_layout(guides = "collect"))
 ```
 
 The panel on the right shows the data after applying the spatial sign. The data now form a circle centered at (0, 0) where the previously flagged instances are no longer distributionally abnormal. The resulting bivariate distribution is quite jarring when compared to the original. However, these new versions of the predictors can still be important components in a machine-learning model. 
 
-## Feature Extraction and Embeddings
-
-
-### Linear Projection Methods {#sec-linear-feature-extraction}
-
-spatial sign for robustness
-
-
-### Nonlinear Techniques  {#sec-nonlinear-feature-extraction}
-
-
-
 ## Chapter References {.unnumbered}
 

From 35cc83307fcb9ab49cb0b450881bdf974bd57f03 Mon Sep 17 00:00:00 2001
From: Max Kuhn <mxkuhn@gmail.com>
Date: Fri, 9 Feb 2024 21:05:19 -0600
Subject: [PATCH 10/10] Apply suggestions from code review

Co-authored-by: kjell-stattenacity <kjell@stattenacity.com>
---
 chapters/numeric-predictors.qmd | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/chapters/numeric-predictors.qmd b/chapters/numeric-predictors.qmd
index 216e584..63770e8 100644
--- a/chapters/numeric-predictors.qmd
+++ b/chapters/numeric-predictors.qmd
@@ -33,22 +33,32 @@ set_options()
 source("../R/setup_ames.R")
 ```
 
-As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to explin the outcome effectively. There is also the need to properly encode/format the data based on the model's mathematical requirements (i.e., pre-processing). The next chapter described techniques for categorical data, and in this chapter, we do the same for quantitative predictors. 
+Data that are available for modeling are often collected passively without the specific purpose of being used for buildng a predictive model.  As an example, the Ames Housing data contains a wealth of information on houses in Ames, Iowa.  But this available data may not contain the most relevant measurements for predicting house price.  This may be due to the fact that important predictors were not measured.  Or, it may be because the predictors we have collected are not in the best form to allow models to uncover the relationship between the predictors and the response.
 
-We'll begin with operations that only involve one predictor at a time before moving on to group transformations. Later, in @sec-add-remove-features,  procedures on numeric predictors are described that create additional predictor columns from a single column, such as basis function expansions. However, this chapter focuses on transformations that leave the predictors "in place" but altered. 
+As mentioned previously, feature engineering is the process of representing your predictor data so that the model has to do the least amount of work to explain the outcome effectively.  A tool of feature engineering is predictor transformations.  Some models alos need predictors to be transformed to meet the model's mathematical requirements (i.e., pre-processing). In this chapter we will review transformations for quantitative predictors. 
 
+We will begin by describing transformations that are applied to one predictor at a time that yield a revised for of the predictor (one in, one out).  Then we will explore transformations that can be applied to a group of predictors and yield an informative summary of the group (many in, some out).  Later, in @sec-add-remove-features, we will examine techniques for expanding a single predictor to many predictors (one in, many out).  Let's begin by understanding some general data characteristics that need to be addressed via feature engineering and when transformations should be applied. 
 
-## When are transformations estimated and applied? 
 
-The next few chapters concern preprocessing and feature engineering tools that mostly affect the predictors. As previously noted, the training set data are used to estimate parameters; this is also true for preprocessing parameters. All of these computations use the training set. At no point do we re-estimate parameters when new data are encountered. 
+## What are Problematic Characteristics, and When Should Transformations be Applied?
 
-For example, a standardization tool that centers and scales the data is introduced in the next section. The mean and standard deviation are computed from the training set for each column being standardized. When the training set, test set, or any future data are standardized, it uses these statistics derived from the training set. Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution. 
+Common problematic characteristics that occur across individual predictors are:
+
+* skewed or unusually shaped distributions,
+* sample(s) that have extremely large or small values, and
+* vastly disparate scales.
+
+ Some models, like those that are tree-based, are able to tolerate these characteristics.  However, these characteristics can detrimentally affect most other models.  Techniques used to address these problems generally involve transformation parameters.  For example, to place the predictors on the same scale, we would subtract the mean of a predictor from a sample and then divide by the standard deviation.  This is know as standardizing and will be discussed in the next section.   
+
+What data should be used to estimate the mean and standard deviation?  Recall, the training data set was used to estimate model parameters.  Similarly, we will use the training data to estimate transformation parameters.  When the test set or any future data set are standardized, the process will use the estimates from the training data set.  Any model fit that uses these standardized predictors would want new samples being predicted to have the same reference distribution.
 
 Suppose that a predictor column had an underlying Gaussian distribution with a sample mean estimate of 5.0 and a sample standard deviation of 1.0. Suppose a new sample has a predictor value of 3.7. For the training set, this new value lands around the 10th percentile and would be standardized to a value of -1.3. The new value is relative to the training set distribution. Also note that, in this scenario,  it would be impossible to standardize using a recomputed standard deviation for the new sample (which means we try to divide with a zero standard deviation). 
 
+Now let's review transformations that leave the predictor "in-place", but altered.
+
 ## General Transformations
 
-Many transformations that involve a single predictor change the data distribution. What would a problematic distribution look like? Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
+Many transformations that involve a single predictor change the data distribution. Most predictive models do not place specific parametric assumptions on the predictor variables (e.g., require normality), but some distributions might facilitate better predictive performance than others. 
 
 TODO some based on convention or scientific knowledge. Others like the arc-sin (ref The arcsine is asinine: the analysis of proportions in ecology) or logit?
 
@@ -128,7 +138,7 @@ lot_area_pctl <-
 #| fig-width: 8
 #| fig-height: 5.5
 #| out-width: "80%"
-#| fig-cap: "Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson transformations (b), percentile (c), and ordered quantile normalization (d) transformations."
+#| fig-cap: "Lot area for houses in Ames, IA. The raw data (a) are shown along with transformed versions using the Yeo-Johnson (b), percentile (c), and ordered quantile normalization (d) transformations."
 (lot_area_raw  +  lot_area_yj) / (lot_area_pctl + lot_area_norm)
 ```
 
@@ -201,7 +211,7 @@ $$
 
 In either case, maximum likelihood is also used to estimate the $\lambda$ parameter. 
 
-In practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may only be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\hat{\lambda}$ is within some range of zero (say between $\pm 0.01$)^[If you've never seen it, the "hat" notation (e.g. $\hat{\lambda}$) indicates an estimate of some unknown parameter.]. 
+In practice, these two transformations might be limited to predictors with acceptable density. For example, the transformation may not be appropriate for a predictor with a few unique values. A threshold of five or so unique values might be a proper rule of thumb (see the discussion in @sec-near-zero-var). On occasion the maximum likelihood estimates of $\lambda$ diverge to huge values; it is also sensible to use values within a suitable range. Also, the estimate will never be absolute zero. Implementations usually apply a log transformation when the $\hat{\lambda}$ is within some range of zero (say between $\pm 0.01$)^[If you've never seen it, the "hat" notation (e.g. $\hat{\lambda}$) indicates an estimate of some unknown parameter.]. 
 
 For the lot area predictor, the Box-Cox and Yeo-Johnson techniques both produce an estimate of $\hat{\lambda} = `r round(yj_est, 3)`$. The results are shown in @fig-ames-lot-area (panel b). There is undoubtedly less right-skew, and the data are more symmetric with a new skewness value of `r signif(bc_skew, 3)` (much closer to zero). However, there are still outlying points.
 
@@ -241,7 +251,7 @@ Examples of other families of transformations for dense numeric predictors.
 
 :::: 
  
-Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0.0, 0.5, and 1.0 percentiles, respectively.
+Skewness can also be resolved using techniques related to distributional percentiles. A percentile is a value with a specific proportion of data below it. For example, for the original lot area data, the 0.1 percentile is `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet, which means that 10{{< pct >}} of the training set has lot areas less than `r format(quantile(ames_train$Lot_Area, prob = .1), big.mark = ",")` square feet. The minimum, median, and maximum are the 0, 50th and 100th percentiles, respectively.
 
 Numeric predictors can be converted to their percentiles, and these data, inherently between zero and one, are used in their place. Probability theory tells us that the distribution of the percentiles should resemble a uniform distribution. This results from the transformed version of the lot area shown in @fig-ames-lot-area (panel c). For new data, values beyond the range of the original predictor data can be truncated to values of zero or one, as appropriate.