Skip to content

Commit

Permalink
fixed spellings
Browse files Browse the repository at this point in the history
  • Loading branch information
behrica committed Feb 10, 2024
1 parent 206e9e3 commit 22dc23b
Showing 1 changed file with 51 additions and 38 deletions.
89 changes: 51 additions & 38 deletions notebooks/prepare_for_ml.clj
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@
;;One typical problem in machine learning is `classification`,
;;so learning how to categorize data in different categories.
;;Sometimes data in this format is as well called "qualitative data"
;;or data having `discrete` values

;;or data having `discrete` values.
;;
;; These categories are often expressed in Clojure as of
;; being of type `String` or `keyword`
;;
;; In `dataset` it is the `Column` which has specific support for
;; categorical data

;; categorical data.
;;
;; Creating a column out of categorical data looks like this:
(require '[tech.v3.dataset.column :as col]
'[tech.v3.dataset :as ds])
Expand All @@ -29,12 +29,13 @@

;; Printing the var shows its "type" as being `keyword`
column-x
;; and printing its metadata show that it got marked as `categorical`
;; and printing its metadata shows that it got marked as `categorical`
(meta column-x)

;; The column is therefore using its metadata to store important information,
;; and it is important to get used to look at it for the case of debugging
;; issues.
;; The column is therefore using its metadata to store important
;; information, and it is important to get used to look at it
;; for the case of debugging issues.
;;
;; The same happens, when creating a `dataset` which is a seq
;; of columns
;;
Expand All @@ -48,13 +49,14 @@ categorical-ds
meta
(vals categorical-ds))

;; ### Express categorical variables in numeric space
;; Most machine learining models can only work on numerical values,
;; ### Transform categorical variables to numerical space
;; Most machine learning models can only work on numerical values,
;; both for features and the target variable.
;; So usually we need to transform categorical data into a numeric representation,
;; so each category need to be converted to a number.
;; These numbers have no meaning for the users, so often we need to
;; convert back into
;; So usually we need to transform categorical data into a numeric
;; representation, so each category need to be converted to a number.
;;
;; These numbers have often no meaning for the users,
;; so often we need to convert back into
;; String / keyword space later on.
;;
;; Namespace `tech.v3.dataste.categorical`
Expand All @@ -63,18 +65,20 @@ categorical-ds
;; ### Transform categorical column into a numerical column
(require '[tech.v3.dataset.categorical :as ds-cat])

;; These function operate on a single column, but expect a dataset and
;; These functions operate on a single column, but expect a dataset and
;; a column name as input.
;;
;; We use them to find a mapping from string/keyword to a
;; We use them to calculate a mapping from string/keyword to a
;; numerical space (0 ... x) like this

(ds-cat/fit-categorical-map categorical-ds :x)

;; This maps value in the order of occurrence in the column to
;; This maps the values in their order of occurrence in the column to
;; 0 .. 1
;; This is a bit dangerous, as the mapping is decided by "row order",
;; which could change or be different on other subset of the data
;; which could change or be different on other subset of the data, like
;; test/train splits
;;
;; So it is preferred to be specified explicitly.

(def x-mapping (ds-cat/fit-categorical-map categorical-ds :x [:a :b]))
Expand Down Expand Up @@ -103,14 +107,17 @@ numerical-categorical-data
;;are different for two columns (for whatever reasons), it is not given
;;that the column cell value like `0` means the same in both columns.
;;Columns which have categorical maps should never be compared via
;;'clojure.core/=' as this will ignore the categorical maps.
;;`clojure.core/=` as this will ignore the categorical maps.
;; (unless we are sure that the categorical maps in both are **the same**)
;; They should be converted back to their original space and then compared.
;; This is specially important for comparing `prediction` and `true value`
;; in machine learning for metric calculations.

;; See the following example to illustrate this.

;; ### Incorrect comparisons
;; In this ds the two columns are clearly different (the opposite even)
;; In the following the two columns are clearly different
;; (the opposite even)
(def ds-with-different-cat-maps
(->
(ds/->dataset {:x-1 [:a :b :a :b :b :b]
Expand All @@ -121,7 +128,7 @@ numerical-categorical-data
(:x-1 ds-with-different-cat-maps)
(:x-2 ds-with-different-cat-maps)

;; By using default `/categorical->number` we get different categorical
;; By using default `categorical->number` we get different categorical
;; maps, having different :lookup-tables
(meta (:x-1 ds-with-different-cat-maps))
(meta (:x-2 ds-with-different-cat-maps))
Expand All @@ -142,18 +149,18 @@ numerical-categorical-data
(:x-2 reverted-ds-with-different-cat-maps)


;; and can know compare them correctly as :false
;; and now they compare correctly as :false
(=
(:x-1 reverted-ds-with-different-cat-maps)
(:x-2 reverted-ds-with-different-cat-maps))

;; So it should be as well avoid to transform mapped columns
;; So it should be as well avoided to transform mapped columns
;; to other representations, which loose the mappings, like tensor
;; or primitive arrays, or even sequences

;; ### Better use same and fixed mapping
;; This issue can be avoided by specifying the mapping to use, as being
;; {:a 0 :b 1}
;; ### Use the same and fixed mapping
;; This issue can be avoided by specifying concretely the mapping
;; to be useds, as being for exmaple {:a 0 :b 1}
(def ds-with-same-cat-maps
(->
(ds/->dataset {:x-1 [:a :b :a :b :b :b]
Expand Down Expand Up @@ -191,7 +198,7 @@ numerical-categorical-data
;;For some models / use cases the categorical data need to be converted
;;in the so called `one-hot` format.
;;In this every column get multiplied by the number of categories , and
;;the each one-hot column can only have 0 and 1 values.
;;then each one-hot column can only have 0 and 1 values.
;;
(def one-hot-map-x (ds-cat/fit-one-hot categorical-ds :x))
(def one-hot-map-y (ds-cat/fit-one-hot categorical-ds :y))
Expand All @@ -211,16 +218,17 @@ categorical-ds

one-hot-ds

;; There are similar functions to convert this format back
;; There are similar functions to convert this format back.
;;

;; ## Features and inference target in a dataset



;; A dataset for machine learning has always two groups of columns.
;; A dataset for supervised machine learning has always two groups of
;; columns.
;; They can either be the `features` or the `inference targets`.
;; The goal of learining is to find the relation ship between
;; The goal of the learning is to find the relationship between
;; the two groups
;; and therefore be able to `predict` inference targets from features.
;; Sometimes the features are called `X` and the targets `y`.
Expand All @@ -231,22 +239,26 @@ one-hot-ds
:x-2 [1 0 1]
:y [:a :a :b]}))

;; we need to mark explicitely which columns are `features` and which are `target`
;; in order to be able to use it later for machine learning in `metamorph.ml`
;; we need to mark explicitly which columns are `features` and which are
;; `targets` in order to be able to use the dataset later for
;; machine learning in `metamorph.ml`
;;
;; As normally only one or a few columns are inference targets,
;; we can simply mark those and the rest is regarded as features.
;; we can simply mark those and the oder columns are regarded as features.

(require '[tech.v3.dataset.modelling :as ds-mod])
(def modelled-ds
(-> ds
(ds-mod/set-inference-target :y))) ; works as well with seq
(ds-mod/set-inference-target :y)))
;; (works as well with a seq)


;; This is marked as well in the column metadata.
(-> modelled-ds :y meta)


;; There are several functions to get information on features and inference targets:
;; There are several functions to get information on features and
;; inference targets:

(ds-mod/feature-ecount modelled-ds)

Expand All @@ -262,7 +274,7 @@ one-hot-ds
;; Very often we need to do transform and model for doing
;; classification and
;; combine the ->numeric transformation of categorical vars
;; and the marking of inference target
;; and the marking of inference targets.
(def ds-ready-for-train
(->
{:x-1 [0 1 0]
Expand All @@ -278,8 +290,9 @@ one-hot-ds
ds-ready-for-train

;; Such a dataset is ready for training as it
;; only contains numerical variables (having the categorical map in place for easy converting back, if needed)
;; and the inference target is marked,
;; only contains numerical variables which have the categorical maps
;; in place for easy converting back, if needed.
;; The inference target is marked as well,
;; as we can see in the meta data:
;;
(map meta (vals ds-ready-for-train))
Expand Down

0 comments on commit 22dc23b

Please sign in to comment.