-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME.rmd
90 lines (70 loc) · 4.95 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
output: github_document
---
```{r, include = FALSE}
# knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
<img align="right" src="https://raw.githubusercontent.com/quayau/fxtract/master/man/figures/hexagon.svg?sanitize=true" width="125px">
[![Build Status](https://travis-ci.org/QuayAu/fxtract.svg?branch=master)](https://travis-ci.org/QuayAu/fxtract)
[![codecov](https://codecov.io/gh/QuayAu/fxtract/branch/master/graph/badge.svg)](https://codecov.io/gh/QuayAu/fxtract)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/QuayAu/fxtract?branch=master&svg=true)](https://ci.appveyor.com/project/QuayAu/fxtract)
[![CRAN](https://www.r-pkg.org/badges/version/fxtract)](https://cran.r-project.org/package=fxtract)
[![cran checks](https://cranchecks.info/badges/worst/fxtract)](https://cran.r-project.org/web/checks/check_results_fxtract.html)
[![Downloads](http://cranlogs.r-pkg.org/badges/grand-total/fxtract)](https://cran.r-project.org/package=fxtract)
# fxtract
Feature extraction is a crucial step for tackling machine learning problems.
Many machine learning problems start with complex (often timestamped) raw data with many grouped variables (e.g. heart rate measurements of many patients, gps data for analysis of movements of many participants of a study, etc.). Often times, this raw data cannot directly be used for machine learning algorithms. User-defined features must be extracted for this purpose. Examples could be the heart rate variability of a patient, or the maximum distance traveled by a participant of a gps study.
Since there are many different machine learning applications and therefore many inherently different raw datasets and features which need to be calculated, we do not supply any automated features.
`fxtract` assists you in the feature extraction process by helping with the data wrangling needed, but still allows you to extract your own defined features.
![](man/figures/fxtract_main.svg)
The user only needs to define functions which have a dataset as input and named vector (or list) with the desired features as output. The whole data wrangling (calculating the features for each ID and collecting the results in one final dataframe) is handled by `fxtract`.
This package works with very large datasets and many different IDs and the main functionality is written in [R6](https://r6.r-lib.org/articles/Introduction.html). Parallelization is available via [future](https://cran.r-project.org/package=future).
See the [tutorial](https://quayau.github.io/fxtract/) on how to use this package.
# Installation
For the release version use:
```{r, eval = FALSE}
install.packages("fxtract")
```
For the development version use [devtools](https://cran.r-project.org/package=devtools):
```{r, eval = FALSE}
devtools::install_github("QuayAu/fxtract")
```
### Why don't just use ``dplyr`` or other packages?
At first glance it looks like we just rewrote the ``summarize()`` functionality of ``dplyr``.
Another similar functionality is covered by the ``aggregate()``-function from the base ``stats`` package.
For small datasets and few (easy to calculate) features, using ``fxtract`` may indeed be a little overkill (and slower too).
However, this package was especially designed for projects with large datasets, many IDs, and many different feature functions. `fxtract` streamlines the process of loading datasets and adding feature functions. Once your dataset (with all IDs) becomes too big for memory, or if some feature functions fail on some IDs, using our package can save you many lines of code.
```{r, echo = FALSE}
unlink("fxtract_files", recursive = TRUE)
```
# Usage
```{r, message = FALSE, warning = FALSE, results = "hide"}
library(fxtract)
# user-defined function:
fun = function(data) {
c(mean_sepal_length = mean(data$Sepal.Length),
sd_sepal_length = sd(data$Sepal.Length))
}
# R6 object:
xtractor = Xtractor$new("xtractor")
xtractor$add_data(iris, group_by = "Species")
xtractor$add_feature(fun)
xtractor$calc_features()
xtractor$results
```
```{r, echo = FALSE}
xtractor$results
```
```{r, echo = FALSE}
unlink("fxtract_files", recursive = TRUE)
```
# Features
* Unit-tested functions.
* Extracting features from raw data of many different IDs with the R6 Class `Xtractor`:
* No more code bloat thanks to R6.
* Very large datasets are supported, since data is only read into RAM when needed. Minimum requirement: Individual datasets for each ID must be small enough to be read into memory.
* Features will be calculated for each ID individually and can be parallelized with the `future`-package.
* If one feature on one ID throws an error, this will not stop the whole process (like in a traditional R script). The remaining features will still be calculated.
* Individual features can be deleted or updated easily.
* Calculation of features can be done in parallel and the process can be monitored. It is also possible to stop and return the calculation at a later time.
* Results can be easily retrieved in one final dataframe.