Data Automation Pipeline is a collection of documented function created from my every problem solving to process data in R starting from exploring, cleaning, transforming, sampling, summarization, visualization, prediction modeling and also manifest other utilities that help to direct the purpose of data analytics. It contains 40k+ code in length with 100+ functions which is within 40+ Sections of code blocks. It is my big library created by tedious Data Analytics & Science learning in a lifetime, This is my go to file to do specific data analytics. Note: some functions may not work for a specific reason, it has not been fully implemented, it has implemented but the update of R version and libraries over time can affect the code compatibility
1) Data Utilities and Preprocessing
This is helpful for getting started before doing data analytics, it covers helper function that is designed for: installing packages, reshape the data into long and wide format from a specific keys, analyze distribution of the data, find missing data quickly, use sampling with a variety of methods. Find Color palette to add for data visualization, forming combination and permutation of data elements, reproduce hard codes with a certain parameter into a new hard code of changed parameter, merging 2 separate factor data into new single entity, automatically reform numeric variable into factor with a defined range, and lastly to learn a new package by extracting all function information in that packages into a dataframe to see how should we use the functions from that packages, and many more
2) Common and Advanced Plotting
This Section contains ready function to make variety of visualization that can be made from data, this function is designed to support univariate and multivariate data. Variety of plot are included such as: density plot, scatter plot, correlation plot, partition matrix, forest plot, tree percent plot, radar chart plot, sunburst plot, wind rose plot, venn diagram, upset diagram and circular packing plot
3) Plotting with Highcharter
This Section contains ready function to generate plot utilizing a platform of a Highcharter which emphasize an interactive plot & personalization theme compared to the previous section of common and advanced plotting. In this section the variety plot such as: Sankey Dependency Wheel, Timeline Variwide, Heatmaps, and SidebySide bar is included
4) Statistical Testing
This Section contains all ready defined statistical tests to data, this section is also helpful to do an experimental AB-test to uncover significance method. The Statistical test included is: ANOVA, MANOVA, Binomial Proportion test, Multinomial Proportion test, Confidence Proportional Test, Univariate Mean Test, Paired Mean Test, Independent Mean Test, Categoric Mean Test, Nonparametric mean group test, Contingency Table Test, Sphericity Test, Power test, and Maximally Selected Statistic
5) Icon and Emoji Plotting (Also uses FontAwesome)
This Section contains function to emphasize the use of an Icon, Pictures, and Emojis while plotting the data into a graph. For this section I use ggwaffle and platform like FontAwesome that helps to use an Icon and Emojis, This is might be helpful to enhance more unique visualization and improve storytelling if used in a simplistic way
6) Quality Control Analysis
This Section contains Analysis related to Quality Control. It emphasizes a defined boundaries from an analyzed distribution of data, this boundaries are the main foundation to quality control of a data assuming that an outlier is a critical data point that need to be more evaluated in an organization, like for example Fraudulent activities involve money, and a process with a tight control parameters like temperature checks to a data server stored in a room, internet data transfer rates, rain density to anticipate potential floods in a specific area, and much more similar cases
7) Benford Analysis
This Section contains Analysis related to Numerical Patterns. Numerical pattern in this context means treating the numeric data into a string type to see complex digits occurences being part in collection of strings. If for instance there is data containing numerical codes such as employee ID, Manager ID, Transaction ID, or maybe numerical passwords, which contain a set of digits. By doing a Numerical Patterns Checking, data which share an identical numerical pattern can be labeled and classified together into a new data features that may uncover hidden statistical meaning
8) Association Rule Mining
This Section contains Analysis related to data association over categorical multivariate data. This Analysis commonly helps to uncover mostly to least transactional patterns over company products and goods which can help decision making to drive sales to a set of products and goods that are prepared to interest their clients and customers. In this section, algorithm used to do association rules mining are from apriori and eclat
9) Cluster Analysis
This Section contains variety of cluster analysis methods that are divided to 4 sub section, which is Connectivity-based Clustering (hierarchical clustering) such as Dendogram and Agnes Diana Clustering method, Centroid-based Clustering such as K-Means, K-Medoids, K-Medians, Distribution-based Clustering which involves Expectation and Maximization (EM Clustering), Density-based Clustering which involves DBScan by finding Optimal Epsilon parameter, and Subspace Clustering (this method is not applicable due to memory issues and very expensive computation)
10) Bayesian and Markov Chain Model Analysis
This Section contains function to perform Bayesian and Markov Chain Analysis, since both method work with probability modeling into a dataset. This type of analytics model help to uncover and measure possibilities through upcoming events by the recorded events pattern. This Section contains model such as: Naïve Bayes model, AODE Model, Bayesian Belief Network, and Markov Chain Model
11) Decision Tree Models
This Section contains function to perform various classification trees on a dataset with recorded properties, procedure and measurement to uncover a set of effective rules that can be used to reevaluate current procedures or to classify data based by the findings of common patterns of properties. This Section contains Classification Tree Linearly with rpart, Classification Tree Binary approach with Iterative Dichotomizer (ID3), Classification Tree with C4.5 and C5.0 Algorithm, Classification Tree with CHAID (Chi-squared Automatic Interaction Detection) Algorithm, Classification Tree with Single Split Tree of Decision Stump (Base Model for Another Model), Classification Tree with M5 and LMT (Logistic Model Trees) Algorithm from RWeka, Classification Tree with Conditional Classification, and Classification tree with Projection Pursuit
12) Exploratory Analysis with EFA and SEM Modeling
This Section contains function to perform Exploratory Factor Analysis which is used to explore best manifest variables from components, cumulative variance, and loading value which is a parameter to use for building SEM (Structural Equation Modeling) to emphasize relationship between latent variables and variable to be predicted, EFA can also be used for Dimensional Reduction and Hypothesis testing
13) Time Series Analysis
This Section covers data analytics related to time series either it’s an univariate or multivariate time series, it contains time series transformation, decomposition to essential properties such as trend, seasonal, random, and fixed pattern of a time series, statistical test to time series, and also time series forecast method such as ARIMA, ETS Model, Prophet Model, Holt Winters Model, Exponential Smooth Model, Fourier Model, and MARSS Model for Multivariate Time Series
14) Panel Data Analysis
This Section covers data analysis related to the form of panel data that contains a time bound variable and categoric variable that are continuously monitored. In this section, it is more focused to do Statistical Test related to Panel Data which is: Hausman test, F test for time-fixed effects, Random effects: Breusch-Pagan Lagrange multiplier (LM), cross-sectional dependence/contemporaneous correlation test, serial correlation test, ADF Test (Augmented Dickey Fuller), and BP Test (Heterocedastity test), to ensure that Panel Data model are not biased and better quality modeling. This Section also include modeling a Heterogeneous model for panel data
15) Dimensionality Reduction Analysis
This Section covers most of the Dimensionality Reduction technique for the purpose of minimize dataset feature to capture each data point uniqueness related to its neighbor, This Section contains: dim_reduction_feature_selection, dim_reduction_stepwise, dim_reduction_component_based, and dim_reduction_fast_2D_mapping, the function is defined mostly related to do a select best features, clustering, and 2D Projection of data abstraction, This Section are more focused on data exploration
16) Gravity Model Analysis
This Section covers a least heard model called Gravity Model, with Gravity Model, an analyst can make a model such as spatial analysis, international and global trade, migration patterns, transportation flows, and other fields where interactions between locations are important. These properties are then deeply calculated where the interaction between 2 locations is proportional to the product of their sizes but inversely proportional to the distance between them.
17) Local Outlier Factor Analysis
This Section covers function that are specially designed to handle and spot outliers in a more advanced way than the common ways of using aggregation to measure outliers. The Local Outlier Factor provides computed outlier score for each data points from its own algorithm computation. Later on the score are used to label the data points either its an outlier or not
18) Fuzzy Rule Based System Machine Learning
This Section covers function to make a machine learning based on Fuzzy Rules. But at the moment are not practical due to expensive computation usage and memory drain upon its execution. Its only to show that such machine learning is exist
19) Instance based Classification
This Section covers function to make a classification based model that does not need to have an assumption and hypothesis in order to execute it. This section covers classification by KNN (K-Nearest Neighbors), SOM (Self Organizing Maps), Learning Vector Quantization (LVQ), LOESS (Locally Weighted Scatterplot Smoothing)
20) Person-Item Parameter Model
This Section covers function to perform Modeling with the concept of “Person-Item Parameter” which are extensively used in psychometrics, particularly in the context of item response theory (IRT). The idea of this model is that each individual responses to test items are modeled as a function of their latent trait level (e.g., ability) and the parameters of the test items (e.g., difficulty). The model then estimates both the parameters of the individuals (person parameters) and the parameters of the items (item parameters) simultaneously, using maximum likelihood estimation or Bayesian estimation methods.
21) Social Network Analysis
This Section covers function to perform visualization and graph interpretation related to social networks or interconnected network that are rich in the usage of nodes and vertex among it. The related datasets can be found at http://snap.stanford.edu/data/#socnets Which contains a variety of social network in terms of size and context
22) Machine Learning Automation with Caret
This Section covers function to perform auto fitting multiple machine learning provided by Caret Library from R. Ranges to 100+ Machine Learning with its unique Hyperparameter, This allow users to do cross validation among machine learning. Key difference using Caret over MLR is Caret are more focused on providing a unified interface for model training, tuning, and evaluation
23) Machine Learning Automation with MLR
This Section covers function to perform auto fitting multiple machine learning provided by MLR Library from R. Ranges to 20+ Machine Learning with its unique Hyperparameter, This allow users to do cross validation among machine learning, Key difference using MLR over Caret is MLR extensions and its practice are emphasize keys like modularity, extensibility, and reproducibility.
24) Various Regression Analysis
This Section covers function to do various fitting of Regression Modeling, but not limited to its specific utilities such as normalization and standardization methods to reduce data variances before modeling the data into Regression model. Included Regression model in this section are as follow: Linear Regression, Polynomial Regression, Logistic/Probit Regression, Quantile Regression, Ridge Regression, Lasso Regression, Elastic Net Regression, Principal Component Regression, Partial Least Square Regression, Support Vector Regression, Ordinal Regression, Poisson Regression, Negative Binomial, Quasi Poisson Regression, Cox Regression, Tobit Regression, Stepwise Regression, Multivariate Adaptive Regression Splines Model, LOESS Regression (Non Parametric Local Regression), LARS Regression (Least Angle Regression), and also its example included for the 20 unique typre regression.
25) Neural Network & Deep Learning with Tensorflow and Keras
This Section covers function to perform prediction with Neural Network and Deep Learning using Tensorflow and Keras. This Section also included a utility to build a Time Series Matrix by incorporating N number of lags to the univariate time series data, that can be train to Keras Deep Learning model with LSTM Components to produce prediction with trained time series patterns
26) Extreme Value Analysis
This Section covers function to perform Extreme Value Analysis to a dataset, Extreme value analysis is a branch of statistics that deals with the statistical modeling and analysis of extreme or rare events, such as floods, hurricanes, stock market crashes, or unusually high or low temperatures. The goal of EVA is to understand the behavior of extreme events, estimate their probabilities, and model their distributions.
27) SVM Classifier
This Section covers function to perform Classification to data using SVM, which in common are a helper vector to make clear classification boundaries between classification. This Section also include helper function such as overlapping density check, which is to measure overlapping rate between variables before model the data to SVM because the modeling can be inefficient sometimes whenever there are high density overlap of inbetween variables. SVM build in this section are using SVM library from e1071 and kernlab
28) Sparklyr Functional Interfaces
This Section covers function to operate Spark in R, The implementation of Spark is to emphasize support for big data operation, while at minimum operation, this section provides filter vector, crosstabulation in batches, check NA Values, unique columns count, and group by summarization across big data
29) Calendar Time Conversion
This Section are a collection of self made foreign calendar converter algorithm to process default calendar data which is Gregorian as International usage, currently available is Chinese Calendar Converter, Julian Calendar Converter, Hebrew Calendar Converter, Islamic Calendar Converter, Persian/Afghan/Kurdish Calendar Converter, Indian Calendar Converter, French Calendar Converter, Ethiopian Calendar Converter, Badi Calendar Converter, Balinese Day Calendar Converter, Balinese Pawukon Calendar Converter, and Javanese Pasaran Calendar Converter. Do also note that this converter may lead to some inaccuracy do to lack of usage and testing. The algorithm is inspired from Wikipedia knowledge and available literature of calendar systems.
30) Cross Tabulation Analysis
This Section covers variant method to do Cross Tabulation to a dataset. Cross Tabulation is a technique used to summarize the relationship between two categorical variables or more by tabulating the frequency or proportion of observations in each combination of categories. In this section Cross tabulation analysis is adapted into 3 approach by the finding from Crosstab by Dr Paul Williamson, Crosstab with Chi Square Independence, Crosstab with Tabyl, with the last 2 approach has more focused visual approach
31) Multi Decision Criteria Making (MCDM Analysis)
This Section covers unique data analysis technique to perform MCDM Analysis. MCDM Analysis is an approach to select a Best products and offers available from matching user criteria. Some Feature can be classified as a positive and negative weight calculation to the MCDM to offer precision finding of Products and Offers by utilize ranking system. In this Section various of MCDM Analytics are included such as: Basic MCDM, AHP (Analytic Hierarchy Process), ANP (which is AHP in the form of super matrix), and other meta algorithms such as MCDM by MULTIMOORA, MCDM by TOPSIS, MCDM by VIKOR, and MCDM by WASPAS
32) Species by Community and Environment Analysis
This Section covers data analysis related to Ecological dataset, Community dataset, and Environment dataset, which is directly supported from vegan library. These datasets are having its own unique way of analysis, for example: analyzing species abundance data, community composition, and diversity metrics, calculating various measures of ecological diversity, including richness, evenness, and diversity indices, and comparing ecological communities and testing it for differences between samples or groups.
33) Spatial/GIS Related Analysis
This Section covers Spatial or GIS related Analysis. At the current moment this section is designed to make fast geographic visualization/map visualization to the datasets.
34) Text Data/Sentiment Analysis
This Section covers NLP Analysis according to Text and Sentiment Analysis, At the current moment the section is providing ways to extract sentiments and token from Text by various positive and negative sentiment labeling for the specific English languages.
35) Event Log Data Analysis
This Section covers Event Logs Analysis supported from bupaR library which built in purpose to discover, monitor, and improve processes within organizations, can also be implemented to improve SOP of an organization. As a side note: Event logs are structured datasets that record sequences of events or activities, along with timestamps and additional attributes, such as case IDs, activity names, and resource IDs. Event logs are typically collected from information systems, workflow management systems, or other sources of operational data within organizations.
36) Feature Enrichment and Optimization for Regression and Classification task
This Section covers feature enrichment and set optimization, for example is: moving_value_statistics which is a function that can expand columns based by aggregating a given feature but in moving terms, so if the size of moving value is 10 that means the feature is aggregated for the first 10 rows and will continue to move on like 2 to 11, 3 to 12, until the last rows, and while this may decrease the number of rows based by the size given, it can be useful for applying the feature enriched to the model so it can contribute by the unique distribution given, and also there is flexible feature selection for model variable that will be used to Regression and Classification task,
for example Regression utility: pick_starting_variable, pick_transform_starting_variable, selective_transformation_variable, drop_all_possible_variable, pick_starting_variable_ordinal, pick_transform_starting_variable_ordinal, selective_transform_variable_ordinal
and also utility for Classification task for example: check_overlap_reduction, pick_transform_overlap_reduction, continue_pick_transform_overlap_reduction, selective_transformation_overlap_reduction.
These utilities are designed to find best combination variable continuously for both Regression and Classification task with minimum time required to search them.
37) Maximum Likelihood Estimation for various distributions
This Section Covers function to find MLE distribution, MLE is a statistical technique called Maximum Likelihood Estimation to simulate a numerical pattern based by the distribution parameter input. MLE in this section covers 2 major type for MLE Distribution which is Continuous and Discretes type of MLE.
For example with Continuous Distribution : Finding MLE for Normal Distribution, Lognormal Distribution, Beta Distribution, Exponent Distribution, Gamma Distribution, Weibull Distribution, Uniform Distribution, Cauchy Distribution, Chi Square Distribution, F Distribution, and Logistic Distribution
And with Discrete Distribution: Finding MLE for Poisson Distribution, Finding MLE for Geometric Distribution, Binomial Distribution, Negative Binomial Distribution, Hyper Geometric Distribution, Wilcoxon Distribution, and Signrank Distribution.
38) ZigZag Computing with Economic/Forex Data
This Section document a function related to make a ZigZag pattern designs to financial asset data. The ZigZag Modeling in this section are powered by Local Minima Maxima Extraction which smooth the ZigZag from finding maximum and minimum value near local and also reduce noise by filtering it by ZigZag size which in this case varies from 1% changes to 10% changes. After the ZigZag computation has been calculated. There is also utility to take summary from the available ZigZag to track next bull and bear ZigZag by emphasize time aggregation for each previous ZigZag to simulate the time when next ZigZag is coming.
39) Pivot Point/Pivot Range Computing with Economic/Forex Data
This Section document a function related to make in depth calculation related to Pivot Points to financial asset data. The Pivot Points are a technical analysis used to determine intraday or swing points based by the previous timeframe movement. Pivot Points can be calculated with different perspectives and methods, In this section perspectives related to Pivot Points are included as follow: Standard Pivot Point R1-R3, S1-S3. Fibonacci Pivot Points R1-R5, S1-S5. Demark Pivot Points R1 and S1. Camarilla Pivot Points R1-R3, S1-S3. And Central Pivot Range (CPR).
And also here is an extra note that ranking the Pivot Points from the farthest to shallow gaps based by experiment:
Camarilla Pivot Points -> Standard Pivot Point -> Demark Pivot Points -> Fibonacci Pivot Points -> Central Pivot Range
40) Web Scrape Economic Indicators
This Section document a function related to gaining historical financial instrument data, historical event news data, sentiment data that are useful reference for beginning experiment of Financial Assets. This Section dependently gather data from web scrape using Selenium to gather data from investing.com, forexfactory.com, and myfxbook.com to also cross reference about event news information and trader sentiment related to a specific financial instruments.