Skip to content
E. F. Haghish edited this page Oct 7, 2021 · 4 revisions

Version: 0.0.3 cite: Haghish, E. F. (2021). Integrating R machine learning algorithms in Stata using rcall 3.0

Imputing missing data using k-nearest neighbors algorithm (kNN)

The program uses nearest neighbor averaging to impute missing data. this command utilizes the impute.knn function from impute R package (Hastie et. all, 2021) and embeds it in a Stata program using rcall package (Haghish, 2019). The kNN is a powerful and extremely fast imputation method. It is especially useful for large datasets, where multiple imputation or Random Forest imputation methods are not feasible, due to excessive computational resoures they need.

more importantly, the command is also meant to be a tutorial for Stata developers, showing how to embed R into Stata and how to document Stata packages with Markdown language, using markdoc package package. visit the project on github and have a look at the source code! to learn more, fork this repository on GitHub, read the source, and if you find it interestng, contribute to its development or documentation on GitHub.

Syntax

knn [varlist], k(10) rowmax(real 0.5) colmax(real 0.8)

Options

the main options are the following:

Option Description
k Number of neighbors to be used in the imputation (default=10)
rowmax The maximum percent missing data allowed in any row (default 50%, see below)
colmax The maximum percent missing data allowed in any column (default 80%, see below)

For any rows with more than rowmax percentage of missing, the overall mean per variable is used for imputation. However, if any variable has more than colmax percentage of missing, an error is returned. You may drop these variables or increase the colmax percentage.

Description

under development ... feel free to contribute on GitHub

Installation

The impute R package is hosted on BioConductor and should not be installed from CRAN. run the following code to install the dependencies within Stata

        . rcall: install.packages("BiocManager", repos="http://cran.uk.r-project.org")
        . rcall: BiocManager::install("impute")

Example

Here is an example of doing missing data imputation with knn. The imputed dataset will be loaded in Stata automatically.

example 1

        . webuse mheart5 
        . knn , k(5)

example 2

        . qui sysuse auto, clear
        . qui replace foreign = . in 1
        . qui replace foreign = . in 15
        . qui replace length = . in 68
        . knn price-foreign

Author

E. F. Haghish
Department of Psychology
University of Oslo
haghish@uio.no
machinelearning homepage
Package Updates on Twitter


This help file was dynamically produced by MarkDoc Literate Programming package

Clone this wiki locally