-
Notifications
You must be signed in to change notification settings - Fork 1
knn
Version: 0.0.3 cite: Haghish, E. F. (2021). Integrating R machine learning algorithms in Stata using rcall 3.0
The program uses nearest neighbor averaging to impute missing data. this command utilizes the impute.knn function from impute R package (Hastie et. all, 2021) and embeds it in a Stata program using rcall package (Haghish, 2019). The kNN is a powerful and extremely fast imputation method. It is especially useful for large datasets, where multiple imputation or Random Forest imputation methods are not feasible, due to excessive computational resoures they need.
more importantly, the command is also meant to be a tutorial for Stata developers, showing how to embed R into Stata and how to document Stata packages with Markdown language, using markdoc package package. visit the project on github and have a look at the source code! to learn more, fork this repository on GitHub, read the source, and if you find it interestng, contribute to its development or documentation on GitHub.
knn [varlist], k(10) rowmax(real 0.5) colmax(real 0.8)
the main options are the following:
Option | Description |
---|---|
k | Number of neighbors to be used in the imputation (default=10) |
rowmax | The maximum percent missing data allowed in any row (default 50%, see below) |
colmax | The maximum percent missing data allowed in any column (default 80%, see below) |
For any rows with more than rowmax percentage of missing, the overall mean per variable is used for imputation. However, if any variable has more than colmax percentage of missing, an error is returned. You may drop these variables or increase the colmax percentage.
under development ... feel free to contribute on GitHub
The impute R package is hosted on BioConductor and should not be installed from CRAN. run the following code to install the dependencies within Stata
. rcall: install.packages("BiocManager", repos="http://cran.uk.r-project.org")
. rcall: BiocManager::install("impute")
Here is an example of doing missing data imputation with knn. The imputed dataset will be loaded in Stata automatically.
example 1
. webuse mheart5
. knn , k(5)
example 2
. qui sysuse auto, clear
. qui replace foreign = . in 1
. qui replace foreign = . in 15
. qui replace length = . in 68
. knn price-foreign
E. F. Haghish
Department of Psychology
University of Oslo
haghish@uio.no
machinelearning homepage
Package Updates on Twitter
This help file was dynamically produced by MarkDoc Literate Programming package