Skip to content

blakebullwinkel/datathin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datathin

This is a Python implementation of data thinning, a method proposed by Neufeld et al.. The code closely follows the original R version and uses a similar syntax. For example, to thin a vector of Poisson distributed data, simply call the datathin function, specifying the family and epsilon parameters:

from datathin.thinning import datathin

X = datathin(data, family="poisson", epsilon=[0.3, 0.7])

Here, X is a three-dimensional numpy array, and the folds of data are stored in the third dimension. I.e.,

X_train = X[:,:,0]
X_test = X[:,:,1]

Note that 1) data == X_train + X_test, and 2) the correlation between X_train and X_test is approximately zero. For more examples of how to thin data following other distributions, see test_helpers.py.

This Python project is very new, and contributions are very much welcome! If you find any bugs, feel free to open a GitHub issue or a pull request. Other helpful contributions include:

  • Documentation (we currently have none)
  • Optimizations that make the datathin function run faster
  • Additional tests, particularly for different input shapes

Installation

datathin is not yet available on PyPI and must be installed directly from this repo. First, initialize and activate a conda environment.

conda create -n datathin-dev python=3.10
conda activate datathin-dev

In the same directory as the pyproject.toml file, you can install dependencies (numpy is the only one) by running:

cd $GIT_PROJECT_HOME
pip install .

OR to install datathin in editable mode for development purposes, run:

pip install -e .[dev]

The [dev] suffix installs additional dependencies for contributors, including pytest.

Citation

If you found this package useful, please cite the original data thinning paper.

@article{JMLR:v25:23-0446,
  author  = {Anna Neufeld and Ameer Dharamshi and Lucy L. Gao and Daniela Witten},
  title   = {Data Thinning for Convolution-Closed Distributions},
  journal = {Journal of Machine Learning Research},
  year    = {2024},
  volume  = {25},
  number  = {57},
  pages   = {1--35},
  url     = {http://jmlr.org/papers/v25/23-0446.html}
}

About

Data thinning in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages