High-Performance Descriptive Statistics and Hypothesis Tests in C++20
Statistics help us analyze and interpret data. High-performance statistical algorithms help us analyze and interpret a lot of data. Most environments provide convenient helper functions to calculate basic statistics. Scistats aims to provide high-performance statistical algorithms with an easy and familiar interface. All algorithms can run sequentially or in parallel, depending on how much data you have.
Table of Contents
Scistats extends the numeric facilities of the standard library to include statistics that work with iterators and ranges. This means you can do things like:
std::vector<int> v{/*...*/};
float m = scistats::mean(v);
or
std::vector<int> v{/*...*/};
float s = scistats::stddev(v); //
or
std::vector<int> v1{/*...*/};
std::vector<int> v2{/*...*/};
float c = scistats::cov(v); // covariance
or
std::vector<int> v{/*...*/};
float p = scistats::t_test(v); // student's t hypothesis test
All algorithms allow execution policies and iterators. So you can do
std::vector<int> v{/*...*/};
float m = scistats::mean(scistats::execution::par, v);
to calculate your average in parallel. Or
std::vector<int> v{/*...*/};
float m = scistats::mean(scistats::execution::seq, v);
to explicitly tell scistats you don't want that to be calculated in parallel. If no execution policy is provided, scistats will choose a policy according to the input size.
As usual, you can also work directly with iterators, so
std::vector<int> v{/*...*/};
float m = scistats::mean(v.begin(), v.end();
also works.
Note that, when needed, the result type gets promoted to float
. If the result for a given statistic needs to be floating point, scistats will always promote an integer input type to a corresponding floating type large enough to keep the results without losing precision.
With ranges:
using namespace scistats;
// ...
mean(x);
With iterators:
mean(x.begin(), x.end());
You can run any algorithm in parallel by changing the execution policy:
mean(execution::seq, x);
mean(execution::par, x);
If no execution policy is provided, scistats will infer the best execution policy according to the input data.
Other functions to measure central tendency are:
Function | Description |
---|---|
mean(x) |
Arithmetic mean |
median(x) |
Median |
mode(x) |
Mode |
To calculate the standard deviation of a data set:
stddev(x);
If you already know the mean m
, you can make calculations faster with:
stddev(x,m);
Other functions to measure dispersion are:
Function | Description |
---|---|
var(x) |
Variance |
stddev(x) |
Standard Deviation |
min(x) |
Minimum Value |
max(x) |
Maximum Value |
bounds(x) |
Minimum and Maximum Values |
percentile(x,p) |
Calculate p -th percentile |
To calculate the covariance of two data sets:
cov(x,y);
To get the probability of x
in a normal distribution:
norm_pdf(x);
To get the cumulative probability of x
in a normal distribution:
norm_cdf(x);
To get the value x
that has a cumulative probability p
in a normal distribution:
norm_inv(p);
Probability | Cumulative | Inverse | Description |
---|---|---|---|
norm_pdf(x) |
norm_cdf(x) |
norm_inv(p) |
Normal distribution |
t_pdf(x,df) |
t_cdf(x,df) |
t_inv(p,df) |
Student's T distribution |
where df
is the degrees of freedom in the probability distribution.
To test the hypothesis that the values in x
come from a distribution with mean(x)
is zero:
t_test(x);
To test the hypothesis that the values in x
and y
have the same mean:
t_test(x,y);
For a paired test:
t_test_paired(x,y);
To get a confidence interval for these tests:
t_test_interval(x);
t_test_interval(x,y);
Given (i) the probability P(E|H)=likelihood
of the evidence E
given the hypothesis H
, (ii) the prior probability p_hypothesis
of hypothesis H
, and (iii) the prior probability p_evidence
of evidence E
, we can calculate the probability P(H|E)
of a hypothesis H
given the evidence E
with:
bayes_theorem(likelihood, p_hypothesis, p_evidence)
Given P(E|H)
and P(E|not H)
, we can calculate the bayes factor:
bayes_factor(p_evidence_given_h, p_evidence_given_not_h)
To sum the elements of a range in parallel:
sum(execution::parallel_policy, x)
Or let scistats
infer if it is worth doing it in parallel:
sum(x)
Function | Description |
---|---|
sum |
summation |
prod |
product |
The header scistats/math/constants.h
defines a number of useful constants as constexpr
functions:
Function | Description | Approximate Value |
---|---|---|
pi |
The constant pi | 3.14159 |
epsilon(scale) |
A tiny tiny number for a given scale and type | epsilon(1.) = 2.22045e-16 |
inf |
The number representing infinity | inf |
min |
Smallest number | 2.22507e-308 |
max |
Largest number | 1.79769e+308 |
NaN |
The number representing "not a number" | nan |
e |
Euler's number - The base of exponentials | 2.71828 |
euler |
Euler–Mascheroni constant / or Euler's gamma : The base of the natural logarithm | 0.577216 |
log2_e |
The base-2 logarithm of e | 1.4427 |
log10_e |
The base-10 logarithm of e | 0.434294 |
sqrt2 |
The square root of two | 1.41421 |
sqrt1_2 |
The square root of one-half | 0.707107 |
sqrt3 |
The square root of three | 1.73205 |
pi_2 |
Pi divided by two | 1.5708 |
pi_4 |
Pi divided by four | 0.785398 |
sqrt_pi |
The square root of pi | 1.77245 |
two_sqrt_pi |
Two divided by the square root of pi | 1.12838 |
one_by_pi |
The reciprocal of pi (1./pi) | 0.31831 |
two_by_pi |
Twice the reciprocal of pi | 0.63662 |
ln10 |
The natural logarithm of ten | 2.30259 |
ln2 |
The natural logarithm of two | 0.693147 |
lnpi |
The natural logarithm of pi | 1.14473 |
Some helper functions:
Function | Description |
---|---|
Numeric | |
abs |
absolute value (for floating point and integers) |
almost_equal |
check if two numbers are almost the same |
is_odd |
check if integer is odd |
is_even |
check if integer is even |
Trigonometric | |
acot |
acot |
cot |
cot |
Special | |
beta |
beta |
beta_inc |
beta_inc |
beta_inc_inv |
beta_inc_inv |
beta_inc_inv_upper |
beta_inc_inv_upper |
beta_inc_upper |
beta_inc_upper |
betaln |
betaln |
erfinv |
erfinv |
gammaln |
gammaln |
tgamma |
tgamma |
xinbta |
xinbta |
To measure the time between two operations:
double t1 = tic();
// your operations
double t2 = toc();
To measure the time it takes to run a function:
double t = timeit([](){
// Your function...
});
To create a mini-benchmark measuring the time it takes to run a function:
std::vector<double> t = minibench([](){
// Your function...
});
std::cout << "Mean: " << mean(t) << std::endl;
std::cout << "Standard Deviation: " << stddev(t) << std::endl;
To generate a random integer between a
and b
with a reasonable
random number generator:
randi(a,b)
To generate a random number from a normal distribution:
randn()
To generate a random number from an uniform distribution between a
and b
:
rand(a,b)
Some functions we plan to implement are:
- Math
- Parallel Arithmetic
- Constants 1
- Mini-benchmarks
- Random Number Generators
- Descriptive statistics 1 2 3
- Central tendency
- Dispersion
- Correlation
- Hypothesis Tests 1
- Regression Models
- Classification
- Clustering
- Data processing 1
- C++20
- CMake 3.14+
Instructions: Linux/Ubuntu/GCC
Check your GCC version:
g++ --version
The output should be something like:
g++-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
If you see a version before GCC-10, update it with
sudo apt update
sudo apt install gcc-10
sudo apt install g++-10
Once you installed a newer version of GCC, you can link it to update-alternatives
. For instance, if you have GCC-7 and GCC-10, you can link them with:
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 10
You can now use update-alternatives
to set your default gcc
and g++
to a more recent version:
update-alternatives --config g++
update-alternatives --config gcc
Also check your CMake version:
cmake --version
If it's older than CMake 3.14, update it with
sudo apt upgrade cmake
or download the most recent version from cmake.org.
Later when running CMake, make sure you are using GCC-8 or higher by appending the following options:
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 -DCMAKE_CXX_COMPILER=/usr/bin/g++-10
Instructions: Mac Os/Clang
Check your Clang version:
clang --version
The output should have something like
Apple clang version 11.0.0
If you see a version before Clang 11, update LLVM+Clang:
curl --output clang.tar.xz -L https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/clang+llvm-11.0.0-x86_64-apple-darwin.tar.xz
mkdir clang
tar -xvJf clang.tar.xz -C clang
cd clang/clang+llvm-11.0.0-x86_64-apple-darwin
sudo cp -R * /usr/local/
Update CMake with
sudo brew upgrade cmake
or download the most recent version from cmake.org.
If the last command fails because you don't have Homebrew on your computer, you can install it with
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
or you can follow the instructions in https://brew.sh.
Instructions: Windows/MSVC
- Make sure you have a recent version of Visual Studio
- Download Git from https://git-scm.com/download/win and install it
- Download CMake from https://cmake.org/download/ and install it
You can see the dependencies in source/CMakeLists.txt
.
This will build the examples in the build/examples
directory:
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2"
cmake --build . --parallel 2 --config Release
- Replace
--parallel 2
with--parallel <number of cores in your machine>
- On Windows, replace
-O2
with/O2
- On Linux, you might need
sudo
for this last command
This will install Scistats on your system:
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2" -DBUILD_EXAMPLES=OFF -DBUILD_TESTS=OFF
cmake --build . --parallel 2 --config Release
cmake --install .
- Replace
--parallel 2
with--parallel <number of cores in your machine>
- On Windows, replace
-O2
with/O2
- On Linux, you might need
sudo
for this last command
If you have the library installed, you can call
find_package(Scistats)
from your CMake build script.
When creating your executable, link the library to the targets you want:
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
Add this header to your source files:
#include <scistats/scistats.h>
You can use Scistats directly in CMake projects without installing it. Check if you have Cmake 3.14+ installed:
cmake -version
Clone the whole project
git clone https://github.com/alandefreitas/scistats/
and add the subdirectory to your CMake project:
add_subdirectory(scistats)
When creating your executable, link the library to the targets you want:
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
You can now add the scistats headers to your source files.
However, it's always recommended to look for Scistats with find_package
before including it as a subdirectory. Otherwise, we can get ODR errors in larger projects.
Check if you have Cmake 3.14+ installed:
cmake -version
Install CPM.cmake and then:
CPMAddPackage(
NAME scistats
GITHUB_REPOSITORY alandefreitas/scistats
GIT_TAG origin/master # or whatever tag you want
)
# ...
target_link_libraries(my_target PUBLIC scistats)
You can now add the scistats headers to your source files.
However, it's always recommended to look for Scistats with find_package
before including it as a subdirectory. You can use:
option(CPM_USE_LOCAL_PACKAGES "Try `find_package` before downloading dependencies" ON)
to let CPM.cmake do that for you. Otherwise, we can get ODR errors in larger projects.
If you want to use it in another build system you can either install the library (Section Binary Packages or Section Installing Scistats from Source or you have to somehow rewrite the build script.
If you want to rewrite the build script, your project needs to 1) include the headers, and 2) link with the dependencies described in source/CMakeLists.txt
.
There are many ways in which you can contribute to this library:
- Testing the library in new environments
- Contributing with interesting examples
- Contributing with new statistics
- Finding problems in this documentation
- Finding bugs in general
- Whatever idea seems interesting to you
If contributing with code, please leave the pedantic mode ON (-DBUILD_WITH_PEDANTIC_WARNINGS=ON
), and don't forget cppcheck and clang-format.
If contributing to the documentation, please edit README.md
directly, as the files in ./docs
are automatically generated with mdsplit.
Alan De Freitas |
Rcpsilva |