Conditional mutual information (CMI) is a model-free measure of correlation between two variables,
In many real-life applications, one may need to control for more than one variable. For this purpose, I implemented a generalized CMI estimator that can calculate CMIs of the form
for discrete random variables. If your dataset contains continuous fields, I suggest you discretize them, e.g. by binning according to quantile.
If you have time-series data, you may want to assess the impact of past states of X on current states of Y while also adjusting for past states of Y, i.e.
If you're on MacOS/Linux/Unix run ./setup.sh
file to build the shared library. If you're on Windows, you can get access to a free hosted Linux runtime through Google Colab.
Here's an example of how the CMI values can be estimated from Pandas DataFrames. This code would need to be contained in the project's root directory. Ensure all variables are of float type.
import pandas as pd
import cmipy
data = pd.read_csv('your_path.csv')
cmi, p_val = cmipy.cond_mutual_info(data=data, x='price', y='area', z=['location', 'bathrooms'], p_samples=10000)
In addition to Pandas DataFrames, NumPy Arrays are also supported out-of-the-box. Ensure all variables are of float type.
import numpy as np
import cmipy
x = np.random.randint(0, )
x = np.random.randint(0, 4, size=(10000, 1)).astype(float)
y = np.random.randint(0, 4, size=(10000, 1)).astype(float)
z = np.random.randint(0, 4, size=(10000, 1)).astype(float)
cmi, p_val = cmipy.cond_mutual_info(x=x, y=y, z=z, p_samples=10000, base=2.0)