Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeWarning: divide by zero encountered in divide when using evaluate_causal_model #1213

Open
newbietogitdotcom opened this issue Jun 22, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@newbietogitdotcom
Copy link

Describe the bug
My data has all the numeric columns and does not have any null, zero or infinite values. It also does not have any duplicate values but still i keep getting this error

"Evaluating causal mechanisms...: 50%|█████ | 10/20 [00:06<00:06, 1.55it/s]/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/dowhy/gcm/divergence.py:84: RuntimeWarning: divide by zero encountered in divide
result = np.sum((d / n) * np.log(nu / rho)) + np.log(m / (n - 1))
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/dowhy/gcm/divergence.py:84: RuntimeWarning: divide by zero encountered in divide
result = np.sum((d / n) * np.log(nu / rho)) + np.log(m / (n - 1))
Evaluating causal mechanisms...: 100%|██████████| 20/20 [00:17<00:00, 1.16it/s]
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/dowhy/gcm/divergence.py:84: RuntimeWarning: divide by zero encountered in divide
result = np.sum((d / n) * np.log(nu / rho)) + np.log(m / (n - 1))"

and also this error

""name": "RuntimeError",
"message": "Got a non-finite KL divergence! This can happen if both data sets have overlapping elements. Since these are normally removed by this method, double check whether the arrays are numeric.",

Versions/3.10/lib/python3.10/concurrent/futures/_base.py:403\u001b[0m, in \u001b[0;36mFuture.__get_result\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 401\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_exception:\n\u001b[1;32m 402\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m--> 403\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_exception\n\u001b[1;32m 404\u001b[0m \u001b[39mfinally\u001b[39;00m:\n\u001b[1;32m 405\u001b[0m \u001b[39m# Break a reference cycle with the exception in self._exception\u001b[39;00m\n\u001b[1;32m 406\u001b[0m \u001b[39mself\u001b[39m \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m\n\n\u001b[0;31mRuntimeError\u001b[0m: Got a non-finite KL divergence! This can happen if both data sets have overlapping elements. Since these are normally removed by this method, double check whether the arrays are numeric.""

Steps to reproduce the behavior

This can also include a verbatim copy of outputs, or screenshots.

Expected behavior
A clear and concise description of what you expected to happen.

Version information:

  • DoWhy version [e.g. 0.7]

Additional context
Add any other context about the problem here.

@newbietogitdotcom newbietogitdotcom added the bug Something isn't working label Jun 22, 2024
@bloebp
Copy link
Member

bloebp commented Jun 24, 2024

Hi, does your data have columns with only a constant?

@newbietogitdotcom
Copy link
Author

Hi @bloebp thank you for replying to my post.

No, it does not have any column with constant value. Please find some more information regarding my data below:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 22 columns):

Column Non-Null Count Dtype


0 Date 29 non-null dbdate
1 ET 29 non-null float64
2 EOT 29 non-null float64
3 DU 29 non-null float64
4 OD 29 non-null Int64
5 ONTD 29 non-null float64
6 ST 29 non-null Int64
7 UT 29 non-null Int64
8 OT 29 non-null Int64
9 TT 29 non-null Int64
10 THT 29 non-null Int64
11 SS 29 non-null float64
12 MPH 29 non-null float64
13 OA 29 non-null float64
14 LCA 29 non-null float64
15 OTP 29 non-null float64
16 DT 29 non-null float64
17 DST 29 non-null Int64
18 PM 29 non-null float64
19 BC 29 non-null float64
20 IC 29 non-null float64
21 TIP 29 non-null float64
dtypes: Int64(7), dbdate(1), float64(14)
memory usage: 5.3 KB

and below are count of unique values per column

Date 29
ET 29
EOT 29
DU 23
OD 29
ONTD 7
ST 29
UT 28
OT 29
TT 29
THT 29
SS 3
MPH 25
OA 3
LCA 29
OTP 29
DT 27
DST 10
PM 16
BC 29
IC 29
TIP 29
dtype: int64

@bloebp
Copy link
Member

bloebp commented Jun 24, 2024

Ok interesting, is there any chance you can provide some artificially generated data that reproduces this issue? I can take a closer look then.

@ZippyCom
Copy link

ZippyCom commented Aug 18, 2024

I confirm this issue is still present in the latest release. I managed to resolve the issue locally by setting assume_unique in gcm/divergence.py on line 64 to False. According to numpy docs: "If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False."
Now I am new to dowhy lib, so please can someone experienced check if this won't have some unintended functional consequences? Thanks.

@bloebp
Copy link
Member

bloebp commented Aug 19, 2024

Thanks for checking on this! I am not sure if there was a particular reason why this was set to True. If the unit tests pass, I think it is fine if we just set this to False then.

Can you run the unit tests and, if they pass, do you want to open a PR to change it?

@ZippyCom
Copy link

@bloebp I don't have permissions, so feel free to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants