Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix astype in coordinates matching when sampling data frames #126

Merged
merged 6 commits into from
Aug 20, 2024

Conversation

acocac
Copy link
Member

@acocac acocac commented Aug 9, 2024

The PR fixes the sampling strategy when locations are np.ndarray. It also amends the docs of the sample_df function.

Fixes #125

@tom-andersson
Copy link
Collaborator

This looks like a great catch! Two comments:

  1. Please could you add a unit test in test_task_loader that tries to sample with a different dtype, run it localy with pytest, and confirm that it fails with InvalidSamplingStrategyError when your fix is commented out, and passes after the fix, then add it to this PR
  2. run ‘black’ for style to pass

@acocac
Copy link
Member Author

acocac commented Aug 13, 2024

@tom-andersson if I understand well, the proposed unit test should test a Dataframe to sample with different dtypes in the x1 and x2 dimensions. There's already a test checking invalid sampling strategies, so should I create another one for invalid sampled dataframes?

@tom-andersson
Copy link
Collaborator

@acocac I wouldn't put the unit test in test_invalid_sampling_strat, because we actually expect that there will not be any errors raised when the x1/x2 dtype is different from that of the DataFrame being sampled. So I would add a new unit test like test_different_dtype_when_sampling_offgrid_data_at_specific_numpy_locs. Here is an example where we just run the TaskLoader and add a comment that no error should be raised. And here is an example where we sample a toy DataFrame at [0, 0], so you can use sampling_strat=np.zeros((2, 1), dtype=wrong_dtype) and then try task = tl("2020-01-01", sampling_strat, sampling_strat).

That unit test should fail before your fix, and pass afterwards.

@acocac
Copy link
Member Author

acocac commented Aug 14, 2024

@tom-andersson thanks for the suggestions. I've added the unit test, however, I've found the example of sampling the toy data frame at [0,0] isn't the most useful to indicate the effect of forcing x1/x2 dtype. I think the problem relates to inaccuracies in the floats with values between 0 and 1 and their impact in np.in1d

The proposed test actually fails with/without the fix, so I'd appreciate if you can have a look at the test 🙏

@tom-andersson tom-andersson marked this pull request as draft August 20, 2024 10:14
@tom-andersson tom-andersson marked this pull request as ready for review August 20, 2024 12:10
@tom-andersson
Copy link
Collaborator

tom-andersson commented Aug 20, 2024

@acocac the unit test was failing because of a bug in the way the DataFrame is being sampled, which triggered an InvalidSamplingStrategy error. I'll fix this in a separate commit.

Regarding the type issue, you were not actually passing the TaskLoader a numpy array with a different dtype to the index values. I fixed that, and then found that the unit test was still failing even after casting the dtype. Thinking more about this, I think it is not safe to do casting, as it doesn't guarantee that coords of different dtypes will match. I've decided to instead raise an informative error for the user if the dtypes differ, rather than casting under the hood, and I've updated the unit test to reflect this.

Happy to merge this once tests are passing. Thanks again for finding this!

@tom-andersson
Copy link
Collaborator

@all-contributors please add @acocac for code, tests

Copy link
Contributor

@tom-andersson

I've put up a pull request to add @acocac! 🎉

@tom-andersson tom-andersson merged commit f5c1a92 into alan-turing-institute:main Aug 20, 2024
4 checks passed
@acocac
Copy link
Member Author

acocac commented Aug 20, 2024

@acocac the unit test was failing because of a bug in the way the DataFrame is being sampled, which triggered an InvalidSamplingStrategy error. I'll fix this in a separate commit.

Regarding the type issue, you were not actually passing the TaskLoader a numpy array with a different dtype to the index values. I fixed that, and then found that the unit test was still failing even after casting the dtype. Thinking more about this, I think it is not safe to do casting, as it doesn't guarantee that coords of different dtypes will match. I've decided to instead raise an informative error for the user if the dtypes differ, rather than casting under the hood, and I've updated the unit test to reflect this.

Happy to merge this once tests are passing. Thanks again for finding this!

The proposed solution sounds reasonable. I agree it could be safer to do casting from the user-side. The warning message helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential bug: GreedyAlgorithm fails for Oracle-related Acquisition Functions
2 participants