Fix astype in coordinates matching when sampling data frames #126

acocac · 2024-08-09T13:11:27Z

The PR fixes the sampling strategy when locations are np.ndarray. It also amends the docs of the sample_df function.

Fixes #125

tom-andersson · 2024-08-09T14:25:39Z

This looks like a great catch! Two comments:

Please could you add a unit test in test_task_loader that tries to sample with a different dtype, run it localy with pytest, and confirm that it fails with InvalidSamplingStrategyError when your fix is commented out, and passes after the fix, then add it to this PR
run ‘black’ for style to pass

acocac · 2024-08-13T15:53:15Z

@tom-andersson if I understand well, the proposed unit test should test a Dataframe to sample with different dtypes in the x1 and x2 dimensions. There's already a test checking invalid sampling strategies, so should I create another one for invalid sampled dataframes?

tom-andersson · 2024-08-13T16:52:32Z

@acocac I wouldn't put the unit test in test_invalid_sampling_strat, because we actually expect that there will not be any errors raised when the x1/x2 dtype is different from that of the DataFrame being sampled. So I would add a new unit test like test_different_dtype_when_sampling_offgrid_data_at_specific_numpy_locs. Here is an example where we just run the TaskLoader and add a comment that no error should be raised. And here is an example where we sample a toy DataFrame at [0, 0], so you can use sampling_strat=np.zeros((2, 1), dtype=wrong_dtype) and then try task = tl("2020-01-01", sampling_strat, sampling_strat).

That unit test should fail before your fix, and pass afterwards.

acocac · 2024-08-14T11:03:14Z

@tom-andersson thanks for the suggestions. I've added the unit test, however, I've found the example of sampling the toy data frame at [0,0] isn't the most useful to indicate the effect of forcing x1/x2 dtype. I think the problem relates to inaccuracies in the floats with values between 0 and 1 and their impact in np.in1d

The proposed test actually fails with/without the fix, so I'd appreciate if you can have a look at the test 🙏

… if numpy array dtype is wrong

…re casting errors)

tom-andersson · 2024-08-20T12:12:00Z

@acocac the unit test was failing because of a bug in the way the DataFrame is being sampled, which triggered an InvalidSamplingStrategy error. I'll fix this in a separate commit.

Regarding the type issue, you were not actually passing the TaskLoader a numpy array with a different dtype to the index values. I fixed that, and then found that the unit test was still failing even after casting the dtype. Thinking more about this, I think it is not safe to do casting, as it doesn't guarantee that coords of different dtypes will match. I've decided to instead raise an informative error for the user if the dtypes differ, rather than casting under the hood, and I've updated the unit test to reflect this.

Happy to merge this once tests are passing. Thanks again for finding this!

…data is gridded

tom-andersson · 2024-08-20T12:54:25Z

@all-contributors please add @acocac for code, tests

allcontributors · 2024-08-20T12:54:34Z

@tom-andersson

I've put up a pull request to add @acocac! 🎉

acocac · 2024-08-20T13:17:05Z

@acocac the unit test was failing because of a bug in the way the DataFrame is being sampled, which triggered an InvalidSamplingStrategy error. I'll fix this in a separate commit.

Regarding the type issue, you were not actually passing the TaskLoader a numpy array with a different dtype to the index values. I fixed that, and then found that the unit test was still failing even after casting the dtype. Thinking more about this, I think it is not safe to do casting, as it doesn't guarantee that coords of different dtypes will match. I've decided to instead raise an informative error for the user if the dtypes differ, rather than casting under the hood, and I've updated the unit test to reflect this.

Happy to merge this once tests are passing. Thanks again for finding this!

The proposed solution sounds reasonable. I agree it could be safer to do casting from the user-side. The warning message helps!

acocac added 2 commits August 9, 2024 13:51

add astype in coordinates matching

ef99d9d

fix doc in sample_df

740e71c

acocac requested a review from tom-andersson August 9, 2024 13:18

add test for different dtype when sampling offgrid data

ce9913b

tom-andersson marked this pull request as draft August 20, 2024 10:14

tom-andersson added 2 commits August 20, 2024 13:03

Remove casting from numpy TaskLoader sampling and instead raise error…

d4953f7

… if numpy array dtype is wrong

Add helpful suggestion in numpy sampling error message (eg if there a…

85326d2

…re casting errors)

tom-andersson marked this pull request as ready for review August 20, 2024 12:10

Use .loc for numpy sampling, fixes sampling extra values when pandas …

b74f97e

…data is gridded

allcontributors bot mentioned this pull request Aug 20, 2024

docs: add acocac as a contributor for code, and test #128

Merged

tom-andersson merged commit f5c1a92 into alan-turing-institute:main Aug 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix astype in coordinates matching when sampling data frames #126

Fix astype in coordinates matching when sampling data frames #126

acocac commented Aug 9, 2024

tom-andersson commented Aug 9, 2024

acocac commented Aug 13, 2024

tom-andersson commented Aug 13, 2024

acocac commented Aug 14, 2024 •

edited

Loading

tom-andersson commented Aug 20, 2024 •

edited

Loading

tom-andersson commented Aug 20, 2024

allcontributors bot commented Aug 20, 2024

acocac commented Aug 20, 2024

Fix astype in coordinates matching when sampling data frames #126

Fix astype in coordinates matching when sampling data frames #126

Conversation

acocac commented Aug 9, 2024

tom-andersson commented Aug 9, 2024

acocac commented Aug 13, 2024

tom-andersson commented Aug 13, 2024

acocac commented Aug 14, 2024 • edited Loading

tom-andersson commented Aug 20, 2024 • edited Loading

tom-andersson commented Aug 20, 2024

allcontributors bot commented Aug 20, 2024

acocac commented Aug 20, 2024

acocac commented Aug 14, 2024 •

edited

Loading

tom-andersson commented Aug 20, 2024 •

edited

Loading