Passing only one weights array for cross-correlations of unequal sizes #180

lgarrison · 2019-05-16T19:14:27Z

General information

Corrfunc version: v2.2

Issue description

Passing only one weights array when doing a cross-correlation of particle sets of different sizes causes a RuntimeError.

The cause is that the Python wrappers create a uniform weights array for the second set of particles if only the first was provided, but it is created with size equal to the first particle set.

Minimal failing example

import numpy as np
from Corrfunc.theory.DD import DD

N_d = 10000
Lbox = 250.
N_r = N_d*5

x_DM_jack = np.random.uniform(0.,Lbox,N_d)                                                                  
y_DM_jack = np.random.uniform(0.,Lbox,N_d)                                                                  
z_DM_jack = np.random.uniform(0.,Lbox,N_d)

x_rand_jack = np.random.uniform(0.,Lbox,N_r)                                                                  
y_rand_jack = np.random.uniform(0.,Lbox,N_r)                                                                  
z_rand_jack = np.random.uniform(0.,Lbox,N_r)

w_DM_jack = np.full_like(x_DM_jack,1.)                                                                        
w_DM_jack[:] = np.random.uniform(1.,2.,N_d)

bins = np.logspace(-1,1.,25)

autocorr = 0                                                                                 
results = DD(autocorr,nthreads=16,binfile=bins,                                            
                           X1=x_rand_jack, Y1=y_rand_jack, Z1=z_rand_jack,
                           X2=x_DM_jack, Y2=y_DM_jack, Z2=z_DM_jack,weights2=w_DM_jack,                  
                           boxsize=Lbox,periodic=False)

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
error: ERROR: the last dimension of `weights` must match the number of positions.  Instead found n_weights=10000, nx=50000

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-fac2f1ff5530> in <module>
     23                            X1=x_rand_jack, Y1=y_rand_jack, Z1=z_rand_jack,
     24                            X2=x_DM_jack, Y2=y_DM_jack, Z2=z_DM_jack,weights2=w_DM_jack,
---> 25                            boxsize=Lbox,periodic=False)

~/anaconda3/lib/python3.6/site-packages/Corrfunc-2.3.0-py3.6-linux-x86_64.egg/Corrfunc/theory/DD.py in DD(autocorr, nthreads, binfile, X1, Y1, Z1, weights1, periodic, X2, Y2, Z2, weights2, verbose, boxsize, output_ravg, xbin_refine_factor, ybin_refine_factor, zbin_refine_factor, max_cells_per_dim, enable_min_sep_opt, c_api_timer, isa, weight_type)
    246     if extn_results is None:
    247         msg = "RuntimeError occurred"
--> 248         raise RuntimeError(msg)
    249     else:
    250         extn_results, api_time = extn_results

RuntimeError: RuntimeError occurred

Thanks to @boryanah for finding this and providing the failing example!

The text was updated successfully, but these errors were encountered:

manodeep · 2019-05-16T22:17:45Z

What is the expected behaviour here? If there was only a scalar passed, then I can understand numpy-style dimension broadcasting. But if there is only one weights array passed, then we should raise a ValueError and request that the user pass another weights array of shape (num_weights, ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would be equivalent for their weight type for custom weighting schemes). The num_weights for the second weights array should match that of the first array.

lgarrison · 2019-05-16T22:36:11Z

The expected behavior is that the Python wrapper automatically provides such an array. Indeed, this is the current documented behavior (but only works as intended when the two sets are of equal size). I think it's sensible to raise a ValueError if the weighting scheme is not `pair_product`.

…

On Thu, May 16, 2019 at 6:17 PM Manodeep Sinha ***@***.***> wrote: What is the expected behaviour here? If there was only a scalar passed, then I can understand numpy-style dimension broadcasting. But if there is only one weights array passed, then we should raise a ValueError and request that the user pass another weights array of shape (num_weights, ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would be equivalent for their weight type for custom weighting schemes). The num_weights for the second weights array should match that of the first array. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=ABLA7S6LKY4ENT7BMXAEHNTPVXMQTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTGS5I#issuecomment-493250933>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLA7S4QJRWRZAR4BIJ4G3LPVXMQTANCNFSM4HNPNDWA> .

manodeep · 2019-05-16T22:40:40Z

I am not sure how to automatically provide the array for the second set of particles. Say, the weights1 is full of uniform random numbers - how do you generate an equivalent array for the second dataset?
Are you saying to generate an array of the correct shape only when weights1 contains only 1 element (i.e., weights1 is a scalar)?

lgarrison · 2019-05-16T22:42:14Z

I just meant generate an array filled with 1s.

…

On Thu, May 16, 2019 at 6:40 PM Manodeep Sinha ***@***.***> wrote: I am not sure how to automatically provide the array for the second set of particles. Say, the weights1 is full of uniform random numbers - how do you generate an equivalent array for the second dataset? Are you saying to generate an array of the correct shape *only* when weights1 contains only 1 element (i.e., weights1 is a scalar)? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=ABLA7S6M5DYUOEDT6RP6AUDPVXPGRA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTHZUY#issuecomment-493255891>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLA7S4EALGZ55QB2WBYHWDPVXPGRANCNFSM4HNPNDWA> .

boryanah · 2019-05-16T22:43:48Z

Hi! I also wonder: I am writing my own estimator to convert DD, DR, RR for non-periodic BC into cf, but I wonder about whether there are any normalization factors when you have weights (I thought there shouldn't but without them I don't seem to be getting the right answer) -- this also doesn't seem to be in convert_3d_counts_to_cf(), but I may be wrong. Thank you for helping. Best, Boryana On Thu, May 16, 2019 at 6:36 PM Lehman Garrison <notifications@github.com> wrote:

…

The expected behavior is that the Python wrapper automatically provides such an array. Indeed, this is the current documented behavior (but only works as intended when the two sets are of equal size). I think it's sensible to raise a ValueError if the weighting scheme is not `pair_product`. On Thu, May 16, 2019 at 6:17 PM Manodeep Sinha ***@***.***> wrote: > What is the expected behaviour here? If there was only a scalar passed, > then I can understand numpy-style dimension broadcasting. But if there is > only one weights array passed, then we should raise a ValueError and > request that the user pass another weights array of shape (num_weights, > ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would > be equivalent for their weight type for custom weighting schemes). The > num_weights for the second weights array should match that of the first > array. > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > < #180?email_source=notifications&email_token=ABLA7S6LKY4ENT7BMXAEHNTPVXMQTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTGS5I#issuecomment-493250933 >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/ABLA7S4QJRWRZAR4BIJ4G3LPVXMQTANCNFSM4HNPNDWA > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AKJFGGBXUTRIO7GWOWRU2T3PVXOVZA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTHSYI#issuecomment-493255009>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKJFGGAPSBTVWUTGBEPVQNTPVXOVZANCNFSM4HNPNDWA> .

manodeep · 2019-05-16T22:52:36Z

@lgarrison Ahh okay. So a matrix of the correct shape, filled with 1's for the PAIR_PRODUCT case. ValueError with the "Please supply weights" for all other weighting schemes...

@boryanah Yeah - we really don't have an utility to compute the weighted correlation function (or even check that weights have been used and warn the user). This has been open for a while now in issue #139

boryanah · 2019-05-16T23:33:11Z

Thanks for the reference! I am really sorry if this is very obvious, but which pair counts should be multiplied by the average in what is fed to the convert function in the case of different weights for R and D and pair_product type of weighting. I tried: cf = convert_3d_counts_to_cf(N_d, N_d, N_r, N_r, DD*w_avg, DR*w_avg, DR*w_avg, RR) where w_avg is the mean weight of the data (mean weight of rand is 1), but it didn't give me the right answer

…

On Thu, May 16, 2019 at 6:52 PM Manodeep Sinha ***@***.***> wrote: @lgarrison <https://github.com/lgarrison> Ahh okay. So a matrix of the correct shape, filled with 1's for the PAIR_PRODUCT case. ValueError with the "Please supply weights" for all other weighting schemes... @boryanah <https://github.com/boryanah> Yeah - we really don't have an utility to compute the weighted correlation function (or even check that weights have been used and warn the user). This has been open for a while now in issue #139 <#139> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AKJFGGBZZEUORBFGSCYK4P3PVXQTLA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTILJA#issuecomment-493258148>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKJFGGCRUBC2KNGJUPAYGJLPVXQTLANCNFSM4HNPNDWA> .

manodeep · 2019-05-17T03:20:25Z

@boryanah You have to first compute the average weights on the dataset1 (and dataset2 for cross-correlations) with avgweight := sum(weights)/N_particles. In each radial bin, the total sum of weights is weightsum := weightavg * num_pairs_in_that_bin (where the weightavg is in itself a sum of the product of the two weights). Now, you have to figure out what would have been the weightsum if the points were randomly distributed -- so multiply by the number of points in one catalog with the density, volume, and average weight of the other catalog.

@lgarrison One thing I realise is that our rpavg is currently completely unweighted. We should probably also have an option to say return_weighted_rpavg -- where the math will be rp += weight*r within the kernels (instead of rp += r) and then divided by weightavg * num_pairs_in_bin at the end. (I have a vague memory of we discussing this previously, and deciding against weighted rpavg for backwards compatibility)

manodeep · 2019-05-17T03:24:41Z

@boryanah And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison Will you please check that I have not omitted steps (or completely mis-stated) ? :)

lgarrison · 2019-05-17T03:58:46Z

That sounds right to me! It seems like we should be returning weightsum instead of weightavg; it was a poor choice on my part. I guess it couldn't hurt to add the latter to the results struct. And good point about rpavg. One option would be to return both, although that would be extra work in the kernels.

…

On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha ***@***.***> wrote: @boryanah <https://github.com/boryanah> And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison <https://github.com/lgarrison> Will you please check that I have not omitted steps (or completely mis-stated) ? :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=ABLA7S4MDRZCOAZ6RDZ2LYLPVYQPTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLA7SZDD7RFLW6UGUBDTIDPVYQPTANCNFSM4HNPNDWA> .

boryanah · 2019-05-17T04:07:56Z

Dear Manodeep, Thanks a lot! This is what I understood (I think I got 2 wrong) Say 1 is Data; 2 is Random. We are trying to compute the factor that should multiply RD before it gets into the LS estimator. This factor will be F 1) N1,2 = number of particles Np1,2 = number of pairs per bin (vector) w_avg1.2 = "weight average" as returned by the DD function when computing DD(XYZ1,XYZ2, weights1,weights2) (also vector) sum1,2 = w_avg1,2*Np1,2 (vector) sum of the products of weights in a given bin 2) We also have mean weights: w_mean1,2 = sum(weights1,2)/N1,2 To compute the sum of weights in the case of random for dataset 1 with weights for 2, we do sum_rand1 = N1 * [ N2/Vol * Vol * w_mean2 ] = N1* N2* w_mean2 (I think that I misunderstood this, maybe it is w_mean1*w_mean2*N1*N2 for same volume?) 3) Here you say multiply the sum of weights per bin, sum1, by what would have been the sum of weights for the catalog if random, which is sum_rand1, so F1 = sum_rand1*sum1 4) We substitute DR1 and DR2 in LS (or convert 3d to cf) with DR1*F1 and DR2*F2 This doesn't seem symmetric, so I guess this is DR1 and there is also DR2 which should go into the final estimator. Best, Boryana

…

On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha ***@***.***> wrote: @boryanah <https://github.com/boryanah> And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison <https://github.com/lgarrison> Will you please check that I have not omitted steps (or completely mis-stated) ? :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AKJFGGCMG7XD7RIUGVMMVGTPVYQPVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKJFGGGJXAQUUJS7QK4IPR3PVYQPVANCNFSM4HNPNDWA> .

boryanah · 2019-05-17T04:21:11Z

I suspect that for 2) sum_avg1 = N1* [N2/Vol Vol_bin w_mean2] and it is thus a vector and Vol is the total volume while Vol_bin is only volume of the bin and for 3) maybe we want F1 = sum_rand1/sum1 I am really sorry for bothering you. On Fri, May 17, 2019 at 12:07 AM Hadzhiyska, Boryana < boryana.hadzhiyska@cfa.harvard.edu> wrote:

…

Dear Manodeep, Thanks a lot! This is what I understood (I think I got 2 wrong) Say 1 is Data; 2 is Random. We are trying to compute the factor that should multiply RD before it gets into the LS estimator. This factor will be F 1) N1,2 = number of particles Np1,2 = number of pairs per bin (vector) w_avg1.2 = "weight average" as returned by the DD function when computing DD(XYZ1,XYZ2, weights1,weights2) (also vector) sum1,2 = w_avg1,2*Np1,2 (vector) sum of the products of weights in a given bin 2) We also have mean weights: w_mean1,2 = sum(weights1,2)/N1,2 To compute the sum of weights in the case of random for dataset 1 with weights for 2, we do sum_rand1 = N1 * [ N2/Vol * Vol * w_mean2 ] = N1* N2* w_mean2 (I think that I misunderstood this, maybe it is w_mean1*w_mean2*N1*N2 for same volume?) 3) Here you say multiply the sum of weights per bin, sum1, by what would have been the sum of weights for the catalog if random, which is sum_rand1, so F1 = sum_rand1*sum1 4) We substitute DR1 and DR2 in LS (or convert 3d to cf) with DR1*F1 and DR2*F2 This doesn't seem symmetric, so I guess this is DR1 and there is also DR2 which should go into the final estimator. Best, Boryana On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha ***@***.***> wrote: > @boryanah <https://github.com/boryanah> And then use the different > weightsum for each DD, DR, and RR and combine with Landy-Szalay to get > the weighted average. @lgarrison <https://github.com/lgarrison> Will you > please check that I have not omitted steps (or completely mis-stated) ? :) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#180?email_source=notifications&email_token=AKJFGGCMG7XD7RIUGVMMVGTPVYQPVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AKJFGGGJXAQUUJS7QK4IPR3PVYQPVANCNFSM4HNPNDWA> > . >

manodeep · 2019-05-17T04:39:32Z

Hey @boryanah - no worries at all. Unfortunately, I have a deadline right now but if I have not responded by Monday, please ping me again!

And thank you for catching the bug in the first place!

boryanah · 2019-05-17T04:40:29Z

Thanks a lot! Good luck!! I appreciate it!

…

On Fri, May 17, 2019 at 12:39 AM Manodeep Sinha ***@***.***> wrote: Hey @boryanah <https://github.com/boryanah> - no worries at all. Unfortunately, I have a deadline right now but if I have not responded by Monday, please ping me again! And thank you for catching the bug in the first place! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AKJFGGHWFOITATYIGER3KCTPVYZILA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTWWLI#issuecomment-493316909>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKJFGGCLQKNHCCC44CNYPFDPVYZILANCNFSM4HNPNDWA> .

…s. Also fixes weights reference leak. Closes #180 and #181.

manodeep · 2019-05-21T02:57:46Z

@boryanah Thanks for your patience. Would it be possible to add in some lines of code to demonstrate what you are doing?

boryanah · 2019-05-22T02:44:36Z

Hi, I will try to write it up soon, but I have switched to something else for the moment! Best, Boryana

…

On Mon, May 20, 2019 at 10:57 PM Manodeep Sinha ***@***.***> wrote: @boryanah <https://github.com/boryanah> Thanks for your patience. Would it be possible to add in some lines of code to demonstrate what you are doing? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AKJFGGBM2D3QMQ6QLKRI7F3PWNQKVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2TLWI#issuecomment-494220761>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKJFGGFCJU7JSTMNGEJHUU3PWNQKVANCNFSM4HNPNDWA> .

lgarrison added the bug label May 16, 2019

lgarrison added this to the v2.3.0 milestone May 16, 2019

lgarrison self-assigned this May 16, 2019

lgarrison added a commit that referenced this issue May 20, 2019

Fix automatic uniform weight arrays and broadcasting of scalar weight…

dca4265

…s. Also fixes weights reference leak. Closes #180 and #181.

manodeep closed this as completed in 91e40fc May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing only one weights array for cross-correlations of unequal sizes #180

Passing only one weights array for cross-correlations of unequal sizes #180

lgarrison commented May 16, 2019 •

edited

Loading

manodeep commented May 16, 2019

lgarrison commented May 16, 2019 via email

manodeep commented May 16, 2019

lgarrison commented May 16, 2019 via email

boryanah commented May 16, 2019 via email

manodeep commented May 16, 2019

boryanah commented May 16, 2019 via email

manodeep commented May 17, 2019

manodeep commented May 17, 2019

lgarrison commented May 17, 2019 via email

boryanah commented May 17, 2019 via email

boryanah commented May 17, 2019 via email

manodeep commented May 17, 2019

boryanah commented May 17, 2019 via email

manodeep commented May 21, 2019

boryanah commented May 22, 2019 via email

Passing only one weights array for cross-correlations of unequal sizes #180

Passing only one weights array for cross-correlations of unequal sizes #180

Comments

lgarrison commented May 16, 2019 • edited Loading

General information

Issue description

Minimal failing example

manodeep commented May 16, 2019

lgarrison commented May 16, 2019 via email

manodeep commented May 16, 2019

lgarrison commented May 16, 2019 via email

boryanah commented May 16, 2019 via email

manodeep commented May 16, 2019

boryanah commented May 16, 2019 via email

manodeep commented May 17, 2019

manodeep commented May 17, 2019

lgarrison commented May 17, 2019 via email

boryanah commented May 17, 2019 via email

boryanah commented May 17, 2019 via email

manodeep commented May 17, 2019

boryanah commented May 17, 2019 via email

manodeep commented May 21, 2019

boryanah commented May 22, 2019 via email

lgarrison commented May 16, 2019 •

edited

Loading