Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align fractional sequences per group conversion methods #1614

Open
victorlin opened this issue Aug 29, 2024 · 0 comments
Open

Align fractional sequences per group conversion methods #1614

victorlin opened this issue Aug 29, 2024 · 0 comments
Assignees
Labels
proposal Proposals that warrant further discussion

Comments

@victorlin
Copy link
Member

victorlin commented Aug 29, 2024

Probabilistic sampling is necessary when targeting a fractional number of sequences per group. This is possible in two scenarios:

  1. Uniform sampling when the number of groups exceeds the number of total requested sequences

  2. Weighted sampling when this expression evaluates to a fractional number:

    $$ \frac{\text{weight}}{\sum{\text{weights}}} * \text{total requested sequences} $$

Each scenario implements its own way of converting the fractional number to a whole number that can be used for sampling.

  1. Poisson sampling method: sample from a Poisson distribution with the mean $\lambda$ being the fractional number of sequences per group which is constant across all groups.

    max_sizes_per_group[group] = random_generator.poisson(target_group_size)

  2. Probabilistic rounding method: round the number probabilistically by adding a random number between [0,1) and truncating the decimal part. The Poisson sampling method does not work for weighted sampling because the fractional number of sequences per group is not guaranteed to be constant across all groups.

    weights[TARGET_SIZE_COLUMN] = (weights[TARGET_SIZE_COLUMN].add(pd.Series(rng.random(len(weights))))).astype(int)

Proposed change

Replace the Poisson sampling method with the probabilistic rounding method. A notable difference: with the Poisson sampling method, there is a slim chance for 2 or more sequences per group. That would not happen with the probabilistic rounding method. I think that is fine and even preferred because it avoids the possibility of under-sampling.

This can be explained through example: --group-by month with 12 months and --subsample-max-sequences 10. This means the fractional number of sequences per group should be¹ $\frac{10}{12} \approx 0.83$. For each of the 12 months we need a whole number of sequences per group.

Probabilistic rounding would be: 0.83 has a 83% chance of rounding to 1 and a 17% chance of rounding to 0.

@victorlin victorlin self-assigned this Aug 29, 2024
@victorlin victorlin added the proposal Proposals that warrant further discussion label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion
Projects
None yet
Development

No branches or pull requests

1 participant