Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tuples as sample IDs in Bootstrapping #362

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ellabarkan
Copy link
Collaborator

@ellabarkan ellabarkan commented Aug 11, 2024

The following things were updated to support sample ids presented as tuples:

  1. Added static method "_convert_tuples" to support converting input list of tuples to np.array format.
  2. Updated the managing of bootstrapping indexes to support indexing using tuples

@ellabarkan ellabarkan requested a review from mosheraboh August 11, 2024 16:02
@@ -887,6 +887,23 @@ def reset(self) -> None:
self._metric.reset()
return super().reset()

@staticmethod
def _convert_tuples(ids: List[Tuple[str, int]]) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming tuple of str and int?
Can we do something more general?

Copy link
Collaborator Author

@ellabarkan ellabarkan Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. The method of convert is supporting tuple of everything of any length, I need to change the interface definition

@ellabarkan ellabarkan requested a review from mosheraboh August 20, 2024 06:13

ids = np.array(ids, dtype=dtype_tuple)
ids = [tuple(x) for x in ids]
return ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you returning here a list of tuples or numpy array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When applying only line 903 the results numpy array includes tuples but they have a problem of being hashable and I was getting error in the line :
permutation = [original_ids_pos[sample_id] for sample_id in required_ids]
TypeError: unhashable type: 'writeable void-scalar'

The suggestion for the fix I found was to updated their types to tuple that is supported in the list and not in array. If I am trying to convert them after to np.array again the tuples are broken to separate elements and we having 2D array

@@ -920,7 +941,11 @@ def eval(
stratum_filter = stratum_id == stratum
n_stratum = sum(stratum_filter)
random_sample = rnd.randint(0, n_stratum, size=n_stratum)
sampled_ids[stratum_filter] = ids[stratum_filter][random_sample]

flt_indx = np.where(stratum_filter)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You needed to move to "for loop implementation" even though you converted it ids to numpy array?

@sivanravidos
Copy link
Collaborator

@ellabarkan @mosheraboh
I have the same issue I think when calling GroupAnalysis or Filter, who uses the Collector
in line 410:
permutation = [original_ids_pos[sample_id] for sample_id in required_ids]

original_ids_pos has tuples and required_ids has ndarrays.

I added this hack just to use the code in the meantime
permutation = [original_ids_pos[(sample_id[0], int(sample_id[1]))] for sample_id in required_ids]

Can we add solving this too to this PR and also finalizing it?


Screenshot 2024-11-18 at 20 26 56

@sivanravidos
Copy link
Collaborator

@mosheraboh I checked and on this branch I get the same exception on line 410, this solution doesn't fix the get method

@sivanravidos
Copy link
Collaborator

sivanravidos commented Nov 26, 2024

@mosheraboh Ella and I discussed again, the problem is as follows:

  • we use Fuse wrapper that wraps dataset ids as tuples (dataset_name, id)
    (dataset_wrap_seq_to_dict.py line 104)
  • Tuple arrays are problematic for some operations, you can't apply boolean filters on them,
    this happens in several places, specificaly in CI, Filter and GroupAnalysis so we need to fix everywhere
  • see an alternative fix using list comprehension instead of filtering in branch sample_id_fix (I reverted my prev suggestion to this). This fixes GroupAnalysis only:
    https://github.com/BiomedSciAI/fuse-med-ml/compare/group_analyis_configuration..sample_id_fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants