Sampling architecture #81

Hsankesara · 2024-02-14T13:06:48Z

Added user and data sampling mechanisms. User sampling mechanism contains option to choose users by fraction, count and IDs Data sampling mechanisms include choosing data between time ranges, by count and by fraction.

afolarin

LGTM. Possibly consider the changes now or on the next iteration.

afolarin · 2024-05-17T12:55:05Z

config.yaml

+    ## TODO: For future
+    #data_sampling:
+        ## Possible methods:  time, count, fraction
+        ## starttime and endtime format is dd-mm-yyyy hh:mm:ss in UTC timezone


Would it be useful to be able to specify an array of ranges? This way if you wanted a value single range you could just specify that and if you wanted a sequence of ranges this could also be provided.

Good point, will add that.

afolarin · 2024-05-17T13:49:59Z

config.yaml

+        #    count: 2
+        #method: userid
+        #config:
+        #    userids:


Do we want to keep fixed on subjectID or userID? For historical reasons, SubjectID is the name we chose for the main ID we use on the platform (with potentially UserID being introduced later when we have the self-enrollment portal). It may not result in too much confusion, as this is more on the analysis side, but I'd point this out perhaps it is sensible to be consistent.

could change it to subjectId. I used user-id because the ID column in the output files is key.userID

Hsankesara · 2024-06-04T14:31:02Z

@afolarin added the code for multiple time ranges. Please let me know if that looks good to you.

afolarin

LGTM

Hsankesara added 11 commits February 8, 2024 16:12

Added custom data reading module that can be used independently

2d884cd

added more possible time columns to preprocess_time_data

68eabe3

Added user sampling mechanism in radarpipeline

857ea89

Added data sampling as well

ffd69ad

minor code changes

bf47487

Added data_sampline methods by: time, count & fraction

b4bc9d2

minor import changes

9659798

Added feature to access spark-cluster

c52281c

Updated tests

77bd070

added tests to test different sampling configs

a542b55

added pipeline tests for samplings

6fc67c4

Hsankesara requested a review from afolarin May 1, 2024 10:38

afolarin approved these changes May 20, 2024

View reviewed changes

Hsankesara added 3 commits June 3, 2024 13:20

changed data_type to source_type in config.yaml

8636032

Added multiple time ranges compatibility with data sampling

4ce5b82

Added tests for multiple time ranges

0351724

Hsankesara requested a review from afolarin June 4, 2024 14:30

afolarin approved these changes Jun 5, 2024

View reviewed changes

Hsankesara merged commit d420b75 into updating_schema_inference Jun 6, 2024

Hsankesara deleted the sampling_architecture branch June 6, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling architecture #81

Sampling architecture #81

Hsankesara commented Feb 14, 2024

afolarin left a comment

afolarin May 17, 2024 •

edited

Loading

Hsankesara Jun 3, 2024

afolarin May 17, 2024

Hsankesara Jun 3, 2024

Hsankesara commented Jun 4, 2024

afolarin left a comment

Sampling architecture #81

Sampling architecture #81

Conversation

Hsankesara commented Feb 14, 2024

afolarin left a comment

Choose a reason for hiding this comment

afolarin May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Hsankesara Jun 3, 2024

Choose a reason for hiding this comment

afolarin May 17, 2024

Choose a reason for hiding this comment

Hsankesara Jun 3, 2024

Choose a reason for hiding this comment

Hsankesara commented Jun 4, 2024

afolarin left a comment

Choose a reason for hiding this comment

afolarin May 17, 2024 •

edited

Loading