Updating Schema reading procedures and refactoring #78

Hsankesara · 2024-01-16T16:27:32Z

Added custom data reading module that can be used independently

afolarin

LGTM

afolarin · 2024-05-17T09:24:03Z

config.yaml

@@ -4,7 +4,7 @@ project:
    version: mock_version

 input:
-    data_type: local # couldbe mock, local, sftp, s3
+    data_type: mock # couldbe mock, local, sftp, s3


would this be better as data_source or source_type data type is more specific to the data, this I think relates more to the source of the data

afolarin · 2024-05-17T10:03:19Z

radarpipeline/io/downloader.py

While SFTP (though rsync might be useful for restart function) and S3 (implemented?) probably cover a lot of cases, I don't want to really support every method here as we can't support the long tail of the distribution. It should probably be the user's responsibility to provide a way to expose the remote data with network mounts, local copies, etc.

Agreed, I think SFTP and S3 would cover the majority of cases. Anything else would need to be sorted by the user. I haven't implemented S3 yet but it's on my TODO list.

afolarin · 2024-05-17T11:13:47Z

radarpipeline/io/ingestion.py

+logger = logging.getLogger(__name__)
+
+
+class CustomDataReader():


What is the distinction here between ingestion and reader?

Nothing, I think ingestion is the filename. I tried to keep the function names which are exposed to the user simple and straight. That's why I named it that way.

Sampling architecture

Hsankesara added 16 commits January 10, 2024 15:25

Updated data reading procedure

faef729

refactoring

5ae2bae

refactor to make all modules more spark dependent

fba2d75

Begin refactoring pipeline

6f21791

refactor Spark Reader mechanism

a8b0582

minor refactor in reader.py

26fc86c

Added and updated tests + upgraded spark version to 3.5.0

25dd237

resolved linting errors

979ae94

minor refactoring of the data preprocessing function

c0ebb67

minor error correction

85d4a1c

updated setup.py

e443cfc

updated paramiko dependency

38e8612

updated modules version due to security vulnerabilities.

d10bbfb

updated pandas and numpy versions

d762039

updated test expected output files

9cedc21

updated tests

f2d8faa

Hsankesara force-pushed the updating_schema_inference branch from ca53ac0 to f2d8faa Compare January 23, 2024 09:52

solved test issue caused due of timezone setting in spark

a50a70e

Hsankesara marked this pull request as ready for review January 24, 2024 14:12

Hsankesara requested a review from afolarin January 24, 2024 14:12

Hsankesara added 10 commits January 25, 2024 16:49

resolved error caused when data for an user is empty

111de11

Added custom data reading module that can be used independently

2d884cd

added more possible time columns to preprocess_time_data

68eabe3

Added user sampling mechanism in radarpipeline

857ea89

Added data sampling as well

ffd69ad

minor code changes

bf47487

Added data_sampline methods by: time, count & fraction

b4bc9d2

minor import changes

9659798

Merge pull request #80 from RADAR-base/simplifying_data_reader

aa82e88

Added custom data reading module that can be used independently

Added feature to access spark-cluster

c52281c

Hsankesara added 3 commits April 30, 2024 10:37

Updated tests

77bd070

added tests to test different sampling configs

a542b55

added pipeline tests for samplings

6fc67c4

afolarin approved these changes May 17, 2024

View reviewed changes

Hsankesara added 4 commits June 3, 2024 13:20

changed data_type to source_type in config.yaml

8636032

Added multiple time ranges compatibility with data sampling

4ce5b82

Added tests for multiple time ranges

0351724

Merge pull request #81 from RADAR-base/sampling_architecture

d420b75

Sampling architecture

Hsankesara merged commit 4b6f56d into dev Jun 6, 2024
3 checks passed

Hsankesara deleted the updating_schema_inference branch June 6, 2024 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Schema reading procedures and refactoring #78

Updating Schema reading procedures and refactoring #78

Hsankesara commented Jan 16, 2024 •

edited

Loading

afolarin left a comment

afolarin May 17, 2024

afolarin May 17, 2024

Hsankesara Jun 3, 2024

afolarin May 17, 2024

Hsankesara Jun 3, 2024

		logger = logging.getLogger(__name__)


		class CustomDataReader():

Updating Schema reading procedures and refactoring #78

Updating Schema reading procedures and refactoring #78

Conversation

Hsankesara commented Jan 16, 2024 • edited Loading

afolarin left a comment

Choose a reason for hiding this comment

afolarin May 17, 2024

Choose a reason for hiding this comment

afolarin May 17, 2024

Choose a reason for hiding this comment

Hsankesara Jun 3, 2024

Choose a reason for hiding this comment

afolarin May 17, 2024

Choose a reason for hiding this comment

Hsankesara Jun 3, 2024

Choose a reason for hiding this comment

Hsankesara commented Jan 16, 2024 •

edited

Loading