-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Schema reading procedures and refactoring #78
Conversation
Hsankesara
commented
Jan 16, 2024
•
edited
Loading
edited
- Updated data reading procedure
- refactor to make all modules more spark dependent
- Added and updated tests + upgraded spark version to 3.5.0
ca53ac0
to
f2d8faa
Compare
Added custom data reading module that can be used independently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
config.yaml
Outdated
@@ -4,7 +4,7 @@ project: | |||
version: mock_version | |||
|
|||
input: | |||
data_type: local # couldbe mock, local, sftp, s3 | |||
data_type: mock # couldbe mock, local, sftp, s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be better as data_source
or source_type
data type is more specific to the data, this I think relates more to the source of the data
radarpipeline/io/downloader.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While SFTP (though rsync might be useful for restart function) and S3 (implemented?) probably cover a lot of cases, I don't want to really support every method here as we can't support the long tail of the distribution. It should probably be the user's responsibility to provide a way to expose the remote data with network mounts, local copies, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think SFTP and S3 would cover the majority of cases. Anything else would need to be sorted by the user. I haven't implemented S3 yet but it's on my TODO list.
logger = logging.getLogger(__name__) | ||
|
||
|
||
class CustomDataReader(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the distinction here between ingestion and reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing, I think ingestion is the filename. I tried to keep the function names which are exposed to the user simple and straight. That's why I named it that way.