Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Schema reading procedures and refactoring #78

Merged
merged 34 commits into from
Jun 6, 2024

Conversation

@Hsankesara Hsankesara force-pushed the updating_schema_inference branch from ca53ac0 to f2d8faa Compare January 23, 2024 09:52
@Hsankesara Hsankesara marked this pull request as ready for review January 24, 2024 14:12
@Hsankesara Hsankesara requested a review from afolarin January 24, 2024 14:12
Copy link
Member

@afolarin afolarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

config.yaml Outdated
@@ -4,7 +4,7 @@ project:
version: mock_version

input:
data_type: local # couldbe mock, local, sftp, s3
data_type: mock # couldbe mock, local, sftp, s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be better as data_source or source_type data type is more specific to the data, this I think relates more to the source of the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While SFTP (though rsync might be useful for restart function) and S3 (implemented?) probably cover a lot of cases, I don't want to really support every method here as we can't support the long tail of the distribution. It should probably be the user's responsibility to provide a way to expose the remote data with network mounts, local copies, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think SFTP and S3 would cover the majority of cases. Anything else would need to be sorted by the user. I haven't implemented S3 yet but it's on my TODO list.

logger = logging.getLogger(__name__)


class CustomDataReader():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction here between ingestion and reader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing, I think ingestion is the filename. I tried to keep the function names which are exposed to the user simple and straight. That's why I named it that way.

@Hsankesara Hsankesara merged commit 4b6f56d into dev Jun 6, 2024
3 checks passed
@Hsankesara Hsankesara deleted the updating_schema_inference branch June 6, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants