Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add support for a project file #82

Open
jbutcher21 opened this issue Sep 13, 2021 · 0 comments
Open

Please add support for a project file #82

jbutcher21 opened this issue Sep 13, 2021 · 0 comments

Comments

@jbutcher21
Copy link

Customer often have multiple files to load. G2Loader allows the user to place the full list of files to be loaded in a single "project" file that looks like this ...

cat demo/truth/project.json
{
"DATA_SOURCES": [
{"DATA_SOURCE": "CUSTOMERS", "FILE_FORMAT": "CSV", "FILE_NAME": "truthset-person-v1-set1-data.csv"},
{"DATA_SOURCE": "WATCHLIST", "FILE_FORMAT": "CSV", "FILE_NAME": "truthset-person-v1-set2-data.csv"}
]
}

The stream-producer should then pick these files in order and load them on the queue.

Ideally wild cards should be allowed as well! like so ...
{"DATA_SOURCE": "SAYARI", "FILE_FORMAT": "JSON", "FILE_NAME": "/sayari/mapped/*.json"},
Note: sayari has 100s of files to be loaded

There should be some validation of these files as in

  1. can it be opened and read
  2. does it contain recognizable json or csv
  3. and does the data source exist in the configuration (future)

G2Loader does this currently... it validates the first 100 records of every file before it loads any so that you don't go through the processing of the first two files only to find out that the 3rd one doesn't even exist or can't be opened.

Future: G2Loader does the 3 above plus the following:
4. Counts the mapped and unmapped attributes it finds.
5. Checks and notates common mapping errors like incomplete addresses
6. Has a set of errors warnings and info and has recommendations and suggestions
7. Publishes a report than can be exported to show to others.

You can see this testing analysis by typing the following in an sshd container for a single file ...
./G2Loader.py -T -f demo/truth/truthset-person-v1-set1-data.csv/?data_source=CUSTOMERS

or for a project file ...
./G2Loader.py -T -p demo/truth/project.json

That's it.

The only other solution is to keep using G2Loader.

@github-actions github-actions bot added the triage Need to triage label Sep 13, 2021
@jamietypovsky jamietypovsky removed the triage Need to triage label Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants