You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Customer often have multiple files to load. G2Loader allows the user to place the full list of files to be loaded in a single "project" file that looks like this ...
The stream-producer should then pick these files in order and load them on the queue.
Ideally wild cards should be allowed as well! like so ...
{"DATA_SOURCE": "SAYARI", "FILE_FORMAT": "JSON", "FILE_NAME": "/sayari/mapped/*.json"}, Note: sayari has 100s of files to be loaded
There should be some validation of these files as in
can it be opened and read
does it contain recognizable json or csv
and does the data source exist in the configuration (future)
G2Loader does this currently... it validates the first 100 records of every file before it loads any so that you don't go through the processing of the first two files only to find out that the 3rd one doesn't even exist or can't be opened.
Future: G2Loader does the 3 above plus the following:
4. Counts the mapped and unmapped attributes it finds.
5. Checks and notates common mapping errors like incomplete addresses
6. Has a set of errors warnings and info and has recommendations and suggestions
7. Publishes a report than can be exported to show to others.
You can see this testing analysis by typing the following in an sshd container for a single file ...
./G2Loader.py -T -f demo/truth/truthset-person-v1-set1-data.csv/?data_source=CUSTOMERS
or for a project file ...
./G2Loader.py -T -p demo/truth/project.json
That's it.
The only other solution is to keep using G2Loader.
The text was updated successfully, but these errors were encountered:
Customer often have multiple files to load. G2Loader allows the user to place the full list of files to be loaded in a single "project" file that looks like this ...
cat demo/truth/project.json
{
"DATA_SOURCES": [
{"DATA_SOURCE": "CUSTOMERS", "FILE_FORMAT": "CSV", "FILE_NAME": "truthset-person-v1-set1-data.csv"},
{"DATA_SOURCE": "WATCHLIST", "FILE_FORMAT": "CSV", "FILE_NAME": "truthset-person-v1-set2-data.csv"}
]
}
The stream-producer should then pick these files in order and load them on the queue.
Ideally wild cards should be allowed as well! like so ...
{"DATA_SOURCE": "SAYARI", "FILE_FORMAT": "JSON", "FILE_NAME": "/sayari/mapped/*.json"},
Note: sayari has 100s of files to be loaded
There should be some validation of these files as in
G2Loader does this currently... it validates the first 100 records of every file before it loads any so that you don't go through the processing of the first two files only to find out that the 3rd one doesn't even exist or can't be opened.
Future: G2Loader does the 3 above plus the following:
4. Counts the mapped and unmapped attributes it finds.
5. Checks and notates common mapping errors like incomplete addresses
6. Has a set of errors warnings and info and has recommendations and suggestions
7. Publishes a report than can be exported to show to others.
You can see this testing analysis by typing the following in an sshd container for a single file ...
./G2Loader.py -T -f demo/truth/truthset-person-v1-set1-data.csv/?data_source=CUSTOMERS
or for a project file ...
./G2Loader.py -T -p demo/truth/project.json
That's it.
The only other solution is to keep using G2Loader.
The text was updated successfully, but these errors were encountered: