-
Notifications
You must be signed in to change notification settings - Fork 0
Nucleus Requirements : Questions and Answers
- Who are the Actors of the system (both human users and external systems)? (Data Engineer, Data Consumer, Data Provider, Search Service?) Engineering Node and Discipline Node Data Engineers (or Operations Engineers depending on the context, but we can just group those together). These "Data Engineers" are responsible for the end-to-end data flow and management of the archive from receiving of the data from the data provider, validating the data received, ingesting the data into the Registry, releasing the data to the public, and pushing the data into the PDS deep archive at NSSDCA.
-
Are these ETL processes Real time or Batch ETL? Nucleus should support both. Ideally the tools and services utilizing this pipeline would be robust enough to be run in parallel, however, we cannot make that assumption in the event we plug-in legacy tools and services.
-
If these ETL jobs are Batches, what is the definition/scope of a Batch? A batch should be defined within the particular plug-in module. For instance, for our current Validate, we found it worked best in batches of 100ish. But for other EN or DN components, they may require the entire data set to be batched in order to process the data.
NOTE: I would consider real-time vs batch processing a should-have or could-have requirement, not a must-have. If Nucleus requires plug-in components be able to support real-time processing, then so be it.
-
It seems the Directory Listener software can trigger data pipelines. How can the Directory Listener software know, when to trigger a pipelines? Not sure what this software is?
-
How can the Directory Listener software know that data is staged by the Instrument Teams in the Mission Staging Area?
-
Other than the above trigger by the Directory Listener software, are there any other ways that these pipelines are triggered? Are those triggers Manual or automatic?
-
If automatically triggered, based on which events?
-
Are there any scheduled pipelines?
-
Is there any preferred way to communicate data between components
E.g.:
- Through events
- RESTful API calls
- Through a file system
-
Or is it up to the software design to select the suitable data communication method as required?
-
What is the usual size of a one unit of data to be processed?
-
What is the maximum data volume of a unit to be processed?
-
What is the average data volume of a unit to be processed?
-
What is the maximum number of records to be processed per day?
-
What is the expected availability of the data pipelines? 24/7?
In AIMS_Briefing_Work_Remaining.pptx, there are user selections to be made to select one of the following processing paths:
* On-premise
* Cloud (ad-hoc)
* Cloud (pre-uploaded backups)
- Who or what process will make this selection?
If Nucleus is going to be a generic data pipeline system:
-
What are the sources to be supported from below source types:
- Files: Then what are the file types to be supported?
- Records in databases: Then what are the databases to be supported?
- Data retrieved from RESTful APIs
- Data received as events (E.g.: Kafka events)
-
What are the destinations to be supported from below destination types:
- Files: Then what are the file types to be supported?
- Records in databases: Then what are the databases to be supported?
- Data posted on RESTful APIs
- Data published as events (E.g.: Kafka events)
- Elasticsearch
- Any preferred programming languages for scripting?
-
What are the expected ways add pluggable components to a data pipeline?
- Through a user interface?
- Through configuration files
-
Should there be any user roles and permissions assigned for users/executors of the pipelines?
-
If so, where do we maintain these user - role - permission mappings?
-
Should there be an authentication and authorization process?
-
What are backup requirements?
-
What are the data purging requirements?
- Are there any restrictions to select implementation technologies based on the software licenses? (Open Source vs Commercial Licenses)
- What are deployment setup to be supported? (On-premise, Cloud etc.)
- What are the logging and auditing requirements?
-
Should we send any alerts based on the completions status of data pipelines?
-
If so, which mechanisms should be used to send these alerts (E.g.: Email, text messages)?