Nucleus Requirements : Questions and Answers

Actors

Who are the Actors of the system (both human users and external systems)? (Data Engineer, Data Consumer, Data Provider, Search Service?) Engineering Node and Discipline Node Data Engineers (or Operations Engineers depending on the context, but we can just group those together). These "Data Engineers" are responsible for the end-to-end data flow and management of the archive from receiving of the data from the data provider, validating the data received, ingesting the data into the Registry, releasing the data to the public, and pushing the data into the PDS deep archive at NSSDCA.

Nature of ETL

Are these ETL processes Real time or Batch ETL? Nucleus should support both. Ideally the tools and services utilizing this pipeline would be robust enough to be run in parallel, however, we cannot make that assumption in the event we plug-in legacy tools and services.
If these ETL jobs are Batches, what is the definition/scope of a Batch? A batch should be defined within the particular plug-in module. For instance, for our current Validate, we found it worked best in batches of 100ish. But for other EN or DN components, they may require the entire data set to be batched in order to process the data.

NOTE: I would consider real-time vs batch processing a should-have or could-have requirement, not a must-have. If Nucleus requires plug-in components be able to support real-time processing, then so be it.

Directory Listener software

It seems the Directory Listener software can trigger data pipelines. How can the Directory Listener software know, when to trigger a pipelines? Not sure what this software is?
How can the Directory Listener software know that data is staged by the Instrument Teams in the Mission Staging Area?

Pipeline Triggers

Other than the above trigger by the Directory Listener software, are there any other ways that these pipelines are triggered? Are those triggers Manual or automatic?
If automatically triggered, based on which events?
Are there any scheduled pipelines?

Data communication Between Components

Is there any preferred way to communicate data between components

E.g.:
- Through events
- RESTful API calls
- Through a file system
Or is it up to the software design to select the suitable data communication method as required?

Capacity, Availability, Performance and Load

What is the usual size of a one unit of data to be processed?
What is the maximum data volume of a unit to be processed?
What is the average data volume of a unit to be processed?
What is the maximum number of records to be processed per day?
What is the expected availability of the data pipelines? 24/7?

AIMS

In AIMS_Briefing_Work_Remaining.pptx, there are user selections to be made to select one of the following processing paths:

* On-premise 
* Cloud (ad-hoc)
* Cloud (pre-uploaded backups)

Who or what process will make this selection?

Generic Data Pipeline Features

If Nucleus is going to be a generic data pipeline system:

What are the sources to be supported from below source types:
- Files: Then what are the file types to be supported?
- Records in databases: Then what are the databases to be supported?
- Data retrieved from RESTful APIs
- Data received as events (E.g.: Kafka events)
What are the destinations to be supported from below destination types:
- Files: Then what are the file types to be supported?
- Records in databases: Then what are the databases to be supported?
- Data posted on RESTful APIs
- Data published as events (E.g.: Kafka events)
- Elasticsearch

Scripting

Any preferred programming languages for scripting?

Extendability

What are the expected ways add pluggable components to a data pipeline?
- Through a user interface?
- Through configuration files

Security

Should there be any user roles and permissions assigned for users/executors of the pipelines?
If so, where do we maintain these user - role - permission mappings?
Should there be an authentication and authorization process?

Backups

What are backup requirements?
What are the data purging requirements?

Licensing

Are there any restrictions to select implementation technologies based on the software licenses? (Open Source vs Commercial Licenses)

Deployment Setups

What are deployment setup to be supported? (On-premise, Cloud etc.)

Logging and Auditing

What are the logging and auditing requirements?

Alerts

Should we send any alerts based on the completions status of data pipelines?
If so, which mechanisms should be used to send these alerts (E.g.: Email, text messages)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly