The Kids First Data Portal will need a service to manage which indices are made available for display, faceted and text based search, and data exploration.
In order for these indices to be managed in a way that allows for growth and consistency, there is a need for a schema for the naming convention to these indices.
As the Kids First Data Resource Portal (DRP) grows in terms of the data size searchable through it, there will be an ever increasing number of elasticsearch indices that are in use at any given time.
In order to meet the feature requirements of the DRP, the data is sharded into multiple indices based on the study that the data is coming from.
As the number of studies and entities being tracked by the portal grows the
number of indices being used will grow in terms of #studies * #entities
and
if we take into account the possibility of using indices with different purposes
such as centric for faceted search and text indices for text search then we have
potential growth of #studies * #entities * #types
. The management of the
publishing of different indices through the maniuplation of aliases will become
a large and important task that will be greately aided through the establishment
of a schema for the naming of an elasticsearch index.
The schema for index naming in terms of a Context-Free Grammar is as follows:
index_name --> <entity>_<index_type>_<shard_prefix>_<shard_id>_<release_id>
<entity> --> "^[a-z]*$"
<index_type> --> centric | text | entity
<shard_prefix> --> "^[a-z]{0,2}$"
<shard_id> --> "^[a-zA-Z0-9]*$"
<release_id> --> "^[a-zA-Z0-9]*$"
The primary drawback is if a need arises to generate versioned indices that are feature specific and do not conform to the enitity/shard design, our schema will be unable to describe these indicies.
The use of a schema is simply an aid and to help maintain the source of truth on what an index describes with the data itself. An alternative will be to build a service that tracks the metadata about releases and studies to act as a source of truth for which index pertains to a particular dataset or usecase.
- Release ID format needs to be formalized with CHOP