An Apache Spark RDD implementation for time series processing - based on Chronix.
-
A
ChronixRDD
is a collection of univariate time series. Each of them has its own vector of timestamps - they are not aligned on one common vector of timestamps. -
Time series are multi-dimensional. Each time series is associated to one or more dimensions. The identity of a time series is the combination of some of its dimension values.
-
ChronixRDD
has its own storage engine based on Solr Cloud and the Chronix format. So the time series data is stored storage-efficient, sharded and with equipped with low-level queries to perform predicate pushdown.
How does Chronix Spark compare to Spark-TS?
-
Spark-TS provides no specific time series storage it uses the Spark persistence mechanisms instead. This leads to a less efficient storage usage and less possibilities to perform performance optimizations via predicate pushdown.
-
In contrast to Spark-TS Chronix does not align all time series values on one vector of timestamps. This leads to greater flexibility in time series aggregation.
-
Chronix provides multi-dimensional time series as this is very useful for data warehousing and APM.
-
Chronix has support for Datasets as this will be an important Spark API in the near future. But Chronix currently doesn’t support an
IndexedRowMatrix
for SparkML. -
Chronix is purely written in Java. There is no explicit support for Python and Scala yet.
-
Chronix doesn not support a ZonedTime as this makes it way more complicated.