Massively Distributed Indexing of Time Series for Apache Spark

Find Out More

Work in progress...



Synthetic datasets

In this use-case, each object of dataset is identified with an ID and consist of 256 values. At each time point, a random walk generator draws a random number from a Gaussian distribution N(0,1), then adds the value of the last number to the new number. The number of time series varies from 50 million to 500 million depending on the experiment.

Finance dataset

Finance dataset, StockR, is based on historical finance data downloaded using the Yahoo Finance API that contains end-of-day quotes of over 40000 stock symbols for the period from Jan 2010 to Mar 2018. The dataset contains the price returns (fractional change of price from one day to another) of the quotes

Seismic dataset

The seismic dataset, Seismic, was obtained from the IRIS Seismic Data Access repository . It contains seismic instrument recording from thousands of stations worldwide and consists of 40 million data series of 200 values each.

Astronomy Data

The astronomy dataset, Astro, represents celestial objects. The dataset consists of 104 million data series of size 256.

Brain MRI Data

The neuroscience dataset, SALD, represents MRI data, including 209 million data series of size 128.

Computer Vision

The image processing dataset, Deep1B , contains 279 million Deep1B vectors of size 96 extracted from the last layers of a convolutional neural network.


Random dataset: contains 200 million time series each again of length 256, representing "white noise". Each point is randomly drawn from a Gaussian distribution N(0,1).

We show below set of charts to demonstrate the Spark-parSketch performance (vs. parallelized Linear Search algorithm), in terms of response time and precision. Bar chart compares time performance of two approaches. Two line charts depict the precision of similarity search as top candidates (up to 10) for selected query in a given dataset.

Parameters: The user can observe the tool performance on a range of input datasets ( Input Time Series ), alongside with various scale of query dataset ( Queries Number ). To proceed throught separate Query in the selected Dataset user may use the bottom slide bar ( Query # ).

The dropdown Cell size demonstrate how some parameters of spark-parSketch tool, related to the sketch-based algorithm, could impact the precision accuracy.

Another dropdown Search allows to choose between approximate and exact algorithms to find correlations with DPiSAX approach.


Input Time Series :
Batch Size (number of queries) :
Search 4 :

Response Time

Quality Metrics

per Batch of Queries:

Quality Ratio1

parSketch:  N/A


parSketch:  N/A

Mean Average Precision 3

parSketch:  N/A


QR1 per Query # :  0

parSketch:  N/A       DPiSAX:  N/A

Parallel Linear Search


Cell Size / Grid Granularity (p)5 :


1 Quality Ratio is defined to be correlation of the 10th time series found by a particular method divided by the correlation of the 10th closest time series found by direct computation of correlation.

2 Recall is calculated as a fraction of relevant items in the top 10 time series found by particular method over the top 10 time series found by direct correlations.

3 Mean Average Precision which considers the order of top 10 time series found by particular search method over ranked sequence of time series returned by direct correlations.

4 the parameter affects only DPiSAX data (time and precision)

5 the parameter affects only parSketch data (time and precision)

Experiments were conducted on a cluster of 16 compute nodes with two 8 cores Intel Xeon E5-2630 v3 CPUs, 128 GB RAM, 2x558GB capacity storage per node. The cluster is running under Hadoop version 2.7, Spark v. 2.4.0 and PostgreSQL v. 9.4 as a relational database system.

The video of demonstration use-case is available here .



Oleksandra Levchenko

Boyan Kolev

Reza Akbarinia

Florent Masseglia

Dennis Shasha

Themis Palpanas