-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathqminer.tex
256 lines (198 loc) · 20.7 KB
/
qminer.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
\documentclass{article} % For LaTeX2e
\usepackage{nips14submit_e,times}
\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
\usepackage{authblk}
\usepackage{listings}
\usepackage{color}
\definecolor{lightgray}{rgb}{1,1,1}
\definecolor{darkgray}{rgb}{.4,.4,.4}
\definecolor{redstrings}{rgb}{0.64,0.08,0.08}
\definecolor{blue}{rgb}{0,0,1}
\definecolor{greencomments}{rgb}{0,0.5,0}
\definecolor{cyan}{rgb}{0.0,0.6,0.6}
\renewcommand{\lstlistingname}{Code fragment}
\lstdefinelanguage{JavaScript}{
keywords={break, case, catch, continue, debugger, default, delete, do, else, finally, for, function, if, in, instanceof, new, return, switch, this, throw, try, typeof, var, void, while, with, null},
keywordstyle=\color{blue}\bfseries,
ndkeywords={class, export, boolean, throw, implements, import, this},
ndkeywordstyle=\color{blue}\bfseries,
identifierstyle=\color{black},
sensitive=false,
comment=[l]{//},
morecomment=[s]{/*}{*/},
commentstyle=\color{greencomments}\ttfamily,
stringstyle=\color{redstrings}\ttfamily,
morestring=[b]',
morestring=[b]"
}
\lstset{
language=JavaScript,
backgroundcolor=\color{lightgray},
extendedchars=true,
basicstyle=\footnotesize\ttfamily,
showstringspaces=false,
showspaces=false,
numbers=left,
numberstyle=\footnotesize,
numbersep=9pt,
tabsize=2,
breaklines=true,
showtabs=false,
captionpos=b
}
\setcounter{Maxaffil}{2}
\title{QMiner: Data Analytics Platform for Processing Streams of Structured and Unstructured Data}
%\thanks {Use thanks if u you need it}
%for single author (just remove % characters)
\author[1,3]{\rm Blaz Fortuna}
\author[1]{\rm Jan Rupnik}
\author[1]{\rm Janez Brank}
\author[2,3]{\rm Carolina Fortuna}
\author[1]{\rm Viktor Jovanoski}
\author[1]{\rm Mario Karlovcec}
\author[1]{\rm Blaz Kazic}
\author[1]{\rm Klemen Kenda}
\author[1]{\rm Gregor Leban}
\author[1]{\rm Andrej Muhic}
\author[1]{\rm Blaz Novak}
\author[2]{\rm Jost Novljan}
\author[1]{\rm Miha Papler}
\author[1]{\rm Luis Rei}
\author[1]{\rm Blaz Sovdat}
\author[1]{\rm Luka Stopar}
\author[1]{\rm Marko Grobelnik}
\author[1]{\rm Dunja Mladenic}
\affil[1]{Artificial Intelligence Laboratory, Jozef Stefan Institute, Jamova 39, Ljubljana, Slovenia.}
\affil[2]{Department of Communication Systems, Jozef Stefan Institute, Jamova 39, Ljubljana, Slovenia.}
\affil[3]{Internet Based Communication Networks and Services, University of Gent, Gaston Crommerlaan 8, Gent, Belgium.}
\affil[ ]{\textit {\{firstname.lastname\}@ijs.si}}
\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}
\nipsfinalcopy
\begin{document}
\maketitle
\begin{abstract}
QMiner is an open source analytics platform for performing large scale data analysis written in C++ and exposed via a Javascript API. The paper presents five main design elements which focus on focus on storage, online and real-time processing as well as fast prototyping and give QMiner unique advantages as a data analytics platform for processing streams of structured and unstructured data. These design elements are incorporated in a five layer architecture represented by 1) storage and indexing, 2) stream aggregate, 3) feature extractor, 4) linear algebra and 5) analytics layers. The functionality of the platform is demonstrated by three representative usage examples containing code fragments: text classification, time series prediction and community detection in graphs.\end{abstract}
\section{Introduction}
QMiner was incrementally developed within several EU Framework Programme research projects in the areas of text, web, and stream mining over the last couple of years. The main requirements for QMiner were derived from the need for rapid developed of solutions focused on user interactivity and the ability to operate on large data sets (hundreds of gigabytes) in real-time on high-end commodity hardware. These goals and constraints resulted in a unique set of features that we implemented in QMiner.
The applications are implemented in JavaScript, making it easy for novice users to get started. Using the JavaScript API, the user can easily compose complex data processing pipelines and integrate them with other systems via RESTful web services. The core system is based on a C++ library and can be included into custom C++ projects, thus providing them with stream processing and data analytics capabilities. Event Registry\cite{eventRegistry} is one example of a system, which was built around the QMiner storage and analytics modules.
QMiner is available as an open source project on GitHub under AGPL licence. The repository contains source code, introduction guide, complete documentation of QMiner's JavaScript APIs and examples showcasing its various uses.
The paper is organised as follows. Section \ref{sec:design} describes the main design elements of the platform while Section \ref{sec:architecture} focuses on the architecture of the platform. Section \ref{sec:examples} details three representative usage examples which showcase how one can quickly build machine-learning applications using the tool. Finally, Section \ref{sec:conclusions} concludes the paper.
\section{Key design elements}
\label{sec:design}
In this section, we provide an overview of the main design elements which give QMiner unique advantages as a data analytics platform for processing streams of structured and unstructured data.
\textbf{Connecting storage, indexing and analytics:} QMiner stores and indexes the data in a way that makes the implemented machine learning methods more scalable. Computing feature vectors from stored records aims at reducing duplication of data in the process and relies heavily on the integrated indexing and its features (e.g. probabilistic joins) to speed up data transformations. Data schemas provide types for directly storing vectors and sparse vectors as part of records.
\textbf{Multi-modal data support:} QMiner provides native support for handling and learning from unstructured data such as text or graphs. Among the supported data operations are full text search, aggregation over text fields, creation and storage of bag-of-words vectors directly in the data stores, and full integration of the Stanford graph analysis library SNAP \cite{snap} in the C++ and JavaScript APIs. Additionally, several time series processing tools such as resampling, merging, computing moments, extrapolation, interpolation and discretization are implemented.
\textbf{Streaming processing:} QMiner provides building blocks for processing and learning from streams of text documents, website logs or numeric streams. For example, pipeline-able stream aggregates for maintaining aggregate statistics over the stream, and machine learning algorithms for learning from streams are available.
\textbf{Probabilistic joins between tables:} For some operations, computing a complete join between tables is not necessary. For example, to compute the gender distribution of visitors of a particular web page, we do not really need a complete list of visitors for that particular web page. What suffices is a statistically representative sample. QMiner supports several sampling techniques for achieving this, and integrates the support for probabilistic joins in the query language and feature extractors.
\textbf{Efficiency and fast prototyping:} All main components of the library are implemented in C++ and are optimized for memory usage and computational speed. This involves the data store, indexing, feature extractors, linear algebra, stream processing components, several machine learning and graph mining algorithms (the latter based on SNAP library). Most of the core functionality is exposed in the JavaScript API which enables fast prototyping and deploying as RESTful services, since QMiner can operate as an HTTP server.
\section{Architecture}
\label{sec:architecture}
The core blocks of the system provide support for data storage, preprocessing and analytics and the architecture is organised in the following five layers:
\textbf{Storage and indexing layer:} The basic storage abstractions are similar to the ones existing in databases: the data is organized around stores (tables), one data point inside a store corresponds to a record (row or instance) and one record consists of one or more fields (columns or features). The index layer provides support for indexing and retrieving records by implementing several types of indexes: (a) inverted index \cite[section 6.5]{knuth1998taocp3}, used to index discrete values and free text, and (b) geo-spatial index, which can be used to index geographic locations presented as longitude and latitude pairs. There are several additional indices, currently under implementation, which will extend the system: (c) B-tree, used to index linearly ordered data types, such as number and dates, and (d) locality sensitive hashing \cite{har2012approximate}, used to answer nearest neighbour queries on high-dimensional data such as sparse vectors. Indices can be utilized through a JSON-based query language. The indices are used by the Analytics layer for various tasks, such as describing and extracting features, and for sampling.
\textbf{Stream aggregates:} Aggregates are algorithms, which take a stream of records, and maintain some aggregate statistics of the stream. The aggregates connect to a store, and update their state as new records are added to the store. QMiner contains a large collection of stream aggregates, e.g. statistical moments, exponential moving average, resampling and merging of streams, interpolation and extrapolation. In addition, new stream aggregates can be implemented directly in the JavaScript layer.
\textbf{Feature extractors:} A feature extractor enables mapping records to linear algebra vectors (supports dense and sparse vectors). The core framework for feature extractors enables constructing and gluing several types of feature extractors and automatically performs all the necessary bookkeeping. Feature extractors include: text feature extractors (bag of words with optional stemming, stop-word removal, n-gram extraction, hashing), numeric feature extractors, element indicator vectors and set indicator vectors (multi-label). In addition, when working with the JavaScript API, custom feature extractors can be implemented as JavaScript functions.
\textbf{Linear algebra:} The library provides efficient representation and manipulation of linear algebra structures (dense and sparse vectors and matrices) directly in JavaScript. The library includes algorithms for solving linear systems and computing several matrix decompositions. Randomized algorithms enable solving of large scale problems, such as singular value decompositions~\cite{tropp} on large structured or sparse matrices. QMiner can optionally be compiled with optimized Basic Linear Algebra libraries such as Intel MKL, providing additional boost in performance.
\textbf{Analytics:} The analytics components combine all components in order to implement various popular machine learning algorithms for clustering, dimensionality reduction, classification, regression and active-learning. The implementations can be easily applied directly to data stored in the storage layer. The analytics components are either implemented purely in C++ or alternatively in JavaScript, by using core C++ modules (such as linear algebra) to retain efficiency.
\section{Examples}
\label{sec:examples}
QMiner lets you get from data to working models exposed through web service API in less than an hour. Most applications can be scripted fully in the JavaScript layer, taking advantage of the performance of components implemented in C++ and libraries such as Intel MKL. We demonstrate the system on three different tasks: text classification, time series prediction and community detection on graphs.
\subsection{Text processing and classification}
The example in this section demonstrates how to extract features from a movie dataset and build classification models for predicting movie genres. For a more advanced application of the platform we refer the reader to the Event Registry\cite{eventRegistry}.
First, we import the analytics module which provides machine learning algorithms. Next, we load the data and enumerate features we would like to extract from the data set. For each feature, we need to specify its type (for example, \texttt{text}), its source (for example, \texttt{Movies}), and name (for example, \texttt{plot}). Source is the \emph{data store} from which we want to extract the feature. We can now run a batch learner on the whole data set for training. The learner then returns the model for predicting movie genres. In the example, the model is stored in the \texttt{genreModel} variable. Finally, the model is used to predict the genre of a new, previously unseen, movie.
\begin{lstlisting}[caption={Example text mining: storage, feature extraction and classification.}]
// Loading in the dataset.
qm.load.jsonFile(Movies, "./sandbox/movies/movies.json");
// Declare the features we will use to build classification models
var genreFeatures = [
{ type: "constant", source: "Movies" },
{ type: "text", source: "Movies", field: "Title" },
{ type: "text", source: "Movies", field: "Plot" },
{ type: "join", source: { store: "Movies", join: "Director"} }
];
// Create a model for the Genres field, by training on all movies.
var genreModel = analytics.newBatchModel(
Movies.recs, genreFeatures, Movies.field("Genres"));
// Predict genres of a new movie
var newMovie = qm.store("Movies").newRec({
Title:"Unnatural Selection",
Plot:"When corpses are found with organs missing ...",
Genres:["Horror", "Sci-Fi"],
Director:{Name:"Baggs Bill", Gender:"Male"}
});
var predictedGenre = genreModel.predict(newMovie);
\end{lstlisting}
\subsection{Time series processing}
The following example demonstrates how to perform time series prediction. For a more advanced example for online energy production forecasting, we refer the reader to \cite{energypredict}. The example uses stream aggregates (rounded rectangles in Figure \ref{fig:timeSeries}) to (a) resample incoming time series into an equally spaced time series, and (b) compute two exponential moving averages (EMAs). To learn the prediction model, it constructs a feature vector containing the current numeric value of the time series, the two EMAs and the date features (time of day, day of week, month). The feature vector is sent to the incremental linear regression model. A one minute delay operator is applied to the input variables, which corresponds to a one minute prediction horizon.
The example involves two \emph{data stores}: \texttt{Raw} and \texttt{Resampled}. The records from the \texttt{Raw} data store represent time series measurement pairs (timestamp, value) and \texttt{Resampled} stores the resampled time series, together with the results of two smoothers. Each time a record is added to the \texttt{Raw} store, the resampler is updated. The resampler may insert a record in the \texttt{Resampled} store based on the interpolation used. Each time a record is added to \texttt{Resampled} several operations are executed: computation of two smoothers, update of the delay operator, enrichment of the resampled record with the results of the smoothers, and the update of the linear regression model.
\begin{figure}[h]
\begin{center}
\includegraphics[width=0.8\textwidth]{timeSeries3.PNG}
\end{center}
\caption{Example time series processing architecture}\label{fig:timeSeries}
\end{figure}
\begin{lstlisting}[caption=Example time series processing]
// Create and connect stream aggregates to the stores.
// These correspond to blue boxes in figure 1.
var sr = Raw.addStreamAggr({name: "Resamp10s", type: "resampler", outStore: "Resampled"/*, resampling parameters*/});
var s0 = Resampled.addStreamAggr({name: "tick", type: "timeSeriesTick", /*aggregate parameters*/});
var s1 = Resampled.addStreamAggr({name: "EMA1m", type: "ema", interval: 60000, /*aggregate parameters*/});
var s2 = Resampled.addStreamAggr({name: "EMA10m", type: "ema", interval: 600000, /*aggregate parameters*/});
var s3 = Resampled.addStreamAggr({ name: "delay", type: "recordBuffer", size: 6 });
// Declare features from the resampled timeseries
var ftrSpace = analytics.newFeatureSpace([
{ type: "numeric", source: "Resampled", field: "Value" },
{ type: "numeric", source: "Resampled", field: "Ema1" },
{ type: "numeric", source: "Resampled", field: "Ema2" },
{ type: "multinomial", source: "Resampled", field: "Time", datetime: true }
]);
// Initialize linear regression model
var linreg = analytics.newRecLinReg(/*model parameters*/);
// Add the stream aggregate that enriches the record with current
// EMAs, computes its feature vector and updates the model
var s4 = Resampled.addStreamAggr(new function() {
onAdd: function (rec) {
// Enrich record with latest EMAs
rec.Ema1 = s1.getFlt();
rec.Ema2 = s2.getFlt();
// Get the ID of the record from a minute ago.
var trainRecId = s3.val.last;
// Compute feature vector of the record
var ftrVec = ftrSpace.ftrVec(Resampled[trainRecId]);
// Update the regression model: the input is delayed
// by 1 minute and the output is the current value
linreg.learn(ftrVec, rec.Value);
}
});
\end{lstlisting}
\subsection{Graph analysis}
QMiner supports handling of graph data by providing JavaScript API to the integrated SNAP library \cite{snap}. The script in Code fragment 3 is an example of loading a graph, computing communities and visualizing the results.
After importing the required modules for graph analysis and visualization, an edge-list graph file is loaded into a new undirected graph. The graph represents the co-authoring network of Slovenian researchers from biotechnical science in 2013. the nodes of the graph represent researchers while edges between nodes indicate co-authoring of at least on publication. Real networks often have community structure with several regions of internally densely connected nodes. Communities in the co-authoring network can indicate groups of strongly connected researchers from a research lab or a similar organization, or simply groups of scientist with a common research interest. The number, size and connectivity between communities reveal the cohesiveness of a network. Following the simplification of the graph by removing weakly connected nodes (with degree 3 or less), the Clauset-Newman-Moore \cite{clauset-newman-moore} community detection algorithm is applied to compute the communities. In the final step, the graph is plotted using force-directed layout, with node colors corresponding to communities. The complete script in Code fragment 3 is written in JavaScript, but some of the components, such as community detection algorithm are executing in the highly optimized C++ SNAP code.
\begin{figure}[h]
\begin{center}
\includegraphics[scale=0.20]{communities_labeled.jpg}
\caption{Communities of co-authoring graph}
\end{center}
\end{figure}
\begin{lstlisting}[caption=Example graph analysis]
// Loading a graph file and creating a new undirected graph
var g = snap.newUGraph("biotechnology_2013.edg");
// Simplifying graph by removing all the nodes with degree <= 3
var g = snap.removeNodes(g, 3);
// computing communities using Clauset-Newman-Moore algorithm
var CmtyCNM = snap.communityDetection(g, "cnm");
// plotting the graph with colors of colors corresponding to communities
viz.drawGraph(g, "graphCNM.html", { "color": CmtyCNM });
\end{lstlisting}
The results of the analysis shows that the co-authoring graph has 12 communities labeled from 1 to 12 in Figure 2. Three smaller communities (10, 11 and 12) are completely disconnected, while the rest make a connected component. By supporting SNAP, QMiner enables rich graph analysis in an easy to use JavaScript interface, which can be especially useful when combined with the supported scalable machine learning methods.
\section{Conclusion}
\label{sec:conclusions}
This paper presents QMiner, an open source analytics platform for performing large scale data analysis. The system is build from five architectural modules based on five key design elements, including: connecting storage, indexing and analytics, scalability of implemented machine learning methods, multimodal data support, processing of streaming data, probabilistic joins between tables and fast as well as scalable prototyping.
The core of the platform is implemented in C++ and has a JavaScript layer on top of it which makes it more user friendly without sacrificing performance. As illustrated with the three representative examples of text classification, time series processing and graph analysis, complete workflows can be implemented with simple scripts.
\subsubsection*{Acknowledgments}
The research leading to these results has received funding from several European Union Seventh Framework Programme (FP7/2007-2013) grants including Sophocles (ICT-317534-FET), X-LIKE (ICT-257790-STREP) and SYMPHONY Collaborative project (FP7-ICT-611875).
\bibliographystyle{unsrt}
\bibliography{qminer}
\end{document}