-
Notifications
You must be signed in to change notification settings - Fork 0
Data analysis interaction record analysis methodology
Overview: This document describes a high-level methodology for analyzing interaction records from data analysis systems. This methodology is implemented in the lupe module. At a high-level, the steps taken to analyze query data are described by the following figure:
While this methodology should be general enough to use with interaction records from many data analysis systems, we will use interaction records from Splunk as our primary concrete example.
Materials:
- Code: This methodology is implemented in the lupe module.
-
Example notebook: We have provided an example query data set in
example/data
and an analysis of that data set using thelupe
module in an IPython notebook:example/example.ipynb
. You can run the notebook or simply follow along by openingexample/example.html
. - Paper: This methodology was used with Splunk queries to develop the results published in our LISA 2014 paper: Analyzing Log Analysis: An Empirical Study of User Log Mining. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz. Large Installation System Administration Conference (LISA). November 2014. It may be helpful to have that paper on hand as you follow along.
Steps:
- Gathering the data
- Parsing, organizing, and storing the data
- Getting an overview
- Imposing high-level categorization to understand tasks
- Understanding workflows via Markov diagrams and common subsequences
- Clustering to understand sub-tasks
Step one is to gather interaction records from a data analysis system. In the context of data analysis, interaction records usually represent sequences of transformations users apply to their data, as well as actions to modify program state. Typically, the easiest place to find interaction records are in one of the system logs.
The exact format and semantics of the data will vary depending on the system, but usually you can expect to find the following components:
- timestamp
- user
- events or queries, each of which have:
- action
- parameters
In Splunk, users transform their data via Splunk Processing Language (SPL) queries. An example Splunk query is:
search “error” | stats count by status | lookup statuscodes status OUTPUT statusdesc
.
These queries are logged in Splunk in audit.log
. For more information on how to obtain the queries in your personal Splunk installation, see the instructions on the README for the queryutils package.
In the example, the queries are provided in the example/data
directory.
Once the data has been collected, it's important to be able to parse the interaction records into sequences of operations that have been applied to data, along with their parameters or arguments.
- To parse Splunk queries, download splparser
- For a collection of utilities that make it easier to organize and analyze Splunk queries, download queryutils
In the example, steps 1 and 2 show how to set up and initialize a database to hold the query data.
Now that you have collected the data and have a means of parsing that data into timestamped sequences of transformations applied to data, the next step is to get an overview of this data. In this phase, the goal is to summarize the basic properties of the data set. Some questions we might ask in this phase include:
- What are the main conceptual entities in my data set? For example, these may be queries, commands, events, stages, pipelines, sessions, users, and so on.
- How many users are there? How are user arrivals distributed? What is the user interarrival rate?
- How many queries or analysis sessions are there? How are query interarrivals distributed? What is the query interarrival rate?
- How is query length or analysis session length distributed?
- Do queries or analysis sessions have other attributes, such as type, or data source analyzed? How many are there of each type?
- How many different types of tasks or actions are there? How are the frequencies of these tasks statistically distributed?
- How many tasks or actions are there per user, query, or analysis session?
In our LISA 2014 paper analyzing Splunk data, the relevant tables that provide answers to some of these questions can be found in:
- Table 1
- Table 2 (Additional information was not included for privacy reasons.)
In the example, steps 3 provides some overview information about the example query data. Feel free to add your own additional overview analysis code.
It is often the case, as you might observe, that the actions (e.g., query commands or event types) in data systems do not usually map one-to-one to the tasks users perform. Some commands are very overloaded in terms of functionality. In other cases, functionality belonging to one conceptual task type may be spread across many actions. Therefore, simply examining the frequency distribution of actions is often not very informative for understanding what tasks users perform in the system. For example, this graph shows this distribution of counts for each command. Because of the number of commands and the extreme skew in the distribution, it is hard to make sense of.
Therefore, the next step is to understand what tasks users are performing. Some questions you might ask during this phase include:
- How are the individual tasks statistically distributed?
- What are the most common tasks users perform? What are the least common?
To answer these questions, it is usually necessary to create a taxonomy that categorizes each action into a task or transformation type. This usually involves a fair amount of expert judgement and hand coding via content analysis.
Our taxonomy for Splunk can be found in Table 3 in our LISA 2014 paper. We based this taxonomy on a extended version of the operators provided in the relational algebra.
When we use this taxonomy to classify stages of queries, we can produce the bar chart shown in Figure 2, which provides a much more informative picture of the tasks users are performing than the graph pictured above.
To produce this graph on your own Splunk queries, run the code in lupe/transformations/barchart.py
. The command to run would be:
python lupe/transformations/barchart.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT -q QUERY_TYPE
.
In the example, step 4 shows how to run this code using the lupe
module programmatically.
Once you have an understanding of the frequency with which users perform various high-level tasks, you might next ask:
- How are sequences of tasks statistically distributed? What type of tasks usually come first? What comes last? What tasks typically follow a given other task?
- How many tasks do users typically apply in a given query or session? What are some common subsequences of tasks?
To answer these questions, we can use Markov diagrams to understand how users transition from one task to another within certain blocks of work (e.g., within a query, or a session).
We can then visually examine common paths through this graph, as well as see what longer, more complex sequences of tasks look like by computing common subsequences.
An example of a Markov diagram based on the task transition frequencies in thousands of queries can be found in Figure 3, Figure 4, and Figure 5 of our LISA 2014 paper.
To create these Markov diagrams on your Splunk queries, run the code in lupe/statemachines/compute.py
. The command to run would be:
python lupe/statemachines/compute.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT -t TYPE -r THRESHOLD -q QUERY_TYPE
.
Common subsequences of tasks are listed in Table 4 of our LISA 2014 paper.
In the example, step 5 shows how to run this code using the lupe
module programmatically.
Lastly, now that we understand at a high-level what types of transformation pipelines are applied in log analysis, we may want more detail about what these transformations exactly entail. We will focus on the most common transformation types. For example, we might ask:
- What are the different ways in which Filter, Aggregate, and Augment transformations are applied, and how are these different ways distributed?
- Can we identify higher-level tasks and activities by identifying related clusters of transformations? Do these clusters allow us to identify common workflow patterns? What can we infer about the user’s information needs from these groups?
- How well do the commands in the Splunk query language map to the tasks users are trying to perform? What implications do the clusters we find have on data transformation language design?
To answer these questions, we:
- Parse each query.
- Extract the stages consisting of the given transformation type.
- Convert the stages into feature vectors.
- Optionally project these feature vectors down to a lower dimensional space using PCA.
- Project these features further down into two dimensions, to allow visualization of the clusters, using t- SNE.
- Manually identify and labeled clusters in the data.
- Estimate cluster sizes by labeling a random sample of stages in the data set.
An example of the clusters we can discover with this methodology can be found in Figure 6, Figure 7, and Figure 8 of our LISA 2014 paper.
In the example, step 6 shows how to run this code using the lupe
module programmatically.