Skip to content

Data analysis interaction record analysis methodology

salspaugh edited this page Nov 24, 2014 · 36 revisions

Methodology:

Overview: This document describes a high-level methodology for analyzing interaction records from data analysis systems. This methodology is implemented in the lupe module. At a high-level, the steps taken to analyze query data are described by the following figure:

SPL query analysis flowchart

While this methodology should be general enough to use with interaction records from many data analysis systems, we will use interaction records from Splunk as our primary concrete example.

Materials:

  1. Code: This methodology is implemented in the lupe module.
  2. Example notebook: We have provided an example query data set in example/data and an analysis of that data set using the lupe module in an IPython notebook: example/example.ipynb. You can run the notebook or simply follow along by opening example/example.html.
  3. Paper: This methodology was used with Splunk queries to develop the results published in our LISA 2014 paper: Analyzing Log Analysis: An Empirical Study of User Log Mining. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz. Large Installation System Administration Conference (LISA). November 2014. It may be helpful to have that paper on hand as you follow along.

Steps:

  1. Gathering the data
  2. Parsing, organizing, and storing the data
  3. Getting an overview
  4. Imposing high-level categorization to understand tasks
  5. Understanding workflows via Markov diagrams and common subsequences
  6. Clustering to understand sub-tasks

Gathering the data

Step one is to gather interaction records from a data analysis system. In the context of data analysis, interaction records usually represent sequences of transformations users apply to their data, as well as actions to modify program state. Typically, the easiest place to find interaction records are in one of the system logs.

The exact format and semantics of the data will vary depending on the system, but usually you can expect to find the following components:

  • timestamp
  • user
  • events or queries, each of which have:
    • action
    • parameters
In Splunk

In Splunk, users transform their data via Splunk Processing Language (SPL) queries. An example Splunk query is: search “error” | stats count by status | lookup statuscodes status OUTPUT statusdesc.

These queries are logged in Splunk in audit.log. For more information on how to obtain the queries in your personal Splunk installation, see the instructions on the README for the queryutils package.

Example

In the example, the queries are provided in the example/data directory.

Parsing, organizing, and storing the data

Once the data has been collected, it's important to be able to parse the interaction records into sequences of operations that have been applied to data, along with their parameters or arguments.

In Splunk
  • To parse Splunk queries, download splparser
  • For a collection of utilities that make it easier to organize and analyze Splunk queries, download queryutils
Example

In the example, steps 1 and 2 show how to set up and initialize a database to hold the query data.

Getting an overview

Now that you have collected the data and have a means of parsing that data into timestamped sequences of transformations applied to data, the next step is to get an overview of this data. In this phase, the goal is to summarize the basic properties of the data set. Some questions we might ask in this phase include:

  • What are the main conceptual entities in my data set? For example, these may be queries, commands, events, stages, pipelines, sessions, users, and so on.
  • How many users are there? How are user arrivals distributed? What is the user interarrival rate?
  • How many queries or analysis sessions are there? How are query interarrivals distributed? What is the query interarrival rate?
  • How is query length or analysis session length distributed?
  • Do queries or analysis sessions have other attributes, such as type, or data source analyzed? How many are there of each type?
  • How many different types of tasks or actions are there? How are the frequencies of these tasks statistically distributed?
  • How many tasks or actions are there per user, query, or analysis session?
In Splunk

In our LISA 2014 paper analyzing Splunk data, the relevant tables that provide answers to some of these questions can be found in:

  • Table 1
  • Table 2 (Additional information was not included for privacy reasons.)
Example

In the example, steps 3 provides some overview information about the example query data. Feel free to add your own additional overview analysis code.

Imposing high-level categorization to understand tasks

It is often the case, as you might observe, that the actions (e.g., query commands or event types) in data systems do not usually map one-to-one to the tasks users perform. Some commands are very overloaded in terms of functionality. In other cases, functionality belonging to one conceptual task type may be spread across many actions. Therefore, simply examining the frequency distribution of actions is often not very informative for understanding what tasks users perform in the system. For example, this graph shows this distribution of counts for each command. Because of the number of commands and the extreme skew in the distribution, it is hard to make sense of.

Command frequencies

Therefore, the next step is to understand what tasks users are performing. Some questions you might ask during this phase include:

  • How are the individual tasks statistically distributed?
  • What are the most common tasks users perform? What are the least common?

To answer these questions, it is usually necessary to create a taxonomy that categorizes each action into a task or transformation type. This usually involves a fair amount of expert judgement and hand coding via content analysis.

In Splunk

Our taxonomy for Splunk can be found in Table 3 in our LISA 2014 paper. We based this taxonomy on a extended version of the operators provided in the relational algebra.

When we use this taxonomy to classify stages of queries, we can produce the bar chart shown in Figure 2, which provides a much more informative picture of the tasks users are performing than the graph pictured above.

Figure 2

To produce this graph on your own Splunk queries, run the code in lupe/transformations/barchart.py. The command to run would be:

python lupe/transformations/barchart.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT -q QUERY_TYPE.

Example

In the example, step 4 shows how to run this code using the lupe module programmatically.

Understanding workflows via Markov diagrams and common subsequences

Once you have an understanding of the frequency with which users perform various high-level tasks, you might next ask:

  • How are sequences of tasks statistically distributed? What type of tasks usually come first? What comes last? What tasks typically follow a given other task?
  • How many tasks do users typically apply in a given query or session? What are some common subsequences of tasks?

To answer these questions, we can use Markov diagrams to understand how users transition from one task to another within certain blocks of work (e.g., within a query, or a session).

We can then visually examine common paths through this graph, as well as see what longer, more complex sequences of tasks look like by computing common subsequences.

In Splunk

An example of a Markov diagram based on the task transition frequencies in thousands of queries can be found in Figure 3, Figure 4, and Figure 5 of our LISA 2014 paper.

Figure 3

To create these Markov diagrams on your Splunk queries, run the code in lupe/statemachines/compute.py. The command to run would be:

python lupe/statemachines/compute.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT -t TYPE -r THRESHOLD -q QUERY_TYPE.

Common subsequences of tasks are listed in Table 4 of our LISA 2014 paper.

Example

In the example, step 5 shows how to run this code using the lupe module programmatically.

Clustering to understand sub-tasks

Lastly, now that we understand at a high-level what types of transformation pipelines are applied in log analysis, we may want more detail about what these transformations exactly entail. We will focus on the most common transformation types. For example, we might ask:

  • What are the different ways in which Filter, Aggregate, and Augment transformations are applied, and how are these different ways distributed?
  • Can we identify higher-level tasks and activities by identifying related clusters of transformations? Do these clusters allow us to identify common workflow patterns? What can we infer about the user’s information needs from these groups?
  • How well do the commands in the Splunk query language map to the tasks users are trying to perform? What implications do the clusters we find have on data transformation language design?

To answer these questions, we:

  1. Parse each query.
  2. Extract the stages consisting of the given transformation type.
  3. Convert the stages into feature vectors.
  4. Optionally project these feature vectors down to a lower dimensional space using PCA.
  5. Project these features further down into two dimensions, to allow visualization of the clusters, using t- SNE.
  6. Manually identify and labeled clusters in the data.
  7. Estimate cluster sizes by labeling a random sample of stages in the data set.
In Splunk

An example of the clusters we can discover with this methodology can be found in Figure 6, Figure 7, and Figure 8 of our LISA 2014 paper.

Example

In the example, step 6 shows how to run this code using the lupe module programmatically.