Skip to content

Data analysis interaction record analysis methodology

jaytlin edited this page Nov 10, 2014 · 36 revisions

Methodology:

This document describes a high-level methodology for analyzing interaction records from data analysis systems.

While we believe this methodology is general enough to use with interaction records from many data analysis systems, we will use interaction records from Splunk as our primary concrete example.

This methodology was used with Splunk queries to develop the results published in our LISA 2014 paper:

Analyzing Log Analysis: An Empirical Study of User Log Mining. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz. Large Installation System Administration Conference (LISA). November 2014.

It may be helpful to have that paper on hand as you follow along.

The methodology is as follows:

  1. Gathering the data
  2. Parsing, organizing, and storing the data
  3. Getting an overview
  4. Imposing high-level categorization to understand tasks
  5. Understanding workflows via Markov diagrams and common subsequences
  6. Creating second-level categorization via clustering to understand sub-tasks

Gathering the data

Step one is to gather interaction records from a data analysis system. In the context of data analysis, interaction records usually represent sequences of transformations users apply to their data, as well as actions to modify program state. Typically, the easiest place to find interaction records are in one of the system logs.

The exact format and semantics of the data will vary depending on the system, but usually you can expect to find the following components:

  • timestamp
  • user
  • events or queries, each of which have:
    • action
    • parameters
In Splunk

In Splunk, users transform their data via Splunk Processing Language (SPL) queries. An example Splunk query is: search “error” | stats count by status | lookup statuscodes status OUTPUT statusdesc

These queries are logged in Splunk in audit.log. For more information on how to obtain the queries in your personal Splunk installation, see the instructions on the README for the queryutils package.

Parsing, organizing, and storing the data

Once the data has been collected, it's important to be able to parse the interaction records into sequences of operations that have been applied to data, along with their parameters or arguments.

In Splunk
  • To parse Splunk queries, download splparser
  • For a collection of utilities that make it easier to organize and analyze Splunk queries, download queryutils

Getting an overview

Now that you have collected the data and have a means of parsing that data into timestamped sequences of transformations applied to data, the next step is to get an overview of this data. In this phase, the goal is to summarize the basic properties of the data set. Some questions we might ask in this phase include:

  • What are the main conceptual entities in my data set? For example, these may be queries, commands, events, stages, pipelines, sessions, users, and so on.
  • How many users are there? How are user arrivals distributed? What is the user interarrival rate?
  • How many queries or analysis sessions are there? How are query interarrivals distributed? What is the query interarrival rate?
  • How is query length or analysis session length distributed?
  • Do queries or analysis sessions have other attributes, such as type, or data source analyzed? How many are there of each type?
  • How many different types of tasks or actions are there? How are the frequencies of these tasks statistically distributed?
  • How many tasks or actions are there per user, query, or analysis session?
In Splunk

In our LISA 2014 paper analyzing Splunk data, the relevant tables that provide answers to some of these questions can be found in:

  • Table 1
  • Table 2 (Additional information was not included for privacy reasons.)

Imposing high-level categorization to understand tasks

It is often the case, as you might observe, that the actions (e.g., query commands or event types) in data systems do not usually map one-to-one to the tasks users perform. Some commands are very overloaded in terms of functionality. In other cases, functionality belonging to one conceptual task type may be spread across many actions. Therefore, simply examining the frequency distribution of actions is often not very informative for understanding what tasks users perform in the system.

TODO: Include example figure here.

Therefore, the next step is to understand what tasks users are performing. Some questions you might ask during this phase include:

  • How are the individual tasks statistically distributed?
  • What are the most common tasks users perform? What are the least common?

To answer these questions, it is usually necessary to create a taxonomy that categorizes each action into a task or transformation type. This usually involves a fair amount of expert judgement and hand coding via content analysis.

In Splunk

Our taxonomy for Splunk can be found in Table 3 in our LISA 2014 paper. We based this taxonomy on a extended version of the operators provided in the relational algebra.

When we use this taxonomy to classify stages of queries, we can produce the bar chart shown in Figure 2, which provides a much more informative picture of the tasks users are performing than the graph pictured above.

To produce this graph on your own Splunk queries, run the code in lupe/transformations/barchart.py. The command to run would be:

python lupe/transformations/barchart.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT_PATH -q QUERY_TYPE.

For example, the command to generate Figure 2 in the paper is:

python lupe/transformations/barchart.py -s postgresdb -U lupe -P lupe -D lupe -o results/fig2 -q scheduled.

Understanding workflows via Markov diagrams and common subsequences

Once you have an understanding of the frequency with which users perform various high-level tasks, you might next ask:

  • How are sequences of tasks statistically distributed? What type of tasks usually come first? What comes last? What tasks typically follow a given other task?
  • How many tasks do users typically apply in a given query or session? What are some common subsequences of tasks?

To answer these questions, we can use Markov diagrams to understand how users transition from one task to another within certain blocks of work (e.g., within a query, or a session).

We can then visually examine common paths through this graph, as well as see what longer, more complex sequences of tasks look like by computing common subsequences.

In Splunk

An example of a Markov diagram based on the task transition frequencies in thousands of queries can be found in Figure 3, Figure 4, and Figure 5 of our LISA 2014 paper.

To create these Markov diagrams on your Splunk queries, run the code in lupe/statemachines/main.py. The command to run would be:

python lupe/statemachines/main.py -s SOURCE -U USER -P PASSWORD -D DATABASE -o OUTPUT_PATH -t TYPE -r THRESHOLD -q QUERY_TYPE.

For example, the command to generate Figure 4 in the paper is:

python lupe/statemachines/main.py -s postgresdb -U lupe -P lupe -D lupe -o results/fig4 -t solaris3-web-access -r 0.0 -q scheduled.

Common subsequences of tasks are listed in Table 4 of our LISA 2014 paper.

Creating second-level categorization via clustering to understand sub-tasks

TODO: Fill in this section

Clone this wiki locally