Data analysis interaction record analysis methodology

Methodology:

This document describes a high-level methodology for analyzing interaction records from data analysis systems.

While we believe this methodology is general enough to use with interaction records from many data analysis systems, we will use interaction records from Splunk as our primary concrete example.

This methodology was used with Splunk queries to develop the results in a paper published in LISA 2014:

Analyzing Log Analysis: An Empirical Study of User Log Mining. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz. Large Installation System Administration Conference (LISA). November 2014.

The methodology is as follows:

Gathering the data
Parsing, organizing, and storing the data
Getting an overview
Imposing high-level categorization to understand tasks
Understanding workflows via Markov diagrams and common subsequences
Creating second-level categorization via clustering to understand sub-tasks

Gathering the data

Step one is to gather interaction records from a data analysis system. In the context of data analysis, interaction records usually represent sequences of transformations users apply to their data, as well as actions to modify program state. Typically, the easiest place to find interaction records are in one of the system logs.

In Splunk

In Splunk, users transform their data via Splunk Processing Language (SPL) queries. An example Splunk query is: search “error” | stats count by status | lookup statuscodes status OUTPUT statusdesc

These queries are logged in Splunk in audit.log. For more information on how to obtain the queries in your personal Splunk installation, see the instructions on the README for the queryutils package.

Parsing, organizing, and storing the data

Once the data has been collected, it's important to be able to parse the interaction records into sequences of operations that have been applied to data, along with their parameters or arguments.

In Splunk

To parse Splunk queries, download splparser
For a collection of utilities that make it easier to organize and analyze Splunk queries, download queryutils

Getting an overview

Now that you have collected the data and have a means of parsing that data into timestamped sequences of transformations applied to data, the next step is to get an overview of this data. In this phase, the goal is to summarize the basic properties of the data set. Some questions we might ask in this phase include:

How many users are there? How are user arrivals distributed? What is the user interarrival rate?
How many queries or analysis sessions are there? How are query interarrivals distributed? What is the query interarrival rate?
How is query length or analysis session length distributed?
Do queries or analysis sessions have other attributes, such as type, or data source analyzed? How many are there of each type?
How many different types of actions are there? How are the frequencies of these actions statistically distributed? * How many actions are there per user, query, or analysis session?

Relevant paper data:

Table 1
Table 2

Imposing high-level categorization to understand tasks

How are the individual data transformations statistically distributed?
What are the most common transformations users perform? What are the least common?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data analysis interaction record analysis methodology

Methodology:

Gathering the data

In Splunk

Parsing, organizing, and storing the data

In Splunk

Getting an overview

Imposing high-level categorization to understand tasks

Clone this wiki locally