-
Notifications
You must be signed in to change notification settings - Fork 0
Data analysis interaction record analysis methodology
This document describes a high-level methodology for analyzing interaction records from data analysis systems.
While we believe this methodology is general enough to use with interaction records from many data analysis systems, we will use interaction records from Splunk as our primary concrete example.
This methodology was used with Splunk queries to develop the results in a paper published in LISA 2014:
Analyzing Log Analysis: An Empirical Study of User Log Mining. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz. Large Installation System Administration Conference (LISA). November 2014.
The methodology is as follows:
- Gathering the data
- Parsing, organizing, and storing the data
- Getting an overview
- Imposing high-level categorization to understand tasks
- Understanding workflows via Markov diagrams and common subsequences
- Creating second-level categorization via clustering to understand sub-tasks
Step one is to gather interaction records from a data analysis system. In the context of data analysis, interaction records usually represent sequences of transformations users apply to their data, as well as actions to modify program state. Typically, the easiest place to find interaction records are in one of the system logs.
In Splunk, users transform their data via Splunk Processing Language (SPL) queries. An example Splunk query is:
search “error” | stats count by status | lookup statuscodes status OUTPUT statusdesc
These queries are logged in Splunk in audit.log
. For more information on how to obtain the queries in your personal Splunk installation, see the instructions on the README for the queryutils package.
Once the data has been collected, it's important to be able to parse the interaction records into sequences of operations that have been applied to data, along with their parameters or arguments.
- To parse Splunk queries, download splparser
- For a collection of utilities that make it easier to organize and analyze Splunk queries, download queryutils
Now that you have collected the data and have a means of parsing that data into timestamped sequences of transformations applied to data, the next step is to get an overview of this data. In this phase, the goal is to summarize the basic properties of the data set. Some questions we might ask in this phase include:
- How many users are there? How are user arrivals distributed? What is the user interarrival rate?
- How many queries or analysis sessions are there? How are query interarrivals distributed? What is the query interarrival rate?
- How is query length or analysis session length distributed?
- Do queries or analysis sessions have other attributes, such as type, or data source analyzed? How many are there of each type?
- How many different types of actions are there? How are the frequencies of these actions statistically distributed? * How many actions are there per user, query, or analysis session?
Relevant paper data:
- Table 1
- Table 2
- How are the individual data transformations statistically distributed?
- What are the most common transformations users perform? What are the least common?