A Framework For Analyzing Database Access Patterns Within A Company

We have developed a framework for analyzing database access patterns within a company based on the SQL queries the company's employees submit. Specifically, our goal is to uncover similarities between analysts that were previously unknown. This can lead to new collaborations within the company, as well as provide management with a tool to help maximize workforce efficiency.

Components and Features:

  • A parser - takes a dataset of SQL queries and creates a global alphabet based on tokens found within the queries.
  • An encoding process - represents SQL queries as ordered trees and labels nodes according to their entry in the global alphabet.
  • Fast Subtree Kernel - A graph kernel we created to calculate the pairwise similarity of ordered trees.
  • User Similarity Algorithm - takes collections of queries submitted by pairs of users and uses the Fast Subtree Kernel to compute the overall similarity between the users in question.
  • Visualization - a series of visualizations of our data created using open-source visualization tools.

Pipeline

Contributors

Related Research

Any work that uses these codes should cite the following paper:

R. Searles, W. Wang, L. Xu, W. Killian, J. Cavazos. "The Similarity Graph: Analyzing Database Access Patterns Within A Company."

Abstract:
A company’s database can reveal a lot about its employees, and that information can be used to manage the workforce, assign tasks, and create collaborations more effec- tively. This paper proposes a framework to build a similarity graph between analysts within a company using the SQL queries they write. We show how we can represent SQL queries in graph form, and we propose a method that can be used for calculating pairwise similarity on these graphs. In order to guarantee that this method will scale to large systems, we accelerated our algorithm using OpenMP. The results we obtained revealed many behavioral similarities between employees from different business units within the company, and we achieved an almost linear speedup, with a minimum of 75% efficiency across all cores.

Individuals

Acknowledgements

This work was funded in part by JPMorgan Chase & Co.

© 2012-2014 University of Delaware