Peter Bailis Research Overview

A New DAWN for Data-Intensive Systems

Sustained decreases in the cost of data storage have enabled users to capture increasing volumes of data. In turn, this data is enabling increasingly powerful analytics-based decision-making that continues to transform every aspect of society, from transportation to housing and medicine. However, many of today's most valuable analytics -- e.g., for prediction, recommendation, and root cause analyses -- are powered by expensive, bespoke, and often ad-hoc data engineering and machine learning efforts and are therefore restricted to the best-trained, best-funded organizations.

The success of the relational database offers an existence proof of an alternative approach, in the form of reusable, modular, and high performance data-driven tools that make powerful analytics accessible to and cost-effective for a broad spectrum of users. However, due to the scale, heterogeneity, and statistical nature of the core techniques powering many of today's data analytics tasks, relational databases alone are no longer sufficient to facilitate democratized access to the most advanced and potentially disruptive analytics functionality.

Research Interests and Goals

As a result, my research interest is in the design and implementation of this next generation of data-intensive systems. My goal is to develop abstractions, composable and reusable interfaces, and efficient implementations of these advanced analytics in the form of useful, data-intensive software systems. My research group (part of the five-year Stanford DAWN project) is pursuing several data-intensive systems projects at the intersection of large-scale data and machine learning:

MacroBase: Prioritizing Attention in Large-Scale Data

MacroBase is an analytics engine that prioritizes attention in high-volume event-oriented datasets such as application telemetry, user behavior logs, and diagnostic reports. Conventional OLAP analytics tools require users to manually specify dimensions and attributes of interest within these data sources, which can contain millions of populations of potential interest. In contrast, MacroBase leverages both contextual information about event sources (e.g., user ID, device ID, application version) and the scale of data available in many production deployments to automatically identify, aggregate, and highlight populations of interest. This reduces the cognitive burden of common and cumbersome tasks such as application health assessment and root cause analysis. At its core, MacroBase performs combinatorial feature selection via hypothesis testing and is powered by both new query optimization techniques and new sketching algorithms. The system currently runs in production in several deployments spanning online services, event analytics, and manufacturing.

User-Friendly and Efficient Video Analytics at Scale

Video is one of the most rapidly-growing sources of data, and the cost to acquire video is at an all-time low. In parallel, the techniques best suited to process this video automatically -- such as deep networks for object detection -- continue to improve in accuracy each year. The resulting bottleneck is computation: state of the art object detectors require a dedicated GPU to run in real time, leading to a three-plus order of magnitude differential between the cost of data acquisition and the cost of data processing. To bridge this gap, we are developing a new video-based query engine called BlazeIt that can both answer video-based queries in a high-level user-friendly language and automatically optimize these queries to reduce inference time and expense. BlazeIt is powered by a combination of query optimization techniques that exploit temporal and spatial locality and perform model search for data-dependent accelerated inference.

Additional Recent Work

Beyond these systems, our recent work studies a range of functionality across the modern analytics stack, including: dimensionality reduction of highly structured time series; improved visualization for monitoring and dashboards; runtime performance analysis of common deep learning tasks; fast sketch-based factorization of sparse models; frugal yet mergeable quantile estimation for data exploration; unsupervised density estimation for efficient classification; large-scale LSH for seismological analysis; and automatic optimization of model serving strategies.

Looking Forward: A Classic Toolbox Shines

In performing the above research, we have found the design of post-relational data-intensive systems does not necessitate an abandonment of classical data-intensive systems techniques such as declarative interfaces or query planning. Rather, this new class of workloads presents new opportunities for applying these techniques to a broad set of statistically-informed problems. For example, we have found that predicate pushdown, cost-based optimization, and cascaded execution shine when applied in many statistical contexts. Just as relational workloads stimulated decades of research into end-to-end query optimization, systems design, and hardware-efficient execution, I believe this next wave of systems holds similar -- and perhaps even greater -- promise for the systems community.

Users and Collaborators

As a systems researcher interested in emerging, often poorly-defined data-intensive workloads and applications, engaging real users is critical to my work. All of our research is informed and enabled by use cases and partners on campus and beyond, who contribute feedback, financial support, data, and engineering resources. In addition to these partners, I am fortunate to work with an amazing set of students and research collaborators. Moreover, to facilitate reproducibility, reuse, feedback, and impact, all of our software is publicly available as open source.

January 2018