David M. Andrzejewski

Applied machine learning and (un)natural language processing.

    Time series related by graph
    Event stream
    Clustered data

Professional Interests

(also see Publications for writings and talks on these topics)

"If you do not work on an important problem, it's unlikely you'll do important work."
- Richard Hamming

Problem domain: understanding software behavior

Cloud-based software systems should be considered among the most fascinating artifacts of human civilization. They provide the commercial and social infrastructure of modern life, with each of us probably interacting with many of them daily. We (quite reasonably) do so without any consideration of the dizzying complexity of the underlying software: vast interconnected information, control, and communication flows lurking behind the innocuous app icon on our mobile phone screen, continuously evolving to compete in the marketplace.

But somebody must pilot the Ship of Theseus. While software may ostensibly exist entirely in the pure realm of ones and zeros stored in some source code repository, the smooth and safe operation of a live production service is utterly dependent on the continuous attention of human experts working within carefully developed teams and processes.

However, these professionals cannot touch the CPUs nor smell the error messages. In order to operate our technology, we must use technology. Organizations achieve the extrasensory perception required to infer the state of their systems by instrumenting components to emit terabyte-scale streams of heterogeneous events and numerical measurements that are in turn consumed by a prosthetic nervous system built for the integration and processing of these signals, capable of transducing them into alerts, summaries, and data visualizations suitable for human consumption.

As the complexity and scale of systems continues to grow, a bottleneck is emerging at the interface between the human operators and this keyhole view into the machine world. It is therefore becoming necessary to investigate how we might push (some of) the higher-level "intelligence" across the divide, using automated methods to provide human users with more relevant information, richer contexts, and more powerful tools for exploring, formulating, and testing hypotheses about observed system behaviors. Machine learning and data mining technologies are natural candidate tools for this task.

Relevant tools and techniques

Machine learning

One might expect software system behavior and its associated telemetry to be perfectly well-ordered and predictable. Setting aside the Entscheidungsproblem, the complexity, dynamism, and human-driven nature of these systems mean that, in practice, much of the data is actually noisy or chaotic. This provides a promising environment for machine learning and data mining: we have some domain knowledge about the underlying structure and mechanics of the data-generating process, but randomness and noise in the observed signals. Some example families of relevant machine learning approaches and problem formulations here are

  • time-series modeling
  • clustering and dimensionality reduction
  • partial or implicit supervision
  • structure extraction/induction
  • exploitation of graphical structure
  • anomaly or outlier detection
  • classifiers and their explanations

Population modeling

Another interesting question is how to pool or combine data across different instances when estimating models. We could consider each entity (eg, host machine running some application) to be totally unique and then estimate models for each in complete isolation. Going the other way, we could naively estimate a single model to cover all instances. The question of how to use metadata or domain knowledge to best interpolate between these extremes is a rich area for exploration, closely related to Bayesian hierarchical modeling or parameter tying in deep neural nets. Ideas from differential privacy may also be relevant in this context.

Software reliability

The challenges of ensuring that software works as intended can easily exceed the nominal effort and cost of creating that software in the first place, especially as you continue to iterate. Beyond the standard best practices around testing code and instrumenting systems, there are exciting opportunities in this area around functional programming, static typing, and the monitoring and testing of complex data-dependent systems like data pipelines and machine learning models.

Approximation algorithms

Resource limitations are an inescapable reality of practical data analytics systems, but surprisingly often it is possible to dramatically expand the operating envelope by accepting some small probability of non-exact results. These techniques are especially appealing where your use case is insensitive to a small approximation error, or if this error is insignificant in comparison to sources of noise or distortion already present in your data.

Product development

How do teams build the right thing, the right way? In general these are hard problems, and can be even trickier on the frontier of novel technologies, applications, or data resources. The effective allocation of scarce effort and attention under conditions of uncertainty within the context of organizational coordination across teams and timezones is a "grand challenge" problem in its own right.

Prior work

Previously, I worked on partially supervised probabilistic modeling of grouped event count data with latent variables. Specifically, I focused on text mining applications where we model word count representations of documents with latent topic models, a class of techniques that can exploit word co-occurrence patterns to recover human-meaningful "topics". Often these purely statistical topics are not well-aligned to ultimate end-user modeling goals, motivating my research exploring mechanisms by which user-provided side information or domain knowledge could help guide statistical topic recovery, and how these learned topics could then be used in applications such as biomedical research or national security.