PhD Dissertation Defense: Rob Churchill
Title: “Modernizing Topic Models: Accounting for noise, time, sparsity, and domain knowledge”
Data has evolved rapidly since the inception of topic models over twenty years ago. The most popular topic models in the world perform poorly on contemporary data sets such as social media and other short, noisy texts. As new types of data become available in larger volumes, we set out to modernize topic models. We set out to produce topic models capable of handling the noisy, sparse setting of social media, modeling on a temporal plane, and incorporating domain knowledge of users. Through the course of this dissertation we formalized the concept of noise as it applies to topic modeling, proposed topic models that deal with noise and preprocessing pipelines that can help cut down on noise prior to running any models. We present a survey of unsupervised topic models that tracks the evolution of topic models from inception to modern days. We proposed a new type of topic model, the topic-noise model, which jointly models topic and noise distributions, greatly increasing the quality of topics derived from social media data. We proposed semi-supervised topic models that allow for user feedback and leverage a user’s domain knowledge. We proposed a dynamic topic-noise model, enabling the modeling of temporal text data in social media streams while accounting for noise. All of these contributions culminate in a furthering of the collective capabilities of topic models, and now topic-noise models, and by the sharing of the models and code, enables future research into the field.