PhD Dissertation Defense: Sajad Sotudeh
Title: “Effective Domain-Specific Text Summarization”
My research goal in this dissertation is to develop neural summarization methods for better adaptation to domain-specific corpora with a focus on Clinical Notes, Scientific Documents, Query-focused, and Social Media domains. This research is driven by a unifying vision: to push the boundaries of how neural summarization can be tailored to meet the unique demands of diverse domains, thereby enhancing the utility of summarized content across a broad spectrum of applications. To begin with, I propose an ontology-aware abstractive summarization system that incorporates significant clinical terms (i.e., domain knowledge), extracted from medical ontologies into the summarization system. Moreover, I propose a graph-based summarization system that explores and models the relationship between clinical ontologies, and incorporates this modeling into the summarizer, for the interest of generating better summaries. These sets of approaches not only highlight the importance of integrating domain-specific knowledge into clinical summarization but also serve as a conceptual bridge to my investigations in other knowledge-intensive areas, including the scientific domain.
Expanding the scope of my research, I explore the adaptation of summarization models to scientific summarization domain through incorporation of document’s domain-specific characteristics (e.g., sectional information). A part of my attempts in this domain is focused on proposing summarization methods to produce so-called “long/extended” summaries (400-600 words) from scientific papers as these summaries encompass more salient details. Additionally, I investigate generating multi-perspective scientific summaries (with regular length of 100 words on average) by proposing a topic-aware two-step summarizer with the intuition that each perspective of the paper is discussed within specific topics and sets of paper’s sentences. This exploration underscores my attempts to enhancing the summarization of complex, information-dense documents by leveraging domain-specific features within the scientific domain.
I further explore the query-focused summarization domain, where a summary is generated tailored to a specific query that is asked on source document. To cope with the challenge of accurately identifying salient source regions, I propose a contrastive learning based summarizer that distinguishes gold salient segments from non-gold segments that are typically the focus of the summarizer’s attention. The work in this domain highlights my focus on developing summarization techniques that are not only responsive to the content of the documents but also to the specific informational needs within the query, thereby demonstrating the adaptability of my research across different summarization challenges.
Finally, I focus on social media domain for generating “extremely short” yet “informative” summaries given users’ posts. To Facilitate social media summarization research, I introduce summarization datasets collected from Reddit communities, two of which (i.e., TLDR9+ and TLDR-HQ) are on general domain, and the third one (i.e., MENTSUM) is on the mental health domain. I then propose a curriculum-guided abstractive summarization system for summarization of mental health posts from social media.
Committee members:
Nazli Goharian (adviser)
Ophir Frieder
Nathan Schneider
Justin Thaler
Arman Cohan (Yale University)