CS Seminar (Data Centric): James Mayfield (John Hopkins University)
Recent Advances in Cross-Language Information Retrieval
Research in Cross-Language Information Retrieval (CLIR), in which documents in one language are retrieved using queries from a different language, has had a revival over the past five years due to funding programs such as IARPA MATERIAL, evaluations such as TREC NeuCLIR, and technical advances such as transformers, multilingual language models, and generative large language models. This talk will describe the new CLIR paradigms such as ColBERT-X and SPLADE-X. Training data for these algorithms has largely relied on machine translation (MT) of the MS MARCO English dataset. Using MT output for training a CLIR system is problematic in a number of ways. The talk will also describe how generative large language models such as GPT-3 are being used to produce CLIR training collections that solve some of these problems.
Bio: Dr. James Mayfield is Principal Computer Scientist at the Johns Hopkins University Applied Physics Laboratory and the Johns Hopkins University Human Language Technology Center of Excellence, and holds an Associate Research Professorship at the Johns Hopkins Computer Science department. Dr. Mayfield, who has authored or co-authored over 100 professional communications, has been the PI or co-PI on programs sponsored by DoD, NASA, DARPA, ARDA and MIPS. Much of his research has examined language-neutral techniques for multilingual and translingual text processing. His current research focuses on cross-language information retrieval. In his spare time he studies card magic.