AI Seminar

Practical Natural Language Processing for Minority Languages

Ben King

Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this
thesis, we present methods for automatically building NLP tools for minority languages with
minimal need for human annotation in these languages. We start first with language identification,
the problem of recognizing a text's language in the absence of an explicit label. We
specifically focus on word-level language identification, an understudied variant that is necessary
for processing Web text and develop highly accurate machine learning methods for this
problem. From there we move onto the problems of part-of-speech (POS) tagging and dependency
parsing. With both of these problems we take the approach of adapting tools built from
English and other well-supported languages for use on a minority language that doesn't have
large annotated corpora. By projecting annotations from many different languages across parallel text, we are able to create accurate tools in the low-resource target language

Sponsored by

Professor Dragomir R. Radev