Multilingual Complex Named Entity Recognition with Data and Context Augmentation
Extend pretrained language model (XLM-RoBERTa)
Data augmentation with Google Translate API (MulDA)
Context augmentation with Elasticsearch on Wikipedia (KB-NER)
Chahyon Ku, London Lowmanstone IV, Asal Shavandi, Josh Spitzer-Resnick
University of Minnesota, Twin Cities
Proposal Paper
Proposal Slides
Midterm Paper
Final Paper
Final Slides
Code
Extended XLM-RoBERTA + Conditional Random Field (CRF) baseline with data and context augmentation
Data Augmentation
Translation based data augmentation with Google Cloud API (MulDA), utilizing multilingual dataset.
Context Augmentation
Following procedures of KB-NER[1], we use Elasticsearch to index Wikipedia articles in 100 languages and use the index to augment the training data with context information.