Multilingual Complex Named Entity Recognition with Data and Context Augmentation

Extend pretrained language model (XLM-RoBERTa)
Data augmentation with Google Translate API (MulDA)
Context augmentation with Elasticsearch on Wikipedia (KB-NER)

Chahyon Ku, London Lowmanstone IV, Asal Shavandi, Josh Spitzer-Resnick
University of Minnesota, Twin Cities

Proposal Paper
Proposal Slides
Midterm Paper
Final Paper
Final Slides
Code

Extended XLM-RoBERTA + Conditional Random Field (CRF) baseline with data and context augmentation

Data Augmentation

Translation based data augmentation with Google Cloud API (MulDA), utilizing multilingual dataset.

Context Augmentation

Following procedures of KB-NER[1], we use Elasticsearch to index Wikipedia articles in 100 languages and use the index to augment the training data with context information.