COVNLP: A Multisource COVID-19 Dataset for Natural Language Processing

Olubayo Adekanmbi; Wuraola Fisayo Oyewusi; Warrie Warrie; Adedayo Odukoy; Abimbola Olawale; Opeyemi Osakuade; Mary Salami

COVNLP: A Multisource COVID-19 Dataset for Natural Language Processing

Authors

Olubayo Adekanmbi

Data Scientists Network (Data Science Nigeria) Lagos

Wuraola Fisayo Oyewusi

Data Scientists Network (Data Science Nigeria) Lagos

Warrie Warrie

Data Scientists Network (Data Science Nigeria) Lagos

Adedayo Odukoy

Data Scientists Network (Data Science Nigeria) Lagos

Abimbola Olawale

Data Scientists Network (Data Science Nigeria) Lagos

Opeyemi Osakuade

Data Scientists Network (Data Science Nigeria) Lagos

Mary Salami

Data Scientists Network (Data Science Nigeria) Lagos

DOI: https://doi.org/10.21467/proceedings.157.2

Synopsis

In this work, we propose COVNLP, a novel dataset for natural language processing tasks. The openly available dataset consists of 3,199 de-identified peer-to-peer messages shared across different channels like Whatsapp, SMS and Social media channels from volunteers during the COVID-19 pandemic in Nigeria. The messages were labelled by both participants at submission and independent data annotators after submission under three (3) major themes; message genuity, type and impact. We discovered that the most trusted source of information for the participants during the COVID-19 pandemic were international stations, social media and websites. 31.20% of the messages received by volunteers were labelled to have psychological effects such as emotional disturbance, depression, stress, mood alterations. The dataset is available here as part of our experimentation, we developed a basic machine learning model to classify the messages into misinformation, disinformation and rumour classes based. The best performing algorithm was Logistic Regression with count vectorizer with Area under the curve (AUC) value of 0.813 compared to Naive Bayes Classifier (0.716 ) and Random Forest Classifier(0.710).