COVNLP: A Multisource COVID-19 Dataset for Natural Language Processing

Authors

Olubayo Adekanmbi
Data Scientists Network (Data Science Nigeria) Lagos
Wuraola Fisayo Oyewusi
Data Scientists Network (Data Science Nigeria) Lagos
Warrie Warrie
Data Scientists Network (Data Science Nigeria) Lagos
Adedayo Odukoy
Data Scientists Network (Data Science Nigeria) Lagos
Abimbola Olawale
Data Scientists Network (Data Science Nigeria) Lagos
Opeyemi Osakuade
Data Scientists Network (Data Science Nigeria) Lagos
Mary Salami
Data Scientists Network (Data Science Nigeria) Lagos

Synopsis

In this work, we propose COVNLP, a novel dataset for natural language processing tasks. The openly available dataset consists of 3,199 de-identified peer-to-peer messages shared across different channels like Whatsapp, SMS and Social media channels from volunteers during the COVID-19 pandemic in Nigeria. The messages were labelled by both participants at submission and independent data annotators after submission under three (3) major themes; message genuity, type and impact. We discovered that the most trusted source of information for the participants during the COVID-19 pandemic were international stations, social media and websites. 31.20% of the messages received by volunteers were labelled to have psychological effects such as emotional disturbance, depression, stress, mood alterations. The dataset is available here as part of our experimentation, we developed a basic machine learning model to classify the messages into misinformation, disinformation and rumour classes based. The best performing algorithm was Logistic Regression with count vectorizer with Area under the curve (AUC) value of 0.813 compared to Naive Bayes Classifier (0.716 ) and Random Forest Classifier(0.710).

SIAIA22
Published
February 17, 2024
Online ISSN
2582-3922