Trustworthiness of health-related information on Arabic social media
Social media (SM) platforms play a vital role in disseminating health-related information. However, evidence suggests that Twitter posts (i.e., tweets) are often inaccurate; for example, research from Saudi Arabia indicates that 50% of health-related tweets contain inaccurate information. Previous studies also suggest that tweets do not need to be evidence-based or accurate to gain traction, which exacerbates the accuracy concern in the sphere of health information. The goal of the thesis is to develop a framework for automatically determining the accuracy of health-related tweets.
Knowing the accuracy of tweets offers the potential to recommend/promote accurate tweets while identifying/flagging/demoting inaccurate tweets. As a first step, this thesis employed a pilot study to identify possible metrics that may correlate with the accuracy of health-related tweets. The results showed that tweet meta-characteristics have some limited potential in the identification of inaccurate tweets and to inform on their dissemination potential.
The research then built past this work to develop a framework for automatically determining the accuracy of health-related tweets in Arabic. The first step was to develop a model to detect instances of health-related tweets. This was accomplished by determining the best pre-processing techniques for use with traditional machine learning and then developing traditional machine learning classifiers. The model was then compared with state-of-the-art pre-trained word embeddings. The findings from evaluating the pre-processing techniques with traditional machine learning showed that pre-processing techniques perform differently from one algorithm to another. In addition, most pre-processing methods highlighted in the literature were not included in the best combination. Pre-processing techniques specific to the Arabic language are more likely to improve classifier performance than other generalized pre-processing techniques. However, ultimately the deep learning model outperformed the traditional machine learning models, even with optimized pre-processing. After developing a model to detect the health-related tweets, the accuracy of the health-related tweets was to be determined. To develop classifiers for this step, we built data sets labeled “accurate” and “inaccurate.” Two medical doctors labeled each tweet, and the data sets were used to evaluate pre-trained language models and word embeddings, to identify the best model for detecting health-related tweets’ trustworthiness. The results suggest that pre-trained language models perform better than pre-trained word embeddings.
Results from both phases were impressive individually but suffer from the individual inaccuracy slightly when used in combination to detect accurate health tweets. However, we believe that the proposed process to identify health trustworthiness and the findings from these experiments will open the door for further research in this direction and may eventually result in an even more effective automatic prevention of incidents of health misinformation in Arabic
History
Faculty
- Faculty of Science and Engineering
Degree
- Doctoral
First supervisor
Jim BuckleySecond supervisor
Nikola NikolovAlso affiliated with
- LERO - The Irish Software Research Centre
Department or School
- Computer Science & Information Systems