Exploring Sentiment Analysis on Arabic Tweets about the COIVD-19 Vaccines

: The COVID-19 pandemic has imposed a public health crisis across the world. The global efforts lead to the development and deployment of multiple vaccines. The success of ending the COVID-19 pandemic relies on the willingness of people to get the vaccines. Social media platforms prove to be a valuable source to perform experiments on sentiment and emotion towards COVID-19 vaccination in many languages, mainly focusing on English. The people express their opinions and emotion on Twitter briefly, which can have tracked almost instantaneously. This helps the governments, public health officials, and decision-makers to understand public opinions towards vaccines. The goal of this research is to investigate public sentiment on COVID-19 vaccines. Twitter social media extracted all Arabic-language tweets mentioning seven vaccines in 7 months from 1 November 2020 to 31 May 2021. A set of Arabic sentiment lexicons were prepared to perform the sentiment analysis. The tweets' monthly average sentiment were calculated from the collected dataset and evaluated comparatively for each vaccine throughout the 11 months. Out of 5.5 million tweets that have been retrieved using the most frequent keywords and hashtags during the COVID-19 pandemic, 202,427 tweets were only considered and included in the monthly sentiment analysis. We considered tweets that mentioned only one vaccine name of the text. The distribution of tweets shows that 47.5% of the considered tweets mentioned the Pfizer vaccine. It is reported that 64% of the total tweets are non-negative while 35% are negative, with a significant difference in sentiment between the months. We observed an increase of non-negative tweets in parallel with increasing negative tweets on May 2021, reflecting the public's rising confidence towards vaccines. Lexicon-based sentiment analysis is valuable and easy to implement the technique. It can be used to track the sentiment regarding COVID-19 vaccines. The analysis of social media data benefits public health authorities by monitoring public opinions, addressing the people's concerns about vaccines, and building the confidence of individuals towards vaccines.


INTRODUCTION
By mid-August 2021, more than 180 million global COVID-19 cases and 3.5 million deaths have been confirmed [1]. People were encouraged to stay at home, apply social distancing, and do their work remotely. This isolation led to many posts using social media to express people's opinions towards the pandemic [2].In 2020 and 2021, most of the top hashtags and posts on Twitter were related to the COVID-19. Its variations [3] not just by people but also governments and health authorities share policies and news related to the pandemic [4]. Recently and after the approval of several COVID-19 vaccines worldwide, most discussions on social media are about the wide availability, people acceptance/rejection of taking vaccines, and side effects of each vaccine [5] Utilizing computing power in several technologies helps to study different areas of the COVID-19, its impact, and people's response to it effectively [6][7][8]. Many research trends argue that social media play an important role in public health [9]. Surveillance systems are one component of public health that are used to monitor, identify, and evaluate health issues [10]. Social media offers a great chance for such research. Monitoring infectious diseases [11] and disease outbreaks [12], tracking public response to health issues [13], detecting target areas for intervention efforts [14], and spreading fake news about epidemics [15,16] are examples of public health surveillance.
As the Twitter social media platform has over 187 million active users, it serves as a powerful medium to better understand public perception about the COVID-19 vaccines. Twitter data is used in many types of research to understand public attitudes and discussions about the COVID-19 pandemic. Some researchers have investigated many topics related to COVID-19, including the qualitative content analysis to understand public communication [17], sentiment analysis and word frequency [18], misinformation detection [19], topic modeling [20][21][22], and social distancing measurement [23].
The key success to control the spreading of the COVID-19 pandemic is the development of vaccines [24][25], while the refusal to uptake the vaccine harms herd immunity [26]. The traditional surveys that have been conducted to understand people's opinions are more expensive because of some issues, including the time-consuming, addressing short health topics, and obtaining from small-scale data. Therefore, we need to look at the community, not just individuals. Social media, including Twitter, can be a great opportunity to infer public sentiment about the COVID-19 vaccines.
Multiple studies have discussed vaccine issues on Twitter data. Twitter is used to detect the community hesitancy regarding other vaccines such as MMR, HPV, and Rdap [26]. It was used to analyze vaccine images [27], detect sentiment about HPV [28], and understanding of vaccine debate of Russian trolls [29]. Since the starting of the COVID-19 pandemic in 2020, Twitter data has been utilized in many research to understand different issues related to pandemic vaccines, such as the detection of people opinions about vaccine hesitancy [30,31], sentiment analysis [32], content analysis of COVID-19 tweets ]33], fake news and misinformation [34][35][36], pro-vaccine campaign on Twitter [37], anti-vaccination tweets [38] and people emotion during the pandemic [39].
The previous studies offer valuable insights about COVID-19 vaccine problems with some limitations. First, most studies focused on the analysis of Tweets in English and some other languages. This research proposes a method that analyzes collected tweets to bridge the gap of covering some under-resourced languages such as Arabic. It identifies the content sentiment of tweets and the major discussed topics to provide insight into the evaluation of public attitudes about the COVID-19 vaccine over time. This is the first article assessing sentiment towards seven vaccines in Arabic to the best of our knowledge.

METHODOLOGY
This part provides details on the method used in the paper. We propose a methodology framework containing five components, as depicted in Fig. 1.

Data Collection
The Twitter streaming API (Application Programming Interface) retrieved Arabic tweets from November 1, 2020, to May 31, 2021. We used different keywords to search and retrieve relevant tweets such as, ‫ﻟﻘﺎﺣﺎت"‬ ‫ﺗﻠﻘﯿﺢ،‬ ‫اﻟﻠﻘﺎح،‬ ‫"ﻟﻘﺎح،‬ and names of each vaccine in Arabic. All Arabic tweets posted in that time were collected and filtered to be included in the sentiment analysis. Tweets that were posted from accounts related to spammer, advertising, and pornographic were excluded. Duplicated tweets from the same users and short tweets that contained less than three words were removed. This process yielded 180,000 tweets out of 5 million tweets.

Date Pre-processing
The daily tweets were measured, and completed descriptive statistics were conducted on the collected tweets. The natural language processing techniques were applied to process, analyze, and visualize the text from tweets. To preprocess the tweets, several steps are used to clean the text, such as removal of URLs, user mentions, non-Arabic words, and punctuations. All analysis was conducted using Java (Apache NetBeans IDE 12.4).

Sentiment Analysis
In our sentiment analysis, we used multiple lexicons from the www.saifmohammed.com website. The lexicons were compound and classified into negative and nonnegative (positive and neutral) classes. The final lexicon contains 14,000 words. We calculated the difference between each tweet's negative and non-negative sentiment scores. If the difference was less than zero, we assign a negative class to the tweet. Otherwise, the sentiment class would be nonnegative. A rule-based approach was developed to extract the sentiment polarity of each tweet in our dataset.
To evaluate the usefulness of this approach, we randomly selected a sample of 1000 tweets representing all vaccines categories. Each tweet was tagged manually by two persons to show if a tweet had a negative or non-negative sentiment. Out of the manually annotated tweets, we used 745 tweets tagged by humans with results obtained by the rule-based approach.
The comparison showed that 78% is the highest agreement between human and rule-based annotations.

Statistical Analysis
This paper conducted a statistical analysis using Tableau 1 to compare each vaccine's sentiment changes over time. It also proposes a comparative comparison between vaccines each month. As mentioned above, we classified the tweets in our dataset into negative and non-negative classes only. Figure 2 shows the distribution of tweets over time concerning the date. The tweets distribution ranges for 13 months, from May 2020 to May 2021. We notice that the increase in the number of tweets in both classes starts in November 2020, which is the starting period for the production of COVID-19 vaccines.  In Fig. 4, the distribution of sentiment of each vaccine is shown. The total non-negative tweets (130,326 ≈ 64.4%) are higher than the negative tweets (72,101 tweets ≈ 35.6%) on each vaccine. In

CONCLUSION
Lexical-based sentiment analysis is a valuable and easy way to implement. It helps to track the sentiment about COVID-19 vaccines. The results in this paper show that Twitter can help in giving suggestions about sentiment towards the COVID-19 vaccines. The methodology in this research can be applied in a similar data from other social media networks such as Facebook.