Automatic Question Generation Using Semantic Role Labeling for Morphologically Rich Languages

: In this paper, a novel approach to automatic question generation (AQG) using semantic role labeling (SRL) for morphologically rich languages is presented. A model for AQG is developed for our native speaking language, Croatian. Croatian language is a highly inflected language that belongs to Balto-Slavic family of languages. Globally this article can be divided into two stages. In the first stage we present a novel approach to SRL of texts written in Croatian language that uses Conditional Random Fields (CRF). SRL traditionally consists of predicate disambiguation, argument identification and argument classification. After these steps most approaches use beam search to find optimal sequence of arguments based on given predicate. We propose the architecture for predicate identification and argument classification in which finding the best sequence of arguments is handled by Viterbi decoding. We enrich SRL features with custom attributes that are custom made for this language. Our SRL system achieves F 1 score of 78% in argument classification step on Croatian hr 500k corpus. In the second stage the proposed SRL model is used to develop AQG system for question generation from texts written in Croatian language. We proposed custom templates for AQG that were used to generate a total of 628 questions which were evaluated by experts scoring every question on a Likert scale. Expert evaluation of the system showed that our AQG achieved good results. The evaluation showed that 68% of the generated questions could be used for educational purposes. With these results the proposed AQG system could be used for possible implementation inside educational systems such as Intelligent Tutoring Systems.


INTRODUCTION
The process of converting unstructured data into structured "machine-readable" data is a very complex task, which involves the retrieval of essential concepts and their systematic editing from a text written in natural language. Since the manual data annotation is a labor-intensive and time-consuming task, there is motivation to automate it. Concept identification in a particular domain is a very important first step. Identifying concepts and defining their semantic relationships seems challenging, but different natural language understanding techniques have already been developed. This paper focuses on identifying semantics in a text, a task known as Semantic Role Labeling (SRL), the widely studied challenge of recovering predicate-argument structure for natural language words, typically verbs.
This task covers advanced methods of understanding the meaning of a sentence used in many natural language processing tasks such as automatic question generation (AQG), machine translation, knowledge extraction and text summarization. The automatic recognition of semantic roles for the inflective Croatian language is a demanding task due to its free word order. The application of SRL tools can greatly contribute to the development of Intelligent Tutoring Systems (ITS) [1] based on natural language processing. This is a special type of ITS in which the communication with the student is carried in natural language.
There are several systems that try to solve the communication problem in ITSs, and by far the most famous one is AutoTutor [2]. Research shows that AutoTutor has produced learning gains that are about 0.8 standard deviation units above controls that read static instructional materials for an equivalent amount of time [3]. AutoTutor consists of five different components and each of them is responsible for a different stage of learning. This research focuses on Curriculum script repository that defines the content associated with a question or problem.
It consists of ideal answers, expected answers, a set of misconceptions, a set of keywords and synonyms and markup language rules for speech and gesture generator that can be easily generated with an authoring tool called the AutoTutor Script Authoring Tool [4]. This means that the process of creating new course material is not fully automated. A system similar to AutoTutor but created in the Croatian language is called CoLaBTutor (Controlled Language-Based Tutor) [5]. The system uses various natural language processing techniques to enable the learner to communicate in natural language. CoLabTutor uses a variety of lexical resources and custom-made rules to enable communication in the process of teaching and testing. The time and cost of content development for such system are much higher than the development of traditional teaching material used in teaching today. This is the biggest drawback of implementing ITS systems, and methodologies to accelerate this process are currently an active topic of research. An SRL component would allow implementation of an automatic tool for development of instructional content. This allows such a system automatic generation of sentences and questions, the verification of text similarity and the knowledge extraction from text written in natural language. The system would receive unstructured text as input, which would be transformed into content that the system could easily manipulate.
The rest of this paper is structured as follows: Section 2 shows related work on SRL. Section 3 describes the architecture of the proposed SRL system for our native speaking language. Section 4 shows the SRL evaluation results of our proposed approach. Section 5 suggests an approach for the AQG model. In section 6 we show the results of our qualitative analysis of the generated questions and comparison to similar AQG systems in other languages. In section 6 we also present some suggestions for the correction of errors that were noticed by our experts during evaluation. In section 7 conclusion is provided and some suggestions for future work.

RELATED WORK 2.1 SRL Task Description
The term semantic role denotes the relationships that verbs in a sentence have with other words. Verbs express the semantics of an event, which is described as relational information among participants in that event and projects a syntactic structure that encodes that information. The verbs are also highly variable, displaying a rich array of semantic and syntactic behavior. Verb classifications help naturallanguage word processing systems organize verbs into groups that share basic semantic and syntactic features. Semantic roles basically describe the conceptual relationships between participants in a given sentence. They illustrate the basic "Who, What, Where, How, and When" information within a sentence. The concept of semantic roles that first appeared in modern linguistics in the late-1960s is called case grammar [6]. Case grammar is a system of linguistic analysis and focuses on finding a connection between the valence of the verb and the context in which the verb is located. According to this grammar, the morphological and syntactic structures of all languages are derived from "hidden" semantic categories, not from syntactic categories as stated in the theory of generative grammar.
Semantic roles can be used to assist in various advanced text processing methods such as question generation, plagiarism detection, and summarization of multiple documents [7][8][9][10]. Therefore, it is very important to develop precise methods for machine recognition of semantic roles. Semantic roles provide a layer of abstraction over the syntactic dependencies of words in a sentence. These tags hide information that is insensitive to syntactic changes and provide some level of semantics, which is why this task is often called shallow semantic parsing. Machine recognition of semantic roles within sentences can be viewed through two machine learning approaches.
The first approach to SRL includes a classification task, where a semantic role for each word in a sentence, depending on its predicate, is determined. Typically, such approach uses syntactic information to "learn" semantic roles through the text tagged with features. Other approaches are based on the rich set of features extracted from texts using the body of expert knowledge [11,12], or features extracted automatically using deep learning algorithms [13][14][15]. Such approaches require a voluminous text labeled with semantic roles and are suitable for resource-rich languages. Also, their main disadvantage is that they are limited to the domain in which they are trained. Other approaches present semantic role labeling as a task of grouping words and sentences [16][17][18][19]. Identifying semantic roles in a text using unsupervised machine learning does not produce as good results as methods based on rich lexical resources and different classification tasks. Another disadvantage of unsupervised machine learning methods is that they make rigorous assumptions about data such as an assumption that the semantic arguments of the predicate remain consistent despite variation in syntactic function. Furthermore, unlike supervised methods, they rely on simple sentential features, which are not adequate for the development of the SRL system for a very free word order language.

SRL Dataset for Croatian Language
Although there are many approaches that use parallel corpora to train multilingual tools for SRL [16,20], they do not include the Croatian language. A corpus annotated with semantic roles in Croatian is currently under construction. The only annotated corpus to be used as training data for supervised machine learning systems [21] was developed within the project Semantic Role Labeling in Slovene and Croatian. The tag set for the Croatian SRL is developed following an approach from Prague Dependency Treebank (PDT) [22]. It contains a total of 87,387 tagged tokens. 3,003 sentences are used for training and 754 sentences for testing. All sentences are annotated with syntactic and semantic tags as shown in Fig. 1.
In the sentence "Kosovo ozbiljno analizira process privatizacije u svjetlu učestalih pritužbi" all words are labeled with their respective part-of-speech according to the MULTEXT-East specification, dependency relations using Universal Dependencies 1 and the semantic roles are labeled with respect to the predicate "analizira" (Eng. analyze). Semantic arguments are the actor of the action (ACT), the patient of the action (PAT) and the manner in which the action is held (REG).

Figure1
Schematic representation of predicates, semantic frames and syntactic information such as dependency tree and part-of-speech tags in the Croatian language

APPROACH TO SEMANTIC ROLE LABELING OF MORPHOLOGICALLY RICH LANGUAGES
For languages that have a rich set of hand-tagged data, supervised machine learning methods that can "learn" from 1 https://universaldependencies.org/ training data to distinguish individual classes are a logical choice. Most natural language processing tasks can be reduced to labeling words in a sentence, whether it is syntactic or semantic processing. Let us define W = {w 1 , w 2 , …, w n } as a set of all words in a language, and   be the set of all tag vectors for each sentence. The function that assigns a vector to each sentence from the set O needs to be found using an optimization algorithm. Good results can be achieved by defining a function that extracts a number of features from each word as shown in the following Eq. (1).     1  2   :  ,  , , ,  ,  , , 1, , The feature function consists of a series of word transformations that extract informative features specific to the machining task to be performed. Set of all features is encoded into numerical representations that are suitable for classification algorithms. A general approach for development of the SRL system set by [12,23] embodies a couple of steps. The first step is predicate identification where the system identifies predicate of a sentence. The second step is argument identification in which the system identifies if a given word is semantically related to the predicate. The last step of the SRL system is the argument classification in which given predicate-argument pairs system identifies the label that connects those pairs.

 
Strategies for SRL can be diverse and most systems use three step approach. In this approach argument identification is implemented as a binary classification and argument classification is multi-label classification. The argument identification step is used to filter the words to reduce the number of non-argument labels that can overfit the argument classification model. This is the main problem for under resourced languages that do not have large corpus of hand annotated data. SRL in the Croatian language was performed on the corpus shown in section 2 using features shown in Tab. 1. SRL strategy used in this paper is a model based completely on a sequence tagging approach. In predicate identification step CRF is used to find sequence of binary values that indicate predicate and non-predicate words. Custom handcrafted sets of features are used that are defined by [11] but they are modified to take into account morphological information from the words. We do this by removing Feats feature defined in [11] and adding type, case, gender and number feature to every predicate candidate. The same is implemented in argument classification where we generate multiple sequences based on the predicate used morphological information from the given argument candidate.
There are two types of classification models: independent models and joint models. Joint models find the best overall sequence of arguments for given predicate while independent models predict individual argument label for given argument predicate pairs. Independent models are prone to inconsistencies so many SRL approaches are based on joint model approaches. This is usually implemented by using beam search over the predicted labels. In this article we used Viterbi decoding algorithm on conditional probabilities inferred by CRF to find optimal sequence of SRL arguments.
CRF is a type of statistical modeling method used to label data sequences. CRF is said to be the precursor to today's recurrent neural networks. It is a commonly used tool in natural language processing for part-of-speech tagging [24,25] tagging named entities [26], and other sequential tagging tasks. CRF is a non-directional graphical model whose vertices can be divided into two disjunctive sets. In sequence modeling, a chain is a graph we are usually interested in. The input set of variables F represents a series of observations while T represents the hidden states, i.e. the states to be obtained through the input parameters. The conditional dependence of input and output variables is defined through a series of functions of significance f(i, This function actually defines which variables at the input define the probability of each possible occurrence of the output variable T i . For each feature, the model assigns a numerical weight value and combines them to determine the "tag" probabilities for the input values. Learning of these parameters Ω is done using maximum probability for p(T i |F i , Ω); for exponential distributions this problem can be solved by using gradient descent or the quasi-Newton methods.

EVALUATION OF THE PROPOSED SRL METHOD
We compared the results obtained with CRF method and custom-defined features with the results obtained with the mate-tools [11] using default German features as the German language of all of the supported is the closest to the Croatian language. The system was trained and evaluated using train-test split obtained from the hr 500k corpora.
Our system 2 was trained using the L-BFGS method for 100 iterations, also L 1 and L 2 regularization is used with L 1 coefficient 0.2009, and L 2 coefficient 0.0284. The CRF transition generator is used for data augmentation and transition that did not occur in training data but, based on available transitions are more likely to occur. The classifier evaluated using 3-fold cross validation and randomized search was used to find optimal hyper parameters.
Tab. 2 shows the results obtained with our CRF method. We also tested linear classifier, but CRF method gave slightly better results.
We compared results obtained by our CRF method to benchmark results obtained by mate-tools 3 . The official benchmark results used gold tags for the predicate identification stage. We rerun the same tool on the same 2 https://github.com/danielvasic/CroatianSRL.git dataset but included predictions from predicate identification stage. The results of our evaluation show that mate-tool gave results with F 1 score of 98.53 for predicate identification stage and F 1 score of 71.87% for argument classification stage. The CRF method compared to matetools gave better results with increase in F 1 score of 0.65% for predicate identification and increase in F 1 score of 6% for argument classification. This is an improvement when compared to the results of the state-of-the-art SRL models for the Croatian language that have been published recently [27]. The most discriminative features for predicate identification were PredWord feature, ChildCaseSet feature and PredType among others. The analysis shows that if ChildCaseSet feature value is nominative and locative there is a high probability that the next element of the sequence is the predicate, or if the PredType is main there is a very high probability that the next word in the sequence is again predicate. In argument classification step the DepPath and PosPath features are the most informative thought, ArgFeatsType is also highly informative especially for identification of TIME, LOC and DUR arguments. In both predicate identification and argument classification, morphological features such as the case, number, gender and type are shown to be most informative in inferring correct classification.

AUTOMATIC QUESTION GENERATION USING SEMANTIC ROLE LABELING
With the development of a method for semantic role labeling in this article, we also propose a system for automatic question generation (AQG) for the Croatian language. Croatian language opposite to the English language is a highly inflected language. In this section, we present a simple model for question generation using semantic roles. Due to the complexity of Croatian language, building a robust question generation system is a very complex task. We propose rules that can be implemented over the SRL graph, bearing in mind that there is room for improving the quality of the questions using other syntactic structures such as part-of-speech tags, dependency tree, named entities and co reference resolution. Process of building AQG system for Croatian language starts with the mapping of the constituent question types to respective semantic roles. Templates for question generation that are used to generate constituent questions are shown in Tab. 3. The predicate of the sentence is known to define the action expressed in the sentence. Our proposed system generates questions based on the predicates in the sentence. We use predicate arguments to determine which type of constituent question can be generated. If the predicate contains arguments required by the template, the system replaces the argument inside the template with its respective value. We adapt the approach for question generation from [28] but map the arguments from PropBank style annotations [29] to Prague Dependency Treebank style annotations. We also suggest using rich morphological information from part-of-speech tags and dependency parse to further improve the quality of questions. Implemented rules such as getting auxiliary and compound verbs for each predicate, greatly improve the quality of questions though there is a room for improvement.

EVALUATION OF THE PROPOSED AUTOMATIC QUESTION GENERATION METHOD
The total of 758 sentences from hr 500k testing set sentences for SQL evaluation were used to generate in total 628 question and answer pairs. The questions were evaluated by three experts, university level teachers, because of the implementation of this system for educational purposes. Each question is evaluated on three different levels, grammar, semantics a relevance as proposed in [9] using Likert scale from one to five where the scale is represented as follows 1) dissatisfied, 2) notsatisfied, 3) satisfied, 4) very satisfied and 5) extremely satisfied.
Upon completion we measured inner-evaluator agreement with Fleiss' Kappa. This analysis showed fair consensus between evaluators' ratings on all three levels of evaluation. We also evaluated results of grading on 3 different scales. First scale is standard Likert five grade scale, second scale is four grade scale that was obtained by concatenating last and second to last grades, and third scale was obtained by concatenation of second and third grade. We concatenated the grades randomly although other approaches for concatenation could also be used and evaluated. The results of Fleiss' Kappa evaluation are shown in Tab. 4. In the first column Fleiss' Kappa for grammar is shown for five, four and three grades. In second column same scores are shown for semantics and last column shows the relevance results. These results show that evaluators have fair and moderate agreement about grammar, semantics and relevance of the generated questions. Fleiss' Kappa score increased substantially when three and four grades were introduced especially in semantics and relevance categories which show that maybe for these categories three level grades are more appropriate. In Tab. 5 we show average ratings for different types of constituent questions. Most of the questions are actor type (72%) and there are very few frequency type questions (0.007%). From the table we can see that the quality of location, quantity and time type questions are rated with the best average scores, but are scored less relevant than actor type questions, which is scored as satisfactory. Also, frequency type questions are scored as grammatically incorrect but satisfactory understandable and not very relevant. To present overall system performance we used the average of all ratings of the AQG system. The evaluation results show that our system, by the evaluation of the experts, generates questions with satisfactory results. The worst scores on all levels are achieved by RESTR type of questions. This could mean that the template for this type of question is badly defined.
Our proposed system achieved average rating of 3.36 for grammar category, for semantic category the system achieved average of 3.81 and for relevance category the system achieved average score of 3.80. This means that the system for AQG could be used for automatically generating questions that could be used for educational purposes. In that regard we evaluated to which extent the generated questions could be used for educational purposes. Questions that have total scoring higher than 10 are considered as useful. This analysis shows that 71.3% of the generated questions could be used for educational purposes. These achieved results show us that we can streamline question generation inside authoring tools of ITS systems and achieve good results. Such a component can reduce the time for the creation of course content by generating questions from unstructured texts.

CONCLUSION
In this article we presented a new approach for SRL of the texts written in Croatian language. We present results of the new SRL model that is based on CRF and custom defined features. This approach showed very competitive results with overall increase of 6% in F 1 score compared to the benchmark results. Using SRL component we also presented AQG model where we defined custom templates for the questions. Using this model, we generated and qualitatively evaluated 628 questions. The questions were evaluated for use in educational systems by expert university level teachers. The result of evaluation show that system generates questions that are very satisfactory for use in educational systems.
In the generated questions we have noticed that pronouns greatly affect the quality of the questions. One of the solutions to this problem is the use of the coreference resolution tool. We also noticed that the system is having problems generating Who or What type of questions. The solution to this problem could be by using word sense disambiguation algorithm (WSD) such is Lesk algorithm [30] and finding WordNet [31] or in case of Croatian language CROWN [32] entries to determine the animacy of the argument. Using auxiliary verbs from original sentence with predicate improves grammatical quality of questions. The system generates very dissatisfactory questions when copular predicate is present. This could be fixed by applying some constraints on dependency tree of the sentence.
Using morphosyntactic information, background knowledge resources, named entities and coreference resolution the quality of the generated questions could be further improved. Compared with the system for AQG in English language presented in [9] our system performed very well, by generating 71.3% usable questions for educational purposes. This is very good result taking into account that Croatian is resource poor language that is highly inflected when compared to English language. Due to this fact any system that automatically structures textual information is great contribution to this language and could be used in any system that makes decisions based on knowledge.
Such system is ITS that could receive unstructured text at the input, which is transformed into content that the system can easily manipulate. With such content, the system can generate questions within the knowledge testing subsystem and generate instructional material within the teaching subsystem. Based on this we can see how SRL and AQG in ITS can be an active topic for research. For future research we will try to improve our AQG system by using other natural language processing tasks and knowledge base. We also plan to integrate our AQG system into ITS system and evaluate the system in real educational environment.