Establishing a Fusion Model of Attention Mechanism and Generative Adversarial Network to Estimate Students' Attitudes in English Classes

: With the rapid development of science and technology, artificial intelligence has been widely used in various fields and a new model of AI-aided education has been developed in the new era. In the education industry, AI-aided education can save teachers' energy, improve teaching efficiency and help to refine teaching methods. In order to estimate students' attitudes towards English teachers' lectures, this paper proposed an AI-aided feedback system. In the constructed system, DG-Net was used to expand the data sets of students, and combined with Attention's Alphapose model to collect students' listening poses. The whole model provided feedback of students' listening postures in English speaking and listening classes, assisting teachers to estimate students' attitudes through data analysis and realizing AI-aided education in English classes.


INTRODUCTION
Traditional college English teaching is exam-oriented.Both the teachers and students focus on how to solve the writing and reading tasks in the exams, leaving little attention on the listening and speaking competence.Accordingly, the teaching methods are simple.Teachers explain language points, and students remember them mechanically.As a result, students' interest in English has gradually faded away, and the teaching result is compromised.That is because both the teachers and the students forget that English, as a language, is used to communicate.Students who have the ability to read and write in English do not necessarily possess the ability to communicate with English.Cultivation of English listening and speaking ability is a way to improve students' communicative competence.Only by listening and speaking more, the students' language intuition can be gradually cultivated and their ability to communicate can be acquired.Therefore, listening and speaking practices have been brought into English classrooms.
However, managing an English listening and speaking class is a demanding task for the teachers.It is not possible for the teachers to hold an English listening and speaking activity, while monitoring every students' response and attitude or recording their preference of any particular teaching method.AI-aided education, an emerging classroom model, can help.This study used AI as auxiliary feedback to estimate students' attitudes by analyzing students' poses in an English listening and speaking classroom.With the application of attention mechanism and generative adversarial network, students' attitudes were obtained.The model refined the students' poses and made data analysis, avoiding the subjective factors of manual analysis.The analysis results are intuitive, visual and accurate.

RELATED RESEARCH
Researchers have developed some attention mechanism models.Jaderberg et al. developed Spatial Transformer Networks which transform spatial information through attention model to realize the functions of image rotation and zoom transformation [1].Wang et al. believed that the Non-local method could capture the long-term dependence of one pixel on other pixels by calculating the autocorrelation matrix [2].Du proposed Interaction-aware Attention model where a new loss function based on PCA was designed on the basis of non-local computation of covariance matrix to help achieve better global interaction between features in the channel dimension [3].Huang developed CCNet from Non-local, which reduced the calculation of autocorrelation matrix from all pixels in the image to the pixels at the intersection, thus greatly reducing the calculation amount [4].Similarly, in order to reduce the computation amount, Li et al. developed an EMA model, which combined the attention mechanism with EM algorithm to obtain a set of bases by expectation maximization and ran the attention mechanism on the set of bases [5].Hu et al. and others put forward a brand-new strategy of "feature recalibration", SENET, which modeled the dependencies among feature channels, giving more weight to effective feature channels and ignoring invalid channels [6].The attention mechanism proposed by Zhang realized self-adaptive adjustment of channel characteristics by calculating the dependencies among channels [7].He suggested calculating the channel weights according to the activation values of the channels around the target position [8].Yu et al. designed a smooth network to select more distinctive features through the channel attention block and the global average pool in order to solve the intra-class inconsistency in semantic segmentation [9].Wang added channel domain attention to target tracking based on offline training, and through this mechanism, the channels with better tracking effect were given greater weight, and the noisy channels were directly deleted [10].Zheng et al. give the current frame a weight by spatially measuring the relationship between the previous frame and the current frame, which is an essential channel domain attention mechanism [11].Woo et al. put forward a convolutional block attention module [12].In the processing of channel domain, this module was basically similar to SE-Net, obtaining the one-dimensional vector by channel compression first, and then operating the one-dimensional vector, while in space, the feature images obtained by Max Pooling and Average Pooling were directly spliced together and then convolved.Cao et al. developed a Global Context (GC) block which integrated Non-local and SEnet [13].After simplifying the foundation of Non-local, it integrated the attention of channel domain and realized a modeling mode without Query dependence.Fu et al. put forward Dual Attention Network, which is mainly a fusion variant of CBAM and non-local [14].It used the idea of non-local in channel domain and spatial domain respectively, and utilized autocorrelation matrix to capture long-distance dependence.Finally, the attention output in channel domain and spatial domain was fused as the final output.The mixed domain attention mechanism (RANet) proposed by Wang et al. drew on the idea of residual network, constructed a residual attention network by stacking attention modules, constructed an identity map, and realized the training of deep residual attention network, which could be easily extended to hundreds of layers [15].Huang improved the object detector by using an efficient fine-grained mechanism called Inverted Attention (IA) [16].
Generated adversarial network (GAN) was proposed by Goodfellow in 2014 [17].With the high-definition images generated after the contest, it has attracted extensive attention from researchers.Early generation of GAN has unstable factors.In the iteration controlled by loss function, the loss value often does not decrease.To solve this problem, Radford et al. proposed the DCGAN network, in which the network structure was improved and a stable GAN network was obtained [18].Arjovsky et al. analyzed the reasons why GAN was prone to collapse based on this theory, and proposed WGAN, which replaced the loss function of GAN with Wsserstein distance, and thus further improved the network performance [19].DG-Net, different from the previous networks, does not need to use the information outside the data set to generate data, thus greatly improving the level of re-recognition baseline and making stable performance on many open source data sets.
Pose estimation is the basic challenge in computer vision.The initial single-person pose estimation employs the traditional tree model [20], the random forest [21] and the conditional random field model [22].With the development of deep learning, the traditional method is far less accurate than the deep learning model.Therefore, DeepPose [23], DNN model [24] and CNN model [25] have been widely used.However, in single-person pose estimation, neither the traditional method nor the deep learning model can accurately estimate the pose unless the person is correctly positioned.This problem has been solved by a regional multi-person pose estimation (RMPE) framework [26], which increases the accuracy not only for single-person pose estimation but also for multi-person pose estimation.

METHOD 3.1 Attention Mechanism and Modules
Attention mechanism originated from the study of human vision.The research on human eyeballs has proved that only the fovea area of the retina has the greatest sensitivity.Therefore, in order to efficiently use the limited sensitive area of the eyeballs, people tend to choose the area that deserves the most attention and focus on it.This "weight-selective" processing mechanism of human body is called attention mechanism.Relying on this attention mechanism, people can deal with immeasurable information that they receive through vision in an orderly way.Similarly, in the era of information explosion, the field of deep learning has been bombarded by big data.In order to cope with excessive data input, the attention mechanism of human eyes has been applied to the field of deep learning, and has become the hottest data processing module in this field.
Attention mechanism is classified into Item-wise Soft Attention, Item-wise Hard Attention, Location-wise Soft Attention and Location-wise Hard Attention.Soft attention is static, focused on space and channel, while strong attention is dynamic, focused on the process involved.The mechanism of soft attention is subtle.If used in conjunction with an in-depth-learned network model, the weights of the network can be updated in the back propagation of the network model, therefore mainly used in that field of deep learning.According to different attention domains, soft attention can be divided into spatial attention, channel attention and fusion attention.Spatial attention generally involves channel compression on the input feature images at first, which can greatly reduce the number of parameters and simplify the calculation.Channel attention involves pooling the overall input feature image to get a onedimensional vector, and carrying out feature interaction on this one-dimensional vector to determine the dependencies between channels because different channel attention mechanisms have different ways to deal with feature interaction.Fusion attention involves both spatial attention and channel attention.It conducts serial processing mode of spatial domain first and then channel domain, or channel domain first and then spatial domain, or parallel processing mode of spatial domain and channel domain.
The attention mechanism used in this study is the Convolutional Block Attention Module (CBAM), where spatial attention and channel attention were combined and coordinated in order improve the performance of the module.

Channel Attention Module
As Fig. 1 shows, the feature image with the size of S × B × C was input into the channel attention module.The C channels of the feature image were averaged and pooled to obtain 1 × 1 × C feature vectors with the same number of dimension.These vectors passed through a two-layer neural network with shared weights, and the activation function of each layer was Relu.Two new feature vectors with the same feature dimension were added and activated by Sigmoid activation function to yield the weight coefficient (W c ).Finally, W c was multiplied with the input to obtain the output of the module.

Spatial Attention Module
As Fig. 2 shows, the feature image with the size of S × B × C was input to the spacial attention module, and each pixel of the image was averaged and maximized on each dimension to obtain the feature vector with the same dimension of S × B. The obtained image was convoluted and activated by Sigmoid to obtain W s .Finally, the input was multiplied by W s to get the module output.

CBAM
As Fig. 3 shows, CBAM, a mixed domain attention mechanism, was divided into two independent modules: a channel attention module and a spatial attention module.This mixed mechanism saved model parameters and reduced the amount of calculation, making the use of CBAM more convenient.The formula is as follows:       s s s avg max CONV AvgPool( ),MaxPool( ) CONV ( ) is the input of the module, c ( ) , W 0 , W 1 is the weight coefficient of layer 1 and layer 2 of the shared network. is the multiplication of matrix.

Generative Adversarial Networks (GAN)
GAN consisted of a generator and a discriminator.By learning the features of real images in the data set, the generator generated "pseudo-images" with high similarity with the real ones, aiming to generate pictures that cannot be discriminated by the discriminator.The discriminator tried to distinguish the real images from the one generated by the generator.The two networks contested with each other to improve the realism of the generated data.Nowadays, DG-Net network, a particular kind of GAN, is widely used in various fields.

The Generator
The generation module of DG-Net consisted of selfidentity generation and cross-identity generation.Selfidentify generation occurred when two sample images of the same ID (the same person) passed through the generator (Fig. 4a).The two images were slightly different in characteristics like clothes, posture, and position.In that case, the generated images retained the same ID.Crossidentity generation occurred when two sample images of different ID (different people) passed through the generator (Fig. 4b).The generator exchanged the features of the samples, and provided the original ID with different features.

The Discriminator
As Fig. 5 shows, the sample images of the discriminative learning module first passed through the appearance coder, and the primary feature was separated from the find-grained feature.The appearance coder then mapped the decomposed features to better predict the ID of the sample.The discrimination module used three losses to control the final result, namely ID loss, appearance loss and fine feature loss, which were calculated according to the following formula: where k is the digital code of the sample ID.

DG-Net
As Fig. 6 shows, DG-Net innovatively integrated the discriminator into the generator, shared the appearance encoder with the generation module, enabled online learning, and benefited the generator and the discriminator, thus forming a complete adversarial network framework.The inputs to the system are three images represented as X j , X i , and X t , X i and X t have the same ID but different features, while X j has a different ID.X j passes the structure module only, X t passes the application module only, and X i passes both the structure and application modules.The IDs and the features are encoded and reconstructed to generate new picture data, which is feedback for appearance encoding online.After that, the improved appearance encoder feeds the reconstructed image and the real image together to discriminator for min-max contest.There were several losses in the whole network, seven in the generator and two in the discriminator.The whole network loss was calculated as follows: is the loss of the same ID image.

Pose Analysis
There are two methods of human posture analysis nowadays: top-down method and bottom-up method.The former is to locate the positions of all the people first, frame the people then, and finally estimate the posture of the people in each frame one by one.The latter begins from locating all the joint points, connecting all the joint points to form the human body then, and finally estimating the posture.Every method has its own drawbacks.Results of the former depend exclusively on the location of the target human body, while the latter tend to confuse the joint points of the densely distributed human targets.
AlphaPose, with both speed and accuracy, is a topdown method, which mainly includes three modules of Symmetric STN+SPPE, Parametric pose Non-Maximum-Suppression (NMS) and pose-guided poses generator (PGPG).In this study, three technologies, namely SYMMETRIC SPATIAL TRANSFORMER NETWORK (SSTN), deep proposals generator (DPG) and PARAMETRIC POSE NON MAXIMUM SUPPRESSION (p-NMS) were used to solve the problem of multi-person posture estimation.SSTN was added to the Technical Gazette 29, 5(2022), 1464-1471 structure of single person pose estimation (SPPE) to optimize the structure and extract high-quality human body regions despite the inaccurate location of human body frames.P-NMS was used to solve the redundancy of detection.The structure had its own pose estimation scheme to compare the similarity between the poses, optimizing the pose distance parameters with data-driven method.PGPG was used to enhance the training data and simulate the generation process of human body region frame by learning the description information of different poses in the output results.

Symmetric STN + SPPE
As Fig. 7 show, this module consisted of Spatial Transform Network (STN) [1], Spatial De-transform network and SPPE.This module estimated the pose of the incorrect input, and the pose estimation result was then mapped to the original image.In this way, the position of the input box was constantly adjusted until it became the precise input.

NMS
When the people images were identified by the discriminator, a lot of potential windows of the pedestrian were obtained.Every window was scored.Since each person got multiple windows and most of windows had a high degree of overlap, Non-Maximum Suppression (NMS) was needed to select those windows with the highest scores and stop the windows with low scores.This module was used to stop the redundant boxes and accurately locate the human position.

AlphaPose
As Fig. 8 shows, PMPE module got the raw frame of the human position using the target detection algorithm, such as YOLO.Then the human posture was detected through STN+SPPE+SDTN module and removed of redundancy through NMS.The PGPG part of the module was finally used to enhance the trained data set.

Experimental Environment and Data Set
Experimental data were obtained from the real pictures of students in some English classes.DG-Net was constructed on OMEN by HP Desktop PC 880-p1xx.The computer's processor was Intel(R) Core(TM) i7-9700 CPU, RAM is 16GB, equipped with NVIDIA GeForce RTX2080Ti GPU, and the running environment was python+pytorch.The alphapose was completed on linux server, a Ubuntu16.04system, whose processor was Intel(R) Xeon(R) Silver 4110 CPU, equipped with four NVIDIA GeForce RTX 2080Ti GPU, and the running environment was docker+anaconda+python+pytorch.

Attention Mechanism and DG-Net Fusion Pose Estimation Model
As Fig. 9 shows, the module used alphapose to estimate students' pose.In order to better cope with the excessive number of students in the English class, attention module was added in the process of pose estimation.The attention module was embedded in Alphapose network, which reduced network parameters and time complexity.In the input data set, DG-Net network was applied to enhance the image data of English class, and the problem of small data set was solved.

Results
As Fig. 10 shows, the network worked well.Although the input image data was a small sample and easy to cause an under-fitting situation, the addition of DG-Net expanded the input data set and enhanced the input image data, which makes the whole model fit better.Even when students' bodies overlapped or appeared sideways, the attitude information of each student could still be accurately obtained.CBAM module, integrated in Alphapose network, solved the problem of too large network and low speed, which provided speed support for the real-time monitoring system of the subsequent research.In addition to displaying the pose skeleton on the original image, the network web also generated the relevant json file (Fig. 11).File data, such as the ID of each student, the ID of every image, and the coordinates of each joint are listed in Tab. 1.There are 17 joint points generated by json.In the json file, all the students' posture skeleton information was digitally output.This enabled the manual check of the accuracy of students' posture estimation in unlabeled data sets and the alarm of the abnormal poses of the students.

CONCLUSION
This study proposes a fusion network to estimate the students' attitude in an English listening and speaking class.
It is a new AI-aided language teaching method.The overall accuracy of the attitude estimation model is high even in the complex environment where the number of students is large and the body overlap is serious.This proposal is a feasible scheme for the future AI-aided education system.In this study, the deep learning method is innovatively added to the traditional education, so that teachers can capture students' attitude in class in a convenient, quick and accurate way, thus improving the educational methods.However, this model does not work well for the students at the edge of the picture or standing sideways.In that case, the students will be missed or the pose will be lost due to the low score in the NMS process.Therefore, the followup research will focus on solving these problems.

Figure 1
Figure 1 Channel attention module

Figure 2
Figure 2 Spatial Attention Module

Figure 4
Figure 4 (a) self-identity generation; (b) cross-identity generation weight coefficient which controls the relation between different losses.

Figure 5 Figure 6
Figure 5 Discriminative RE-ID learning

Figure 9
Figure 9 AAD network model

Figure 11 json file Table 1
Figure 11 json fileTable 1 Interpretation of json file elements Element Format Interpretation Image_id int id of the image category_id int id of the person keypoints [x, y, v] × k The abscissa x, the ordinate y and the point mark v of the key point; k is the number of keypoints.score int score box [x, y, w, h]The abscissa x and ordinate y of the upper left corner of the frame, and the width w and height h of the frame.idx [0.0] --------