Image Completion Based on Edge Prediction and Improved Generator

: The existing image completion algorithms may result in problems of poor completion in the missing parts, excessive smoothing or chaotic structure of the completed areas, and large training cycle when processing more complex images. Therefore, a two-stage adversarial image completion model based on edge prediction and improved generator structure has been put forward to solve the existing problems. Firstly, Canny edge detection is utilized to extract the damaged edge image, to predict and to complete the edge information of the missing area of the image in the edge prediction network. Secondly, the predicted edge image is taken as a priori information by the Image completion network to complete the damaged area of the image, so as to make the structure information of the completed area more accurate. A-JPU module is designed to ensure the completion result and speed up training for existing models due to the enormous number of computations caused by the large use of extended convolution in the self-coding structure. Finally, the experimental results on Places 2 dataset show that the PSNR and SSIM of the image using the image completion model are higher and the subjective visual effect is closer to the real image than some other image completion models.


INTRODUCTION
Image completion means that the computer inferentially completes the content of the damaged area of the image based on the known information in the damaged image. The purpose of image completion is to make the visual effect of the completed image close to the real image. However, human vision can keenly capture the traces of image completion. Therefore, it is an essential research topic to make the completion result more real and make the human eye unable to distinguish the difference between the complementary image and the real image in image completion technology.
Traditional image completion methods [1][2][3] mainly include diffusion-based or patch-based methods. Variational algorithms and patch similarity are usually utilized to search for the pixel information of the undamaged area in the image, finding the pixel blocks with high similarity to fill in missing areas of the image. Most of these methods are based on the low-level features of the image for completion, which can only deal with some simple texture information missing tasks. Nonetheless, it is difficult to process the complex semantic information in the missing areas of the image. In recent years, with the rapid development of Convolutional Neural Networks (CNN) and Generative Adversarial Networks (GAN) [4], image completion technologies have made great progress and these deep learning technologies are adopted in image completion [5][6][7][8][9].
In 2016, Pathak et al. [5] applied GAN to image completion tasks for the first time, putting forward an unsupervised image completion algorithm Context encoder based on context content, which combined reconstruction loss and counter loss to make the completed image structure integrated and conform to the semantics of the entire image. However, CNN could not extract information from the far area of the image, which caused a partially blurred and structurally distorted image. Iizuka et al. [7] proposed to add dilated convolution to the auto-encoding structure of the generator, due to the problems of partial blurring and poor overall consistency in the complementary image. They designed a local discriminator to train on the detail problems and utilized global and local discriminators to train the generation network, which could complete images to conform to the global semantics and enhance the clarity of the local area. Moreover, Yu et al. [8] introduced a spatial attention module [11] and proposed a feed-forward generation network with a contextual attention module, which enabled the network to select effective information at long-distance locations. This meant the input image was occluded and expanded convolution was used to obtain the repair result through the completion network. Then the result was input into the global and local discriminators for detection to obtain a high-quality completion effect.
Image restoration based on deep learning has greatly improved the quality of restoration compared with traditional image restoration methods. However, above methods cannot be integrated to the structure information and close to the real images well when faced with a lack of information in the large-area structure of an image. In addition, a large number of dilated convolutions in the selfencoding structure of the generator are utilized in image restoration based on deep learning to improve the quality of image completion, which can obtain a larger receptive field without changing the size of the convolution kernel. However, this has increased a large amount of memory usage, resulting in increased model training time and lower overall efficiency.
In response to the above problems, an Image Completion Model based on Edge Prediction (ICMEP) and improved generator is proposed in this paper. The ICMEP model is divided into two stages. The first stage of edge prediction network is to complete the edge information of the missing area in the image. Then, according to the completed edge information, the image completion network can complete the damaged area in the image, which makes the edges of the complementary image clear and reasonable. Finally, Attention Joint Pyramid Upsampling (A-JPU) module is utilized to replace a large number of expansion convolutions in the traditional network in the autoencoder structure of the completion network. This can ensure the completion effect, accelerate the training of the model and improve efficiency.

MODEL STRUCTURE DESIGN
The overall framework of ICMEP is shown in Fig. 1. ICMEP consists of the edge prediction network and the image completion network. Firstly, the role of the edge prediction network is to predict and generate the edge structure information close to the missing part of the image. In addition, the role of the image completion network is to complement the image based on the edge structure information and the damaged image generated by the edge prediction network to generate a complete image with real visual effects.

Edge Prediction Network
The edge prediction network is composed of an edge generation network (G1) and an edge discrimination network (D1). The network parameter structure diagram is shown in Fig. 2. The edge generation network (G1) uses the structure of the autoencoder. The size of the convolution kernel of the convolution layer is 7  7, 4  4 and 4  4. In addition, the transposed convolution layer is a mirror setting, and the activation function is ReLU. The intermediate layer between the encoder and the decoder is composed of 4 continuous residual blocks to extract deeplevel edge texture features and avoid the problem of gradient disappearance caused by too deep network depth. The edge discrimination network (D1) is composed of 5 convolutional layers, a 4  4 convolution kernel and a LeakyReLU activation function. The input of the edge generation network is composed of the binary mask image of the occlusion area, the gray image of the occlusion image, and the edge image extracted by the Canny edge detector. Moreover, the input size is 256  256  3. Finally, with the alternate training of the edge generation network and the edge discrimination network, the edge prediction image generated by the edge generation network can be closer to the real edge image.

Image Completion Network
The image completion network is composed of a generation network (G2) and a discrimination network (D2). The discrimination network (D2) has the same structural parameters as the edge discrimination network D1. The parameter structure diagram of the generation network G2 is shown in Fig. 3. The input of the image completion network is composed of the occluded image and the edge map generated by the edge prediction network prediction and splicing according to the channel dimension, and the input size is 256  256  4.

Figure 3 G2 structure parameter diagram
Traditional deep learning-based image completion model generation networks generally use a large number of dilated convolutions [8][9][10] as the connection layer of the encoding-decoding part, which makes the convolution kernel size unchanged to obtain a larger receptive field. Dilated convolution plays an important role in obtaining the final feature map, but a large number of dilated convolutions increase computational complexity and memory usage, resulting in long model training time and low overall efficiency. Moreover, the majority of semantic information is encoded in the final feature map obtained by expanding the convolution layer, but the fine image structure information is lost. This results in inaccurate edge information of the complemented image and blurred completion effect.
In order to reduce the time and memory consumed by the expansion convolution, a spatial pyramid structure based on the attention mechanism A-JPU module (Fig. 4) has been designed which refers to the structure of the JPU joint up-sampling module [12]. A-JPU model can replace the traditional expansion convolutional layer to extract image context information from different receptive fields.  In the image completion network, five convolutional layers for feature extraction are used, then the feature maps of the three convolutional layers after the encoding structure are selected as the input of the A-JPU module to realize the use of cross-level multi-scale feature maps. In the A-JPU module, in order to analyze the feature maps from different layers, the dilated convolutional layers with different expansion rates are regarded as multi-scale convolutions. And four parallel expansion rates are designed as 1, 2, 4, and 8 dilated convolutions to obtain deep features of different scales. Among them, a convolution kernel with a smaller expansion rate can pay more attention to image details. Although a convolution kernel with a larger expansion rate pays less attention to detail textures, it can process local spatial information well. Secondly, the spatial pyramid structure is combined with the spatial attention module which is added after each expanded convolutional layer.
After the spatial attention module is weighted and processed, the obtained multi-scale local contextual features are enhanced. In order to obtain a better completion effect, the obtained multi-scale features are stitched through the feature fusion layer. In addition, the local detail texture information is embedded into the spatial information, and the channel attention module is used as a semantic descriptor to calculate and select filter important channels. Here, the average pooling and the maximum pooling are used to consider the spatial positions of all neurons and highlight the spatial positions with high activation [13]. This can extract global semantic information of the image to constrain feature semantics from different scales, thereby suppressing useless deep context information. Finally, a complete image is obtained through three transposed convolutional layers. By alternately training the generation network and the identification network, the complete image complemented by the generation network is finally made as close to the real image as possible.

LOSS FUNCTION
The loss function of the ICMEP model consists of two parts: the edge prediction network loss function L G1 and the image completion network loss function L G2 .

Edge Prediction Network Loss Function
Suppose the real image in the training data set is I, then I E represents the edge image of I, and I g represents the gray image of I. In the edge prediction network, the input of the edge generation network G 1 is the grayscale image G 1 is the function represented by the edge generation network. The generated E pred and I E are jointly input for edge discrimination network training, so that it can distinguish whether the input edge image is generated by the edge generation network or the edge image of the original image. The loss function of the edge prediction network is shown in Eq. (2), which is composed of the weighted sum of the confrontation loss and the feature consistency loss: D1 represents the function represented by the discrimination network in the edge prediction network, L adv1 represents the adversarial loss in the edge prediction network. L F represents the feature consistency loss. λ adv1 and λ F are the super balance of the adversarial loss and the feature consistency loss. Parameters λ adv1 = 0.1, λ F = 1. The definition of L adv1 is shown in Eq.
The composition of the feature consistency loss L F is similar to the perceptual loss [14,15], which is defined by the L 2 distance between the feature map and the activation map of the pre-trained VGG network [16]. However, because the VGG network has not been trained to generate edge information, it cannot generate the required feature maps. Therefore, the feature consistency loss L F is defined by the L 1 distance of the activation map in each convolutional layer of the edge discrimination network. The specific definition is shown in Eq. (4): L is the last layer of the edge discrimination network,

Image Completion Network Loss Function
The image completion network uses the damaged color image ( In the formula, the hyperparameters λ adv2 = λ p = 0.1, λ r = 1, λ s = 10, the adversarial loss L adv2 is similar to the Eq. (3) as shown in Eq. (7): The reconstruction loss L r is defined by the L 1 distance between the real sample and the generated sample: Perceptual loss L perc uses the L 2 distance metric between the activation feature maps of the pre-trained VGG19 network to penalize the results that are not similar to the real sample in perception, and its definition is shown in Eq. (9): where i  is the activation feature map of the i-th layer of the pre-trained VGG19 network, and i respectively corresponds to the five convolutional layers of relu1_1, relu2_1, relu3_1, relu4_1 and relu5_1. These activation feature maps are also used to calculate the style loss [17]. The style loss is an effective tool against the checkerboard artifact effect caused by the transposed convolutional layer [18]. Its definition is shown in Eq. (10)

Experimental Results and Analysis
The experiment of the ICMEP model in this paper is carried out on a CUDA 10.0 host computer with an Intel Core i7-7700 CPU, Nvidia GTX 1060 6GB GPU, 16 GB memory, and a deep learning framework PyTorch 1.0. The experiment uses 1.8 million images from 365 scene categories with a resolution of 256  256 in the Places2 [19] data set and 20000 randomly generated irregular mask images with a resolution of 256  256 in the literature [20] for the training of the model. In this paper, the model training data batch is 4, the model optimizer uses the Adam optimizer, and the learning rate is set to 10 −5 . The test data set is randomly selected from the 18250 verification images that come with the Places2 data set for model testing.

Edge Prediction Network Performance Comparison
The ICMEP model is divided into two stages in this paper. In order to prove the effectiveness of the edge prediction network, the two models both use the Places2 dataset and the image completion network of this model in the same environment to train the models. One model uses edge prediction and another one does not use edge prediction respectively. Tested on the test set, the subjective visual effect is shown in Fig. 5. The comparison shows that the edge structure information is chaotic and fuzzy in the model completion area without edge prediction; while the ICMEP model using edge prediction is more accurate and reasonable to complete the area structure information. The visual effect is closer to the real image. The objective evaluation index results are shown in Tab. 1. Compared with the objective evaluation index, using edge prediction has an increase of 1.54 dB in PSNR and an increase of 0.036 in SSIM compared with no edge prediction. The comparison of objective evaluation indicators and subjective visual effects shows that the use of edge prediction networks can make the structural information of the complementary region image more accurate and the visual effects closer to reality.

No edge
Edge prediction Input Figure 5 Edge prediction effect comparison

A-JPU Module Performance Comparison
The self-encoding structure in the image completion network of the ICMEP model uses the A-JPU module to replace the traditional new structure of a large number of expanded convolutional layers to achieve the purpose of accelerating training and improving efficiency. In order to further prove experimentally that the two models are in the same environment, we compare the second-stage image completion network using the A-JPU module and the traditional expanded convolution method. The results are shown in Tab. 2. The data in Tab. 2 shows that using the A-JPU module to replace the traditional expanded convolution, the objective evaluation index gap is very small, indicating that the A-JPU module has achieved the role of expanded convolution. When the batch size is set to 4, the one iteration time is 1.027 s by using the A-JPU module as the middle layer of self-encoding structure. However, the one iteration time is 1.348 s by using dilated convolutional training. The data shows that the completion network using the A-JPU module can reduce approximately while slightly or without sacrificing the completion effect. A quarter of the training time effectively improves efficiency.

Image Completion Effect Evaluation
Contextual Attention (CA) [8], GLCIC [7] and the ICMEP model of this paper are selected for the evaluation of the image completion effect. All models use Places 2 as the training data set, and the objective evaluation indicators use PSNR and SSIM. The experimental results are shown in Fig. 6, showing the completion effects of each comparison model and the ICMEP model of this paper on the test set. From left to right: damage image, GLCIC model, CA model, original image, and the supplement of the ICMEP model of this paper. Through observation and comparison, there are problems such as fuzzy structure and confusion in the complement area of the GLCIC model, and the complement effect is average. Meanwhile, the image complemented by the CA model and the model in this paper is clearer, but when some key structures are missing, such as the roof of the first and fifth images, the railing of the third picture and the hot air balloon of the fourth picture. The CA model supplemented images all have the problems of local structure confusion and unreasonable structure information. However, the structure information of the image supplemented by the model in this paper is clear and reasonable, whose visual effect is closer to the real image. The comparison of subjective visual effects shows that compared with the other two models, the model in this paper complements the damaged image based on the structural information of the missing area predicted by the edge prediction network. The structural information of the image complement area is more accurate and reasonable, and the visual effect is closer to the real image. In order to further verify the effectiveness of the ICMEP model in this paper, Tab. 3 lists the objective evaluation index results of the GLCIC model, the CA model and the ICMEP model in this paper to repair the same 100 test images in the face of different scale mosaics, and the completed images. Through the comparison of the data in the table, the GLCIC model has a general blur effect in the complement area which has a lower evaluation index than the CA and the ICMEP model of this paper. When the mosaic area is small (below 20%), the evaluation of the image completion result of the ICMEP model of this paper is slightly better than the CA model. As the mosaic area gradually increases, the gap between the evaluation indexes of the ICMEP model and the CA model in this paper is getting bigger and bigger. The PSNR and SSIM results of the ICMEP model in this paper both obtain higher value.
The above analysis and comparison results show that compared with the existing models, the image completion results of the ICMEP model in this paper have improved data evaluation indicators. In addition, the model in this paper effectively alleviates the problem of information confusion in the image results after completion and makes the image visual effect closer to the real image.

CONCLUSION
An image completion model ICMEP based on edge prediction and improved generator is designed and proposed in this paper. The model is divided into two parts: edge prediction network and image completion network. The experimental results show that the edge prediction network enables the model to predict and process the missing complex structure of the image, and the completion effect is better than the model that only uses a single image completion network, which also confirms the importance of edge high-frequency information in the image completion process function. The A-JPU module is used in the image completion network to replace the traditional expansion convolution to ensure the image completion effect while effectively improving the training efficiency. Compared with other existing models, the subjective visual effects and objective evaluation indicators of completion results of the ICMEP model in this paper have both been improved. However, the ICMEP model will still be blurred when processing the complement of the missing areas with a large number of detailed texture features which will be studied in the future research.