3D Reconstruction of Optical Building Images Based on Improved 3D-R2N2 Algorithm

: Three-dimensional reconstruction technology is a key element in the construction of urban geospatial models. Addressing the current shortcomings in reconstruction accuracy, registration results convergence, reconstruction effectiveness, and convergence time of 3D reconstruction algorithms, we propose an optical building object 3D reconstruction method based on an improved 3D-R2N2 algorithm. The method inputs preprocessed optical remote sensing images into a Convolutional Neural Network (CNN) with dense connections for encoding, converting them into a low-dimensional feature matrix and adding a residual connection between every two convolutional layers to enhance network depth. Subsequently, 3D Long Short-Term Memory (3D-LSTM) units are used for transitional connections and cyclic learning. Each unit selectively adjusts or maintains its state, accepting feature vectors computed by the encoder. These data are further passed into a Deep Convolutional Neural Network (DCNN), where each 3D-LSTM hidden unit partially reconstructs output voxels. The DCNN convolutional layer employs an equally sized 3  3  3 convolutional kernel to process these feature data and decode them, thereby accomplishing the 3D reconstruction of buildings. Simultaneously, a pyramid pooling layer is introduced between the feature extraction module and the fully connected layer to enhance the performance of the algorithm. Experimental results indicate that, compared to the 3D-R2N2 algorithm, the SFM-enhanced AKAZE algorithm, the AISI-BIM algorithm, and the improved PMVS algorithm, the proposed algorithm improves the reconstruction effect by 5.3%, 7.8%, 7.4%, and 1.0% respectively. Furthermore, compared to other algorithms, the proposed algorithm exhibits higher efficiency in terms of registration result convergence and reconstruction time, with faster computational speed. This research contributes to the enhancement of building 3D reconstruction technology, laying a foundation for future research in deep learning applications in the architectural field.


INTRODUCTION
With the development of modern technology, 3D reconstruction technology [1] for buildings has received increasing attention, and 3D model construction has become a key element of urban geographic spatial data frameworks.This paper focuses on the optical image-based 3D reconstruction of buildings using 3D-R2N2.Currently, many scholars both domestically and internationally have conducted research in this area and achieved significant results.Wu et al. [2] were the first to propose 3D Shape-Nets, a voxel-based 3D reconstruction network.However, this network suffered from texture defects, specular reflections, and baseline matching issues.Choy et al. [3] proposed the 3D-R2N2 method, which primarily addresses the problem of object feature matching.However, the accuracy and efficiency of this method are not high.Kanazawa et al. [4] developed a Warp-Net network framework based on convolutional neural networks, which achieved reconstruction quality similar to that of supervised methods.However, the targets reconstructed by this method were distorted.Wu J. et al. [5] combined the Marr-Net model, which is trained end-to-end on real images, but this algorithm has issues of computational complexity and lacks finer geometric shapes.Lu Chuan et al. [6] used an improved 2DPCA-SIFT feature-matching algorithm to match feature points in images of ancient buildings from the Qing Dynasty.They then achieved 3D reconstruction through image sequence fusion.This method has good completeness and accuracy, but the reconstruction efficiency is not high.Chen Jiankun et al. [7] proposed a 3D reconstruction method based on deep neural networks, which achieved good results in SAR images, but the 3D reconstruction effect in optical images is not perfect and cannot fully reflect the continuous structure of the target.Stathopoulou E. K. et al. [8] investigate three of the available commonly used opensource solutions, namely COLMAP, OpenMVG + OpenMVS and AliceVision, evaluating their results under diverse large scale scenarios.Comparisons and critical evaluation on the image orientation and dense point cloud generation algorithms is performed with respect to the completeness and accuracy of the final results.Zhu Pan et al. [9] developed an AISI-BIM 3D reconstruction method, which has high accuracy under AISI network segmentation, and the segmentation boundary is also clearer.However, the reconstruction efficiency of this method is not ideal, and it consumes a lot of time and resources.To improve the accuracy of the deep learning algorithm based on a Multi-View Stereo matching network (MVS-Net) in weak texture scenes, Liu D. et al. [10] proposed a novel 2D-3D CNN with spectral-spatial multiscale feature fusion for hyperspectral image classification, which consists of two feature extraction streams, a feature fusion module as well as a classification scheme.The authors innovated a classification scheme to lift the classification accuracy.Wang A. et al. [11] showed that the point cloud is dense enough which is reconstructed by the 3d reconstruction algorithm based on regional growth combining CMVS-PMVS and well expressed the practical model of object reconstruction; the reconstruction of objects in remote sensing images has very strong practicability, but this algorithm is suitable for a limited number of building types and is not suitable for all types of buildings.
If solely relying on the 3D-R2N2 algorithm for threedimensional reconstruction in the model, the following limitations on the reconstruction results may be observed: 1) Accuracy, the outcome of the reconstruction is contingent upon the quantity and quality of the input images.As such, if the number of input images is limited or the image quality is subpar, it may lead to lower accuracy in the generated 3D model.2) Loss of Detail, this algorithm may encounter issues when dealing with the details of buildings.Particularly for complex architectural structures, such as arches, towers, and intricate decorations, it may not be able to accurately reconstruct these details. 3) Scale Issues, as the scale of buildings is generally large, this algorithm may face certain challenges when handling large-scale 3D models.For instance, it may require a substantial amount of computational resources and time, and improper handling might result in a decline in model quality.4) Data Sparsity, for certain parts of a building that cannot be directly observed from images, such as the interior or obscured sections of a building, this algorithm may not be able to effectively reconstruct them.5) Realism, the 3D models reconstructed by this algorithm might not achieve an extremely realistic effect.Especially in terms of details such as lighting and textures, it may not be able to achieve the desired outcome.
In this study, an improved version of the 3D-R2N2 algorithm was proposed.We first replaced the CNN module with a convolutional layer that uses densely connected forms.Subsequently, a pyramid pooling layer was inserted between the feature extraction module and the fully connected layer to enhance the model's performance.The experimental results show that the proposed algorithm improves the reconstruction performance by 5.3%, 7.8%, 7.4%, and 1.0%, respectively, compared to the 3D-R2N2 algorithm, the SFM-improved AKAZE algorithm, the AISI-BIM algorithm, and the improved PMVS algorithm.Compared with the 3D-R2N2 algorithm, the SFMimproved AKAZE algorithm, and the AISI-BIM algorithm, the proposed algorithm has higher efficiency and faster computing speed in terms of algorithm registration result convergence and reconstruction time.

BASIC PRINCIPLE OF 3D-R2NALGORITHM 2.1 Overall Framework
This study focuses on the research of heterogeneous buildings.The overall process is shown in Fig. 1.

Figure 1 Overall framework
Preprocessed optical remote sensing images are input to CNN [12] for encoding to obtain a low-dimensional feature matrix.To enhance the role of deep neural networks, a residual connection is added between every two convolutional layers in CNN, and 3D-LSTM is used for transitional connection, while conducting recurrent learning [13].The 3D network structure in 3D-LSTM is formed by arranging each unit of 3D-LSTM, because 3D-LSTM will selectively adjust or maintain the state of each unit, forming a three-dimensional grid structure unit while also accepting a feature vector, which is the result calculated by the encoder, and finally transmitting the data of these feature vectors to DCNN.Finally, each 3D-LSTM hidden unit reconstructs a part of the output voxel, and DCNN convolutional layer takes a 3 × 3 × 3 equal-sized convolution kernel to process various pixel data with building image feature information, and then decodes it to obtain the three-dimensional reconstruction of the building.

Image Preprocessing 2.2.1 Image Denoising
Due to the presence of noise and blur in optical images [14], direct 3D reconstruction from optical images can lead to inaccurate results.Therefore, preprocessing of optical images is necessary.In this study, a total variation (TV) model was used to denoise the images.The principle is to use physical noise natural harmonics to explain the inherent physical regularity of noise images, which is easy to accurately reflect the authenticity and natural characteristics of real noise images from the inherent physical noise of real natural noise images and natural harmonic images, as shown in Eq. (1).
where, f represents the original high-frequency image of the noise-free simulation, f0 represents the clear noisy image contaminated by high-frequency noise, s has a zeromean property, and (x, y) denotes the pixel position in the image.To eliminate noise, the study uses total variation (TV) minimization.The image denoising problem can be formulated as the following minimization problem, as shown in Eq. ( 2).
The satisfied constraint conditions are shown as Eq. ( 3) and Eq. ( 4).  The equation is the parameter regularization value variation term, where λ is a parameter of regular integers, which plays an important role in balancing noise reduction and noise smoothing in the image.
In this research, the Total Variation model (TV) is applied to denoise the optical images of anisotropic buildings, effectively extracting the authenticity and natural noise characteristics of the images.This approach significantly enhances the accuracy of the 3D reconstruction results.

Image Enhancement
The optical image was enhanced using the histogram equalization method [15].The input optical image was transformed into a histogram image and then grayscale.
The method extended the grayscale range of the specific comparison to the entire range of all grayscale regions, achieving non-uniform extension stretching of the input image.Then, the pixels in the input image were redistributed, as shown in Eq. ( 6).
where, L -1 = 255, k denotes an integer value that marks the number of object pixels of the given image in a certain grayscale, m is the total number of pixels in the image, and p(k) represents the frequency.
The cumulative histogram is calculated as shown in Eq. ( 7).

  
where, ni is the number of pixels corresponding to the i gray level, and Pk represents the sum of frequencies obtained from the gray level i in [0, L -1].Then, Pk is extended by rounding, as shown in Eq. ( 8).
  int 1 0.5 The mapping relationship is as follows: where k represents the integer value labeled by the number of object pixels in the given image, and Pk represents the sum of frequencies obtained for the gray level i in [0, L -1].
In this study, the histogram equalization stretching method was used to enhance the contrast and details of the optical anisotropic building images.This method improves the visual effects and readability of the images, aiding in more accurate execution of image encoding, feature extraction, and 3D reconstruction.By enhancing the less discernible details in dark or bright areas, the accuracy and quality of the reconstruction are significantly improved.

Build 3D-R2N2 Network
Voxel refers mainly to the probability distribution of three-dimensional objects represented as threedimensional binary variables under the action of deep convolutional belief networks [16].The obtained probabilities are input into a depth map and the data is continuously predicted and filled using Gibbs sampling [17] to reconstruct the shape voxel of the building target.To improve the effect of building reconstruction, this study uses the voxel-based 3D-R2N2 algorithm for 3D reconstruction.The combination of voxel and the 3D-R2N2 algorithm makes the 3D reconstruction result more accurate and solves problems such as texture defects and wide baseline feature matching.
Firstly, random sampling is performed on preprocessed images, and the sampled images are extracted and encoded with different 2D-CNN standards, including feedforward CNN and deep residual variation.To match the number of channels after convolution, this study uses 1  1 convolution for residual connections.Next, the output of the encoder is unfolded and fed into a fully connected layer, which maps the output to a 1024-dimensional feature vector, i.e., a low-dimensional feature vector.
The obtained low-dimensional feature vector is inputted into 3D-LSTM units, which are arranged in a 3D grid structure.In the 3D grid, there are N  N  N 3D-LSTM units, where N is the spatial resolution of the 3D-LSTM grid.As shown in Fig. 2, each 3D-LSTM unit receives a feature vector of the image after encoding and the hidden state of the previous unit.The index (i, j, k) has an independent hidden state ht, and   , , . The output gate ft，input gate it, and storage unit st control the 3D-LSTM grid according to Eq. ( 9), Eq. ( 10), Eq. ( 11), and Eq. ( 12).
Represents convolutional operation.We set n = 4 in our study.A notable characteristic of our study is the absence of an output gate ft, as we only output the final result.To reduce the number of parameters, we removed the redundant output gate, which enables the 3D-LSTM unit to handle inconsistencies between the reconstruction region and the true model, allowing each unit to focus on learning a part of the voxel space rather than participating in the entire space reconstruction.This structure gives the network a locality, allowing it to selectively update predictions of previously occluded parts of the object.The specific portion of the reconstructed final output is calculated and input into the decoder.
After inputting the image sequence 1 2 ,ꞏꞏ , , ꞏꞏꞏ T x x x , the 3D-LSTM transfers the hidden state features to the decoder, which uses 3D convolutions, nonlinearity, and 3D transposed convolution to increase the resolution of the hidden state until the target output resolution is achieved.As shown in Fig. 3, similar to the encoder, this study designs a decoding network, consisting of 5 convolutions and a deep residual (with 4 residual connections).After the final activation before reaching the target output resolution in the last layer, voxel-wise-softmax [18] is applied to the final activation, as shown in Eq. ( 13)., as shown in Eq. ( 14).This study adopts Intersection over Union (IoU) as the evaluation metric, which is larger the better, as shown in Eq. (15).
where, I(x) is the indicator function and t is the threshold.If the probability is greater than this threshold, then the corresponding voxel exists.

Improved 3D-R2N2 Model
This study mainly focuses on improving the CNN module in 3D-R2N2.The linearly connected convolutional layers used in the original 3D-R2N2 CNN are replaced with convolutional layers using dense connections.In addition, a pyramid pooling layer is added between the CNN feature extraction module and the fully connected layer.Compared to the old model, the new model is faster and more stable during training, with a faster decline in cross-entropy loss function, and the reconstructed 3D models are more accurate.

Compact Module
The first structural module in 3D-R2N2, Encoder, uses a CNN to extract and encode image features, which includes 12 convolutional layers and 5 residual connections.Although the convolutional part is aided by 5 residual connections, the feature extraction ability of the entire Encoder is still limited.In this paper, while retaining the advantages of 3D-R2N2 as much as possible, we improve and optimize the model's structure and algorithm.The specific improvements are as follows: The original CNN structure in the Encoder module was modified by replacing the conventional connection in the convolutional layer with densely connected (DC) connections [19].This modification not only allows training of a deeper network without increasing the training difficulty, but also enhances the feature extraction ability of the Encoder while making the network as a whole more easily trainable.The content of the network's recurrent learning is the feature, and the more features obtained, the more accurate the reconstructed 3D model will be.
The main focus of this study is to add two dense modules to the CNN structure.The CNN structure before and after the improvement is shown in Fig. 4 and Fig. 5, respectively.The traditional convolutional layers with regular links in CNN were replaced by convolutional layers with dense links.From the improved Fig. 5, it can be seen that the improved network added two modules, each containing 6 and 5, 3  3 convolutional layers, respectively.In these convolutional layers, each layer is connected to all subsequent layers.The layers after the improved convolutional layers can obtain feature information extracted from previous layers, and each layer input can be represented as x n , as shown in Eq. ( 16).
  where, Yn represents the transportation method, and  represents the set of extracted feature information.Each convolutional layer contains all the feature information extracted from the previous layers, so each layer requires very few feature maps, unlike other networks that have a large width.At the same time, the transfer of features and gradients is more efficient under this dense connection scheme, making training the improved network easier.
In the improved densely connected network, there is also a parameter growth rate v, which controls the number of feature extractions in the network.This allows for more efficient use of the channel number, making the CNN network more efficient and improving encoding efficiency.

Feature Classification Module
The conventional convolutional neural network [20] typically follows the feature extraction module with multiple fully connected layers, which output the classification results.The improvements made in this study are as follows: Inserting a spatial pyramid pooling (SPP) [21] layer between the feature extraction module and the fully connected layer can solve the problem of varying input image sizes.The SPP layer, as shown in Fig. 6, extracts multi-scale features and improves classification performance by fusing these features with the feature classification module.
As shown in Fig. 6, the principle of the SPP method is to process from the input layer to the output layer of the network.Taking the feature map group with a depth of 256 as an example, the operation process of the SPP layer is described, where the height and width of the feature map group are not fixed.In the SPP layer, the leftmost blue box divides each channel of the feature map group into 16 parts, and the feature map group is divided into 16256 parts in this way.The green box and the right purple box are similar, and they divide the feature map group into 4  256 parts and 1  256 parts, respectively.After pooling operation is performed on each part, the feature map group is transformed into a 256  21-dimensional matrix, which is then flattened into a feature vector of length 5376 before entering the fully connected layer.This ensures that regardless of the dimension of the input data, the data will be unified to the same dimension after passing through the SPP layer before entering the fully connected layer, solving the problem of inconsistent sizes of input feature maps caused by saliency segmentation and cropping in sonar image feature extraction module.In addition, unlike regular pooling layers, spatial pyramid pooling (SPP) uses multiple sampling scales.The partitioning of SPP can be flexibly set according to the application requirements.For example, as shown in Fig. 6, a 3-level pyramid is used, with sampling windows of size 4  4, 2  2, and 1  1, respectively.By down-sampling the feature map with multiple window sizes and strides, a set of new feature maps with different scales is obtained.These feature maps are then concatenated to form a new vector that contains both high-level semantic and spatial information.By inputting this feature vector to a fully connected layer, the accuracy of target recognition can be improved.

EXPERIMENTAL RESULTS AND ANALYSIS 3.1 Experimental Environment
The experimental environment of this study was Windows 10 operating system, with 32 GB memory, and the training was performed using a Ge Force RTX 3090 SUPRIM X 24G graphics card.The training framework used was TensorFlow.
After data preprocessing, a total of 100 000 optical images were used in the comparative experiments in this study.The dataset consists of 20 different categories, among which 12 were selected as the training samples and 8 were used as the testing samples for this research.

Evaluation Metrics
In this study, the IoU value [22], i.e., the accuracy of 3D reconstruction, was used as the evaluation criterion for experiments 3.3 and 3.6.The evaluation was mainly based on the cross-overlap ratio of the 3D reconstruction result and the true model.The ratio of the intersection and union between the measured target volume and the true volume was measured.Therefore, when dealing with other FFr representations based on surface reconstruction, voxelization of the reconstructed and true models is required first, as shown in Eq. (17).
where, I(ꞏ) denotes the indicator function, pred i V represents the i voxel of the predicted model, gt i V represents the i voxel of the ground truth model.The higher the IoU value, the better the reconstruction accuracy.
The reconstruction results of the target object may vary with different network structures, and the improved Encoder network structure may also have differences in the reconstruction results compared with the original Encoder.In this experiment, the proposed algorithm improved the Encoder module by adding Inception modules while preserving the residual convolutional network.After the optical images were segmented to a certain extent, in order to solve the problem of gradient vanishing, residual connections in Resnet were used to extract more accurate target image features.
The evaluation criterion for experiment 3.4 is the convergence of the matching results, which is calculated using the method shown in Eq. (18).
R1 represents the convergence factor, and represents the vector matrix.The closer R1 is to 0, the better the convergence of the algorithm, and vice versa.

Comparison of the Same Image Reconstruction Accuracy of Different Algorithms
To evaluate the 3D reconstruction accuracy of the proposed algorithm, this experiment compared it with the 3D-R2N2 algorithm, SFM-improved AKAZE algorithm, AISI-BIM algorithm, and improved PMVS algorithm on the same type of optical images, and calculated the corresponding IoU values for different algorithms, as shown in Tab. 1.
As shown in Tab. 1, the proposed algorithm has higher reconstruction accuracy (IoU) than other algorithms in different types of buildings.Compared with the 3D-R2N2 algorithm, the proposed algorithm has an increase of 5.3% in IoU, an increase of 7.8% in IoU compared with the SFM-improved AKAZE algorithm, an increase of 7.9% in IoU compared with the AISI-BIM algorithm, and an increase of 1.0% in IoU compared with the improved PMVS algorithm.Although the proposed algorithm has not shown a significant improvement in reconstruction accuracy compared with the improved PMVS algorithm, the experiment shows that the improved PMVS algorithm has certain limitations in applicable building types, as it lacks data for buildings in Guting, adobe houses, and ancient town buildings.
The IoU values corresponding to (b), (c), (d), (e), and (f) in Fig. 7 are 0.671, 0.650, 0.657, 0.712, and 0.729, respectively, when specific building images are selected.As higher IoU values indicate higher accuracy, it can be seen from Fig. 7 that the proposed algorithm is more complete in 3D reconstruction details.

Comparison of Convergence of Registration Results for Different Algorithms on the Same Dataset
Convergence is one of the methods to test the stability of an algorithm.In this experiment, convergence tests of registration results were conducted on 3D-R2N2, SFMimproved AKAZE, AISI-BIM algorithm, improved PMVS algorithm, and the proposed algorithm using the same dataset to evaluate the stability of different algorithms.The results are shown in Tab. 2. According to Tab. 2, it can be observed that the convergence of the proposed algorithm is stronger than other algorithms in optical image datasets of 4000, 17311, and 37780 images, respectively.When performing 3D reconstruction on three different datasets, the proposed algorithm is compared with 3D-R2N2 algorithm, SFMimproved AKAZE algorithm, AISI-BIM algorithm, and improved PMVS algorithm.The results show that the proposed algorithm can effectively improve the convergence of algorithm registration results by 1.6%, 3.2%, 2.0%, and 2.6%.

Comparison of the Same Image Reconstruction Effect and Convergence Time of Different Algorithms
The qualitative analysis method was used in this experiment to evaluate the 3D building reconstruction effect of the proposed algorithm.A teaching building with various architectural features, such as a clock tower and multi-story buildings, was chosen as the experimental object.The 3D-R2N2 algorithm, SFM-improved AKAZE algorithm, AISI-BIM algorithm, improved PMVS algorithm, and the proposed algorithm were used in sequence for 3D reconstruction of the building images, and the results are shown in Figs. 8 to 12.As can be seen from Figs. 8 to 12, the 3D reconstruction effect obtained using the proposed algorithm has clearer outlines and better reconstruction details compared to other algorithms.According to Tab. 3, the convergence time of the five algorithms is directly proportional to the data scale, that is, the larger the data scale, the longer the convergence time.Among the algorithms compared in the same data scale, the convergence time of the 3D-R2N2 algorithm, SFMimproved AKAZE algorithm, AISI-BIM algorithm, and improved PMVS algorithm were all higher than that of this study algorithm.The proposed algorithm reduces the convergence time by an average of 8.1%, 15.1%, 4.2%, and 6.5%, respectively, compared to other algorithms.

Evaluation Metrics
The ablation experiment is one of the methods used to demonstrate the effectiveness of the improved algorithm, as shown in Tab. 4. DC represents the improved dense connection module in CNN, and SPP represents the improved feature classification module.
The experimental results show that when the DC module is applied to improve 3D-R2N2, the reconstruction accuracy increases by 11.1% compared to 3D-R2N2; when the SPP module is applied to improve 3D-R2N2, the reconstruction accuracy increases by 11.39%; when both DC and SPP modules are applied to improve 3D-R2N2, i.e., the proposed algorithm in this study, the reconstruction accuracy is increased by 17.81% compared to 3D-R2N2.The results demonstrate that the proposed algorithm has a significant advantage in reconstruction accuracy.

CONCLUSION
This study is based on the 3D-R2N2 algorithm for three-dimensional reconstruction of optical building images.The principle of the 3D-R2N2 algorithm was analyzed, and the algorithm was improved and compared with traditional 3D reconstruction algorithms.The experimental results show that the improved 3D-R2N2 algorithm has significant advantages over traditional 3D reconstruction algorithms in terms of reconstruction accuracy, algorithm registration result convergence, reconstruction effect, and convergence time.The efficacy of the algorithm itself was verified through ablation experiments.Furthermore, the results of this study can be applied not only to optical building images but also to other types of complex planar stereoscopic optical images for three-dimensional reconstruction.However, there are still many aspects of this study that need improvement, such as the poor performance of the algorithm on curved surface and irregular edge buildings, which are the future research directions and goals.

where, σ 2
represents the variance of noise, δ represents the domain of the target region, Ω represents the domain of the image,   , x y   , and the pixel point   , x y   .The above equation is the system data fidelity term, which can mainly preserve the distortion characteristics of the original image and greatly reduce the distortion degree of the system image noise.The derived equation is shown as Eq.(5).

Figure 2
Figure 2 LSTM specific structure voxel cross-entropy loss function is used to calculate the probability of each voxel.X represents the input image sequence,   , , y i j k represents the actual ground truth, and   , , p i j k represents the predicted probability of the voxel.

Figure 3
Figure 3 Decoding network Let the final output at each voxel   , , i j k follow a

Figure 4 Figure 5
Figure 4 CNN module of 3D-R2N2 before improvement

Figure 7
Figure 7 Comparison of reconstruction accuracy

Table 2
Convergence comparison of registration results

Table 3
Convergence time of different image data scales

Table 4
Performance comparison results of 3D-R2N2 algorithm before and after improvement