Spatial Stimuli Gradient Based Multifocus Image Fusion Using Multiple Sized Kernels

: Multi-focus image fusion technique extracts the focused areas from all the source images and combines them into a new image which contains all focused objects. This paper proposes a spatial domain fusion scheme for multi-focus images by using multiple size kernels. Firstly, source images are pre-processed with a contrast enhancement step and then the soft and hard decision maps are generated by employing a sliding window technique using multiple sized kernels on the gradient images. Hard decision map selects the accurate focus information from the source images, whereas, the soft decision map selects the basic focus information and contains minimum falsely detected focused/unfocused regions. These decision maps are further processed to compute the final focus map. Gradient images are constructed through state-of-the-art edge detection technique, spatial stimuli gradient sketch model, which computes the local stimuli from perceived brightness and hence enhances the essential structural and edge information. Detailed experiment results demonstrate that the proposed multi-focus image fusion algorithm performs better than the other well known state-of-the-art multifocus image fusion methods, in terms of subjective visual perception and objective quality evaluation metrics.


INTRODUCTION
Usually, imaging cameras have the limitation of a finite field of depth which causes a partially focused scene acquired from the optical lens [1]. The objects located within the field of depth of a camera have sharp details, whereas the rest of the objects are blurred [2]. Partially focused images often provide limited performance in various applications including surveillance, remote sensing, medical imaging and object recognition [1]. Therefore, multiple images are acquired and the complementary information of these images is combined into one single image by using image fusion techniques.
Multifocus image fusion algorithms can be broadly classified into two categories: transform domain algorithms [3] and spatial domain algorithms [4]. A subcategory of transform domain fusion algorithms is the multiscale transform. Multiscale transformation is applied on the source images to obtain multi-resolutions and decomposed coefficients of low and high frequencies or orientations. Based upon the fusion rule, different sub-band coefficients are considered, and the final fused image is obtained by taking inverse multiscale transform. Early proposed multiscale transform fusion methods include: pyramid decomposition [5], wavelet transform [6], complex wavelet transform [7] and contourlet transform [8]. Some of the latest multiscale transform fusion methods are proposed in [9][10][11][12]. Since these kinds of algorithms apply a global fusion technique, hence, for misregistered images they produce poor results [13]. Moreover, These methods generally produce a low contrast fused image [14].
Spatial domain algorithms usually find the saliency map/focus regions in the source images [14] for fusion. Block based fusion methods have been used widely because the focus measure cannot be efficiently represented by a single pixel [15]. The optimum size of block is another challenge in these types of methods because the larger block size may result in degraded weight calculation, especially for a block having both, focused and unfocused areas.
Usually spatial domain fusion algorithms are immune to shift variance and due to simplicity, they are the best choice for incorporation in real-time devices. Saliency detection for focus measure is the basic requirement for spatial domain algorithms. Normally, gradient information is used to detect the focused areas in the image. Scale Invariant Feature Transform (SIFT), is a well-known algorithm to find the feature vectors that contain gradient information on key points in an image [16]. However, dense SIFT (DSIFT) finds 128D feature vector for each pixel in the image, in contrast to SIFT which finds 128D feature vector for key points only [17]. Baseline for DSIFT algorithms is the binning of magnitude and directions of the gradients. Therefore, Liu et al. used DSIFT for determination of activity map for multi-focus image [18]. The DSIFT can be used for the registration of misaligned source images; however, most of the available datasets in case of multi-focus image fusion are pre-registered. Moreover, activity map determination using DSIFT is computationally expensive as compared to some latest gradient based implementations such as; Spatial Stimuli Gradient Sketch Model (SSGSM) proposed by Mathew and James [19].
In this paper, we used a multiple sized kernel technique to determine the focus map. A kernel size of n and n × n is used simultaneously to measure the focus activity in each of the source images. The kernel size of n and n × n determines the hard and soft decision maps, respectively. The hard decision map has the tendency to differentiate between the boundary of focused and unfocused regions, whereas the soft decision map strikes out the outliers falsely detected by the soft decision maps. The focus map is finally determined by combining the hard and soft decision maps. Histogram equalization is applied as a preprocessing step to enhance the edge information in the source images.
The remainder of the paper is organized as follows: section 2 describes the detailed fusion scheme of the proposed algorithm. Experimental setup and results are discussed in section 3. Finally, conclusions are drawn in section 4.

PROPOSED FUSION SCHEME
The flow diagram of our proposed fusion scheme is given in Fig. 1. As most of the datasets available for multifocus image fusion are preregistered, we assume that both the source images are preregistered in this work. Smooth regions of the source image belong to unfocused area and the region containing sharp edges froms a part of the focused area. Therefore, for pre-processing, contrast of both the source images is enhanced by using the Nonparametric Modified Histogram Equalization (NMHE) presented in [20]. This helps boost the information regarding the edges available in the source images. The fusion scheme proposed in this paper is mainly divided into three steps. First, local stimuli map of both the preprocessed source images is calculated by using SSGSM. Local stimuli maps are used to compute the activity level maps which contain the information of focused areas in both the images. Following that, the determined focused areas of both source images, along with the undetermined area are computed in the coarse decision map by using the focus information in both activity level maps. The next step is to refine the undetermined area of the coarse decision map by using the local focus measure to obtain a final decision map. As a last step, the final decision map is used to obtain the fused image. Histogram equalization is widely used to enhance the low contrast images so that the original image can be mapped as close as possible to the uniform distribution of the intensities in the histogram. NMHE is one of the histogram equalization methods which not only enhances the contrast, but also preserves the average brightness of the original image.
We used NMHE as a pre-processing step for the contrast enhancement of source images. NMHE algorithm calculates an updated histogram by removing the spikes from the original histogram whose occurrence in the histogram is very high as compared to neighbouring intensity levels. The first step of NMHE suppresses the spikes from original histogram. Afterwards the algorithm clips and normalizes the histogram of the previous step and then calculates the cumulative deviation of the transitional updated histogram from the uniform histogram. It then uses this as a weighting factor to construct a final updated histogram that is a weighted mean of the updated histogram and the uniform histogram. It then uses this as a weighting factor to construct a final modified histogram that is a weighted mean of the modified histogram and the uniform histogram. Fig. 2 shows the change in gradients after applying NMHE on "joy'', "flower'', "clock" and "toy" source images, respectively.   Eq. (1) determines a threshold and the pixels higher than the threshold contribute in the modified histogram.
where, [ | ] p n K is the occurrence probability of the nth intensity level given the horizontal variation of contrast C. A measure of equalization (M eq ) indicates the nonuniformity of histogram distribution of an image, which is calculated according to Eq. (2): where, p u is uniform probability density function and c is the modified clipped histogram calculated from original histogram. M eq assigns the weight to absolute uniform and the modified histogram to calculate the redesigned histogram (h NMH ) as per the following Eq. (3): From CDF of h NMH , transformation curve T(n) is obtained Eq. (4): where, c(n) is the CDF of the h NMH and L = 256 for an 8bit image. T(n) is applied to the original image to obtain the contrast improved images. Contrast enhancement results in improved edges in the source images. The next step to the fusion scheme is to identify the focused areas. Various gradients based techniques are proposed in the literature like DSIFT, SSGSM and local standard deviation (LSD). LSD computes the deviation in a neighbourhood of 3 × 3, whereas, DSIFT finds a feature vector of 128D for each pixel of an image which contains the 8 orientation bins of gradients in a grid of 4 × 4. The SSGSM is based on two well-known laws: Weber-Fechner law of perceived brightness and Sheperd similarity law pertaining to neighbourhood similarity [19]. The perceived brightness B P of an image I is evaluated from following Eq. (5): where, a is the constant of proportionality.
Gradients represent the edges in an image; however, the gradient is a linear operator so it cannot suppress the noise available in the images. Moreover, gradient measures the distance from the local intensity variability point of view. One of the methods proposed by [19] is the measure of perceived similarity capable of screening out the noise edges because of the exponential transformation. Dissimilarity of B P along x and y axis are calculated as D x and D y , respectively, and the function of corresponding gradients G x and G y are given in Eq. (6) below: ; ; , Eq. (7) gives the magnitude of local stimuli D as follows:  The averaging filter smoothens the sharp edges recovered in the previous step, to avoid strict boundary decision in the forthcoming step. The output of the Gaussian averaging filter is denoted by G 1 for the first source image and G 2 for the second source image.
Next step is the determination of basic focus maps. We achieved this by finding the definite focus regions of the first source image and the definite focus regions of the second source image.
Similarly, the second basic focus map contains the definite focus regions of the second source image and definite focus regions of the first source image. The detailed scheme is described in the following four stages: Stage 1: We used n × n sliding window approach to make the fusion process shift-invariant. The size of n is selected as 3 for window W and n × 3 for window W'. These two window sizes are chosen to determine hard and soft decision maps, respectively. Furthermore, windowing operation is also used to reduce the blocking artefacts in the coarse decision maps D 1 and D 2 .
Step size for both pairs of sliding windows is set to one pixel. For a source image with the resolution of P × Q, total number of sliding window patches for window W is: M ' stores the focus information of near and far focused source image respectively, computed through the window W'. These score matrices are initialized with zero value and have the same size as the source images in order to store the pair of near and far focus maps acquired through windows W and W'. The score matrices are further used to determine the focus map against each pixel from two source images, as shown in Fig. 4.   M ' for the sliding window W'. Following this scheme, focus measure score is computed for all the corresponding pixels of both the source images. Detailed flow of the focus map determination is shown in Fig. 5. Above classification rule can be summarized in Eq. (9), as follows: where, n = {1, 2}. Fig. 6 shows the classification results acquired through equation Eq. (8). Fig. 6a-b shows the near and far coarse focus maps 1 F ' and 2 F ' , Fig. 6c-d shows the post-processed coarse focus maps and Fig. 6e shows the absolute difference of Fig. 6c-d. The black area along the boy shows the unclassified pixels, and they need to be further processed in the next step.
Stage 3: Next step is to identify the unclassified pixels. From above classification rule, it is clear that if a pixel   1 , I i j is focused in source image 1 I , then the same pixel must be unfocused in the source image 2 I . The coarse decision maps obtained through above classification rule have strong tendency to segregate the focused and unfocused areas in both the source images, which can be observed in Fig. 6. Fig. 6a-b shows the coarse decision maps of source images obtained from the Fig. 3gh respectively. Afterwards, morphological operations are applied as a post processing step on the coarse decision maps to fill the small holes existing within the focused areas and remove the small focused objects surrounded by unfocused area. Fig. 6c-d shows the images obtained after morphological operations. Fig. 6e shows the absolute difference of binary coarse decision maps shown in Fig. 6c and Fig. 6d. Black boundary around the body of boy in Fig.  6e belongs to the unclassified pixels. Figure 6 Coarse classification results of "joy'' source images, a)-b) coarse maps; c)-d) post processing results and e)absolute difference of both decision maps Stage 4: Final step before the fusion is to further refine the unclassified pixels in the last step. These unclassified pixels normally belong to the boundary between the focused and unfocused regions. A local descriptor, spatial frequency is used to measure the local focus of the unclassified pixel [21]. Spatial frequency can be calculated as per the following Eq. (10): where, RF is the row frequency and CF is the column frequency, which can be calculated from Eq. (11) and Eq.
where, N 1 and N 2 is the total number of pixels in rows and columns of the source image I. The higher the value of spatial frequency, the higher the focus measure of the corresponding pixel.

EXPERIMENTAL SETUP
For detailed experiments, we have utilized 18 pairs of popular source images which are shown in Fig. 7. From these pairs of multi-focus images, 10 images are greyscale and 8 pairs are colour images. These test images are also used in [34] and a few of them are taken from the Lytromulti-focus dataset [22]. Near focused images are shown on the left side, whereas the right side images are the far focused.
To prove the effectiveness of our proposed multi-focus image fusion algorithm, we have compared our results with the latest state-of-the-art algorithms. Figure 7 Multi-focus test images used during the experiments These include DSIFT [18], Discrete Wavelet Transform and Adaptive Block (DWT) [23], Guided Filter Fusion (GFF) [24], Discrete Cosine Transform (DCT) based fusion 25], Quadtree-based Multi-focus Image Fusion (QMIF) [13], Energy of Laplacian based DCT (DCT-EoL) [3] and boosted random walks-based algorithm with Two-Scale Focus maps (TSF) [26]. The proposed algorithm when compared to the DSIFT implementation has the following main improvements: a) A multiple sized kernel scheme has been introduced to determine and hard and soft decision maps. Hard decision map differentiates between the boundary of near and far focus areas, whereas, the soft decision map reduces the falsely detected focused/unfocused regions from the final focus map. b) The dense SIFT used to calculate the activity level map does not identify the coarse decision map properly owing to the fact that dense SIFT detects many redundant features, therefore, we have used SSGSM to determine gradient features. c) Before calculating the decision maps, contrast and edge enhancement scheme has been incorporated to support the onward image fusion process.
The proposed algorithm and the algorithms used for comparisons are executed on Intel® CORE TM i5 3210M Processor, 8 GB RAM, 64-bit Windows 10 laptop. All the algorithms were implemented on MATLAB R2014a (8.3.0.532) 64-bit version.

RESULTS AND DISCUSSION 4.1 Qualitative Comparison
Multi-focus source images contain both focused and unfocused areas. Hence, a fusion algorithm with the tendency to accurately detect and merge the focused areas from the source images will yield a better visual perception. For qualitative comparison, we evaluate the visual quality of the fusion results shown in Fig. 8. These are the results of different fusion techniques for a pair of pre-registered "Joy" source images shown in Fig. 7p. Three different areas from the boundary of near and far focused source images are selected and their magnified version is displayed above each fusion result for the purpose of detailed visual comparison of all the fusion techniques compared during the experiment. Figure 8 Fusion result of "Joy" multi-focus test image, a) QMIF [13],; b) DCT [25]; c) DWT [23]; d) GFF [24]; e) DSIFT [18]; f) DCT-Eol [3]; g) TSF [26]

and h) Proposed
It can be seen that GFF and DSIFT have better fusion results as compared to the rest of the fusion methods, but our proposed method has picked-up the focus regions from both the source images accurately. Adequate blur can be seen in all three areas of QMIF, DCT and DWT based fusion schemes. For the DCT-EoL fused image, numerous blocking artefacts are present. However, in case of GFF and TSF, the magnified area on the right hand side has very clear focus, but the magnified areas show the blur at the boundary of focused and unfocused areas. Similarly, for the DSIFT fusion results, the magnified area showing cap view, but the magnified area shown on the right side contains some of the portion of unfocused part. In our proposed fusion scheme, it can be observed that all the three magnified areas contain sharp details of both near and far focused source images.
Multi-focus source images contain both focused and unfocused areas. Hence, a better fusion algorithm will have the capability to acquire the focused areas from both the source image more accurately. To further evaluate the perceptive quality of the proposed fusion algorithm, the difference between the fused and far focused "Joy'' source image is shown in Fig. 9. Figure 9 Difference of fused image and far focused ''Joy''' source images, a) QMIF [13]; b) DCT [25]; c) DWT [23]; d) GFF [24]; e) DSIFT [18]; f) DCT-Eol [3]; g) TSF [26] and h) Proposed It can be observed that the Quadtree, DWT, DCT, DCT-EoL and DSIFT based fusion schemes have acquired few areas of near focused source image as well, which belongs to the unfocused area of far focused source image. TSF also acquired some of the unfocused portion into the fused image from the far focused source image. It is clear from the magnified areas of the proposed fusion results that the proposed fusion scheme has outperformed the rest of techniques in term of preceptive quality. Fig. 10 shows the fusion results of "Clock" source images. It can be observed that the proposed fusion algorithm has clearly separated the boundaries of near and far focused images, whereas the rest of the fusion scheme has acquired some of the out-of-focus regions near the boundaries.
To assess the contribution of each of the source image in the fused image, we have used Gradient Magnitude Similarity Deviation (GMSD) [27]. Fig. 11c-d show which portion of the near and far focused source images is present in the fusion result of ''clock'' test images. Figure 10 Fusion results of "Clock" source images, a) QMIF [13]; b) DCT [25]; c) DWT [23]; d) GFF [24]; e) DSIFT [18] and f) DCT-Eol [3]; g) TSF [26] and h) Proposed

Quantitative Comparison
Evaluation of algorithms on the basis of visual quality is insufficient and requires a quantitative comparison. Various fusion quality evaluation metrics have been proposed, but none of them can completely evaluate the fusion quality independently [28]. Some well-known criteria for objective evaluation of fusion quality are explained as follows:

Spatial Structural Similarity (SSS) Q AB/F
SSS is an edge based fusion quality evaluation metric proposed by Xydeas and Petrovic [42,43]. This metric finds the amount of edge information transferred into the fused image from all the source images. Q AB/F for two source images can be calculated from the following Eq. (13): A  BF  B  1  1   A  B  1  1   ,  ,  ,  , , , where, Q FB (x, y) represents the information transferred from source image A into the fused image F for the pixel location (x, y) and W B (x, y) is the weight for a particular pixel location (x, y). A pixel with higher gradient value influences more the Q AB/F than the lower gradient value.

Mutual Information (MI)
MI can be determined from the following Eq. (14): where, P if (m, n) represents the joint probability density distribution of the greyscale image in i and f.   i P m and   f P n represent the probability density distribution of the greyscale image in i and f, respectively. MI defines the sum of mutual information between each input image and the fused image. The greater MI shows that the fused image has greater information than the source images [29].

Feature Mutual Information (FMI)
FMI is the non-reference performance metric for fusion algorithm which calculates the mutual information of the image features, like edges and gradients [30,31]. FMI can be computed from the following Eq. (15): (15) where, N represents the number of sliding windows,

Fusion Similarity Metric (FSM) QT
Q T is the fusion quality metric which measures similarity in terms of luminance, contrast and structure between the source and fused images. Q T finds above similarities in the source and fused image block by block through a sliding window. Q T can be calculated from Eq.
where if  is defined as per Eq. (18):

Results Discussion
Proposed fusion scheme for multi-focus image fusion is compared with the state-of-the-art latest fusion algorithms which include DSIFT, DWT, DCT, GFF, QMIF, DCT-Eol and TFS. Four types of quality assessment metrics as described earlier were used to compare the effectiveness of our proposed fusion scheme.
Tab. 1 shows the value of SSS which is the measure of the edge information transferred in the fused image from source images. The overall SSS index of the proposed method is better than the other techniques, for most of the source images. However, for "Pepsi", "Joy" and "Toy" source images, our results for SSS are approximately equal to TSL method. Our proposed method acquires the edges of source images for determining the initial fusion map. The edge information in near and far focused 'Pepsi' images is not well defined, hence, in-terms of SSS, our proposed method gives the results approximately the same as those of TSF. Similarly, for "Joy" and "Toy" source images, there is a minor mis-registration in both the source images and our proposed method specifically describes the effectiveness of the method on registered images, therefore, the SSS results of the proposed method for these images are approximately equal to the TSF. However, for the rest of the evaluation metrics MI, FSM and FMI, our proposed algorithm has produced significantly better results than the TSF for these three source images. Pertinent to mention here is that the SSS results of our proposed technique of above three images have a negligible margin as compared to TSF; hence it may be said that the maximum possible improvement of SSS for these three test images may have been achieved.
Tab. 2 has the second quantitative measure, i.e. MI. The MI directly measures the amount of common information between two source images. It can be observed from Tab. 2, that our proposed method has better MI for all the test images, except the "Temp.". For the multifocus image fusion, more information reflects the presence of higher magnitude of the gradients in the fused image. However, in Fig. 8a, it can be observed that few blocking artefacts are present at the boundary of near and far focused images, which increases the MI score for the QMIF method for most of the test images. Specifically, MI of 'Temp.', fused image, our results are approximately equal to QMIF.   The FMI is a non-reference fusion metric which reflects the amount of feature information in the fused image. Tab. 3 shows that FMI of the proposed method is better than all other techniques, except the GFF for four test images. In case of FMI, GFF produces better results for "Ball.", "Leop.", "Toy" and "World" test images. From Fig.  9 (d) of the qualitative analysis, it can be observed that the fused images produced by using GFF technique have some irrelevant information, which should not be present there. This additional information is used by FMI to calculate the feature information. Therefore, the FMI score of the GFF is better for a few of the test images, as compared to our proposed technique. The last evaluation metric for the quantitative comparison is the FSM and results are given in Tab. 4. Similar to the previous evaluation metrics, overall performance of our proposed fusion scheme is better than the competing algorithms. However, for "Ball.", "Book" and "Lab" images, our FSM results are approximately equal to the DSIFT, DCT-EoL and GFF, respectively.  ,924 0,912 0,926 0,926 0,916 0,892 0,887 0,88 0,920 0,849 0,772 0,825 0,925 0,901 0,882 0, 5 94,2 83,6 108,9 92,8 103,7 86,4 94,8 121,0 138,5 81,6 112,5 84,8

CONCLUSION
In this paper, we proposed the SSGSM based multifocus image fusion technique using multiple sized kernel. The source images are first pre-processed using the NMHE histogram equalization method and then gradients of these enhanced images are calculated using SSGSM. The basic focus map is determined from soft and hard decision maps acquired by employing multiple sized kernels. The final decision map is determined by applying morphological operations as a post processing step. The proposed algorithm is compared with the seven other state-of-the-art multi-focus image fusion algorithms for well known colour and grey multi-focus image dataset. It is concluded that the proposed algorithm has demonstrated significant improvement in qualitative results as compared to the rest of the algorithms. For quantitative comparison, four fusion metrics were used, because single fusion metric cannot show the effectiveness of a fusion scheme. For one fusion metric, our results are approximately equal to the best performing technique, but for the rest of the fusion metrics, our method performs way better than the same technique. Therefore, it is concluded that overall performance of our proposed technique in terms of quantitative measure was better than the rest of the techniques.