A Novel Deep Convolutional Neural Network Architecture Based on Transfer Learning for Handwritten Urdu Character Recognition

: Deep convolutional neural networks (CNN) have made a huge impact on computer vision and set the state-of-the-art in providing extremely definite classification results. For character recognition, where the training images are usually inadequate, mostly transfer learning of pre-trained CNN is often utilized. In this paper, we propose a novel deep convolutional neural network for handwritten Urdu character recognition by transfer learning three pre-trained CNN models. We fine-tuned the layers of these pre-trained CNNs so as to extract features considering both global and local details of the Urdu character structure. The extracted features from the three CNN models are concatenated to train with two fully connected layers for classification. The experiment is conducted on UNHD, EMILLE, DBAHCL, and CDB/Farsi dataset, and we achieve 97.18% average recognition accuracy which outperforms the individual CNNs and numerous conventional classification methods.


INTRODUCTION
Character recognition is one of the biggest challenges in pattern recognition and with the evolution of deep learning character recognition has accomplished incredible advancement in recent years. Convolutional Neural Networks have been established as exceptionally influential deep learning tools for a wide range of computer vision tasks including character recognition. CNN learns highly descriptive and ordered image features from a large number of training images [1]. However, deep learning requires massive labelled samples, although labelled samples are hard to obtain in several tasks. The most widely recognized datasets utilized to prepare CNNs using error prorogation is the ImageNet dataset [2]. This dataset is vast, including a large number of pictures acquired from picture facilitating multiple sites. Preparing CNN with ImageNet can take quite a while relying upon the machine handling power. In the character recognition domain, obtaining data as widely labelled as ImageNet remains a big challenge. At the point when such extensive scale data are not accessible, utilizing a small number of data to prepare profound CNN frequently causes convergence and overfitting issues. In such case where we do not have large enough dataset for training CNN from scratch, transfer learning of CNN is approached. By transfer learning from the various tasks, CNN has appeared to have the capability to learn new classes quicker with minimum training data [3]. Additional, transfer learning is better than decreasing the net limit when managing minimum training dataset [4]. The basic idea of CNN transfer learning is to utilize the supremacy of networks that are trained on a large image dataset in different application areas. There are two ways to perform transfer learning: first, use the pretrained CNN model and initialize all the weights randomly and train the model according to our new dataset. Second, use the architecture of pretrained CNN model but freeze the initial layers to retain the same weights and retrain the higher layers according to the new dataset.
In the past two decades, numerous systems for character recognition of scripts like Arabic, Latin, and Chinese, etc. have been presented and some revolutionary effort has also been done in a few Indian scripts like Devanagari, Bangla, Oriya, and Gurmukhi. Character recognition using transfer learning has been effectively presented in a different combination of scripts [5] like handwritten Devanagari character recognition that has been presented using layer-wise training of Deep CNN in [6] and achieved favourable results. Results closed to the state-of-the-art in Bangla character recognition has been obtained in [7] using CNN. Ref [8] has presented Latin and Chinese character recognition by transfer learning deep CNN. Even transfer learning for handwritten historical document recognition has also been presented for a few scripts [9]. Transfer learning using CNN has also been effectively supported for numeral recognition of Hindi, Arabic, Bangla [10] and Oriya, Telugu & Devanagari [11]. One among the state-of-the-art methods for Chinese character recognition a semi-supervised transfer learning has been presented [12] using CNN. For Arabic and Urdu script which are highly cursive, not many works have been exhibited using CNN and transfer learning. In the few works on Arabic and Urdu script, CNN is used either as a sole system classifier or composite classifiers [13][14]. Even convolutional-recursive deep learning-based Urdu Nastalique character recognition has also been approached in [15] with quite satisfactory performance. From the literature we observe that CNN based transfer learning is successfully applied mostly on Chinese [8], Bangala [7], Arabic [13] and Devanagari printed scripts rather than handwritten. Indeed CNN outshines any other method by transferring knowledge learned on large datasets even with few classes [16][17]. From the literature, we also witness that character recognition of Urdu script using CNN and transfer learning has not yet been experimented.
In this paper, we propose a novel deep CNN architecture for handwritten Urdu character recognition using three different pretrained CNN models, namely AlexNet [18], GoogleNet [19] and ResNet18 [20]. We propose to exploit transfer learning these three CNNs by fine-tuning them and concatenate the features extracted to classify using two fully connected layers. We experimented our proposed network on four standard benchmark dataset namely, UNHD [21], EMILLE [22] HCD/Farsi [23] and DBAHCL [24].
The chief contributions of this work are (i). Exploiting feature concatenation extracted from three different pretrained CNNs (ii). Construction of hybrid feature vector which comprises intricate features extracted from cursive Urdu characters. (iii). The use of transfer learning on above mentioned Urdu dataset. (iii). The proposed handwritten Urdu character recognition system accomplishes a high classification accuracy, beating current methodologies in literature mainly concerning recognition.

DESIGN OF PRETRAINED CNN FOR FEATURE EXTRACTION
We present an Urdu handwritten character recognition using a novel deep CNN based on transfer learning. In our network, the feature extractor layer is constructed using three state-of-the-art pretrained CNN models. The extracted features are concatenated and fed to two fully connected layers for classification.
We have adopted three deep CNN AlexNet, GoogleNet and ResNet18 models as a feature extractor for our proposed Urdu character recognition task. These CNNs are pre-trained with non-character images dataset -ImageNet for broad image descriptors. We are molding it to extract discriminative features for our Urdu datasets by transfer learning. The adopted CNNs are concisely described as follows.

AlexNet
AlexNet broadly won the 2012 ImageNet LSVRC-2012 rivalry by a vast margin (15.3% Vs 26.2% (second place) error rates). The network is 8 layers deep, pretrained with millions of images and can classify images into 1000 classes. Subsequently, the system has learned rich features portrayals for a wide scope of images. AlexNet has achieved good classification result for medical images [25][26][27], biometric [28], scene classification [29], embedded system [30] and character recognition [31,32].The initial Convolutional layer of AlexNet contains 96 kernels of size 11 × 11 × 3. The width and height of the part are normally equal, and the number of channels depends on its depth. The Overlapping Max Pooling layers are trailed by the initial two Convolutional layers. The last three convolutional layers are directly connected.

Figure 1 AlexNet Architecture
The yield goes into a progression of three fully connected layers. The ReLU layer is connected in all the convolution and fully connected layers. The second completely associated layer encourages into a softmax classifier with 1000 class marks. The overall architecture of AlexNet is illustrated in Fig. 1.

GoogleNet
GoogleNet is a convolutional neural system that is prepared in excess of a million images from the ImageNet database [19]. GoogleNet won a recognition field in ILSVRC-2014 and has a much deeper architecture than existing CNN architectures. The system is 22 layers profound and can characterize images into 1000 different classes. Accordingly, the system has learned rich component portrayals for an extensive variety of images. The network has a picture input size of 224-by-224.
This network is constructed using 9 inception modules which are different from conventional CNN and it has only a 1D series configuration. This inception prompts well the local space features by sectioning the local characteristic of the kernel space of different size to compute the convolutional value and all the convolutional results are concatenated in the final layers of inception.
The convolutional layers of the system extricate image features and the last two layers are used for classification. These two layers, 'loss3-classifier' and 'yield' in GoogleNet, contain data on the most proficient method to join the features that the network extricates into class probabilities, predicated labels and a loss function. GoogleNet has achieved decent classification result for scene classification [33], character recognition [34], Weather Recognition [35], malware detection [36] and complex recognition [37]. For simplicity, only the inception module of GoogleNet is shown in Fig. 2.

ResNet18
ResNet Won the 1 st place in the ILSVRC 2015 classification competition with a top-5 error rate of 3.57% (An ensemble model) and also won the 1 st place in COCO 2015 competition in ImageNet Detection, ImageNet localization, Coco detection, and Coco segmentation.

ResNet18 Module
ResNet18 Network Converges faster compared to the plain counterpart of it and ResNet18 has the least depth size compared to other variants. ResNet has achieved good recognition in medical imaging and biometrics [38]. The simplified architecture of ResNet18 is illustrated in Fig. 3.

Proposed Network Design
The above discussed three CNN models have successfully applied for the character recognition task individually. Different CNN architectures are capable of extracting diverse features from character images. In our proposed network we integrated the features from these three different CNN models to make a more discriminative feature illustration in contrast with solitary CNN. Fig. 4 illustrates the complete architecture of our proposed network which shows that the features extracted from the three CNNs are concatenated and fed to the two fully connected layers for classification.

Figure 4 Proposed Network for Urdu Character Recognition
The learning capacity of our system embraces the conventional and generic features extricated by the pretrained CNNs to the particular Urdu character structure. For character recognition, it is usually advised to adapt fine-tuning for the pretrained network [35]. Though, finetuning may cause the overfitting as for a few character classes in the dataset the number of images for training may not be sufficient. Therefore, we express a straightforward arrangement for enhanced approval without overfitting.

EXPERIMENTAL ANALYSIS 3.1 Corpus
Arabic and Urdu languages both utilize the same script. In fact, Arabic is the sole owner of the script. Urdu, Persian, Pashto have inherited the script from Arabic. Urdu is the national language of Pakistan and Nastalique is the standard style of writing Urdu script. Officially 58 characters are in Urdu, out of which 28 are derived from Arabic. In writing practice, 48 essential characters, 1 dochashmi-hey and 5 additional characters are utilized to frame every composite letter set. For our experimentation we have extracted around 5000 characters/ligatures from UNHD dataset [21] (a benchmark-most frequently used Urdu ligature for word formation in Urdu Lugath). From EMILLE corpus [22] we have extracted around 2000 characters, from DBAHCL [24] we chose those characters which are similar in both Urdu and Arabic script and similarly from CDB/ Farsi [23] dataset we chose to work on characters which are similar in Urdu and Farsi. To ensure the objectivity of the experiment we manually prepared an additional Urdu character dataset with different fonts that were not in the above-said database.
The input image layer of AlexNet requires input image of size 227 × 227 by 3, GoogleNet requires 224 × 224 by 3 and ResNet18 requires 226 × 226 by 3. So we utilized an augmented image datastore to dynamically resize the input images.

Feature Concatenation
Our proposed network extracts individual feature vector from each of the pretrained models as shown in the proposed network Fig. 4. For a single character image, from AlexNet we extract 4096 features form 'fc7' (the final fully-connected layers). From GoogleNet we extract 7863 features from inception 5b and from ResNet18 we extract 3894 features from the res 5b layer. Even though all these three pretrained models are trained with the same ImageNet database but distinct networks extract diverse features from character image [40]. So we concatenated these three CNN features to integrate the data from different CNNs composed to generate an increasingly discriminative feature. Finally, the classification is done by feeding the concatenated features into two fully connected layers. Fig. 5 shows the Urdu character feature map of each CNN model and their strongest activation from corresponding layers. Initially, we compared our proposed network classification accuracy with the three individual pretrained models. All through this investigation, an NVIDIA-1060 graphics Zotak (CUDA v10.0) was utilized for a fast response. Tab. 1 shows that the proposed method steadily performs superior compared to other three individual CNN models on different datasets. From Fig. 6 especially, it is noticed that the performance of three models fluctuates on various ratios of the individual datasets, which proves that distinct CNN models can extract diverse features in Urdu character images.  Next, to our proposed method, AlexNet individually shows good classification accuracy for all the dataset compared to the other two individual GoogleNet and ResnNet18. Electing one pretrained network that ensembles all datasets at hand with an individual transfer learning network is hard.  Fig. 7 shows that our proposed network outperforms the other CNNs considered for cursive character recognition. Tab. 2 also shows that compared to the existing conventional methods for Urdu character recognition our proposed method outperforms in all the ratios of datasets. Our actual simulation shows that our network attains encouraging accuracy by the 2 nd epoch itself. Fig. 8 shows that the AlexNet took around 28 minutes, GoogleNet took 230 minutes, ResNet18 took 175 minutes and our proposed network took 198 minutes on average to complete the training and execution. This is because OCR-GoogleNet spends more time in the inception module in finding feature than OCR-AlexNet. For time constraint recognition like online Urdu OCR, our network is more efficient for training Urdu characters. From all the above comparisons (Tab. 2), we conclude that the limitations of the individual network are overcome by our proposed network which concatenates the features from different pretrained networks to produce robust and better performance.

Execution Time (Minutes)
AlexNet GoogleNet ResNet18 Proposed character recognition by transfer learning three pre-trained CNN models (AlexNet, GoogleNet & ResNet18). We concatenated the features extracted from these individual models and trained with two fully connected layers. We experimented our proposed network on four different standard Urdu datasets. To ensure the objectivity of the experiment we also tested our network in our handcrafted dataset. The results show that for all the dataset our proposed network achieved 97.18 % averaged accuracy which outperforms all other individual CNN models considered for character recognition. Additionally, we also compared our proposed network with conventional classification methods, which again shows that our method attains substantial accuracy gain. Multi-layer hierarchical feature concatenation from multiple CNN to further improve the classification performance would be our future consideration. The proposed system mainly goes in the direction of developing an ideal character recognition system for Urdu script.