Development of A Computer Aided Real-Time Interpretation System for Indigenous Sign Language in Nigeria Using Convolutional Neural Network

— Sign language is the primary method of communication adopted by deaf and hearing-impaired individuals. The indigenous sign language in Nigeria is one area receiving growing interest, with the major challenge faced is communication between signers and non-signers. Recent advancements in computer vision and deep learning neural networks (DLNN) have led to the exploration of necessary technological concepts towards tackling existing challenges. One area with extensive impact from the use of DLNN is the interpretation of hand signs. This study presents an interpretation system for the indigenous sign language in Nigeria. The methodology comprises three key phases: dataset creation, computer vision techniques, and deep learning model development. A multi-class Convolutional Neural Network (CNN) is designed to train and interpret the indigenous signs in Nigeria. The model is evaluated using a custom-built dataset of some selected indigenous words comprising of 15000 image samples. The experimental outcome shows excellent performance from the interpretation system, with accuracy attaining 95.67%.

A new technique created by [12] entails YCbCr (Luminance; Chroma: Blue; Chroma: Red) conversion, edge detection, shape enhancement, and CbCr (Chroma: Blue; Chroma: Red) mapping. [13] identified a new technique that combines the TSL (Tint, Saturation, Luminance) skin color model and the improved Kalman filter. The technique applied limits any extra skin color objects through the improved Kalman filter model, which evaluates the motion and its midpoint location. The use of gloves by [14] employs random distributions of 10 full saturated colors in 20 patches. Nevertheless, the need for regular tuning would robustly enhance its operation [15], [16].
Fuzzy C-Means clustering (FCM) and thresholding technique was employed by [17] for segmenting the face and hand. The FCM uses a clustering method that applies an iterative approach to engages fuzzy partitioning. The Fuzzy C-Means algorithm first clusters the image using its color data. The mean value is finally selected by thresholding that uses the possible values available to the skin tone.
[18] applied a method involving two-step hand segmentation. The method applies canny edge detection on the images first to identify the edges. The second phase involves applying the seeded region growing technique to segment the hand from its background. [19] use morphological operations and HSV (Hue, Saturation, Value) thresholding to segment the hand from its background. Determination of the HSV threshold values is done based on the skin color of the hand before applying it to the images. Morphological operations, including dilation and erosion, are further applied to eliminate noise. [20] used three-hand segmentation algorithms in the development of a recognition system. The three algorithms include the Otsu algorithm, thresholding in YCbCr color space, and the Gaussian Mixture Model (GMM).
[21] employed a heuristic optimization algorithm for the hypermeter optimization process. The hyperparameters for the CNN structure used are based on the AlexNet model, alongside the outcome from the optimization algorithms. A 98.09% and 98.40% recognition accuracy were achieved on two public datasets used for evaluation. [22] used connected component analysis algorithm on the hand to segment the tip of the fingers as its Region of Interest (RoI). Gesture detection and recognition model using the segmented finger was developed via CNN, which achieved a 96.2% recognition accuracy.
Over the past years, several techniques have been developed for different sign languages used in different countries. For the indigenous sign language of indigenous words in Nigeria, which is possibly adopted and adapted from the ASL, the interpretation system presented is designed to solve the peculiarities concerning indigenous words in Nigeria.

III. BACKGROUND STUDIES
Most recognition systems focus on three major techniques for hand sign analysis: appearance-based, depth-based, and sensor-based systems. The appearance-based analysis of hand signs combines two key stages: i. Detection/Tracking; ii. Recognition.

A. Detection/Tracking
Detection plays a key part in the interpretation system. Selective preprocessing and segmentation are required mainly on the raw input samples to aid feature extraction from random/uniform backgrounds. Tracking techniques involve reading the hand movement and its position [23]. Notable steps to the detection of the hands are: i. Color Space Thresholding; ii. Binarization; iii. Background Removal.

B. Recognition
The primary task for an interpretation system is to identify hands-signs correctly for specific positions. Dynamic and static hand signs are two recognition techniques in the research of hand sign movement.
IV. DATA PROCESSING Fig. 1 shows the framework for the proposed interpretation system. The interpretation systems for sign language combine three key stages: Segmentation, Feature Extraction, and Classification. The segmentation process is characterized based on the capture system employed and focuses on obtaining the hand shape from input streams. The hand tracking and detection process uses the appearance-based method.

A. Data Collection
To create the database for the interpretation system, the selected indigenous signs and their connotations were obtained, with each containing several samples. The study employed the use of questionnaire, physical interaction, and indigenous sign illustrations [24]. The appearance-based technique was employed to acquire the needed data. This method was chosen because it is user-friendly and a popular option in sign recognition systems. RGB cameras were used to capture multiple image frames of the hand signs needed to create the dataset. The use of multiple samples per sign collected helps increase the accuracy of the system. The collected signs were adequate to create both the training, validation and test dataset for the system. The image data and labels are saved into an SQLite database. The collected data served as direct inputs for the machine learning algorithms. The word dataset contains 15 words from the indigenous sign language in Nigeria. Fig. 2 shows some samples from the created dataset. The word dataset has the following features: i. Each class has 1000 image samples from several signers and is saved in the. JPG format. ii. The total image samples contained in the dataset are 15000 (1000 * 15) images.

B. Preprocessing
From the raw hand-sign samples with normal backgrounds, the hand area is cropped out from the images. The approach applied focuses on the hand, the key element needed to help the interpretation system. The system converts the image samples in RGB (Red, Green, Blue) to HSV color space. The images are converted into binarized colour images using the histogram back-projection and OTSU approach [25] as the automatic thresholding technique. All images obtained are further re-sized by 50×50-pixel size.

C. Augmentation
The use of real-world data usually features the tendency to be random and offers several distortion types, including rotation, shifting, and more. Data augmentation remains a popular strategy [26] used to enhance the performance (with regards to accuracy) of a particular dataset. To enhance the dataset with such situations, image augmentation techniques are applied. The image samples are rotated randomly using 0 and 10 degrees, both in clockwise and anticlockwise directions. The augmentation applied to the system data shows promising results. Fig. 3 shows samples of the augmented images.

D. Splitting Dataset
The word dataset contains 1000 exemplars for each of its 15 signs. The total number of images in the dataset is 15000. The word dataset is further split into training, validation, and test set. The three sets are split using the ratio of 80:10:10, respectively.

V. DEEP LEARNING MODEL
A deep neural network is engaged in this system to interpret indigenous hand signs. The neural network used is Convolutional Neural Network. The CNN model used, model training, evaluation, and other methodologies are discussed in this section.

A. Model Design
The system uses the created dataset as input to the CNN to classify the indigenous sign language. The CNN is able to perform feature extraction on the dataset. The CNN implementation and model training process were achieved using the Keras library. The CNN architecture comprises of the following layers: 2D Convolutional Layer, Max-pooling Layer, Dense Layer, and Output Layer. A small Region of Interest (RoI) was employed for the image samples to avoid disingenuous performances and overfitting in the learning process.
The CNN model comprises three convolutional layers, two pooling layers, two fully connected layers, combining the ReLU (Rectified Linear Unit) and SoftMax. In addition, the batch-normalization [27] learning approach is employed to avert overfitting. The first convolutional layer takes in the input shape using 8 kernels of size 1 × 1, followed by a batch normalization layer. Convolution is centered around merging two sets of information, and in the case of CNN, they are the Kernel function and input data. Considering a kernel [ , ] with dimensions [2 + 1,2 + 1] where ( , ) = (0,0) is the center of the kernel, the output [ , ] applying the kernel function on the input [ , ] is given in (1) The second convolutional layers follow with 16 kernels of size 1 × 1, followed by a batch normalization layer. A 2D maxpool layer having (2,2) pool size is added. The maxpool layer decreases the spatial size of the subsequent feature map by accruing the generated features. The third convolutional layer has 32 kernels of size 3 × 3 with an additional batch normalization layer. Another 2D maxpool layer having (5,5) pool size and stride value of 5 is added.
Next, flatten layers and a fully connected dense layer are added to accept output elements. This fully connected layer uses the ReLU activation function. The ReLU function is well expressed by (3).
The SoftMax classifier is the final layer in the classification block. The model used the ReLU activation function on every layer and the categorical cross-entropy cost function for loss optimization.
The use of ReLU does not initiate neurons having input values as negative. This approach makes the model more computationally efficient over tanh and sigmoid functions as it creates sparsity. Table 1 shows the layers, kernel size, and the number of neurons in the model.

B. Model Optimization
The model optimization process is applied to make models more efficient and dependable to input data. The deep learning model that was created employed Stochastic Gradient Descent (SGD) as the optimizer for compiling the model. The SGD functions as a parameter in every training sample and delivers much faster performance. The SGD completes exclusive updates at a particular time.
The cross-entropy provides a better option for optimizing the cost function. For improved classification and prediction outcomes, the cross-entropy cost function is employed in the model.

C. Model Training
In training the model, the He normal initialization [28] was applied to initialize every kernel towards achieving quicker convergence. The batch-normalization technique was applied to limit the convolution layers from learning less relevant features and avoid overfitting to the features.
Batch normalization (BN) is a technique to normalize activations amongst connecting layers of deep neural networks to enhance accuracy and training [29]. The addition of the batch-normalization layer aids in the reduction of covariance shifts within the network's hidden kernels. The outcome from the previous layer is normalized via BN engaging the present batch statistics. This approach makes the weight distribution of the hidden kernel units or adapts to any data distribution variation. The BN layer also provides a level of regularization effect as weights are scaled depending on the batch mean and variance, which gives the weights stability. Fig. 4 shows the model training flowchart.

VI. EXPERIMENT AND RESULT
The model is trained using the training and validation set representing 80% and 10% of the dataset distribution, respectively. The neural network completes classification tasks for the indigenous sign language in Nigeria by learning key features for individual classes in the dataset. The classification model attained excellent performance with no transfer learning. The downside to this technique is that if similar data exist for some classes in the subsequent classification task, the possibilities of data leakage will cause biases transfer learning.
In tackling this issue, the model was initialized via the He initialization for classification tasks. The model was trained for 30 epochs, with experimentation performed on multiple hyperparameters, including its regularization, and learning rate. The best values found for the initial learning rate are 0.0001. The model also employs a schedule learning rate which uses step decay to avert divergence and enhance the loss curve. The step decay takes in a learning rate and decreases by a factor of 0.1 after every 5th epoch.
Accuracy, confusion matrix, Recall, Precision, and F1 score were used in evaluating the model performance. Fig. 5 shows the plot of training accuracy against the validation accuracy. Fig. 6 shows the plot of training loss against the validation loss. Both plots were achieved using 30 epochs.   For the multi-class classification model, observations from the accuracy plot showed that the graph attains more smoothness after every 5 epochs. This variation comes from the use of learning rate decay happening after successive 5th epoch. Decreasing by a 0.1 factor, the learning rate decay aids the model to converge optimally and stabilizes the learning process. The final model developed using the word dataset containing the indigenous sign language in Nigeria performed 95.67% training accuracy and 95.28% validation accuracy. Table II shows the evaluation outcomes. A confusion matrix or error matrix for the classification model is shown in Fig. 7. The confusion matrix presents the class-wise performance of the developed model. The multiclass classification model was tested with a total test sample of 1500 images (10% of the word dataset). The 1500 samples comprise 15-word classes, each having 100 samples for the test.
From the confusion matrix in Fig. 7, it was observed that of the 1500 samples: i. Test samples for Niger, Calabar, Enugu, Imo, Bayelsa, Bauchi, and Igbo were correctly predicted. The 100 samples predicted correctly for these words show its True Positive (TP) value as 100 and the False Negative (FN) value as 0. Calabar, Enugu, Imo, and Bauchi get a False Positive (FP) value of 0 as no sample was predicted incorrectly within its class. The True Negative (TN) value for the individual classes is calculated using the total sample (1500) minus the TP, FP, and FN. Table III shows the TP, FN, FP, and TN values obtained for Niger, Calabar, Enugu, Imo, Bayelsa, Bauchi, and Igbo.   Niger  100  3  0  1397  Calabar  100  0  0  1400  Enugu  100  0  0  1400  Imo  100  0  0  1400  Bayelsa  100  0  0  1400  Bauchi  100  0  0  1400  Igbo  100  2  0  1398 ii. For Ondo, Borno, Abia, Ekiti, Kastina, Kaduna, Anambra, and Are, the outcome for these classes shows varying True Positive (TP), False Positive (FP), and False Negative (FN). Ondo has the lowest TP value with 68. The highest misclass value of 31 was recorded between "Ondo" and "Are." The values for the TP, FP, TN, and FN for class 1, 6, 7, 8, and 9 are shown in Table IV.  Based on the above observation, the following performance metrics were computed for the following class: The Precision, Recall, F1-score, and Support for all Word is shown in Table V. The micro-averaged score for Precision, recall, and F1score is calculated as: Based on the performance metric values, the classification model for the word dataset has a precision of 0.96(on a scale of 0 -1), Recall of 0.95 (on a scale of 0 -1), and an F1-score of 0.95 (on a scale of 0 -1).
The F1 score was used because it emphasizes the false negatives and false positives in its evaluation process. Table  6 shows the average Precision, Recall, and F1 score of the developed model. These experiments show the model considerably performed well with the quality input data and training process. The outcomes also validate the robustness and Precision in interpreting indigenous sign language in Nigeria using the proposed model.

VII. CONCLUSION
This paper presents an efficient real-time-based interpretation system for sign language of some indigenous words in Nigeria. A dataset containing word samples from the indigenous sign language in Nigeria was created to cater for training the model. This paper considered hand movement as the key feature, creating static image samples from continuous video stream subjected to image processing techniques. The model developed uses the convolutional neural network as the classifier. For the interpretation of signs, appearance-based and digital interpretation, the CNN proves to be a better choice. The proposed model recognizes and interprets individual signs as text, supporting the removal of any communication gap amongst signers and non-signers. The method developed has given excellent interpretation accuracies attaining a 95.67% training accuracy and 95.28% validation accuracy. The proposed interpretation system using the CNN classification approach achieves 0.95 of Precision, 0.95 of Recall, and 0.95 of F1-score (on a 0-1 scale). Towards improving the interpretation system, the next step is to gather more data for Nigeria's indigenous sign language. In consideration of approach and practicality, more focus will go into interpreting and recognition of idioms, phrases, and sentences available in the indigenous signs of Nigeria.

ACKNOWLEDGMENT
We acknowledge with deep gratitude the special class of the Redeemed Christian Church of God (RCCG) Kings Sanctuary, Akure, Nigeria, and the staff of the School for the Deaf, Akure, Nigeria, for their valuable assistance for this work.