A Novel Approach of Traffic Density Estimation Using CNNs and Computer Vision

— In modern life, we face many problems, one of which is the increasingly serious traffic jam. The cause is the large volume of vehicles, inadequate infrastructure and unreasonable distribution, and ineffective traffic signal control. This requires finding methods to optimize traffic flow, especially during peak hours. To optimize traffic flow, it is necessary to determine the traffic density at each time in the streets and intersections. This paper proposed a novel approach to traffic density estimation using Convolutional Neural Networks (CNNs) and computer vision. The experimental results with UCSD traffic dataset show that the proposed solution achieved the worst estimation rate of 98.48% and the best estimation rate of 99.01%.


I. INTRODUCTION
Traffic density estimation provides information to Intelligent Transport Systems (ITS) for road planning, intelligent routing, traffic control, scheduling, etc. [1]. Accurate estimation of traffic density is essential for the development of automated signaling and early warning systems, development of statistical, planning, and security applications. Furthermore, traffic density data can be used to help drivers choose the optimal way between multiple routes [2].
In the past, the methods of estimating traffic density were usually vehicle counting methods. Pang et al. [3] introduced a method to count the number of vehicles in the presence of multiple vehicles in the traffic image. Chen et al. [4] developed a new system for counting vehicles in dark environments using their headlights. Pornpanomchai et al. [5] studied video vision for vehicle counting. Mohana et al. [6] studied the counting of vehicles in real-time. Zhao and Wang [7] studied vehicle counting in mixed traffic zones. Today's traffic management systems use image and video processing techniques to extract information from surveillance videos on roads [8].
With the rapid development of computer vision and digital image processing technology, image and video-based traffic flow detection system has become more and more developed and intelligent. Due to these benefits, image and video-based vehicle detection technology are becoming more and more important for ITS [9]. The approach proposed by Gupta et al. [10] is based on traffic calculation by comparing two images, Submitted on August 04, 2021. Published on August 31, 2021. Luong Anh Tuan Nguyen, IT Faculty, Vietnam Aviation Academy, Vietnam.
(corresponding e-mail: nlatuan vaa.edu.vn) a reference image, and a real traffic image. Kanojia [11] proposed another method to control traffic signals using image processing techniques. He first selects a reference image that is an image with no or fewer cars and matches the actual image to that reference image each time. Based on the matching rate, the traffic lights are controlled.
In recent years, artificial intelligence and computer vision have made remarkable achievements, which are applied in all fields such as security, economy, commerce, education, etc. Image recognition is intelligent functions from the combination of computer vision and artificial intelligence.
Deep Learning developed from neural networks has many outstanding achievements in the field of computer vision. Convolutional Neural Networks (CNNs) is one of the famous and widely deep learning models in the field of computer vision.
The combination of artificial intelligence and computer vision makes traffic density estimation studies based on image processing techniques more and more effective. Celil Ozkurt and Fatih Camci [12] proposed a method to estimate traffic density and classify vehicles using neural networks. Julian Nubert et al. [13] proposed the work of estimating traffic density by convolutional neural networks (CNNs). The goal of this project is to introduce and present a machine learning application that aims to improve the quality of life of Singaporeans. R. Kaviani et al. [14] propose a method to estimate traffic density based on Topic model, the Topic model is a quite popular method in the natural language processing field.
From the studies of related works, the authors propose a traffic density estimation model for traffic video clips based on computer vision combined with convolutional neural networks (CNNs). The experimental results with UCSD traffic dataset show that the proposed solution achieved the worst estimation rate of 98.48% and the best estimation rate of 99.01%.
The remainder of the paper is organized as follows. Section II presents the design and implementations of the proposed solution. In section III, the numerical results of the experiment are illustrated. Finally, Section IV concludes this paper and figures out the future works.

II. PROPOSED SOLUTION
The system architecture including 3 modules is shown in Fig. 1 The first module is the "cut and discards similar frames" module which cuts the input traffic video clip into frames, then discards adjacent similar frames.
The second module is the "Pre-Processing" module which uses OpenCV library and image segmentation algorithm [15] to remove non-vehicle objects in traffic images.
The third module is the CNN module which estimates traffic density from the output image of the second module.

A. "Cut and Discard Similar Frames" Module
This module cuts the input traffic video clip into frames, each frame is a traffic image. However, in reality, adjacent frames have almost no content change, so to reduce processing time for the entire system, it is necessary to must discard frames with similar content.
This module is implemented with the following steps: 1) Convert K and K+1 frames from color image to grayscale image by averaging method (R+G+B)/3. 2) Get the histogram of the grayscale image of the K and K+1 frames, then convert these histograms into a onedimensional array consisting of 256 elements, called vector K and K+1. 3) Calculate the similarity between 2 vectors k and vector k+1 according to the following correlation coefficient Where, X=(X1, X2,…, Xn) is the vector K, Y=(Y1, Y2, …, Yn) is the Vector K+1 and N=256. This correlation coefficient has a value from 0 to 1. If this correlation coefficient is greater than or equal to 0.95, then frame K and frame K+1 are similar, and frame K+1 will be discarded.

B. "Pre-Processing" Module
The purpose of this module is to detect vehicle objects on traffic images and retain these vehicle images, filtering out non-vehicle images. Therefore, the output of this module is an image of only vehicle objects, this output image has been discarded for non-vehicle objects and serves as an input to CNN module. This module uses OpenCV library and image segmentation algorithm to remove non-vehicle objects in traffic images. Fig. 2 illustrates the input image and output image in this module, (a) input image, (b) output image.

C. CNN Module
This module is shown in Fig. 3. This module includes the following layers: 1) The convolutional layer extracts features from input images. Convolution preserves the relationship between pixels by learning image features using small squares of input data.

2) ReLU layer: ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0, x).
Since the real-world data would be non-negative linear values.

4) Fully Connected Layer:
The flatten layer is used to transfer pooled feature map into a one-dimensional vector. The flattened matrix goes through a fully connected layer to classify emotions. In the proposed method, using the softmax activation function in the output layer to represent a categorical distribution over class labels, and obtaining the probabilities of each input element belonging to a label. Fully Connected Layer is described in Fig. 4.

III. EXPERIMENTAL RESULTS
The experimental dataset was obtained from UCSD [17] with a total of 253 highway traffic video clips collected by the Washington Department of Transportation through 2 cameras. Each video clip has about 42-52 frames with size 320×240 pixels. These video clips are divided into 3 traffic densities: light traffic, medium traffic, and heavy traffic with weather conditions of overcast, rainy weather, and clear sky.   This dataset is divided into the following 2 datasets: • Training dataset accounts for 80% • Testing dataset accounts for 20% This dataset uses 200 video clips to experiment as shown in Table I. The 5-fold cross-validation system with the training dataset accounts for 80% and the testing dataset accounts for 20% as shown in Fig. 7. The 5-fold cross-evaluation experimental system with the results of the confusion matrix [18] evaluates the performance of the system is depicted in Fig. 8. The average classification rate obtained over 5 rounds of cross-checking is shown in Fig. 9. The overall average correct classification rate is 98.66%. The estimation rate of each type of traffic density is shown in Fig. 10.
With the UCSD Traffic dataset [17], the work [14] achieved the traffic density estimation rate shown in Table II. However, with the UCSD Traffic dataset [18], the proposed model achieves the traffic density estimation rate as shown in Table III. From the data of Table II and III, it finds that the traffic density estimation rate of the work [14] is not evenly at different density labels. With the proposed model, the traffic density estimation rate is more evenly, which shows that the proposed model is more stable.
Moreover, the traffic density estimation rate "Light" of the proposed model is lower than the work [14] (99.01% compared to 100%), but the lower rate is not much, so can be remedied in the future; While the traffic density estimation rate "Medium" and "Heavy", the work [14] is much lower than the proposed model. Fig. 11 depicts the traffic density estimation rate on the UCSD Traffic dataset [17] of the work [14] and the proposed model.

IV. CONCLUSION
The authors have studied the theoretical basis of image processing, computer vision, image feature extraction methods, OpenCV library, models, and methods to estimate traffic density from images and traffic video clips. From the researched knowledge, the authors propose the traffic density estimation model using convolutional neural networks (CNNs) and computer vision with the input data of traffic video clips. This proposed model is tested on the dataset of video clips from UCSD (University of California San Diego) [17]. The experimental results show that the proposed solution achieved the worst estimation rate of 98.48% and the best estimation rate of 99.01%. The proposed model is more stable than the work [14]. With the traffic density estimation rate "Medium" and "Heavy", the work [14] is much lower than the proposed model.
In the future, the authors will experiment with many traffic video clips with different situations and environments to evaluate and improve the proposed model more and more completely.