Improving Object Detection with Contrast Stretching (Part 1/2)
An important part of training neural networks is preprocessing of the input. A lot of performance gain can be obtained by carefully examining, cleaning and transforming the input data. In this post we will consider the influence of contrast stretching of our input images on the performance of a Faster R-CNN network to recognize objects. In this Hackathon these objects are white tents in refugee camps. This post is the first part of our series about contrast stretching. The second part can be found here.
A Faster R-CNN is an object detection model. It is based on conventional CNNs. A CNN (Convolutional Neural Network) is often used to classify images. A Faster R-CNN model is based on such a CNN model and extends it with region proposals. These region proposals contain regions of the image in which an object is potentially present. Then the CNN part classifies these proposals.
Using a training set of 3,500 image slices, we train the Faster R-CNN model 10 epochs, with in each epoch 1,000 randomly chosen images. We validate our model on a validation set, consisting of 2,034 image slices with in total 1,976 white tents.
In Figure 1 the mAP (mean Average Precision) plot can be found, together with the rolling mean (window=100) of a measure for the area size of the detections (sqrt(Area)). Keep in mind that, to create a mAP curve, the detected boxes are sorted by the probability the box contains a tent, according to the model. We notice two things: although a lot of tents are found (recall=0.7), we have a lot of false positives (precision=0.2). Especially the sharp drop in the mAP curve at the left side of the graph is a problem; this indicates that, although the model is quite confident about these boxes containing a tent, in reality they do not contain any object. Also the model seems to be more confident when assessing larger bounding boxes, as those appear (incorrectly as just discussed) in the beginning of the mAP curve.
We try to improve these results by using contrast stretching. Contrast stretching should theoretically improve the learning ability of the model as it enhances the contours of objects and emphasizes the difference between object and background. This helps the convolution layers in extracting information and features from the images. Contrast stretching can be done for an image simply by using a formula that scales up the differences in pixel values. As we would like to use colored images, we have to scale up three image ‘channels’: the red channel (R), the green channel (G) and the blue channel (B). A common convention is to represent each pixel of each channel with an integer in the range [0,255], where a RGB value of (0,0,0) equals black and (255,255,255) represents white.
As we deal with image slices of 400×400 pixels, this results in three channels of 400×400 pixels. So in total one image slice can be represented by a 400x400x3 matrix with values between 0 and 255. Before we dive into the formulas, one final remark needs to be made. We are going to split up this 400x400x3 matrix into 3 matrices of 400×400. Then we will apply contrast stretching on each color channel (R, G and B) and afterwards we will combine the three stretched matrices again into a 400x400x3 matrix.
Then stretching can be done by transforming pixel x with a formula like:
where a represents the lower boundary to which we want to scale and b represents the upper boundary. Instead of using the minimum and maximum value in our contrast stretch as lower and upper boundary, we use respectively the 2 and 98 percentile values. An advantage of taking these percentile values is that it is more robust to outliers; if only one pixel in the image channel would be 0 and only one pixel would be 255, no contrast stretching would occur. As we use the 2 and 98 percentile values, we generate more stretching. We should be careful though that after stretching this way, we should round up all values below the 2 percentile to a value of 0 and all the values above the 98 percentile to a value of 255. This directly shows the disadvantage of stretching with different values than the minimum and maximum value; we lose some information.
To quantify the effect of the contrast stretching, we introduce a metric that represents the spread in color within one channel k in one image n: σk,n.
This standard deviation can be obtained by calculating the variance of one 400×400 matrix. If we indicate a pixel x at row i and column j of this 400×400 matrix with xi,j, we get:
where x̄ is the average pixel value in the matrix, calculated with
with I,J=400. Then σk,n = √σ2k,n.
Our total dataset consists of N=58,163 images.
In the upper two image slices in Figure 2 the qualitative effect of 2-98 percentile contrast stretching can be seen.
In the two diagrams below we have plotted the histograms of our metric. For each of the image slice we have calculated the standard deviation of each of the three image channels. Then we apply the contrast stretching separately on the three channels, resulting in the diagram on the right. As we would like to monitor the effect of separately applying contrast stretching on each channel, we also plotted a histogram of the total image, by taking the average of the three standard deviations (R,G and B, so K=3):
As can be seen, the shift in the average standard deviation for each image is representative for the shift per channel. Moreover, a significant increase in standard deviation can be observed. We can calculate the average standard deviation of all the image slices with
Notice that this standard deviation of the pixel values in an image increases a lot. The average standard deviation of all pixel values within an image slice (400×400 pixels) was μσ = 26 before contrast stretching and after stretching it is μσ = 64. Also the values of the standard deviation are more normally distributed. Every channel has been stretched with its own 2 and 98 percentile values (per image slice), but as can be seen the stretching effect is the same for every channel.
When training the model on this pre-processed dataset, we end up with a new mAP curve, see Figure 3.
When comparing Figure 1 to Figure 3, we can notice several improvements. First of all, the recall increases from 0.7 to 0.8, which means 80% of the 1,976 tents is detected. Also the sharp drop in precision at the start of the mAP plot has been decreased. Together this leads to an improving mAP score of 9%. Furthermore, notice that the rolling mean of the area also decreases more gradually. As we want the model to be not too sensitive regarding the size of tents, this is an improvement as well. However, we notice that smaller detected bounding boxes still come with a faster drop in precision, as can be seen when following the blue graph from recall=0.6 to the right. So some improvement can still be made.