Basically computer vision has 4 main tasks :

**1. Object Recognition/Classification**

Classify the object in the image.

**2.Object Detection**

Are there any object that we want to detect in the image? If yes, draw the bounding box around the image

**3. Object Localization**

Are there any object that we want to detect in the image? If yes, draw the bounding box around the image and show the coordinate of the bounding box.

The (x1, y1) would be the top left corner and the (x2, y2) the bottom right.

And finally … the latest one :

**4. Object Segmentation**

By accommodating mask rcnn, we can get the exact pixel position for each object. This kind of development is very important for robotic vision.

Suppose you have a small robot

And we need to instruct the robot to get passes through this woman between her tiny legs. By using mask rcnn, the robot knows the exact position of her legs.

This kind of trick can not be accomplished by object localization which uses bounding box since we need to know exact position of her leg.

Currently, The most suitable type of neural network to perform those 4 tasks is **“convolutional neural network”**.

Previously on my post, I wrote about “Cardboard Box Detection using Retinanet (Keras)”, it’s about train a custom model on keras retinanet for cardboard localization in the image. **RetinaNet is a convolutional neural network architecture**.

Convolutional neural network is commonly used in computer vision for object detections, object localizations, object recognitions, analyzing depth of image regions, etc…

This post will cover about convolutional neural network in general, including some math of convnet, convnet architecture and then continue with RetinaNet architecture.

**Convolutional Neural Network**

“A convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery. CNN is an improved version of multilayer perceptron”. It’s a class of deep neural network inspired by human’s visual cortex.

Basically CNN works by collecting matrix of features then predicting whether this image contains a class or another class based on these features using softmax probabilities.

**Convolutional Neural Network Architecture**

Commonly, a convolutional neural network architecture consists of these layers :

**1.Convolution Layer**

The core idea between convolutional operation is for feature extractions or we can say filtering. Later, the network will be trying every possible matching features from the input image compared to the class’s image (class is an object name that we want to recognize, e.g : a car).

In order to get enlightenment of how convolutional layer operate, have a look at above image. Based from the above picture we have an input image of 6×6 px, and we have a 3×3 px 2d convolution kernel. The kernel will do 1 **stride** from top left pixel of the image until the bottom of the image. The kernel is a 3×3 matrix of weight (each component of the matrix is a weight). This convolutional operation is used to extract features from image. The most frequently used kernel for convolutional is 2d convolution kernel.

Suppose we have a 6×6 pixel image with RGB color channel.

Suppose we are going to do a convolutional operation using 3×3 matrix as kernel and stride = 1 (the number of strides defines how many pixels the kernel will step).

Here’s the RGB channel in 6×6 matrix extracted from numpy array :

#!/usr/bin/env python3

from PIL import Image

import numpy as np

im = Image.open(“6px.png”)

imgarr = np.array(im)

print(“R channel”)

print(imgarr[:,:,0])

print(“_” * 30)

print(“G channel”)

print(imgarr[:,:,1])

print(“_” * 30)

print(“B channel”)

print(imgarr[:,:,2])

print(“_” * 30)

For this example, we are going to do a 1 stride convolutional operation on red channel using this 3×3 matrix of weight (sobel) :

As an example of convolutional operation, we are going to use “The Red Channel” matrix :

Here’s the mathematical operation using convolutional operation :

#!/usr/bin/env python3

res = (46 * -1) + (67 * 0) + (161 * 1)+ (48 * -2) + (41 * 0) + (114 * 2) + (101 * -1 ) + (165 * 0 ) + (216 * 1)

print(res)

The next 1 pixel stride :

and so on, the stride will continue until the last pixel.

The result is called **convolved feature map matrix**.

Since the matrix is only 4×4 pixel, There will be 2 pixel **padding** for bottom, right, top and left.

**The Linearity**

Algebraically, a convolutional operation is a linear combination. We need to introduce non linearity hence an activation function is needed. Right after the convolutional operation, in order to introduce non linearity we the “ReLU” activation function is used. If we keep it linear, we do not need to use deep learning since it’s just a simple linear functions.

Mathematically, ReLU can be defined as

After ReLU, all negative pixel value from the previous **convolved feature map matrix **with negative pixel value will be replaced by 0.

**Why Non Linearity is Needed ?**

In Math and statistic, a non linearity is commonly used to solve complex problem, meanwhile a linear equation is simple, if we define the input of a linear equation, the output can be found by simple algebra. Before we use activation function such as ReLU, basically the convolutional operation is only a linear function.

Consider an example of a simple this linear equation :

Y = a.x

No matter how many layers, the final activation function will always yield the exact same predicted output. In this condition we do not need to use deep learning with many layers, a simple one layer neural network is enough.

By using ReLU activation function right after a convolutional operation, we can introduce the non linearity hence the system can learn how to solve more complex problem.

Real-life image recognition is a complex problem which can’t be solved literally by computer.

For example we have trained our single layer neural network using dataset of cars and dataset of bat logos:

class 1 is honda civic

class 2 is bat logo

Then if we give an input image with something like this (the same image resolution with dataset)

The computer will be able to answer the correct prediction since it’s just answering a literally just the same image with the same pixels arrangement.

Unfortunately when we give this input image :

The computer will not be able to **vote** correctly whether this one is a bat logo or a honda civic class.

In order to solve this kind of complex problem (since the object in image might be rotated slightly or having a different pose or a bit different form) the ideal neural network to solve this one need a non linearity.

By having a different pose or a slightly different form, this means that the prediction can not by simply solved by a linear regression, since

Y is no longer a.X hence we need to solve this using a non linear equation.

In convolutional neural network, we would update the weights and biases of the neurons on the basis of the error at the output. This process is known as **back-propagation**. **Activation functions **will introduce non linearity to the system thus making the **back-propagation** possible since the **gradients** are supplied along with the **error / loss** to update the **weights** and **biases**.

**2. Pooling Layer**

Right after the ReLU, the next layer is a pooling layer. The pooling layer basically is used to reduce the spatial size of the input hence reducing the number of parameters and computational complexity.

Commonly used pooling method is max pooling.

**3. Fully Connected Layer**

The Fully Connected layer holds composite and aggregate information from previous layers. Before given as input of fully connected layers, those previous multi dimensional inputs will be flatten into a single dimensional inputs.

And finally, the prediction (voting) will be accomplished using the activation function, e.g : softmax.

**Some Examples of CNN Architectures**

**Lenet**

The input image of lenet 5 is **32×32 px** image. Here’s the summary of lenet 5 architecture :

Other than using tanh activation function, we can use ReLU as activation function.

Here’s example of lenet implementation in keras :

import keras from keras.models import Sequential from keras import models, layers model = keras.Sequential() model.add(layers.Conv2D(filters=6, kernel_size=(3, 3), activation='tanh', input_shape=(32,32,1))) model.add(layers.AveragePooling2D()) model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='tanh')) model.add(layers.AveragePooling2D()) model.add(layers.Flatten()) model.add(layers.Dense(units=120, activation='tanh')) model.add(layers.Dense(units=84, activation='tanh')) model.add(layers.Dense(units=10, activation = 'softmax')) model.summary()

**VGG16**

The input image of vgg16 is **224×224 px. **Here’s the summary of vgg16 architecture :

Example of implementation of vgg16 in keras :

#!/usr/bin/env python3 import keras from keras.models import Sequential from keras.layers import Dense, Activation, Dropout, Flatten from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras import models, layers model = keras.Sequential() model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', input_shape=(224,224,3))) model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same')) model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(layers.Conv2D(128, (3, 3),activation='relu',padding='same')) model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same')) model.add(layers.MaxPooling2D((2, 2), strides=(2, 2))) model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same')) model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same')) model.add(layers.Conv2D(256, (3, 3), activation='relu',padding='same')) model.add(layers.MaxPooling2D((2, 2), strides=(2, 2))) model.add(layers.Conv2D(512, (3, 3),activation='relu',padding='same')) model.add(layers.Conv2D(512, (3, 3),activation='relu', padding='same')) model.add(layers.Conv2D(512, (3, 3),activation='relu',padding='same')) model.add(layers.MaxPooling2D((2, 2), strides=(2, 2))) model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same')) model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same')) model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same')) model.add(layers.MaxPooling2D((2, 2), strides=(2, 2))) model.add(layers.Flatten()) model.add(layers.Dense(4096, activation='relu')) model.add(layers.Dense(4096, activation='relu')) model.add(layers.Dense(1000, activation='softmax')) model.summary()

**Resnet**

The main purpose of resnet architecture is to make a convolutional neural network with many layers to train effectively.

The problem of a deep convolutional neural network is that when we increase the network depth, there’s a vanishing gradient problem. As the network goes deeper, its performance gets saturated or even starts degrading in accuracy.

Resnet splits a deeper network into three layer chunks and passing the input into each chunk straight through to the next chunk, along with the residual output of the chunk minus the input to the chunk that is reintroduced.

An implementation of resnet from keras :

https://github.com/raghakot/keras-resnet/blob/master/resnet.py

**RetinaNet**

The problem with a single shot detection model such as yolo is : “there is extreme foreground-background class imbalance problem in one-stage detector.”

RetinaNet introduce “The Focal Loss” to cover for extreme foreground-background class imbalance problem in one-stage detector.

Retinanet is a single shot detection model just like Yolo. On RetinaNet, a commonly used backbone is resnet50, we add a FPN (Feature Pyramid Network) for feature extraction and later the network will use Focal lost to handle extreme foreground-background class imbalance problem.

Example implementation of RetinaNet using keras can be cloned from https://github.com/fizyr/keras-retinanet

Example of custom object detection using Retinanet :

Reference :

https://arxiv.org/abs/1708.02002