YOLO: Introduction | Custom Training

YOLO: You Only Look Once is an algorithm that helps us to identify the objects in a given image. Let’s understand it better by diving into the history of Computer Vision starting from CNN to various object detection algorithms that led to the invention of YOLO and its successors.

Computer Vision

Computer Vision represents a relative understanding of visual environments and their contexts. In other words, they are used to make useful decisions about physical objects and scenes based on sensed images.

Some applications of Computer Vision is as below:

o Face Recognition

o Surveillance

o Biometrics

o Smart Cars

Convolution Neural Network (CNN)

CNN’s are an integral part of applications that support image recognition. They take an input image and output a class (a cat, dog, etc) or a probability of classes that best describes the image.

For more information on CNN’s, please refer to my article on Convolution Neural Network.

Object Detection

Object detection focusses on the problem of identifying different objects in a given image. It accomplishes this task by creating a bounding box across each identified object.

CNN’s alone are not suitable for object detection as locations of the object in an image are not constant. Solving such a problem would result in CNN’s being applied to a lot of sub-images (region of interest) of a given image, that would blow up the computation cost.


Region-based CNN (R-CNN) was one of the first CNN based deep learning approaches introduced to solve the problem of object detection.

R-CNN model selects 2000 regions of interest from a given image using the Selective Search algorithm. Pre-trained CNNs are used to perform forward computations to extract features from each of the proposed regions. These features are later used to predict the categories and bounding boxes of proposed regions.

R-CNN Model

The main downside of R-CNN is speed. It takes a huge amount of time to train the network as it requires a classification of 2000 region proposals per image.

Fast R-CNN

Fast R-CNN improves on R-CNN by performing CNN forward computation on the image as a whole.

A selective search algorithm is used to identify 2000 regions of interest of different sizes from the preceding convolutional feature map. An RoI pooling layer is then used to reshape these extracted features to a common shape that is later fed into a fully connected dense layer. From the RoI feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

Fast R-CNN Model

The region proposals count of 2000 / image affects the performance (speed) of the Fast R-CNN algorithm.

Faster R-CNN

Faster R-CNN replaces the selective search used above with a Region Proposal Network (RPN) that learns to identify the region proposals during the training of the model. This reduces the number of proposed regions generated while ensuring precise object detection.

Faster R-CNN Model


An anchor is a box representing a combination of sliding window center, scale, and ratio. For example, 3 scales + 3 ratios => k=9 anchors at each sliding position (default configuration).

Anchor Boxes / Each Stride

Anchors have height-width ratios of 1:1, 1:2 and 2:1 respectively. Three colors above represent three scales or sizes: 128x128, 256x256, 512x512.

Region Proposal Network (RPN)

It ranks anchors and proposes the ones most likely containing objects.

RPN Network

o Slide a small n x n spatial window over the conv feature map of the entire image.

o At the center of each sliding window, we predict multiple regions of various scales and ratios simultaneously using anchors.

o Each anchor is then classified as background or foreground to predict the possibility of an object within it.

o The bounding box coordinates of these anchors are then found.

The above process is repeated until the model loss is within an acceptable range.


The previous object detection algorithms fail to look at the complete image. It uses regions / sub-images to localize the object within the image.

In YOLO, 1 convolutional network predicts the bounding boxes and the class probabilities for all these sub-images simultaneously. As a result, it is 45 frames per second faster than other object detection algorithms.


YOLO Model

o We take an image and split it into an S x S grid. Each grid has M bounding boxes

o YOLO outputs a class probability and offset values for each bounding box if the center of an object falls in the grid cell.

o The bounding boxes having class probability above a threshold is selected and used to locate the object within the image

The limitation of the YOLO algorithm is that it struggles to identify small objects within an image. For example, it might have difficulties in detecting a group of similar objects (eg., a flock of birds) due to the spatial constraints of the algorithm

Custom Data Detection

1. Initial Setup

Follow the below steps to set up the initial environment on your local machine:

$ pip install Cython

$ git clone https://github.com/thtrieu/darkflow.git

$ cd darkflow

$ python3 setup.py build_ext — inplace $ pip install .

2. Data Crawling

Download 100–200 images each of your desired object detection choice in. /train/images

Rename downloaded files in number format to facilitate image identification during training

3. Annotate Data

Annotate and save each of the downloaded images in .xml format using LabelImg in ./train/annotations


o pip install labelImg

o cd darkflow/train/images

o labelimg

o File -> Open Dir -> Choose “darkflow/train/images”

o File -> Change Save Dir -> Choose “darkflow/train/annotation”

o Click on Create \nRectBox to be able to draw boxes around your desired objects in the selected Image

o Save the object with desired name (eg., London Underground)

o Repeat the process for all images present in the File List

o Open any of the annotated .xml file saved and cross verify that file path points to /darkflow/train/images

o If above is not true, redo the above LabelImg annotation process to get the file path fixed.

The process is a bit annoying, but if it’s not taken care of, training YOLO model won’t be successful.

YouTube link: LabelImg Tutorial:

4. Config & Weights

Download YOLO configuration (.cfg) and weights from https://pjreddie.com/darknet/yolov2/

Once downloaded, duplicate the config file and rename it.

Make below changes in the renamed config file:

Reducing the height and width of image results in faster training

Here, .cfg file has been renamed based on number of classes (yolo_1c)

5. Labels

Update labels.txt file with names of classes being trained.

o $ pwd


o vi labels.txt

o Add 1 label in each line

o Save

6. Training

Train YOLO model from command line using steps below:

$ pwd


$ python flow — -model cfg/yolo-1c.cfg — -load bin/yolo.weights — train — — annotation train/annotations — -dataset train/images — — epoch 500

Training: Command Line

Train YOLO model from Jupyter Notebook line using steps mentioned below:

Training: Jupyter Notebook

Once training is complete, you will see updated weights stored in the folder ckpt in form of .profile files (eg., yolo_1c-3500.profile)

7. Solution To Common Training Issues

o Path Not Found

Define the full path of your training weights (.weights) and configuration file (.cfg)




o Offset Error

Go to ./darkflow/utils/loader.py

Change offset value to from 16 to 20

i.e, self.offset = 16 à self.offset = 20

o Weight Mismatch

If you still face issues, download YOLO weights from below Google drive.

8. Testing

Load trained YOLO model using the below steps in Jupyter notebook.

load’ refers to the last saved checkpoint during training. Herein, it corresponds to yolo_1c3500.profile

threshold’ refers to the confidence level (0–1) above which the current object detected need to be classified as true/false

Load newly trained YOLO model

Test YOLO model on a sample image

Display results using OpenCV


I work in AI at Amazon to help customers search for products on their wishlist by uploading related images.