CNN — The Basic

Jack W.
3 min readDec 31, 2021

I am a newbie in the world of AI. I studied courses on machine learning and is captivated by the power of Deep Learning and particularly CNN. Before driving into CNN stuff, let start with the basic Artificial Neural Network (ANN). Here is a general model representation of neural network.

What are neural networks by IBM Cloud Education

So far, so good. Just set an optimizer and a loss function based on your task. Then, get inputs into the hidden layers, let it do the job finding relationship to predict the target. Not satisfied with the result? Add more layers and nodes, train and repeat.

It works most of the time… Until it won’t.

General ANN model does not perform well on Image tasks because

  • Spatial relationship is not captured in ANN i.e. the relationship of nearby pixels like line and shape is not used.
  • ANN fails to predict when object in the image is shifted or rotated

This is where Convolutional Neural Network (CNN) comes to the rescue.

CNN can solve these problems as it can

  • capturing spatial relationships.
  • recognizing patterns even it was shifted or rotated (Translation Invariant)

I like to share an excerpt from a book AI 2041 page 59, written by Chen Qiufan and Kai-Fu Lee. It explains how CNN process is similar to how human eyes work.

Our visual cortex use many neurons corresponding to sub-region called reflective field. These reflective fields identify basic features such as shapes, lines, colors or angles. These detectors are connected to the neocortex… which stores information hierarchically and processes these receptive fields’ output into more complex scene understanding.

A CNN’s lowest level is a large number of filters, which are applied repeatedly across an image. Each of these filters can see only small contiguous sections of the image, just like the receptive fields… A CNN’s higher layers are hierarchically organized, like the neocortex.

What a CNN sees. Source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.

Starting from the input, which is the image itself. The image is translated in into 2-dimensional matrix which content in each cell represent color and brightness of pixel. The input is then feeded into first hidden layer which each nodes in the layer respond to small region of input. Unlike ANN, where all inputs are fully connected to all nodes in hidden layers.

Connectivity pattern between neurons resembles the organization of the animal visual cortex.

While nearby pixels are often locally correlated with one another. This is captured by applying filters.

So what is this filters anyway ? And how can we configure it? You may ask.

The filters are basically arbitrary matrixes (size and values). It acts as a feature extraction. Think of camera filter that manipulate photo output. In CNN, part of the image representing by matrix will be dot product with filter to get information (feature). We don’t need to make sense of it as it is model’s problem to figure out on training.

Generally, the first layers of the network extract low level features, like edges, and deeper levels combine these features into more abstract, and often meaningful shapes…

On deeper layers, regions are then converted in to 1 dimension layers fully connected like traditional ANN.

--

--