UM-CV 7 & 8 CNN and its design principles

About 2882 wordsAbout 10 min

2024-12-21

Problem: All of the previous models flattens the input image into a vector. This loses the spatial structure of the image. CNNs are designed to work with image data directly.

@Credits: EECS 498.007 | Video Lecture: UM-CV 5 Neural Networks

Personal work for the assignments of the course: github repo.

Notice on Usage and Attribution

These are personal class notes based on the University of Michigan EECS 498.008 / 598.008 course. They are intended solely for personal learning and academic discussion, with no commercial use.

For detailed information, please refer to the complete notice at the end of this document

Components of CNN

Convolutional Layers
Pooling Layers
Normalization Layers

Convolution Layer

Image: 3x32x32

Filter: 3x5x5 Convolve the filter with the image to get a dot product. The filter is convolved with the image by sliding it across the image. The output is a 1x28x28 map.

We have multiple (for example 6) filters, each filter produces a different output. The output of the convolutional layer is a stack of 6 (28x28) maps.

Fig: General form of convolutional layer

Fig: Stacking Convolutions

Typo: the length of $b_1$ is 6.

What do convolutional filters learn?

MLP: A set of template matching filters. Each filter is a set of weights that are learned during training. The filter is convolved with the input image to produce an output.

Convolutional Network: A set of local image templates, like edge detectors, corner detectors, etc.

Fig: Convolutional Filters

Padding

A closer look at spatial dimensions

Fig: Convolutional Filters

Input: 7x7 -> Filter 3x3 -> Output: 5x5 In general: Input W -> Filter K -> Output W-K+1

Fig: Padding

Receptive Field

Fig: Receptive Field

The receptive field has two meanings: The kernel size and the field that the input dimension affects.

Strided Convolutions

Stride: The number of pixels the filter moves each time. Strided convolutions reduce the spatial dimensions of the output.

Fig: Strided Convolutions

Example:

Input volume 3x32x32 -> 10 3x5x5 filters with stride 1, padding 2 Output volume size? Number of parameters? Number of multiply-add operations?

Answer:
10x((32+4-5)/1+1)^2 = 10x32x32
10x3x5x5 + 10 = 760
768000, since each output pixel requires 3x5x5 multiplications. Total = 75 * (10*32*32) outputs = 768000

Example: 1x1 convolution

Fig: 1x1 Convolution

Stacking 1x1 convolution layers gives MLP operating on each input position. This preserves the spatial structure of the input.

Convolution Summary

Input: C_in x H x W
Hyperparameters: F filters, K kernel size, S stride, P padding
Weight Matrix: C_out x C_in x K x K giving C_out filters of size KxK
Bias vector: C_out
Output: C_out x H_out x W_out
H_out = (H + 2P - K)/S + 1
W_out = (W + 2P - K)/S + 1

Other types of convolution

So far: 2D convolution. There are other types of convolutions like 1D, 3D, etc.

Fig: 1D Convolution

Fig: 3D Convolution

Pytorch has 1d to 3d convolutional layers.

Pooling Layer

The pooling layer is used to reduce the spatial dimensions of the input. It is used to reduce the number of parameters and computation in the network. It also helps in controlling overfitting.

Pooling is a form of downsampling. It reduces the spatial dimensions of the input.

Fig: Pooling Layer

Types of pooling:

Max pooling: Takes the maximum value in the pooling window.
Average pooling: Takes the average value in the pooling window.

This introduces invariance to small spatial shifts, and there is no learnable parameters in pooling.

Fig: Pooling Summary

Convolutional Networks

Classic architecture:

Example: LeNet-5

Layer	Output Size	Weight Size
Input	1 x 28 x 28
Conv (C_out=20, K=5, P=2, S=1)	20 x 28 x 28	20 x 1 x 5 x 5
ReLU	20 x 28 x 28
MaxPool (K=2, S=2)	20 x 14 x 14
Conv (C_out=50, K=5, P=2, S=1)	50 x 14 x 14	50 x 20 x 5 x 5
ReLU	50 x 14 x 14
MaxPool (K=2, S=2)	50 x 7 x 7
Flatten	2450
Linear (2450->500)	50 x 7 x 7	2450 x 500
ReLU	500
Linear (500->10)	10	500 x 10

We tend to decrease the spatial dimensions and increase the number of channels as we go deeper into the network.

Training Deep Networks: Batch Normalization

Problem: Deep networks are hard to train. The gradients tend to vanish or explode as we go deeper into the network.

Solution: Batch Normalization

This helps reduce the internal covariate shift. It normalizes the activations of the network. It helps in training deeper networks.

(Joffe and Szegedy, ICML 2015, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift)

Fig: Batch Normalization

Fig: Calculating Batch Normalization

Calculating mean and variance over the batch means the inputs are intertangled. The estimation depends on the batch size.

During training, we use the batch mean and variance.

During testing, we use the average and variance of the entire seen data during training.

During testing, the batchnorm becomes a linear operator! We can fuse them into the previous layer.

Batch Normalization for Convolutional Networks

Fig: Batch Normalization for Convolutional Networks

See ICML 2015 paper for more details.

Advantages and Disadvantages of Batch Normalization:

Makes deep networks much easier to train.
Allows higher learning rates and faster convergence.
Networks become more robust to initialization.
Acts as regularization during training.
Zero overhead at test time. Can be fused into the previous layer.
Works well with feed-forward networks.
Not well understood theoretically(yet)
0 Mean is forced, and may not be ideal for all models.
Behaves differently during training and testing: this is a very common source of bugs!

Instance Normalization

Fig: Instance Normalization

Fig: Different Normalization Techniques

227×227 inputs, 5 Convolutional layers, Max pooling, 3 Fully connected layers, ReLU nonlinearity. Used "local response normalization"(not used anymore). Trained on two GTX 580 GPUs - only 3GB of memory each. Model split over two GPUs.

Fun fact: citations to the AlexNet

Fig: AlexNet

Layer	Input Size (C)	Input Size (H / W)	Filters	Kernel	Stride	Pad	Output Size (C)	Output Size (H / W)	Memory (KB)	Params (k)	FLOP (M)
conv1	3	227	64	11	4	2	64	56	784	23	73
pool1	64	56		3	2	0	64	27	182	0	0
conv2	64	27	192	5	1	2	192	27	547	307	224
pool2	192	27		3	2	0	192	13	127	0	0
conv3	192	13	384	3	1	1	384	13	254	664	112
conv4	384	13	256	3	1	1	256	13	169	885	145
conv5	256	13	256	3	1	1	256	13	169	590	100
pool5	256	13		3	2	0	256	6	36	0	0
flatten	256	6					9216		36	0	0
fc6	9216						4096		16	37,749	38
fc7	4096						4096		16	16,777	17
fc8	4096						1000		4	4,096	4

conv1: Number of floating point operations (multiply+add) = (number of output elements) * (ops per output element) = (64 * 56 * 56) * (11 * 11 * 3) = 72,855,552 = 73M flops

How is it designed? Trails and errors. Also a compromise between memory usage and computational efficiency.

Fig: AlexNet

ZFNet: A Bigger AlexNet

Similar to AlexNet, but with smaller filters and deeper layers. 7x7 filters in the first layer, 3x3 filters in the second layer. Deeper layers.

VGG: The principles of designing a good network

All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
- Two convolutional layers together make a receptive field of 5x5, while having fewer parameters and less flops. We can also add ReLU after each layer to add non-linearity to the network.
After pool, double the number of channels
- We want each convolutional layer to have the same computational cost.
- Input1: Cx2Hx2W - Conv(3x3, C->C) - 4HWC Memory - 9C^2 Params - 4HWC flops
- Input2: 2CxHxW - Conv(3x3, 2C->2C) - 4HWC Memory - 36C^2 Params - 4HWC flops

5 stages: conv-conv-pool conv-conv-pool conv-conv-pool conv-conv-conv-(conv)-pool conv-conv-conv-(conv)-pool FC-FC-FC

VGG is much larger than AlexNet. It has 138 million parameters! VGG is 19.4x computationally more expensive than AlexNet.

Fig: VGG

Done in academia by one grad and one faculty member.

Google LeNet: Focus on Efficiency

Design an efficient convolutional network for mobile devices. 22 layers, 4 million parameters.

Aggressive Stem downsamples the network at the beginning. 3x3 convolutions, 1x1 convolutions, and factorized convolutions. (CVPR 2015)

Fig: Google LeNet

Inception Module repeated throughout the network. It uses multiple filter sizes in parallel. It uses 1x1, 3x3, 5x5, and max pooling in parallel. It concatenates the outputs of these filters. Do all the kernel sizes in parallel and concatenate the outputs.

Fig: Inception Module

Global Average Pooling at the end. Rather than flatting tensor, it takes the average of the tensor.

Fig: Global Average Pooling

Auxiliary Classifiers at intermediate layers. Making gradient flow easier. (For VGG, the network is trained layerwise.)

Fig: Auxiliary Classifiers

ResNet (CVPR 2016)

Deeper models does worse than shallow model!

Initial Guess: Deep model is overfitting. In fact, the training error of deeper networks is also higher than the shallower networks.

Hypothesis: This is an optimization problem. The deeper networks are harder to optimize, and in particular don't learn the identity functions to emulate the shallower networks.

Solution: Change the network so learning identity function is easier.

Residual Block: Instead of learning $H(x)$ , learn $F(x) = H(x) - x$ . The network learns the residual function.

Fig: Residual Block

When backpropagating, the gradient is copied to the input. This makes the optimization easier.

Network is divided into stages like VGG. Each stage has a different number of filters.

Fig: ResNet

Basic Block and Bottleneck Block

Fig: Basic Block and Bottleneck Block

In 2015, ResNet ranked 1st in all five competitions!

MSRA @ ILSVRC & COCO 2015 competitions

ImageNet Classification
ImageNet Detection
ImageNet Localization
COCO Detection
COCO Segmentation

Comparing Complexity

Fig: Comparing Complexity

ImageNet 2016 winner: Model Ensembles

ResNeXt

ResNeXt: Aggregated Residual Transformations for Deep Neural Networks (CVPR 2017)

Fig: ResNeXt

Annual ImageNet competition is no longer held after 2017. Now it is moved to Kaggle as a challenge.

DenseNet

Densely Connected Convolutional Networks (CVPR 2017)

Fig: DenseNet

MobileNets: Tiny Networks

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (CVPR 2017)

Fig: MobileNets

MobileNetV2: Inverted Residuals and Linear Bottlenecks (CVPR 2018)

ShuffleNetV2: Practical Guidelines for Efficient CNN Architecture Design (ECCV 2018)

Neural Architecture Search (NAS)

a hot topic in deep learning research

Fig: Neural Architecture Search

Summary

Early work (AlexNet, VGG) focused on designing deeper networks.

Later work (ResNet, DenseNet) focused on designing more efficient networks. Recent work (MobileNets, ShuffleNet) focused on designing networks for mobile devices.

Early work (AlexNet -> ZFNet -> VGG) shows that bigger networks work better

GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck convolutions, global avg pool instead of FC layers)

ResNet showed us how to train extremely deep networks - limited only by GPU memory! Started to show diminishing returns as networks got bigger

After ResNet: Efficient networks became central: how can we improve the accuracy without increasing the complexity?

Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet, etc

Neural Architecture Search promises to automate architecture design

What architecture should I use?

For most problems you should use an off-the-shelf architecture (e.g. ResNet, DenseNet, etc)

If you just care about accuracy, ResNet-50 or ResNet-101 is a good choice.

If you care about efficiency, MobileNet or ShuffleNet is a good choice.

Notice on Usage and Attribution

This note is based on the University of Michigan's publicly available course EECS 498.008 / 598.008 and is intended solely for personal learning and academic discussion, with no commercial use.

Nature of the Notes: These notes include extensive references and citations from course materials to ensure clarity and completeness. However, they are presented as personal interpretations and summaries, not as substitutes for the original course content.
Original Course Resources: Please refer to the official University of Michigan website for complete and accurate course materials.
Third-Party Open Access Content: This note may reference Open Access (OA) papers or resources cited within the course materials. These materials are used under their original Open Access licenses (e.g., CC BY, CC BY-SA).
Proper Attribution: Every referenced OA resource is appropriately cited, including the author, publication title, source link, and license type.
Copyright Notice: All rights to third-party content remain with their respective authors or publishers.
Content Removal: If you believe any content infringes on your copyright, please contact me, and I will promptly remove the content in question.

Thanks to the University of Michigan and the contributors to the course for their openness and dedication to accessible education.