UM-CV 15 Object Detection

About 1618 wordsAbout 5 min

2024-12-29

Summary

Slow R-CNN, Fast R-CNN, Faster R-CNN, Single-Stage Object Detection, and their comparison.

@Credits: EECS 498.007 | Video Lecture: UM-CV 5 Neural Networks

Personal work for the assignments of the course: github repo.

Notice on Usage and Attribution

These are personal class notes based on the University of Michigan EECS 498.008 / 598.008 course. They are intended solely for personal learning and academic discussion, with no commercial use.

For detailed information, please refer to the complete notice at the end of this document

Intro

Task Definition

Input: Single RGB image Output： A set of detected objects; for each object predict:

Computer Vision Tasks:Classification, Semantic Segmentation, Object Detection, Instance Segmentation

Category label (from fixed, known set of categories)
Bounding box (four numbers, x, y, width, height)

Challenges

Multiple outputs: Need to output variable numbers of objects per image
Multiple types of output" Need to predict "what" (category label) as well as "where" (bounding box)
Large images: Classification works at 224 x 224; need higher resolution for detection, often 800 x 800 or more
Detecting Multiple Objects: Need different numbers of outputs per image

Region-based CNN (R-CNN)

CVPR 2014

Window Detection

Sliding Window?

Apply a cnn to many different crops of the image, CNN classifies each crop as object or background. And enumerate all possible crops -> too slow.
Region Proposals: Find a small set of boxes that are likely to cover all objects. Often based on heuristics: e.g. look for "blob-like" image regions. This is relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU
Regions of interest (RoI) from a proposal methods. For each RoI, apply a CNN to predict the class of the object in the RoI and refine the bounding box.
Bounding box regression: predict "transform" to correct the RoI: x, y, w, h

Fig: Bounding-box regression

Fig: R-CNN

During training, backpropagate on all the regions of interest (RoIs).

Comparing Boxes: Intersection over Union (IoU)

IoU = Area of overlap / Area of union. Also called "Jaccard similarity" or "Jaccard index".
IoU > 0.5 is "decent". IoU > 0.7 is "good". IoU > 0.9 is "almost perfect".

Overlapping Boxes

Object detectors often output many overlapping detections. Solution: Post-process raw detections using Non-Max Suppression (NMS).

Select next highest-scoring box
Eliminate lower-scoring boxes with IoU > threshold

Fig: Non-Max Suppression

Problem: NMS may eliminate "good" boxes when objects are highly overlapping...

Evaluating Object Detectors: Mean Average Precision (mAP)

Run object detector on all test images (with NMS)
For each category, compute average precision (AP) = area under precision vs recall curve

Fig: Mean Average Precision

mean Average Precision (mAP) = mean of AP over all categories
for "COCO mAP", average over 10 IoU thresholds (0.5 to 0.95) and take average

Fig: COCO mAP

Fast-RCNN

ICCV 2015

Fig: Fast-RCNN

Crop Features: RoI Pool

Input Image: e.g. 3 x 640 x 480 -> CNN (e.g. 512 x 20 x15)

Project and snap RoI to CNN feature map - > Divide into 2x2 grid of (roughly) equal subregions -> Max pool within each subregion (e.g. 512 x 7 x 7)

Region features always the same size even if input regions have different sizes!

Fig: RoI Pool

Problem: Slight misalignment between RoI and CNN grid can cause misalignment in RoI Pooling. Solution: RoI Align

RoI Align

Instead of snapping RoI to CNN grid, interpolate between grid points
More accurate, but more expensive

Fig: RoI Align

The cropping may not perfectly match the original object grid. RoI Align implements a bilinear near-neighbor interpolation to get more accurate cropping. We can consider the image as a real-valued tensor and backpropagate to any points in the image.

Fig: RoI Align

Fast R-CNN vs "Slow" R-CNN (ICCV 2015)

Fig: Fast R-CNN

Faster R-CNN: Learnable Region Proposals

Train a CNN to predict region proposals. NeurIPS 2015

Fig: Faster R-CNN

Region Proposal Network (RPN):

Fig: Region Proposal Network

Anchor box of fixed size at each point in the feature map.
At each point, predict whether the corresponding anchor contains an object or not (per-cell logistic regression, predict scores with conv layer.)
For positive boxes, also predict a box transform to regress from anchor to object box.
Problem: Anchor box may be too small or too large for the object. Solution: K Multi-scale anchors for each point.

Fig: Multi-scale anchors

Jointly with 4 losses:

RPN classification loss: anchor box is object or not
RPN regression loss: predict transform from anchor box to proposal box
Object classification: classify proposals as background or object class
Object regression: predict transform from a proposal box to object box

Fig: Faster R-CNN

Two stages:

First stage: Run once per image

Backbone Network
Region proposal network

Second stage: Run once per region

Crop features: RoI pool/align
Predict object class
Prediction bbox offset

Single-Stage Object Detection

YOLO: You Only Look Once. ECCV 2016
Focal Loss for Dense Object Detection. ICCV 2017

Fig: YOLO

Detection without Anchors: CornerNet (ECCV 2018)

Use a backbone CNN to predict the heatmap of object upper-left corners and lower-right corners.

To match the upper-left and lower-right corners, use a "associative embedding" to predict the offset between the two corners.

Fig: CornerNet

Comparison

Speed/accuracy trade-offs for modern convolutional object detectors CVPR 2017

Takeaways:

Two stage method (e.g. Faster R-CNN) is more accurate but slower
Single stage methods (e.g. YOLO, SSD) are faster but less accurate
Bigger backbones improve performance, but are slower
Nowadays, single stage methods are as good as two-stage methods
Very big models work better
Test-time augmentation pushes numbers up
Big ensembles, more data, etc

Fig: Comparison

Object Detection: Open-Source Code

Don't implement it yourself (Unless you are working on the assignment)

Detectron2 (PyTorch)

Fast/Faster/Mask R-CNN, RetinaNet

Notice on Usage and Attribution

This note is based on the University of Michigan's publicly available course EECS 498.008 / 598.008 and is intended solely for personal learning and academic discussion, with no commercial use.

Nature of the Notes: These notes include extensive references and citations from course materials to ensure clarity and completeness. However, they are presented as personal interpretations and summaries, not as substitutes for the original course content.
Original Course Resources: Please refer to the official University of Michigan website for complete and accurate course materials.
Third-Party Open Access Content: This note may reference Open Access (OA) papers or resources cited within the course materials. These materials are used under their original Open Access licenses (e.g., CC BY, CC BY-SA).
Proper Attribution: Every referenced OA resource is appropriately cited, including the author, publication title, source link, and license type.
Copyright Notice: All rights to third-party content remain with their respective authors or publishers.
Content Removal: If you believe any content infringes on your copyright, please contact me, and I will promptly remove the content in question.

Thanks to the University of Michigan and the contributors to the course for their openness and dedication to accessible education.