【计算机视觉实战】第9章 | 目标检测基础R-CNN家族：从区域提议到精准定位

《计算机视觉实战》第九章摘要：本章聚焦目标检测技术，对比图像分类与目标检测任务差异，介绍两阶段检测器（如R-CNN系列）的原理流程。内容涵盖目标检测评价指标（IoU、mAP）、R-CNN架构详解（区域提议、特征提取、分类与回归），并通过Python代码示例演示核心算法实现。教程采用PyTorch 2.2+环境，适合具备CNN基础的开发者进阶学习计算机视觉核心应用。

所谓伊人，在水一方333

23人浏览 · 2026-04-01 08:20:08

所谓伊人，在水一方333 · 2026-04-01 08:20:08 发布

欢迎来到《计算机视觉实战》系列教程的第九章。在前几章我们学习了CNN基础和经典架构，本章我们将进入计算机视觉的核心应用领域：目标检测。

目标检测的任务是让计算机回答"图像中有什么物体，在哪里"这个关键问题。从2014年R-CNN的诞生到现在，目标检测经历了革命性的发展。本章我们将学习两阶段检测器（R-CNN系列）的原理，理解从区域提议到特征提取再到分类的完整流程。

1. 环境声明

Python版本：Python 3.12+
PyTorch版本：PyTorch 2.2+
torchvision版本：0.17+
OpenCV版本：4.10+
NumPy版本：1.26+

2. 目标检测概述

2.1 目标检测 vs 图像分类

图像分类：判断图像中主要物体属于哪一类
目标检测：不仅要识别类别，还要定位物体的位置（边界框）

目标检测的输出：

边界框（Bounding Box）：(x_min, y_min, x_max, y_max) 或 (x_center, y_center, width, height)
类别标签：物体的类别
置信度：模型对预测的置信程度

import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def visualize_detection_outputs():
    """可视化目标检测的输出"""

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # 图像分类
    axes[0].text(0.5, 0.5, '图像分类\n输出: 类别标签\n例如: "狗" (0.95)',
                ha='center', va='center', fontsize=14,
                bbox=dict(boxstyle='round', facecolor='lightblue'))
    axes[0].set_title('图像分类')
    axes[0].axis('off')

    # 目标检测
    axes[1].text(0.5, 0.5, '目标检测\n输出: 边界框 + 类别 + 置信度\n例如: 狗 [0.95] @ (x,y,w,h)',
                ha='center', va='center', fontsize=14,
                bbox=dict(boxstyle='round', facecolor='lightgreen'))
    axes[1].set_title('目标检测')
    axes[1].axis('off')

    # 语义分割
    axes[2].text(0.5, 0.5, '语义分割\n输出: 每个像素的类别\n例如: 狗=1, 背景=0',
                ha='center', va='center', fontsize=14,
                bbox=dict(boxstyle='round', facecolor='lightyellow'))
    axes[2].set_title('语义分割')
    axes[2].axis('off')

    plt.tight_layout()
    plt.show()

    print("\n目标检测的两大类别:")
    print("1. 两阶段检测器 (Two-stage): R-CNN, Fast R-CNN, Faster R-CNN")
    print("2. 单阶段检测器 (One-stage): YOLO, SSD, RetinaNet")
    print("\n两阶段检测器精度高但速度慢")
    print("单阶段检测器速度快但精度略低（但差距在缩小）")

visualize_detection_outputs()

2.2 目标检测的评价指标

import numpy as np

def iou(box1, box2):
    """计算两个边界框的IoU (Intersection over Union)"""

    # box格式: [x_min, y_min, x_max, y_max]
    x1_min, y1_min, x1_max, y1_max = box1
    x2_min, y2_min, x2_max, y2_max = box2

    # 计算交集区域
    inter_x_min = max(x1_min, x2_min)
    inter_y_min = max(y1_min, y2_min)
    inter_x_max = min(x1_max, x2_max)
    inter_y_max = min(y1_max, y2_max)

    # 检查是否有交集
    if inter_x_max < inter_x_min or inter_y_max < inter_y_min:
        return 0.0

    inter_area = (inter_x_max - inter_x_min) * (inter_y_max - inter_y_min)

    # 计算各自的面积
    box1_area = (x1_max - x1_min) * (y1_max - y1_min)
    box2_area = (x2_max - x2_min) * (y2_max - y2_min)

    # 计算并集
    union_area = box1_area + box2_area - inter_area

    # 计算IoU
    iou = inter_area / union_area if union_area > 0 else 0.0

    return iou

def calculate_ap(recalls, precisions):
    """
    计算平均精度 (Average Precision)
    使用VOC 2010风格的插值方法
    """
    recalls = np.concatenate(([0.0], recalls, [1.0]))
    precisions = np.concatenate(([0.0], precisions, [0.0]))

    # 插值
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = max(precisions[i], precisions[i + 1])

    # 计算AP
    indices = np.where(recalls[1:] != recalls[:-1])[0]
    ap = np.sum((recalls[indices + 1] - recalls[indices]) * precisions[indices + 1])

    return ap

# IoU示例
print("IoU计算示例:")
box_a = [50, 50, 150, 150]  # [x_min, y_min, x_max, y_max]
box_b = [100, 100, 200, 200]
box_c = [300, 300, 400, 400]  # 不重叠

print(f"框A: {box_a}")
print(f"框B: {box_b}")
print(f"框C: {box_c}")
print(f"\nIoU(A, B) = {iou(box_a, box_b):.4f}")  # 应该有一些重叠
print(f"IoU(A, C) = {iou(box_a, box_c):.4f}")  # 应该是0

print("\n" + "=" * 50)
print("mAP (mean Average Precision):")
print("- 对所有类别分别计算AP，然后取平均")
print("- AP是Precision-Recall曲线下的面积")
print("- IoU阈值通常为0.5 (mAP@0.5)")
print("- COCO数据集使用多个IoU阈值 (0.5:0.95)")

3. R-CNN：区域提议的开端

3.1 R-CNN原理

R-CNN（Region with CNN features）由Ross Girshick等人在2014年提出，是深度学习在目标检测领域的开山之作。

R-CNN的流程：

区域提议（Region Proposal）：使用Selective Search生成约2000个候选区域
特征提取：对每个区域使用CNN提取特征
分类：使用SVM对每个区域的特征进行分类
边界框回归：精调边界框位置

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

class RCNN(nn.Module):
    """R-CNN简化版实现"""

    def __init__(self, num_classes=20):
        super().__init__()
        self.num_classes = num_classes

        # CNN特征提取器 (类似VGG)
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/2

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/4

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/8

            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/16

            nn.Conv2d(512, 4096, kernel_size=7, padding=0),  # 全局池化
            nn.ReLU(inplace=True),
        )

        # 分类器
        self.cls_fc = nn.Linear(4096, num_classes + 1)  # +1 for background

        # 边界框回归器
        self.bbox_fc = nn.Linear(4096, 4 * (num_classes + 1))

    def forward(self, x, regions):
        """
        x: 输入图像 (batch_size, 3, H, W)
        regions: 区域提议列表，每个区域是 [x_min, y_min, x_max, y_max]
        """
        # 对整个图像提取特征
        features = self.features(x)
        features = features.view(features.size(0), -1)

        # 对每个提议区域分类
        cls_scores = self.cls_fc(features)
        bbox_preds = self.bbox_fc(features)

        return cls_scores, bbox_preds

print("R-CNN的问题:")
print("=" * 50)
print("1. 每个区域提议都单独通过CNN，计算量大")
print("   - 2000个提议 × 前向传播 = 非常慢")
print("2. 训练是多阶段的（CNN -> SVM -> 回归）")
print("3. 推理速度极慢（约47秒/图像）")
print("\n这促使了Fast R-CNN和Faster R-CNN的诞生")

3.2 Selective Search区域提议

import numpy as np

def selective_search(image, num_proposals=2000):
    """
    Selective Search的简化实现
    实际使用中应使用opencv的selective_search或第三方库
    """
    # 实际实现非常复杂，这里给出简化版概念说明
    print("Selective Search算法步骤:")
    print("1. 初始化基于图的分割")
    print("2. 计算所有区域的相似度")
    print("3. 贪婪地合并最相似的区域")
    print("4. 重复直到整个图像变成一个区域")

    # 生成随机提议（仅用于演示）
    num_proposals = min(num_proposals, 100)
    h, w = image.shape[:2]
    proposals = []

    for _ in range(num_proposals):
        x_min = np.random.randint(0, w // 2)
        y_min = np.random.randint(0, h // 2)
        x_max = np.random.randint(x_min + 10, w)
        y_max = np.random.randint(y_min + 10, h)
        proposals.append([x_min, y_min, x_max, y_max])

    return np.array(proposals)

# 实际使用OpenCV的selective_search
def selective_search_opencv(image_path):
    """使用OpenCV的selective search"""
    import cv2

    img = cv2.imread(image_path)
    ss = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    ss.setBaseImage(img)

    # 高速但低精度模式
    # ss.switchToSelectiveSearchHighQuality()

    # 低速但高精度模式
    ss.switchToSelectiveSearchFast()

    rects = ss.process()
    print(f"生成了 {len(rects)} 个区域提议")

    return rects[:2000]  # 返回前2000个

print("\n区域提议算法对比:")
print("=" * 50)
print("Selective Search: 基于颜色、纹理、大小、形状等")
print("EdgeBoxes: 基于边缘信息")
print("RPN (Faster R-CNN): 神经网络学习的区域提议")
print("=" * 50)

4. Fast R-CNN：端到端训练的突破

4.1 RoI Pooling

Fast R-CNN的最大创新是RoI Pooling（Region of Interest Pooling），它允许整个图像只通过一次CNN，然后从特征图中裁剪出各个RoI的区域。

RoI Pooling的工作流程：

整个图像通过CNN得到特征图
对每个RoI，在特征图上找到对应的区域
将不同大小的RoI区域池化到固定大小（如7x7）

import torch
import torch.nn as nn
import torch.nn.functional as F

class RoIPool(nn.Module):
    """RoI Pooling实现"""

    def __init__(self, output_size):
        super().__init__()
        self.output_size = output_size  # 如 (7, 7)

    def forward(self, features, rois):
        """
        features: 来自CNN的特征图 (batch, C, H, W)
        rois: RoI列表，每个是 [batch_idx, x_min, y_min, x_max, y_max]
        """
        batch_size, c, h, w = features.shape
        output_h, output_w = self.output_size

        # 对每个RoI进行池化
        pooled_features = []

        for roi in rois:
            batch_idx, x_min, y_min, x_max, y_max = roi

            # 在特征图上计算RoI的大小
            roi_h = y_max - y_min
            roi_w = x_max - x_min

            # 计算每个池化格子的大小
            bin_h = roi_h / output_h
            bin_w = roi_w / output_w

            # 遍历每个输出格子
            pooled = torch.zeros(c, output_h, output_w, device=features.device)

            for i in range(output_h):
                for j in range(output_w):
                    # 计算当前格子的坐标范围
                    y_start = y_min + i * bin_h
                    x_start = x_min + j * bin_w
                    y_end = y_min + (i + 1) * bin_h
                    x_end = x_min + (j + 1) * bin_w

                    # 确保坐标在有效范围内
                    y_start = max(0, int(y_start))
                    x_start = max(0, int(x_start))
                    y_end = min(h, int(y_end))
                    x_end = min(w, int(x_end))

                    # 跳过无效区域
                    if y_end > y_start and x_end > x_start:
                        # Max pooling
                        pooled[:, i, j] = features[batch_idx, :, y_start:y_end, x_start:x_end].max(dim=2)[0].max(dim=1)[0]

            pooled_features.append(pooled)

        return torch.stack(pooled_features)


# 使用PyTorch内置的RoIAlign（更精确）
class RoIAlign(nn.Module):
    """RoIAlign使用双线性插值，比RoI Pooling更精确"""

    def __init__(self, output_size, spatial_scale, sampling_ratio):
        super().__init__()
        self.output_size = output_size
        self.spatial_scale = spatial_scale
        self.sampling_ratio = sampling_ratio

    def forward(self, features, rois):
        """
        features: (batch, C, H, W)
        rois: (num_rois, 5) [batch_idx, x1, y1, x2, y2]
        """
        # 实际使用torchvision.ops.roi_align
        from torchvision.ops import roi_align

        return roi_align(features, rois,
                        output_size=self.output_size,
                        spatial_scale=self.spatial_scale,
                        sampling_ratio=self.sampling_ratio)

print("RoI Pooling vs RoI Align:")
print("=" * 50)
print("RoI Pooling:")
print("  - 将RoI划分为固定数量的格子")
print("  - 每个格子取最大值（量化取整）")
print("  - 量化误差可能导致特征对齐不准确")
print("\nRoI Align:")
print("  - 同样划分为固定数量的格子")
print("  - 使用双线性插值计算采样点的值")
print("  - 避免了量化误差，精度更高")
print("\n实际使用中，推荐使用RoI Align！")

4.2 Fast R-CNN架构

import torch
import torch.nn as nn

class FastRCNN(nn.Module):
    """Fast R-CNN网络架构"""

    def __init__(self, num_classes=21):
        super().__init__()

        # CNN特征提取器 (VGG16的前几层)
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/2

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/4

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/8

            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/16

            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/32
        )

        # ROI pooling
        self.roi_pool = nn.AdaptiveMaxPool2d((7, 7))

        # 全连接层
        self.fc = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
        )

        # 分类头
        self.cls_head = nn.Linear(4096, num_classes)

        # 边界框回归头
        self.bbox_head = nn.Linear(4096, num_classes * 4)

    def forward(self, x, rois):
        """
        x: 输入图像 (batch_size, 3, H, W)
        rois: RoI列表 (num_rois, 5) [batch_idx, x1, y1, x2, y2]
        """
        # 1. 特征提取（整个图像只过一次CNN）
        features = self.features(x)

        # 2. 对每个RoI进行RoI Pooling
        pooled = self.roi_pool(features)

        # 3. 展平并通过全连接层
        pooled = pooled.view(pooled.size(0), -1)
        fc_features = self.fc(pooled)

        # 4. 分类和边界框回归
        cls_scores = self.cls_head(fc_features)
        bbox_preds = self.bbox_head(fc_features)

        return cls_scores, bbox_preds

print("Fast R-CNN的改进:")
print("=" * 50)
print("1. 整张图像只通过一次CNN，大大减少计算量")
print("2. 端到端训练，不再需要多阶段训练")
print("3. 使用RoI Pooling处理不同大小的RoI")
print("4. 分类和回归共享特征提取网络")
print("\n但区域提议仍然使用Selective Search，成为瓶颈")

5. Faster R-CNN：区域提议网络

5.1 RPN原理

Faster R-CNN的最大贡献是提出了区域提议网络（Region Proposal Network, RPN），用神经网络代替Selective Search来生成区域提议。

RPN的工作原理：

在特征图上滑动一个小网络
对每个位置，预测k个锚框（anchor box）
每个锚框预测两个值：物体/非物体概率 + 边界框精调

import torch
import torch.nn as nn
import torch.nn.functional as F

class RPN(nn.Module):
    """区域提议网络 (Region Proposal Network)"""

    def __init__(self, in_channels=512, num_anchors=9):
        super().__init__()

        self.num_anchors = num_anchors

        # 中间卷积层
        self.rpn_conv = nn.Conv2d(in_channels, 512, kernel_size=3, padding=1)

        # 分类层：预测每个锚框是前景的概率
        self.rpn_cls = nn.Conv2d(512, num_anchors * 2, kernel_size=1)  # 2 for bg/fg

        # 回归层：预测边界框精调
        self.rpn_bbox = nn.Conv2d(512, num_anchors * 4, kernel_size=1)  # 4 for dx, dy, dw, dh

    def forward(self, features):
        """
        features: 来自CNN的特征图 (batch, C, H, W)
        """
        # 中间特征
        x = F.relu(self.rpn_conv(features))

        # 分类 logits: (batch, num_anchors*2, H, W)
        cls_logits = self.rpn_cls(x)

        # 回归 deltas: (batch, num_anchors*4, H, W)
        bbox_logits = self.rpn_bbox(x)

        return cls_logits, bbox_logits


def generate_anchors(base_size=16, ratios=[0.5, 1.0, 2.0], scales=[8, 16, 32]):
    """生成锚框"""

    base_anchor = np.array([0, 0, base_size, base_size])  # [x_min, y_min, x_max, y_max]

    # 计算不同ratios的锚框
    w, h = base_size, base_size
    area = w * h

    anchors = []
    for ratio in ratios:
        # h = sqrt(area / ratio), w = area / h
        new_h = np.round(np.sqrt(area / ratio))
        new_w = np.round(new_h * ratio)

        for scale in scales:
            # 缩放
            anchor_w = new_w * scale / base_size
            anchor_h = new_h * scale / base_size

            # 中心点
            cx, cy = base_size / 2, base_size / 2

            # 计算锚框坐标
            x_min = cx - anchor_w / 2
            y_min = cy - anchor_h / 2
            x_max = cx + anchor_w / 2
            y_max = cy + anchor_h / 2

            anchors.append([x_min, y_min, x_max, y_max])

    return np.array(anchors)  # 9个锚框

# 示例
anchors = generate_anchors()
print("生成的锚框 (9个):")
print(anchors)
print(f"\n锚框数量: {len(anchors)}")
print("\n锚框设计原则:")
print("- 不同尺度(scale): 适应不同大小的物体")
print("- 不同长宽比(ratio): 适应不同形状的物体")

5.2 Faster R-CNN完整流程

import torch
import torch.nn as nn

class FasterRCNN(nn.Module):
    """Faster R-CNN完整网络"""

    def __init__(self, num_classes=21):
        super().__init__()

        # 1. 共享CNN特征提取器 (VGG16风格)
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 1/32
        )

        # 2. RPN (Region Proposal Network)
        self.rpn = RPN(in_channels=512, num_anchors=9)

        # 3. RoI Head (Fast R-CNN)
        self.roi_head = FastRCNNHead(num_classes)

    def forward(self, x, image_info=None):
        """
        x: 输入图像 (batch_size, 3, H, W)
        """
        # 1. 提取特征图
        features = self.backbone(x)

        # 2. RPN生成区域提议
        rpn_cls, rpn_bbox = self.rpn(features)

        # 3. 将RPN输出转换为提议
        proposals = self.generate_proposals(rpn_cls, rpn_bbox, image_info)

        # 4. RoI pooling和分类/回归
        if len(proposals) > 0:
            cls_scores, bbox_preds = self.roi_head(features, proposals)
        else:
            cls_scores = torch.empty(0, self.roi_head.num_classes)
            bbox_preds = torch.empty(0, self.roi_head.num_classes * 4)

        return cls_scores, bbox_preds, proposals

    def generate_proposals(self, cls_logits, bbox_logits, image_info):
        """从RPN输出生成提议"""
        # 这是一个简化版本
        # 实际实现需要考虑NMS、阈值筛选等

        # 假设batch_size=1
        cls_logits = cls_logits[0]  # (9*2, H, W)
        bbox_logits = bbox_logits[0]  # (9*4, H, W)

        # 转换为概率
        probs = F.softmax(cls_logits, dim=0)

        # 简化：返回前N个提议
        # 实际需要解码并应用NMS
        return []  # 占位

print("Faster R-CNN的核心贡献:")
print("=" * 50)
print("1. 端到端训练：RPN + Fast R-CNN联合训练")
print("2. 区域提议网络：用CNN代替Selective Search")
print("3. 共享特征：RPN和检测网络共享CNN特征")
print("4. Anchor机制：多尺度多形状的锚框设计")
print("\n这使得检测精度和速度都大幅提升！")

6. 特征金字塔网络 (FPN)

6.1 多尺度特征融合

FPN（Feature Pyramid Network）由Lin等人在2017年提出，解决了目标检测中多尺度物体检测的问题。

问题：小物体在高层特征图中分辨率太低，难以检测
解决方案：构建多尺度的特征金字塔

import torch
import torch.nn as nn
import torch.nn.functional as F

class FPN(nn.Module):
    """特征金字塔网络"""

    def __init__(self, in_channels_list=[256, 512, 1024, 2048], out_channels=256):
        super().__init__()

        self.out_channels = out_channels

        # 侧向连接卷积
        self.lateral_convs = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, kernel_size=1)
            for in_channels in in_channels_list
        ])

        # 输出卷积
        self.fpn_convs = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
            for _ in in_channels_list
        ])

    def forward(self, features_list):
        """
        features_list: 来自主干网络的多尺度特征 [C2, C3, C4, C5]
        """
        # 自顶向下路径
        laterals = [lateral_conv(features)
                   for lateral_conv, features
                   in zip(self.lateral_convs, features_list)]

        # 逐层融合
        for i in range(len(laterals) - 1, 0, -1):
            # 上采样并相加
            laterals[i - 1] = laterals[i - 1] + F.interpolate(
                laterals[i], scale_factor=2, mode='nearest')

        # 1x1然后3x3卷积
        fpn_features = [conv(lateral)
                       for conv, lateral
                       in zip(self.fpn_convs, laterals)]

        return fpn_features  # [P2, P3, P4, P5, P6]

print("FPN的设计原则:")
print("=" * 50)
print("1. 自底向上：主干网络的前向路径，提取多尺度特征")
print("2. 自顶向下：高分辨率特征上采样，与低层特征融合")
print("3. 侧向连接：融合时使用1x1卷积调整通道数")
print("4. 特征融合：上采样后的高层语义与低层细节结合")
print("\nFPN已成为现代检测器的基础组件！")

7. 实战：使用预训练检测器

7.1 torchvision目标检测模型

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def use_pretrained_detector():
    """使用预训练的Faster R-CNN检测器"""

    # 加载预训练模型
    model = fasterrcnn_resnet50_fpn(weights='COCO_V1')
    model.eval()

    # COCO数据集的类别
    COCO_CLASSES = [
        '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
        'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
        'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
        'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack',
        'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
        'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
        'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife',
        'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli',
        'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
        'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
        'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
        'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
        'hair drier', 'toothbrush'
    ]

    print(f"模型类别数: {len(COCO_CLASSES)}")

    return model, COCO_CLASSES


def detect_objects(model, image_path, threshold=0.5):
    """在图像上检测物体"""

    from torchvision.transforms import functional as F

    # 加载图像
    # img = Image.open(image_path)
    # img_tensor = F.to_tensor(img).unsqueeze(0)

    # 使用随机图像模拟
    img_tensor = torch.randn(1, 3, 600, 800)

    # 推理
    with torch.no_grad():
        predictions = model(img_tensor)[0]

    # 筛选高置信度的检测
    keep = predictions['scores'] > threshold
    boxes = predictions['boxes'][keep]
    labels = predictions['labels'][keep]
    scores = predictions['scores'][keep]

    print(f"检测到 {len(boxes)} 个物体 (阈值={threshold})")

    return boxes, labels, scores


print("使用torchvision预训练检测器:")
print("=" * 50)
print("1. fasterrcnn_resnet50_fpn: Faster R-CNN + ResNet50 + FPN")
print("2. retinanet_resnet50_fpn: RetinaNet (单阶段)")
print("3. ssdlite320_mobilenet_v3_large: MobileNet + SSDLite")
print("\n推荐模型:")
print("- 通用场景: Faster R-CNN (最高精度)")
print("- 实时场景: YOLO (最快速度)")
print("- 移动端: SSD MobileNet (平衡)")

7.2 绘制检测结果

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

def draw_detections(image, boxes, labels, scores, class_names, threshold=0.5):
    """绘制检测结果"""

    fig, ax = plt.subplots(1, figsize=(12, 9))
    ax.imshow(image)

    # 为不同类别定义颜色
    np.random.seed(42)
    colors = np.random.randint(0, 255, size=(len(class_names), 3))

    for box, label, score in zip(boxes, labels, scores):
        if score < threshold:
            continue

        x_min, y_min, x_max, y_max = box
        w, h = x_max - x_min, y_max - y_min

        # 绘制边界框
        color = colors[label] / 255
        rect = patches.Rectangle(
            (x_min, y_min), w, h,
            linewidth=2, edgecolor=color, facecolor='none'
        )
        ax.add_patch(rect)

        # 添加标签
        label_text = f"{class_names[label]}: {score:.2f}"
        ax.text(x_min, y_min - 5, label_text,
               color='white', fontsize=10,
               bbox=dict(boxstyle='round', facecolor=color, alpha=0.7))

    ax.axis('off')
    plt.title(f'Detection Results ({len(boxes)} objects)')
    plt.tight_layout()
    plt.show()

# 示例绘制
def example_plot():
    """示例：绘制检测结果"""

    # 模拟检测结果
    np.random.seed(42)
    boxes = np.array([
        [100, 100, 300, 250],
        [200, 150, 400, 350],
        [50, 200, 150, 400],
    ])
    labels = np.array([1, 17, 3])  # person, dog, car
    scores = np.array([0.95, 0.88, 0.72])

    class_names = ['__background__', 'person', 'bicycle', 'car']

    # 创建示例图像
    image = np.random.randint(0, 255, (500, 600, 3), dtype=np.uint8)

    draw_detections(image, boxes, labels, scores, class_names)

example_plot()

8. 避坑小贴士

常见错误1：混淆边界框格式

现象：检测结果边界框位置错误

常见格式：

[x_min, y_min, x_max, y_max] - 左上角到右下角
[x_center, y_center, width, height] - 中心点和尺寸
[x_min, y_min, width, height] - 左上角和尺寸

正确做法：

# 转换格式的函数
def xyxy_to_cxcywh(boxes):
    """[x_min, y_min, x_max, y_max] -> [x_center, y_center, width, height]"""
    x_min, y_min, x_max, y_max = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x_min + x_max) / 2
    cy = (y_min + y_max) / 2
    w = x_max - x_min
    h = y_max - y_min
    return np.stack([cx, cy, w, h], axis=1)

def cxcywh_to_xyxy(boxes):
    """[x_center, y_center, width, height] -> [x_min, y_min, x_max, y_max]"""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x_min = cx - w / 2
    y_min = cy - h / 2
    x_max = cx + w / 2
    y_max = cy + h / 2
    return np.stack([x_min, y_min, x_max, y_max], axis=1)

常见错误2：NMS参数设置不当

现象：检测结果有大量重复框或漏检

NMS (Non-Maximum Suppression)：

def nms(boxes, scores, iou_threshold=0.5, score_threshold=0.05):
    """
    非极大值抑制
    boxes: (N, 4) [x_min, y_min, x_max, y_max]
    scores: (N,)
    """
    # 按分数排序
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        i = order[0]
        keep.append(i)

        if order.size == 1:
            break

        # 计算IoU
        ious = iou(boxes[i], boxes[order[1:]])

        # 保留IoU小于阈值的
        inds = np.where(ious <= iou_threshold)[0]
        order = order[inds + 1]

    return np.array(keep)

print("NMS参数调优:")
print("- IoU阈值过高(>0.7): 可能保留重复框")
print("- IoU阈值过低(<0.3): 可能漏检靠得很近的同类别物体")
print("- Score阈值: 通常设为0.05-0.1，过低会增加误检")

常见错误3：训练和推理时Batch Size不一致

现象：训练正常但推理时内存溢出或结果异常

原因：RoI Pooling等操作对Batch Size敏感

正确做法：

# 训练时可以使用较大的batch
# 推理时可以逐张处理，但要注意batch维度

# 模型.eval()后
with torch.no_grad():
    for images, targets in dataloader:
        # 推理
        predictions = model(images)

# 或者累积多个图像后一起推理
model.eval()
with torch.no_grad():
    batch_images = torch.stack(images_list)  # (B, 3, H, W)
    predictions = model(batch_images)

9. 本章小结

通过本章的学习，你应该已经掌握了：

目标检测基础：理解了分类、检测、分割的区别
IoU和AP指标：掌握了评价检测性能的核心指标
R-CNN原理：理解了区域提议+CNN特征+分类的两阶段方法
RoI Pooling：学会了如何从不同大小的区域提取固定大小的特征
Fast R-CNN：理解了端到端训练和多任务损失
Faster R-CNN：掌握了RPN区域提议网络的原理
FPN：理解了多尺度特征融合的重要性
实战技能：能够使用预训练检测模型进行推理

一句话总结：Faster R-CNN通过RPN实现了区域提议的端到端化，结合FPN的多尺度特征，在精度上达到了很高的水平，至今仍是目标检测的重要基准。

10. 练习与思考

IoU计算：实现IoU和GIoU的计算函数
RPN分析：分析不同anchor设置对检测性能的影响
NMS改进：思考如何改进NMS以更好地处理重叠物体
轻量化：探索将Faster R-CNN应用于移动设备的方案
COCO vs VOC：比较两种数据集的评价指标差异

下一章预告：第10章《单阶段目标检测YOLO与SSD》将带你学习YOLO系列和SSD等单阶段检测器，理解它们是如何实现实时检测的。

如果本章内容对你有帮助，欢迎点赞、收藏和关注。有任何问题可以在评论区留言。

九章云极普惠算力

更多推荐

Kandinsky-5.0-I2V-Lite-5s惊艳效果展示：赛博朋克街景→霓虹闪烁+雨滴滑落动态视频

本文介绍了如何在星图GPU平台上自动化部署Kandinsky-5.0-I2V-Lite-5s镜像，实现高效图生视频转换。该工具能将静态赛博朋克街景图片快速转换为动态视频，添加霓虹闪烁、雨滴滑落等效果，适用于短视频制作、广告创意等场景，显著提升内容创作效率。

九章云极普惠算力

终极指南：如何用facenet-pytorch快速构建企业级人脸识别系统

在当今数字化时代，人脸识别技术已成为身份验证、安全监控和智能交互的核心驱动力。facenet-pytorch作为一款基于PyTorch的开源人脸识别工具包，凭借其高效的MTCNN人脸检测和InceptionResnetV1特征提取能力，为开发者提供了从零构建专业级人脸识别系统的完整解决方案。本文将带你一步步探索这个强大工具的使用方法，从环境搭建到实际应用，让你在短时间内掌握人脸识别的核心技术。