一、骨骼动作识别

1.1 视频数据的多种模态

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

1.2 骨骼动作识别含义、条件、适用场景与优点

动作可以仅通过Skeleton序列来识别。

(1)动作类型要有合适的定义不适合Skeleton识别的例子:有许多个动作类别为eating sth,那就只能通过识别sth来进行动作识别,此时与Skeleton无关。、

(2)视频中要存在质量比较好的骨骼数据不适合Skeleton识别的例子:视频中没有人,或者只出现人的一小部分。

(1)训练数据稀缺、训练数据与测试数据存在较大bias的情况下,比如背景颜色或者环境等因素会导致RGB模型质量不佳。此时如果训练骨骼动作模型的话,就会有更好地泛化性。因为骨骼模型的话就只基于关键点坐标,从而进行动作识别。

(2)轻量型。可以用较小的计算量来进行Action Recognition的任务。

使用Skeleton的计算量 < 使用RGB的计算量

这里的RGB(3D-CNN)方法可见MMAction2

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

1.3 如何获得Skeleton序列,作为骨骼动作识别模型的输入

(1)Kinect Sensor(RGBD)

在构建RGBD数据集时,用带有RGBD深度相机的Kinect传感器,来估计3D的Skeleton。但是得到的关键点坐标噪声较多,质量较差。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(2)★2D的人体姿态估计

通过估计出2D人体姿态关键点,来预测动作。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(3)3D的人体姿态估计——Motion Capture

使用动捕设备。这种情况一般使用的网络较少。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

二、基于GCN的技术路线——ST-GCN++

2.1 GCN——Key Points

(1)GCN以骨骼关键点序列作为输入,输入形状为T × V × C

T:序列长度

V:一个Skeleton里关键点的个数

C: 维度。2D或者3D。

(2)存在多个人时,GCN取所有人特征的平均作为特征,如图所示

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(3)GCN网络由多个GCN Block堆叠而成。类似于Bottleneck→ResNet。

2.2 ST-GCN结构

随着网络的加深,特征的通道数(C)不断加深,时序维度(T)上不断地降采样。

最后一层的输出会经过一个global average pooling得到输出特征。

最后,经过线性层分类器,得到分类结果。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

GCN Block的组成:1个GCN Layer和1个TCN Layer

GCN Layer:使用系数矩阵A,对同一帧内部的关键点进行特征融合

TCN Layer: 使用1D卷积为每个关键点进行时序建模

(1)GCN Block的forward函数:

def forward(self, x, A = None):
    x = self.tcn(self.gcn(x, A)) + self.residual(x)  # GCN Layer和TCN Layer
    return self.relu(x)

(2)TCN Layer:是由时序维度上的1D卷积和一个bn组成

选择了size为9的大kernel,大量增加了计算量的消耗,增加了参数量。

class unit_tcn(nn.Module):
    def __init__(self,
                in_channels,
                out_channels,
                kernel_size = 9,
                stride = 1):
        super(unit_tcn, self).__init__()
        pad = (kernel_size - 1) // 2
        self.conv = nn.Conv2d(
        	in_channels,
        	out_channels,
            kernel_size = (kernel_size, 1),
            padding = (pad, 0),
            stride = (stride, 1))
        self.bn = nn.BatchNorm2d(out_channels)
    def forward(self, x):
        x = self.bn(self.conv(x))
        return x

(3)GCN Layer:

使用系数矩阵A对不同关键点进行特征融合。

系数矩阵A的来源:预定义的系数矩阵 × 数据驱动的稀疏mask

class unit_gcn(nn.Module):
    def __init__(self,
                in_channels,
                out_channels,
                s_kernel = 3):
        super().__init__()
        self.s_kernel = s_kernel
        self.conv = nn.Conv2d(
        	in_channels,
            out_channles * s_kernel,
            kernel_size = 1)
	def forward(self, x, A):
        # the shape of A is (s_kernel, V, V)
        assert A.size(0) == self.s_kernel     # 使用系数矩阵A对不同关键点进行特征融合
        x = self.conv(x)
        n, kc, t, v = x.size()
        x = x.view(n, self.s_kernel, kc // self.s_kernel, t, v)
        x = torch.einsum('nkctv, kvw->nctw', (x. A))
        return x.contiguous()

2.3 ST-GCN++

(1)TCN的改进

舍弃的单个大kernel的1D卷积,换成使用多分支的时域卷积。采用多分支的结构可以增强网络的时序建模能力,同时还会节省计算量和参数量。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(2)GCN的改进

(3)数据预处理及超参的改进

2.4 基于骨骼的动作识别模型精度对比

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

2.5 GCN的缺点

(1)鲁棒性:输入的扰动容易对 GCN 造成较大影响,使其难以处理关键点缺失或训练测试时使用骨骼数据存在分布差异(例如出自不同姿态提取器)等情形。

(2)兼容性:GCN 使用图序列表示骨架序列,这一表示很难与其他基于 3D-CNN 的模态(RGB, Flow 等)进行特征融合。

(3)可扩展性:GCN 所需计算量随视频中人数线性增长,很难被用于群体动作识别等应用

三、基于2D-CNN的技术路线——PoTion

将多张heatmap图以color coding算法压缩为一张图片。再用2D-CNN进行处理。尽管color coding在一定形式上保留运动的形式,但在编码过程中会存在一定的信息损失

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

四、基于3D-CNN的解决方案——PoseC3D

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(1)提取2D Skeleton

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(2)生成3D heatmap volume(热图堆叠)

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(3)用3D-CNN对获取的3D heatmap volume进行分类,得到动作类别

设计了两种3D-CNN

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

五、各模型的比较

5.1 PoseC3D的优势

(1)鲁棒性

每帧去drop一个关键点,p表示drop关键点的概率。可以看到3D-CNN在这种情况下几乎没受影响。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(2)可扩展性

随着视频中人数的增多,3D-CNN并不需要额外的计算量与参数量。但GCN每多一个人就会多一份计算量。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

5.2 PoseC3D与2D-CNN的比较

PoseC3D计算量与参数量更小,准确率更高。在使用Kinetic-400预训练后,PoseC3D优势更显著。

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

5.3 3D-CNN与GCN比较

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

六、PYSKL(附相关代码演示)

6.1 可视化skeleton

(1)下载依赖库

import glob
from pyskl.smp import *
from pyskl.utils.visualize import Vis3DPose, Vis2DPose
from mmcv import load, dump

(2) Download annotations

download_file('http://download.openmmlab.com/mmaction/pyskl/demo/annotations/ntu60_samples_hrnet.pkl', 'ntu60_2d.pkl')
download_file('http://download.openmmlab.com/mmaction/pyskl/demo/annotations/ntu60_samples_3danno.pkl', 'ntu60_3d.pkl')

(3)可视化2D skeleton

annotations = load('ntu60_2d.pkl')
index = 0
anno = annotations[index]
vid = Vis2DPose(anno, thre=0.2, out_shape=(540, 960), layout='coco', fps=12, video=None)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(4)可视化2D skeleton融合RGB video

annotations = load('ntu60_2d.pkl')
index = 0
anno = annotations[index]
frame_dir = anno['frame_dir']
video_url = f"http://download.openmmlab.com/mmaction/pyskl/demo/nturgbd/{frame_dir}.avi"
download_file(video_url, frame_dir + '.avi')
vid = Vis2DPose(anno, thre=0.2, out_shape=(540, 960), layout='coco', fps=12, video=frame_dir + '.avi')
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

(5)可视化 3D Skeletons

from pyskl.datasets.pipelines import PreNormalize3D
annotations = load('ntu60_3d.pkl')
index = 0
anno = annotations[index]
anno = PreNormalize3D()(anno)  # * Need Pre-Normalization before Visualization
vid = Vis3DPose(anno, layout='nturgb+d', fps=12, angle=(30, 45), fig_size=(8, 8), dpi=80)
vid = vid.vis()
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

6.2 可视化skeleton+ heatmap + 行为识别

(1)安装依赖项

import os
import cv2
import os.path as osp
import decord
import numpy as np
import matplotlib.pyplot as plt
import urllib
import moviepy.editor as mpy
import random as rd
from pyskl.smp import *
from mmpose.apis import vis_pose_result
from mmpose.models import TopDown
from mmcv import load, dump

(2)准备经过预处理的注释文件

  1. 关于相关预处理的annotation文件地址:annotations

  2. 经过预处理的骨架annotation链接:

  1. 关于pickle文件的格式:

​ 每个pickle文件对应于一个动作识别数据集。pickle文件的内容是一个字典,由两个字段组成:split、annotations。

# 准备经过预处理的 annotation
gym_ann_file = './data/gym/gym_hrnet.pkl'
ntu60_ann_file = './data/nturgbd/ntu60_hrnet.pkl'

(3)定义可视化效果(包括动作标签可视化、skeleton可视化)

# 设置字体与线条
FONTFACE = cv2.FONT_HERSHEY_DUPLEX
FONTSCALE = 0.6
FONTCOLOR = (255, 255, 255)
BGBLUE = (0, 119, 182)
THICKNESS = 1
LINETYPE = 1
# 定义可视化动作标签函数
def add_label(frame, label, BGCOLOR=BGBLUE):
    threshold = 30
    def split_label(label):
        label = label.split()
        lines, cline = [], ''
        for word in label:
            if len(cline) + len(word) < threshold:
                cline = cline + ' ' + word
            else:
                lines.append(cline)
                cline = word
        if cline != '':
            lines += [cline]
        return lines
    if len(label) > 30:
        label = split_label(label)
    else:
        label = [label]
    label = ['Action: '] + label
    sizes = []
    for line in label:
        sizes.append(cv2.getTextSize(line, FONTFACE, FONTSCALE, THICKNESS)[0])
    box_width = max([x[0] for x in sizes]) + 10
    text_height = sizes[0][1]
    box_height = len(sizes) * (text_height + 6)
    cv2.rectangle(frame, (0, 0), (box_width, box_height), BGCOLOR, -1)
    for i, line in enumerate(label):
        location = (5, (text_height + 6) * i + text_height + 3)
        cv2.putText(frame, line, location, FONTFACE, FONTSCALE, FONTCOLOR, THICKNESS, LINETYPE)
    return frame
# 定义可视化skeleton函数
def vis_skeleton(vid_path, anno, category_name=None, ratio=0.5):
    vid = decord.VideoReader(vid_path)
    frames = [x.asnumpy() for x in vid]
    h, w, _ = frames[0].shape
    new_shape = (int(w * ratio), int(h * ratio))
    frames = [cv2.resize(f, new_shape) for f in frames]
    assert len(frames) == anno['total_frames']
    # The shape is N x T x K x 3
    kps = np.concatenate([anno['keypoint'], anno['keypoint_score'][..., None]], axis=-1)
    kps[..., :2] *= ratio
    # 转换为 T x N x K x 3
    kps = kps.transpose([1, 0, 2, 3])
    vis_frames = []
    # 我们需要一个Topdown模型的实例,所以构建一个最小的实例
    model = TopDown(backbone=dict(type='ShuffleNetV1'))
    for f, kp in zip(frames, kps):
        bbox = np.zeros([0, 4], dtype=np.float32)
        result = [dict(keypoints=k) for k in kp]
        vis_frame = vis_pose_result(model, f, result)
        if category_name is not None:
            vis_frame = add_label(vis_frame, category_name)
        vis_frames.append(vis_frame)
    return vis_frames

(4)定义获取heatmap

keypoint_pipeline = [
    dict(type='PoseDecode'),
    dict(type='PoseCompact', hw_ratio=1., allow_imgpad=True),
    dict(type='Resize', scale=(-1, 64)),
    dict(type='CenterCrop', crop_size=64),
    dict(type='GeneratePoseTarget', with_kp=True, with_limb=False)
]
limb_pipeline = [
    dict(type='PoseDecode'),
    dict(type='PoseCompact', hw_ratio=1., allow_imgpad=True),
    dict(type='Resize', scale=(-1, 64)),
    dict(type='CenterCrop', crop_size=64),
    dict(type='GeneratePoseTarget', with_kp=False, with_limb=True)
]
from pyskl.datasets.pipelines import Compose
def get_pseudo_heatmap(anno, flag='keypoint'):
    assert flag in ['keypoint', 'limb']
    pipeline = Compose(keypoint_pipeline if flag == 'keypoint' else limb_pipeline)
    return pipeline(anno)['imgs']
def vis_heatmaps(heatmaps, channel=-1, ratio=8):
    # 如果通道为-1,则在同一张图上绘制所有关键点
    import matplotlib.cm as cm
    heatmaps = [x.transpose(1, 2, 0) for x in heatmaps]
    h, w, _ = heatmaps[0].shape
    newh, neww = int(h * ratio), int(w * ratio)
    if channel == -1:
        heatmaps = [np.max(x, axis=-1) for x in heatmaps]
    cmap = cm.viridis
    heatmaps = [(cmap(x)[..., :3] * 255).astype(np.uint8) for x in heatmaps]
    heatmaps = [cv2.resize(x, (neww, newh)) for x in heatmaps]
    return heatmaps

(5)进行动作识别与可视化(gym与ntu60数据集)

首先是GYM数据集

# Load GYM annotations
lines = mrlines('./tools/data/label_map/gym.txt')    # 加载动作标签txt文件
gym_categories = [x.strip().split('; ')[-1] for x in lines]
gym_annos = load(gym_ann_file)['annotations']
# download sample videos of GYM
!wget https://download.openmmlab.com/mmaction/posec3d/gym_samples.tar
!tar -xf gym_samples.tar
gym_root = 'gym_samples/'    # 50个体操视频
gym_vids = os.listdir(gym_root)
#  index in 0 - 49.
# 这里选择文件夹中第3个视频
idx = 2
vid = gym_vids[idx]
frame_dir = vid.split('.')[0]
vid_path = osp.join(gym_root, vid)
anno = [x for x in gym_annos if x['frame_dir'] == frame_dir][0]
# 可视化 Skeleton,并且进行动作识别,可视化动作标签
vis_frames = vis_skeleton(vid_path, anno, gym_categories[anno['label']])
vid = mpy.ImageSequenceClip(vis_frames, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

# 可视化 heatmap keypoint
keypoint_heatmap = get_pseudo_heatmap(anno)
keypoint_mapvis = vis_heatmaps(keypoint_heatmap)
keypoint_mapvis = [add_label(f, gym_categories[anno['label']]) for f in keypoint_mapvis]
vid = mpy.ImageSequenceClip(keypoint_mapvis, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

# 可视化 heatmap limb
limb_heatmap = get_pseudo_heatmap(anno, 'limb')
limb_mapvis = vis_heatmaps(limb_heatmap)
limb_mapvis = [add_label(f, gym_categories[anno['label']]) for f in limb_mapvis]
vid = mpy.ImageSequenceClip(limb_mapvis, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

接下来是NTU60数据集

# ntu_60数据集的标签
ntu_categories = ['drink water', 'eat meal/snack', 'brushing teeth', 'brushing hair', 'drop', 'pickup', 
                  'throw', 'sitting down', 'standing up (from sitting position)', 'clapping', 'reading', 
                  'writing', 'tear up paper', 'wear jacket', 'take off jacket', 'wear a shoe', 
                  'take off a shoe', 'wear on glasses', 'take off glasses', 'put on a hat/cap', 
                  'take off a hat/cap', 'cheer up', 'hand waving', 'kicking something', 
                  'reach into pocket', 'hopping (one foot jumping)', 'jump up', 
                  'make a phone call/answer phone', 'playing with phone/tablet', 'typing on a keyboard', 
                  'pointing to something with finger', 'taking a selfie', 'check time (from watch)', 
                  'rub two hands together', 'nod head/bow', 'shake head', 'wipe face', 'salute', 
                  'put the palms together', 'cross hands in front (say stop)', 'sneeze/cough', 
                  'staggering', 'falling', 'touch head (headache)', 'touch chest (stomachache/heart pain)', 
                  'touch back (backache)', 'touch neck (neckache)', 'nausea or vomiting condition', 
                  'use a fan (with hand or paper)/feeling warm', 'punching/slapping other person', 
                  'kicking other person', 'pushing other person', 'pat on back of other person', 
                  'point finger at the other person', 'hugging other person', 
                  'giving something to other person', "touch other person's pocket", 'handshaking', 
                  'walking towards each other', 'walking apart from each other']
# Load ntu60 annotations
ntu_annos = load(ntu60_ann_file)['annotations']
# download sample videos of NTU-60
!wget https://download.openmmlab.com/mmaction/posec3d/ntu_samples.tar
!tar -xf ntu_samples.tar
# !rm ntu_samples.tar
ntu_root = 'ntu_samples/'       # 50个室内动作视频
ntu_vids = os.listdir(ntu_root)
# index in 0 - 49.
# 这里选择文件夹中第42个视频
idx = 41
vid = ntu_vids[idx]
frame_dir = vid.split('.')[0]
vid_path = osp.join(ntu_root, vid)
anno = [x for x in ntu_annos if x['frame_dir'] == frame_dir.split('_')[0]][0]
# 可视化 Skeleton,并且进行动作识别,可视化动作标签
vis_frames = vis_skeleton(vid_path, anno, ntu_categories[anno['label']])
vid = mpy.ImageSequenceClip(vis_frames, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

# 可视化 heatmap keypoint
keypoint_heatmap = get_pseudo_heatmap(anno)
keypoint_mapvis = vis_heatmaps(keypoint_heatmap)
keypoint_mapvis = [add_label(f, ntu_categories[anno['label']]) for f in keypoint_mapvis]
vid = mpy.ImageSequenceClip(keypoint_mapvis, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

# 可视化 heatmap limb
limb_heatmap = get_pseudo_heatmap(anno, 'limb')
limb_mapvis = vis_heatmaps(limb_heatmap)
limb_mapvis = [add_label(f, ntu_categories[anno['label']]) for f in limb_mapvis]
vid = mpy.ImageSequenceClip(limb_mapvis, fps=24)
vid.ipython_display()

基于骨骼关键点的动作识别(OpenMMlab学习笔记,附PYSKL相关代码演示)

七、相关参考地址

  1. 关于PoseC3D更详细的解释参考:PoseC3D: 基于人体姿态的动作识别新范式
  2. PoseC3D项目地址:posec3d
  3. PoseC3D相关论文: Revisiting Skeleton-based Action Recognition
  4. PYSKL项目地址:PYSKL
  5. MMPose安装文档:MMPose安装
  6. 关于相关预处理的annotation文件地址:annotations
  7. GYM的annotation文件:GYM_annotation
  8. NTU60的annotation文件:NTU60_annotation
  9. gym.txt标签文件地址:gym.txt