みらいテックラボ

音声・画像認識や機械学習など, 週末プログラマである管理人が興味のある技術の紹介や実際にトライしてみた様子などメモしていく.

KAPAOで姿勢推定を試してみる!

OpenPose[1]やHRNet(High Resolution Network)[2]なオープンソースの姿勢推定アルゴリズムのコードが公開されているが, 今回KAPAO(Keypoints and Poses as Objects)[3]という姿勢推定手法が, 処理が速く精度がよいというので試してみた.

試すにあたり, 少しトラブったところもあるので, メモを残しておく.


1. セットアップ[3]
基本的には, GitHubのSetupに基づいて行えばよいのだが, PC環境の問題か以下のようなエラーが発生した.

[PC環境]

(kapao) aska@moonlight:~/kapao$ python demos/image.py --bbox
/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/cuda/__init__.py:106: UserWarning: 
NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Using device: cuda:0
Traceback (most recent call last):
  File "demos/image.py", line 69, in <module>
    model = attempt_load(args.weights, map_location=device)
  File "/home/aska/kapao/models/experimental.py", line 96, in attempt_load
    model.append(ckpt['ema' if ckpt.get('ema') else 'model'].float().fuse().eval())  # FP32 model
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 692, in float
    return self._apply(lambda t: t.float() if t.is_floating_point() else t)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 552, in _apply
    param_applied = fn(param)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 692, in <lambda>
    return self._apply(lambda t: t.float() if t.is_floating_point() else t)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

"The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70."ってことで, どうもPyTorchがPCのCUDAのバージョンとあっていないようだ.
そこで, インストールされているPyTorch(1.9.1)とTorchvision(0.10.1)をアンインストールし, PCのCUDAバージョンに対応したPyTorch(1.12.0+cu113)とTorchvision(0.13.0+cu113)をインストールした.
次は, こんなエラーが発生!!

(kapao) aska@moonlight:~/kapao$ python demos/image.py --bbox
Using device: cuda:0
image 1/1 /home/aska/kapao/res/crowdpose_100024.jpg: Traceback (most recent call last):
  File "demos/image.py", line 80, in <module>
    out = model(img, augment=True, kp_flip=data['kp_flip'], scales=data['scales'], flips=data['flips'])[0]
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aska/kapao/models/yolo.py", line 137, in forward
    return self.forward_augment(x, kp_flip, s=scales, f=flips)  # augmented inference, None
  File "/home/aska/kapao/models/yolo.py", line 148, in forward_augment
    yi, train_out_i = self.forward_once(xi)  # forward
  File "/home/aska/kapao/models/yolo.py", line 173, in forward_once
    x = m(x)  # run
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/upsampling.py", line 154, in forward
    recompute_scale_factor=self.recompute_scale_factor)
  File "/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Upsample' object has no attribute 'recompute_scale_factor'

少し調べていたら, 以下のような対応[4]でいけそう.
対処方法:
~/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/nn/modulesディレクトリにあるupsampling.pyを修正する.

修正前:

def forward(self, input: Tensor) -> Tensor:
        return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners,
                             recompute_scale_factor=self.recompute_scale_factor)

修正後:

def forward(self, input: Tensor) -> Tensor:
        return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)

これで一応動作するようになった.


2. 動作確認
GitHubのInference Demosに沿って, まずは静止画で動作確認した.

(kapao) aska@moonlight:~/kapao$ python demos/image.py --bbox --pose --face --no-kp-dets
Using device: cuda:0
/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
image 1/1 /home/aska/kapao/res/crowdpose_100024.jpg: 

[実行結果]
ファイル:crowdpose_100024_kapao_l_coco_bbox_pose_face.png

次に, 動画でも動作確認した.

(kapao) aska@moonlight:~/kapao$ python demos/video.py --face --gif
Using device: cuda:0
/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Running inference:  99%|████████████████████████████████████████████████████████████▋| 190/191 [00:03<00:00, 60.21it/s]
Saving GIF...
191it [00:07, 27.19it/s]

[実行結果]


デモ動画でなく, YouTubeから最近話題の「きつねダンス」を試してみた.

[元動画:YouTube 【朗報】きつねダンス『ついに“アレ”がつきました!』 ]


3. WEB CAM対応
カメラ映像で試せるように, video.pyをベースに改良してみた.

[コード]

import sys
from pathlib import Path
FILE = Path(__file__).absolute()
sys.path.append(FILE.parents[1].as_posix())  # add kapao/ to path

import argparse
import os.path as osp
from utils.torch_utils import select_device, time_sync
from utils.general import check_img_size
from utils.datasets import LoadWebcam
from models.experimental import attempt_load
import torch
import cv2
import yaml
from tqdm import tqdm
import imageio
from val import run_nms, post_process_batch
import numpy as np
import gdown
import csv

def main(args):
    with open(args.data) as f:
        data = yaml.safe_load(f)  # load data dict

    # add inference settings to data dict
    data['imgsz'] = args.imgsz
    data['conf_thres'] = args.conf_thres
    data['iou_thres'] = args.iou_thres
    data['use_kp_dets'] = not args.no_kp_dets
    data['conf_thres_kp'] = args.conf_thres_kp
    data['iou_thres_kp'] = args.iou_thres_kp
    data['conf_thres_kp_person'] = args.conf_thres_kp_person
    data['overwrite_tol'] = args.overwrite_tol
    data['scales'] = args.scales
    data['flips'] = [None if f == -1 else f for f in args.flips]
    data['count_fused'] = False

    device = select_device(args.device, batch_size=1)
    print('Using device: {}'.format(device))

    model = attempt_load(args.weights, map_location=device)  # load FP32 model
    half = args.half & (device.type != 'cpu')
    if half:  # half precision only supported on CUDA
        model.half()
    stride = int(model.stride.max())  # model stride

    imgsz = check_img_size(args.imgsz, s=stride)  # check image size
    dataset = LoadWebcam(pipe='0', img_size=imgsz, stride=stride)

    if device.type != 'cpu':
        model(torch.zeros(1, 3, imgsz, imgsz).to(device).type_as(next(model.parameters())))  # run once

    cap = dataset.cap
    fps = cap.get(cv2.CAP_PROP_FPS)
    h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    print(f'CAM : {w}x{h}, {fps}')
    gif_frames = []

    t0 = time_sync()
    for i, (path, img, im0, _) in enumerate(dataset):
        img = torch.from_numpy(img).to(device)
        img = img.half() if half else img.float()  # uint8 to fp16/32
        img = img / 255.0  # 0 - 255 to 0.0 - 1.0
        if len(img.shape) == 3:
            img = img[None]  # expand for batch dim

        out = model(img, augment=True, kp_flip=data['kp_flip'], scales=data['scales'], flips=data['flips'])[0]
        person_dets, kp_dets = run_nms(data, out)
        bboxes, poses, _, _, _ = post_process_batch(data, img, [], [[im0.shape[:2]]], person_dets, kp_dets)

        im0_copy = im0.copy()

        # DRAW POSES
        csv_row = []
        for j, (bbox, pose) in enumerate(zip(bboxes, poses)):
            x1, y1, x2, y2 = bbox
            cv2.rectangle(im0_copy, (int(x1), int(y1)), (int(x2), int(y2)), args.color, thickness=1)
            if args.csv:
                for x, y, c in pose:
                    csv_row.extend([x, y, c])
            if args.face:
                for x, y, c in pose[data['kp_face']]:
                    if not args.kp_obj or c:
                        cv2.circle(im0_copy, (int(x), int(y)), args.kp_size, args.color, args.kp_thick)
            for seg in data['segments'].values():
                if not args.kp_obj or (pose[seg[0], -1] and pose[seg[1], -1]):
                    pt1 = (int(pose[seg[0], 0]), int(pose[seg[0], 1]))
                    pt2 = (int(pose[seg[1], 0]), int(pose[seg[1], 1]))
                    cv2.line(im0_copy, pt1, pt2, args.color, args.line_thick)
        im0 = cv2.addWeighted(im0, args.alpha, im0_copy, 1 - args.alpha, gamma=0)

        if i == 0:
            t = time_sync() - t0
        else:
            t = time_sync() - t1

        if not args.gif and args.fps_size:
            cv2.putText(im0, '{:.1f} FPS'.format(1 / t), (5 * args.fps_size, 25 * args.fps_size),
                        cv2.FONT_HERSHEY_SIMPLEX, args.fps_size, (255, 255, 255), thickness=2 * args.fps_size)

        if args.gif:
            gif_img = cv2.cvtColor(cv2.resize(im0, dsize=tuple(args.gif_size)), cv2.COLOR_RGB2BGR)
            if args.fps_size:
                cv2.putText(gif_img, '{:.1f} FPS'.format(1 / t), (5 * args.fps_size, 25 * args.fps_size),
                            cv2.FONT_HERSHEY_SIMPLEX, args.fps_size, (255, 255, 255), thickness=2 * args.fps_size)
            gif_frames.append(gif_img)
        else:
            cv2.imshow('', im0)
            cv2.waitKey(1)

        if args.csv:
            csv_writer.writerow(csv_row)

        t1 = time_sync()
        key = cv2.waitKey(1)
        if key == 27:
            break

    cv2.destroyAllWindows()
    cap.release()

def options():
    parser = argparse.ArgumentParser()
    # video options
    parser.add_argument('--color', type=int, nargs='+', default=[255, 255, 255], help='pose color')
    parser.add_argument('--face', action='store_true', help='plot face keypoints')
    parser.add_argument('--display', action='store_true', help='display inference results')
    parser.add_argument('--fps-size', type=int, default=1)
    parser.add_argument('--gif', action='store_true', help='create gif')
    parser.add_argument('--gif-size', type=int, nargs='+', default=[480, 270])
    parser.add_argument('--kp-size', type=int, default=2, help='keypoint circle size')
    parser.add_argument('--kp-thick', type=int, default=2, help='keypoint circle thickness')
    parser.add_argument('--line-thick', type=int, default=3, help='line thickness')
    parser.add_argument('--alpha', type=float, default=0.4, help='pose alpha')
    parser.add_argument('--kp-obj', action='store_true', help='plot keypoint objects only')
    parser.add_argument('--csv', action='store_true', help='write results so csv file')

    # model options
    parser.add_argument('--data', type=str, default='data/coco-kp.yaml')
    parser.add_argument('--imgsz', type=int, default=1024)
    parser.add_argument('--weights', default='kapao_s_coco.pt')
    parser.add_argument('--device', default='', help='cuda device, i.e. 0 or cpu')
    parser.add_argument('--half', action='store_true')
    parser.add_argument('--conf-thres', type=float, default=0.5, help='confidence threshold')
    parser.add_argument('--iou-thres', type=float, default=0.45, help='NMS IoU threshold')
    parser.add_argument('--no-kp-dets', action='store_true', help='do not use keypoint objects')
    parser.add_argument('--conf-thres-kp', type=float, default=0.5)
    parser.add_argument('--conf-thres-kp-person', type=float, default=0.2)
    parser.add_argument('--iou-thres-kp', type=float, default=0.45)
    parser.add_argument('--overwrite-tol', type=int, default=50)
    parser.add_argument('--scales', type=float, nargs='+', default=[1])
    parser.add_argument('--flips', type=int, nargs='+', default=[-1])

    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = options()
    main(args)

しかし, 色々と試していると, 以下のようなエラーが発生する場合があるようだ.
同様のエラー発生については, この記事[5]のコメント欄でも触れられているが, 対策案については触れられていない.

(kapao) aska@moonlight:~/kapao$ python demo_poses.py 
Using device: cuda:0
/home/aska/anaconda3/envs/kapao/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
CAM : 640x480, 30.0
webcam 0: webcam 1: webcam 2: webcam 3: webcam 4:

                                    :

Traceback (most recent call last):
  File "demo_poses.py", line 189, in <module>
    main(args)
  File "demo_poses.py", line 82, in main
    bboxes, poses, _, _, _ = post_process_batch(data, img, [], [[im0.shape[:2]]], person_dets, kp_dets)
  File "/home/aska/kapao/val.py", line 108, in post_process_batch
    kpd[:, :4] = scale_coords(imgs[si].shape[1:], kpd[:, :4], shape)
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.

そこで原因を調査することに.
調べた結果, val.pyの108行目のscale_coordsの引数に渡される値に問題があるようで, 具体的にはkpdのSizeが”[1, 40]”の場合, つまり, キーポイントが1つしか検出できなかった場合にエラーが発生するようである.

webcam 69: torch.Size([3, 768, 1024]) torch.Size([5, 40])
webcam 70: torch.Size([3, 768, 1024]) torch.Size([4, 40])
webcam 71: torch.Size([3, 768, 1024]) torch.Size([1, 40])
Traceback (most recent call last):
  File "demo_poses.py", line 189, in <module>
    main(args)
  File "demo_poses.py", line 82, in main
    bboxes, poses, _, _, _ = post_process_batch(data, img, [], [[im0.shape[:2]]], person_dets, kp_dets)
  File "/home/aska/kapao/val.py", line 112, in post_process_batch
    kpd[:, :4] = scale_coords(imgs[si].shape[1:], kpd[:, :4], shape)
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.

暫定対策として, val.pyを以下のように修正することで, 一応回避できることを確認した.

修正前:

                if data['use_kp_dets'] and nkp:
                    mask = scores > data['conf_thres_kp_person']
                    poses_mask = poses[mask]

                    if len(poses_mask):
                        ### DEBUG
                        print(imgs[si].shape, kpd.shape)
                        
                        kpd[:, :4] = scale_coords(imgs[si].shape[1:], kpd[:, :4], shape)
                        kpd = kpd[:, :6].cpu()

修正後:

                if data['use_kp_dets'] and nkp:
                    mask = scores > data['conf_thres_kp_person']
                    poses_mask = poses[mask]

                    if len(poses_mask) and kpd.shape[0] > 1:
                        ### DEBUG
                        print(imgs[si].shape, kpd.shape)
                        
                        kpd[:, :4] = scale_coords(imgs[si].shape[1:], kpd[:, :4], shape)
                        kpd = kpd[:, :6].cpu()

処理速度などについて, 他のアルゴリズムとの比較などはできていないが, かなりいい感じである.
このあと, いろいろと試してみようと思う.

----
参照URL:
[1] GitHub - CMU-Perceptual-Computing-Lab/openpose
[2] GitHub - HRNet/HRNet-Human-Pose-Estimation
[3] GitHub - wmcnally/kapao
[4] YOLOv5で物体検出して座標・幅・高さをCSV出力する_ Python
[5] Kapaoで、人物検出と姿勢推定を行う