最近, YOLOv5, YOLOXやDetectron2などを少し触る機会があったので, まずはYOLOv5について少しまとめておく.
少し前に, YOLOv5推論編や学習編について記載したが, 今回はモデル変換編ということでONNX形式やTFLite形式へのモデル変換などについて記す.
関連記事:
- 最近の物体検知を試してみる(YOLOv5編①)
- 最近の物体検知を試してみる(YOLOv5編②)
- 最近の物体検知を試してみる(YOLOv5編③)
- 最近の物体検知を試してみる(YOLOX編①)
物体検知をRaspberry PiやJetson Nanoなどエッジコンピュータで動かす場合に, PCやクラウドで学習した高性能なモデルは, そのままではメモリや処理速度の観点で利用が難しい.
エッジコンピュータで物体検知を動かす場合,
・パラメータ数の少ない, コンパクトなモデルを使用する.
・モデルへの入力画像サイズを小さくする.
・重みの固定小数点化や量子化などを行う.
・アプリ用途に応じて, 検出対象物を絞る. (例; 人, 車, バイクのみ)
など, 省メモリ化や処理の高速化を考える必要がある.
今回は, yolov5のexport.pyを使ったONNX形式やTFLite形式へのモデル変換や, モデルへの入力サイズを小さくするなどし, reTerminal(Raspberry Pi CM4)でどの程度動作するか試してみた.
1. モデル変換[1]
yolov5には, export.pyというモデル変換ツールが付属しており, 各種モデル形式の変換を簡単に行うことができる.
Format | `export.py --include` | Model --- | --- | --- PyTorch | - | yolov5s.pt TorchScript | `torchscript` | yolov5s.torchscript ONNX | `onnx` | yolov5s.onnx OpenVINO | `openvino` | yolov5s_openvino_model/ TensorRT | `engine` | yolov5s.engine CoreML | `coreml` | yolov5s.mlmodel TensorFlow SavedModel | `saved_model` | yolov5s_saved_model/ TensorFlow GraphDef | `pb` | yolov5s.pb TensorFlow Lite | `tflite` | yolov5s.tflite TensorFlow Edge TPU | `edgetpu` | yolov5s_edgetpu.tflite TensorFlow.js | `tfjs` | yolov5s_web_model/ PaddlePaddle | `paddle` | yolov5s_paddle_model/ Requirements: $ pip install -r requirements.txt coremltools onnx onnx-simplifier onnxruntime openvino-dev tensorflow-cpu # CPU $ pip install -r requirements.txt coremltools onnx onnx-simplifier onnxruntime-gpu openvino-dev tensorflow # GPU Usage: $ python export.py --weights yolov5s.pt --include torchscript onnx openvino engine coreml tflite ... Inference: $ python detect.py --weights yolov5s.pt # PyTorch yolov5s.torchscript # TorchScript yolov5s.onnx # ONNX Runtime or OpenCV DNN with --dnn yolov5s.xml # OpenVINO yolov5s.engine # TensorRT yolov5s.mlmodel # CoreML (macOS-only) yolov5s_saved_model # TensorFlow SavedModel yolov5s.pb # TensorFlow GraphDef yolov5s.tflite # TensorFlow Lite yolov5s_edgetpu.tflite # TensorFlow Edge TPU yolov5s_paddle_model # PaddlePaddle
PC上でONNX形式やTFLite形式へのモデル変換を行い, 変換後のモデルをreTerminalにコピーし推論を行った.
変換元の学習済みモデルには, 一番コンパクトなyolov5n.ptを用いた.
reTerminalの環境 :
- torch-1.12.1 / torchvision-0.13.1
- tensorflow-2.10.0
- onnx-1.12.0 / onnxruntime-1.12.1
- yolov5 (GitHub[1]からcloneしてインストール)
1.1 モデル変換(PC)
(1) ONNX形式
(yolov5) aska@moonlight:~/work/yolov5$ python export.py --weights yolov5n.pt --img 640 --include onnx export: data=data/coco128.yaml, weights=['yolov5n.pt'], imgsz=[640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=12, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx'] YOLOv5 v6.2-180-g82bec4c Python-3.8.12 torch-1.10.2+cu113 CPU Fusing layers... YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients, 4.5 GFLOPs PyTorch: starting from yolov5n.pt with output shape (1, 25200, 85) (3.9 MB) ONNX: starting export with onnx 1.11.0... ONNX: export success ✅ 1.3s, saved as yolov5n.onnx (7.5 MB) Export complete (1.5s) Results saved to /home/aska/work/yolov5 Detect: python detect.py --weights yolov5n.onnx Validate: python val.py --weights yolov5n.onnx PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'custom', 'yolov5n.onnx') Visualize: https://netron.app
(2) TFLite形式
(yolov5) aska@moonlight:~/work/yolov5$ python export.py --weights yolov5n.pt --img 640 --include tflite export: data=data/coco128.yaml, weights=['yolov5n.pt'], imgsz=[640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=12, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['tflite'] YOLOv5 v6.2-180-g82bec4c Python-3.8.12 torch-1.10.2+cu113 CPU Fusing layers... YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients, 4.5 GFLOPs PyTorch: starting from yolov5n.pt with output shape (1, 25200, 85) (3.9 MB) 2022-10-01 16:20:56.179939: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. TensorFlow SavedModel: starting export with tensorflow 2.10.0... from n params module arguments 2022-10-01 16:20:57.167212: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0 -1 1 1760 models.common.Conv [3, 16, 6, 2, 2] 1 -1 1 4672 models.common.Conv [16, 32, 3, 2] 2 -1 1 4800 models.common.C3 [32, 32, 1] 3 -1 1 18560 models.common.Conv [32, 64, 3, 2] 4 -1 1 29184 models.common.C3 [64, 64, 2] 5 -1 1 73984 models.common.Conv [64, 128, 3, 2] 6 -1 1 156928 models.common.C3 [128, 128, 3] 7 -1 1 295424 models.common.Conv [128, 256, 3, 2] 8 -1 1 296448 models.common.C3 [256, 256, 1] 9 -1 1 164608 models.common.SPPF [256, 256, 5] 10 -1 1 33024 models.common.Conv [256, 128, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 90880 models.common.C3 [256, 128, 1, False] 14 -1 1 8320 models.common.Conv [128, 64, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 22912 models.common.C3 [128, 64, 1, False] 18 -1 1 36992 models.common.Conv [64, 64, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 74496 models.common.C3 [128, 128, 1, False] 21 -1 1 147712 models.common.Conv [128, 128, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 296448 models.common.C3 [256, 256, 1, False] 24 [17, 20, 23] 1 115005 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256], [640, 640]] Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(1, 640, 640, 3)] 0 [] tf_conv (TFConv) (1, 320, 320, 16) 1744 ['input_1[0][0]'] tf_conv_1 (TFConv) (1, 160, 160, 32) 4640 ['tf_conv[0][0]'] tfc3 (TFC3) (1, 160, 160, 32) 4704 ['tf_conv_1[0][0]'] tf_conv_7 (TFConv) (1, 80, 80, 64) 18496 ['tfc3[0][0]'] tfc3_1 (TFC3) (1, 80, 80, 64) 28928 ['tf_conv_7[0][0]'] tf_conv_15 (TFConv) (1, 40, 40, 128) 73856 ['tfc3_1[0][0]'] tfc3_2 (TFC3) (1, 40, 40, 128) 156288 ['tf_conv_15[0][0]'] tf_conv_25 (TFConv) (1, 20, 20, 256) 295168 ['tfc3_2[0][0]'] tfc3_3 (TFC3) (1, 20, 20, 256) 295680 ['tf_conv_25[0][0]'] tfsppf (TFSPPF) (1, 20, 20, 256) 164224 ['tfc3_3[0][0]'] tf_conv_33 (TFConv) (1, 20, 20, 128) 32896 ['tfsppf[0][0]'] tf_upsample (TFUpsample) (1, 40, 40, 128) 0 ['tf_conv_33[0][0]'] tf_concat (TFConcat) (1, 40, 40, 256) 0 ['tf_upsample[0][0]', 'tfc3_2[0][0]'] tfc3_4 (TFC3) (1, 40, 40, 128) 90496 ['tf_concat[0][0]'] tf_conv_39 (TFConv) (1, 40, 40, 64) 8256 ['tfc3_4[0][0]'] tf_upsample_1 (TFUpsample) (1, 80, 80, 64) 0 ['tf_conv_39[0][0]'] tf_concat_1 (TFConcat) (1, 80, 80, 128) 0 ['tf_upsample_1[0][0]', 'tfc3_1[0][0]'] tfc3_5 (TFC3) (1, 80, 80, 64) 22720 ['tf_concat_1[0][0]'] tf_conv_45 (TFConv) (1, 40, 40, 64) 36928 ['tfc3_5[0][0]'] tf_concat_2 (TFConcat) (1, 40, 40, 128) 0 ['tf_conv_45[0][0]', 'tf_conv_39[0][0]'] tfc3_6 (TFC3) (1, 40, 40, 128) 74112 ['tf_concat_2[0][0]'] tf_conv_51 (TFConv) (1, 20, 20, 128) 147584 ['tfc3_6[0][0]'] tf_concat_3 (TFConcat) (1, 20, 20, 256) 0 ['tf_conv_51[0][0]', 'tf_conv_33[0][0]'] tfc3_7 (TFC3) (1, 20, 20, 256) 295680 ['tf_concat_3[0][0]'] tf_detect (TFDetect) ((1, 25200, 85), 115005 ['tfc3_5[0][0]', ) 'tfc3_6[0][0]', 'tfc3_7[0][0]'] ================================================================================================== Total params: 1,867,405 Trainable params: 0 Non-trainable params: 1,867,405 __________________________________________________________________________________________________ 2022-10-01 16:20:59.099566: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support) 2022-10-01 16:20:59.099656: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session Assets written to: yolov5n_saved_model/assets TensorFlow SavedModel: export success ✅ 4.1s, saved as yolov5n_saved_model (7.4 MB) TensorFlow Lite: starting export with tensorflow 2.10.0... Found untraced functions such as tf_conv_2_layer_call_fn, tf_conv_2_layer_call_and_return_conditional_losses, tf_conv_3_layer_call_fn, tf_conv_3_layer_call_and_return_conditional_losses, tf_conv_4_layer_call_fn while saving (showing 5 of 268). These functions will not be directly callable after loading. Assets written to: /tmp/tmp3a887r7n/assets 2022-10-01 16:21:25.634156: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format. 2022-10-01 16:21:25.634201: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency. 2022-10-01 16:21:25.634801: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /tmp/tmp3a887r7n 2022-10-01 16:21:25.665209: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve } 2022-10-01 16:21:25.665259: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /tmp/tmp3a887r7n 2022-10-01 16:21:25.783391: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled 2022-10-01 16:21:25.809995: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle. 2022-10-01 16:21:26.039735: I tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /tmp/tmp3a887r7n 2022-10-01 16:21:26.161768: I tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 526967 microseconds. 2022-10-01 16:21:26.560965: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. 2022-10-01 16:21:26.892092: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:1989] Estimated count of arithmetic ops: 5.393 G ops, equivalently 2.697 G MACs Estimated count of arithmetic ops: 5.393 G ops, equivalently 2.697 G MACs TensorFlow Lite: export success ✅ 26.9s, saved as yolov5n-fp16.tflite (3.7 MB) Export complete (31.3s) Results saved to /home/aska/work/yolov5 Detect: python detect.py --weights yolov5n-fp16.tflite Validate: python val.py --weights yolov5n-fp16.tflite PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'custom', 'yolov5n-fp16.tflite') Visualize: https://netron.app
TFLite形式のint8モデルに変換する場合には, "--int8"オプションを追加する.
1.2 動作確認(reTerminal)
yolov5n.onnx, yolov5n-fp16.tflite, yolov5n-int8.tflite, yolov5n.ptおよびyolov5s.ptの5種類について, 動作させてみた.
aska@mars:~/work/yolov5 $ python detect.py --weights yolov5n.onnx --img 640 --source data/images/bus.jpg /home/aska/.local/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") detect: weights=['yolov5n.onnx'], source=data/images/bus.jpg, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1 YOLOv5 v6.2-180-g82bec4c Python-3.9.2 torch-1.12.1 CPU Loading yolov5n.onnx for ONNX Runtime inference... image 1/1 /home/aska/work/yolov5/data/images/bus.jpg: 640x640 3 persons, 1 bus, 532.9ms Speed: 10.2ms pre-process, 532.9ms inference, 6.5ms NMS per image at shape (1, 3, 640, 640) Results saved to runs/detect/exp
他のモデルで推論する場合, "--weights"でそれぞれのモデルを指定する.
ちなみに, 各モデルでの5回の平均処理速度は以下の通り.
モデル | 平均処理時間(ms) |
yolov5n.onnx | 550.4 |
yolov5n-fp16.tflite | 881.2 |
yolov5n-int8.tflite | 638.9 |
yolov5n.pt | 540.6 |
yolov5s.pt | 954.5 |
int8のTFLiteが速いかと思っていたが, 以外にもPyTorchモデルのyolov5n.ptが最も処理時間が短かった.
yolov5n.ptでは, だいたい約460msで処理するのだが, 1回だけ約850msだったため, 平均値が高くなった.
ただ, 処理時間のばらつき幅が大きいのは気になる.
2. モデル入力サイズ変更
現状では, 1枚の画像の物体検出に500ms以上かかっており, まだかなり遅い.
そこで, モデルへの入力画像のサイズを小さくして, 演算量を減らすことを試してみた.
デフォルトではモデルへの入力画像のサイズは640x640なので, 1/4の320x320で処理速度を計測してみた.
まず, 入力画像サイズを変更したモデルの作成は, ONNX形式やTFLite形式にモデル変換する際に"--img 320"のオプションを付ける.
PyTorchのモデルは, 特に変換は不要のようである.
推論時は, "--img 320"のオプションを付けて実行する.
各モデルでの5回の平均処理速度は以下の通り.
モデル | 平均処理時間(ms) |
yolov5n.onnx | 149.9 |
yolov5n-fp16.tflite | 220.3 |
yolov5n-int8.tflite | 155.2 |
yolov5n.pt | 158.5 |
yolov5s.pt | 299.5 |
モデルへの入力画像サイズを320x320にした場合, ONNX形式がもっとも処理時間が短くなった.
これくらいであれば5-6フレーム/秒で処理できるので, 対象物の動きがあまり速くない場合には使えるレベルかな.
ちなみに, 今回の結果はRaspberry Pi CM4(reTerminal)の場合であり, Jetson Nanoの場合にはGPUを搭載していたり, NVIDIAのTensorRTも利用できるので違う結果になると思われる.
また, 対象物が限定されるなら, yolov5n.ptをベースにカスタムモデルを作成することで, 処理時間の高速化及び検出精度の向上が更にできるはず.
いろいろとアプリを考えてみよう.
---
参照URL:
[1] GitHub - ultralytics/yolov5