みらいテックラボ

音声・画像認識や機械学習など, 週末プログラマである管理人が興味のある技術の紹介や実際にトライしてみた様子などメモしていく.

WSL2を用いたUbuntu環境を構築(続き)

この夏(2022年)に, 下記のようなスペックのWindows PCを導入した.

CPU : AMD Ryzen7 5800X
RAM : 16GB
OS : Windows 11 Pro
SSD + HDD : 500GB + 6TB
etc : 水冷クーラー

前回, WSL2によるUbuntu 20.04環境の構築およびCUDA, cuDNNの導入について記した.
mirai-tec.hatenablog.com

[Ubuntu環境]
OS : Ubuntu 20.04 on Windows
GPU : GTX 1060-6GB
NVIDIA Driver : 516. 94
CUDA : 11.7
cuDNN : 8.5


その後, PyTorch(v1.12)やTensorFlow(v2.10)の仮想環境をminiconda3で作成し試していたところ, PyTorchではGPUを認識しているのだが, TensorFlowではGPUが認識されず, CPUのみで動作していることが判明した.
結局, 原因はcuDNNのバージョン不一致だったのだが, 少し調べたことをまとめておく.


1. PyTorch[2]
PyTorchはGPUをどのように認識しているか, 以下の項目について確認してみた.

  • PyTorchでGPUが使用可能か : torch.cuda.is_available()
  • GPUバイスの数 : torch.cuda.device_count()
  • デフォルトのGPU番号 : torch.cuda.current_device()
  • GPUの名称 : torch.cuda.get_device_name()
  • CUDA Compute Capability : torch.cuda.get_device_capability()
$ python
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.12.1+cu116'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name()
'NVIDIA GeForce GTX 1060 6GB'
>>> torch.cuda.get_device_capability()
(6, 1)
>>>

PyTorchでは, GPU/CUDA情報を正しく取得できているようだ.


2. TensorFlow
次に, 問題のTensorFlowの方も確認していく.
・デバイス情報のリスト:tensorflow.python.client.device_lib.list_local_devices()

$ python
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2022-09-24 17:43:19.321941: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-24 17:43:19.662008: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-24 17:43:20.409488: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-09-24 17:43:20.409554: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-09-24 17:43:20.409571: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
>>> tf.__version__
'2.10.0'
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2022-09-24 17:44:10.216362: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-24 17:44:10.390243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 17:44:10.450432: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-09-24 17:44:10.450470: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3809127935636748211
xla_global_id: -1
]
>>>

確かに, CPUしか認識していないようだ.
そもそも, tensorflowをimportした時点で,

E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

って, エラーが出ている. ただ, これはcuBLAS factoryを登録しようとしたら, すでに登録済みだと言ってる感じ. 

ネットで調べてもTensorflow 2.9.2にダウグレードするように書いてある[3]くらいで, あまりエラーに触れているものはないようだ.
一旦, Tensorflowは2.9.2にダウングレードすることに.

あと, 調べている中で, TensorFlow 2.10のCUDA/cuDNNのバージョン(CUDA 11.2/cuDNN 8.1)とインストールしたCUDA/cuDNNのバージョン(CUDA 11.7/cuDNN 8.5)があっていないことに気づいた.
(以前は結構このあたりのバージョンを気にしていたのだが, 今回はすっかり忘れていたのだ.)
ちなみに, TensorFlowの各バージョンとCUDA/cuDNNバージョンの関係はこちらを参考に.

そこで, まずはcuDNNを8.5から8.1にダウングレードすることに.
もし, これだけでダメな場合はCUDAも11.2にする.

NVIDIAのcuDNN Archive[4]の「Download cuDNN v8.1.1 (Feburary 26th, 2021), for CUDA 11.0,11.1 and 11.2」から「cuDNN Runtime Library for Ubuntu20.04 x86_64 (Deb)」をダウンロードし, インストールした.

$ python
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'2.9.2'
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2022-09-24 18:07:14.128639: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-24 18:07:14.251695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.272856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.273250: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.792393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.793182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.793213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2022-09-24 18:07:14.793569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:07:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-09-24 18:07:14.793636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 4598 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:07:00.0, compute capability: 6.1
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17517294614262033770
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4821352448
locality {
  bus_id: 1
  links {
  }
}
incarnation: 2792029488055711110
physical_device_desc: "device: 0, name: NVIDIA GeForce GTX 1060 6GB, pci bus id: 0000:07:00.0, compute capability: 6.1"
xla_global_id: 416903419
]
>>>

「Your kernel may have been built without NUMA support.」とは言われているが, とりあえずGPUを認識するようになった.

最後に, jupyter notebook上で, TensorFlowを使ってMNISTの学習を行ってみた.

$ nvidia-smi
Sat Sep 24 20:41:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 516.94       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:07:00.0  On |                  N/A |
| 44%   43C    P2    57W / 120W |   5819MiB /  6144MiB |     58%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        80      G   /Xwayland                       N/A      |
|    0   N/A  N/A      1463      G   /chrome                         N/A      |
|    0   N/A  N/A      1512      G   /chrome                         N/A      |
|    0   N/A  N/A      1756      C   /python3.9                      N/A      |
+-----------------------------------------------------------------------------+

GPUを使って, 問題なく学習していそう.

TensorFlowをインストールする場合には, CUDA/cuDNNのバージョンには注意しましょう!!

----
[1] WSL2を用いたUbuntu環境を構築 - みらいテックラボ
[2] PyTorchでGPU情報を確認(使用可能か、デバイス数など)| note.nkmk.me
[3] TensorFlow 2.10 causes trouble! #47・google-research/multinerf
[4] cuDNN Archive | NVIDIA Developer




動かして学ぶAI・機械学習の基礎 ―TensorFlowによるコンピュータビジョン、自然言語処理、時系列データの予測とデプロイ

TensorFlow2 TensorFlow&Keras対応 プログラミング実装ハンドブック

深層学習&深層強化学習による電子工作 TensorFlow編 (たのしくできる)

初めてのTensorFlow.js ―JavaScriptで学ぶ機械学習