読者です 読者をやめる 読者になる 読者になる

みらいテックラボ

音声・画像認識や機械学習など, 管理人が興味のある技術の紹介や実際にトライしてみた様子などメモしていく.

TensorFlow はじめの一歩(7)

今回は, TensorFlowのちょっとしたエラーの話である.

TensorFlow r0.8で漢字認識[1]を試していると, 下記のように学習途中でトレーニングデータの認識性能が急に0になるような症状が時折発生した.

学習ログ:

--- Start Learning ---
# Learning data num = 64964
step 0, training accuracy 0
step 100, training accuracy 0
step 200, training accuracy 0
step 300, training accuracy 0
step 400, training accuracy 0
step 500, training accuracy 0
step 600, training accuracy 0
step 700, training accuracy 0
step 800, training accuracy 0
step 900, training accuracy 0
step 1000, training accuracy 0.02
  :
step 18100, training accuracy 0.98
step 18200, training accuracy 1
step 18300, training accuracy 1
step 18400, training accuracy 0.96
step 18500, training accuracy 1
step 18600, training accuracy 1
step 18700, training accuracy 0.98
step 18800, training accuracy 1
step 18900, training accuracy 1
step 19000, training accuracy 0
step 19100, training accuracy 0
step 19200, training accuracy 0
step 19300, training accuracy 0
step 19400, training accuracy 0
step 19500, training accuracy 0

毎度なるわけではないので, 原因を特定することもできず困っていた.
ところが, ある時ブログラムをTensorFlow r0.6の環境で実行すると, 以下のようなエラーが発生した.

エラー内容:
aska@ubuntu:~/work/CNN$ python deepKanji.py
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 4
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 4

--- Start Learning ---
# Learning data num = 64964
step 0, training accuracy 0
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_1_grad/Relu_1/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_1)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_2_grad/Relu_2/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_2)]]
Traceback (most recent call last):
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 159, in main
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys, keep_prob: 0.5})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 368, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 444, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
Caused by op u'gradients/Relu_grad/Relu/CheckNumerics', defined at:
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 110, in main
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 186, in minimize
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 232, in compute_gradients
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 445, in gradients
in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 126, in _ReluGrad
t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 119, in _VerifyTensor
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()

...which was originally created as op u'Relu', defined at:
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 78, in main
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 547, in relu
return _op_def_lib.apply_op("Relu", features=features, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()

上記エラーについて検索してみると, 今度は同じようなエラーに遭遇している人がいた.

こちらの記事[2]によると,
inference関数で求められた各クラスの確率のようなものに0の値が入ると, loss関数内のcross_entropyを計算する部分で0*log(0)を計算してしまい, NaNが代入されてしまっているとのこと.

ということで, 以下のようにコードを修正することでOKだそうだ.
修正前:
 cross_entropy = -tf.reduce_sum(y_ * tf.log(y_conv))
修正後:
 cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y_conv, 1e-10, 1.0)))
 注) log内の数値を1e-10~1.0の範囲になるように指定

上記対応後は, r0.6でエラーが発生しないだけでなく, r0.8でも学習ログのような症状は発生しなくなった.
r0.8のようにエラーを発生しないで学習したふりをされても困るので, この件に関してはr0.6のようにエラーと振る舞ってくれるほうがいい.

----
[1] TensorFlowで文字認識にチャレンジ(3)
[2] TensorFlow AdamOptimizerが収束しないエラー? ReluGrad input is not finite. : Tensor had NaN values




続・わかりやすいパターン認識―教師なし学習入門―

続・わかりやすいパターン認識―教師なし学習入門―


入門パターン認識と機械学習

入門パターン認識と機械学習



統計的学習の基礎 ―データマイニング・推論・予測―

統計的学習の基礎 ―データマイニング・推論・予測―

  • 作者: Trevor Hastie,Robert Tibshirani,Jerome Friedman,杉山将,井手剛,神嶌敏弘,栗田多喜夫,前田英作,井尻善久,岩田具治,金森敬文,兼村厚範,烏山昌幸,河原吉伸,木村昭悟,小西嘉典,酒井智弥,鈴木大慈,竹内一郎,玉木徹,出口大輔,冨岡亮太,波部斉,前田新一,持橋大地,山田誠
  • 出版社/メーカー: 共立出版
  • 発売日: 2014/06/25
  • メディア: 単行本
  • この商品を含むブログ (3件) を見る