今回は, TensorFlowのちょっとしたエラーの話である.
TensorFlow r0.8で漢字認識[1]を試していると, 下記のように学習途中でトレーニングデータの認識性能が急に0になるような症状が時折発生した.
学習ログ:
--- Start Learning ---
# Learning data num = 64964
step 0, training accuracy 0
step 100, training accuracy 0
step 200, training accuracy 0
step 300, training accuracy 0
step 400, training accuracy 0
step 500, training accuracy 0
step 600, training accuracy 0
step 700, training accuracy 0
step 800, training accuracy 0
step 900, training accuracy 0
step 1000, training accuracy 0.02
:
step 18100, training accuracy 0.98
step 18200, training accuracy 1
step 18300, training accuracy 1
step 18400, training accuracy 0.96
step 18500, training accuracy 1
step 18600, training accuracy 1
step 18700, training accuracy 0.98
step 18800, training accuracy 1
step 18900, training accuracy 1
step 19000, training accuracy 0
step 19100, training accuracy 0
step 19200, training accuracy 0
step 19300, training accuracy 0
step 19400, training accuracy 0
step 19500, training accuracy 0
毎度なるわけではないので, 原因を特定することもできず困っていた.
ところが, ある時ブログラムをTensorFlow r0.6の環境で実行すると, 以下のようなエラーが発生した.
エラー内容:
aska@ubuntu:~/work/CNN$ python deepKanji.py
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 4
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 4
--- Start Learning ---
# Learning data num = 64964
step 0, training accuracy 0
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_1_grad/Relu_1/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_1)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0x2614bc0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_2_grad/Relu_2/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_2)]]
Traceback (most recent call last):
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 159, in main
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys, keep_prob: 0.5})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 368, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 444, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
Caused by op u'gradients/Relu_grad/Relu/CheckNumerics', defined at:
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 110, in main
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 186, in minimize
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 232, in compute_gradients
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 445, in gradients
in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 126, in _ReluGrad
t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 119, in _VerifyTensor
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()
...which was originally created as op u'Relu', defined at:
File "deepKanji.py", line 183, in
main(mode, model)
File "deepKanji.py", line 78, in main
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 547, in relu
return _op_def_lib.apply_op("Relu", features=features, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()
上記エラーについて検索してみると, 今度は同じようなエラーに遭遇している人がいた.
こちらの記事[2]によると,
inference関数で求められた各クラスの確率のようなものに0の値が入ると, loss関数内のcross_entropyを計算する部分で0*log(0)を計算してしまい, NaNが代入されてしまっているとのこと.
ということで, 以下のようにコードを修正することでOKだそうだ.
修正前:
cross_entropy = -tf.reduce_sum(y_ * tf.log(y_conv))
修正後:
cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y_conv, 1e-10, 1.0)))
注) log内の数値を1e-10~1.0の範囲になるように指定
上記対応後は, r0.6でエラーが発生しないだけでなく, r0.8でも学習ログのような症状は発生しなくなった.
r0.8のようにエラーを発生しないで学習したふりをされても困るので, この件に関してはr0.6のようにエラーと振る舞ってくれるほうがいい.
----
[1] TensorFlowで文字認識にチャレンジ(3)
[2] TensorFlow AdamOptimizerが収束しないエラー? ReluGrad input is not finite. : Tensor had NaN values
|
|
図解・ベイズ統計「超」入門 あいまいなデータから未来を予測する技術 (サイエンス・アイ新書)
|
|