program story

GradientDescentOptimizer의 적응 형 학습률을 설정하는 방법은 무엇입니까?

inputbox 2020. 8. 29. 10:59
반응형

GradientDescentOptimizer의 적응 형 학습률을 설정하는 방법은 무엇입니까?


저는 TensorFlow를 사용하여 신경망을 훈련하고 있습니다. 이것이 내가 초기화하는 방법입니다 GradientDescentOptimizer.

init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

mse        = tf.reduce_mean(tf.square(out - out_))
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)

여기서 중요한 것은 학습률에 대한 업데이트 규칙이나 그에 대한 감쇠 값을 설정하는 방법을 모른다는 것입니다.

여기서 적응 형 학습률을 어떻게 사용할 수 있습니까?


우선, tf.train.GradientDescentOptimizer모든 단계의 모든 변수에 대해 일정한 학습률을 사용하도록 설계되었습니다. TensorFlow는 또한 및를 포함하여 즉시 사용 가능한 적응 형 최적화 도구를 제공 tf.train.AdagradOptimizer하며 tf.train.AdamOptimizer, 이들은 드롭 인 대체물로 사용할 수 있습니다.

그러나 그렇지 않은 경우 기본 경사 하강 법을 사용하여 학습률을 제어 learning_rate하려면 tf.train.GradientDescentOptimizer생성자에 대한 인수가 Tensor객체 가 될 수 있다는 사실을 활용할 수 있습니다 . 이를 통해 각 단계의 학습률에 대해 다른 값을 계산할 수 있습니다. 예를 들면 다음과 같습니다.

learning_rate = tf.placeholder(tf.float32, shape=[])
# ...
train_step = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate).minimize(mse)

sess = tf.Session()

# Feed different values for learning rate to each training step.
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.01})
sess.run(train_step, feed_dict={learning_rate: 0.01})

또는 tf.Variable학습률을 유지 하는 스칼라 생성 하고 학습률을 변경할 때마다 할당 할 수 있습니다.


Tensorflow는 학습률 tensor에 지수 감쇠를 자동으로 적용하는 연산을 제공합니다 : tf.train.exponential_decay. 사용중인 예제는 MNIST 컨볼 루션 모델 예제의이 줄을 참조하십시오 . 그런 다음 위의 @mrry의 제안을 사용하여이 변수를 선택한 최적화 프로그램에 learning_rate 매개 변수로 제공하십시오.

살펴볼 핵심 발췌 부분은 다음과 같습니다.

# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0)

learning_rate = tf.train.exponential_decay(
  0.01,                # Base learning rate.
  batch * BATCH_SIZE,  # Current index into the dataset.
  train_size,          # Decay step.
  0.95,                # Decay rate.
  staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
                                     0.9).minimize(loss,
                                                   global_step=batch)

global_step=batch최소화 매개 변수에 유의하십시오 . 이는 최적화 프로그램이 학습 할 때마다 '배치'매개 변수를 유용하게 증가 시키도록 지시합니다.


경사 하강 법 알고리즘은 초기화 중에 제공 할 수있는 일정한 학습률을 사용합니다 . Mrry가 보여준 방식으로 다양한 학습률을 통과 할 수 있습니다.

그러나 그 대신 더 빠른 수렴 속도를 가지고 상황에 적응하는 고급 최적화 프로그램사용할 수도 있습니다 .

내 이해를 바탕으로 한 간단한 설명은 다음과 같습니다.

  • momentum helps SGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct dirrection and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makse sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentom. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
  • nesterov accelerated gradient overcomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
  • AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically illiminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning
  • AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly decrease. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSprop is very similar to AdaDelta
  • Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately

    A few visualizations: enter image description here enter image description here


From tensorflow official docs

global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                       100000, 0.96, staircase=True)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
tf.train.GradientDescentOptimizer(learning_rate)
.minimize(...my loss..., global_step=global_step))

If you want to set specific learning rates for intervals of epochs like 0 < a < b < c < .... Then you can define your learning rate as a conditional tensor, conditional on the global step, and feed this as normal to the optimiser.

You could achieve this with a bunch of nested tf.cond statements, but its easier to build the tensor recursively:

def make_learning_rate_tensor(reduction_steps, learning_rates, global_step):
    assert len(reduction_steps) + 1 == len(learning_rates)
    if len(reduction_steps) == 1:
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: learning_rates[1]
        )
    else:
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: make_learning_rate_tensor(
                reduction_steps[1:],
                learning_rates[1:],
                global_step,)
            )

Then to use it you need to know how many training steps there are in a single epoch, so that we can use the global step to switch at the right time, and finally define the epochs and learning rates you want. So if I want the learning rates [0.1, 0.01, 0.001, 0.0001] during the epoch intervals of [0, 19], [20, 59], [60, 99], [100, \infty] respectively, I would do:

global_step = tf.train.get_or_create_global_step()
learning_rates = [0.1, 0.01, 0.001, 0.0001]
steps_per_epoch = 225
epochs_to_switch_at = [20, 60, 100]
epochs_to_switch_at = [x*steps_per_epoch for x in epochs_to_switch_at ]
learning_rate = make_learning_rate_tensor(epochs_to_switch_at , learning_rates, global_step)

참고URL : https://stackoverflow.com/questions/33919948/how-to-set-adaptive-learning-rate-for-gradientdescentoptimizer

반응형