Elod P Csirmaz’s Blog: 2019

24 July 2019

Decreasing accuracy in neural nets during training - things to try

Recently I attempted to train a classifier network, just to find that after an initial period of the loss decreasing and accuracy increasing, the accuracy quickly dropped to the point where it became just 1/N where N was the number of classes. In other words, it became no better than a random guess, and the network was stuck in this state.

I tried a number of things with more or less success, researched what the usual causes and remedies are, and thought I'd share my experience as the solution that ultimately helped was somewhat surprising. This was:

More stable loss function. I used TensorFlow and Keras, and added a softmax layer from Keras as the last layer in my model, and used Keras's categorical_crossentropy as the loss function. It turned out that this arrangement can be quite unstable, causing unwanted behaviour during training. When I switched to tf.nn.softmax_cross_entropy_with_logits_v2 from the TensorFlow backend, and removed the softmax layer, the problem largely went away. This layer calculates softmax and the cross-entropy in one go, with apparently much better numerical stability.

To use this as the loss function in a Keras model, I followed this structure:

def my_loss(truth, prediction):
    truth = tf.stop_gradient(truth)
    loss = tf.nn.softmax_cross_entropy_with_logits_v2(
        labels=truth,
        logits=prediction
    )
    return tf.reduce_mean(loss)

model = keras.models.Model(...)
model.compile(loss=my_loss, ...)

The fact that it was necessary was surprising because the network was rather simple, and the classes completely balanced. When I displayed the output of the network just before the softmax layer, I found that when the accuracy got stuck at 1/N, the signals for all classes were largely equal, and quite large.

Next are some of the other things I did, some of which did seem to help. Some may help you, too, although YMMV.

Check for bugs in the code. Many suggested that such behaviour can occur if due to a bug a NaN value is fed into the loss function, for example, as a result of taking the logarithm of 0.

Try easy data. To ensure that the model is actually capable of converging, feed it some data that is really easy to train on, like using N words ("aaa", "bbb", "ccc", etc.) as the input with a one-to-one correspondence to classes as the target.

Reduce the learning rate. There were suggestions online that this behaviour can also be due to exploding gradients or oscillation around a minimum, which can potentially be avoided by reducing the learning rate.

Change the optimizer. Related to the above suggestion, if the loss function is really funny, some optimizers may just not work well with it. Try different ones, and/or try setting their parameters better.

Change the activation function. Changing the activation functions throughout the model will change its characteristics completely. In other projects I found that ReLUs and Leaky ReLUs, although great to partially manage vanishing gradients, allow weights to grow very large. Of course, we have clamping and regularizers available to help with this, too. Also, ReLUs have a large dead region where the derivative is 0, and the network can become stuck there. Leaky ReLUs avoid this problem.

Turn regularizers and batch norm layers on/off. Again these will change the characteristics of the model considerably, which may be better or worse with the given loss function and optimizer.

Reduce the width of the network. I found that a network was more susceptible to this issue if it had more units on its layers. It has been suggested that wide and shallow networks are difficult to train, which may be related.

I hope you'll find some of these suggestions useful. For reference, let me list some of the sources I found:

23 July 2019

How to interpret a neural network?

Machine learning solutions and neural networks are used in more and more areas of our lives. At the same time, since it is difficult to grasp how they function and so they can be seen as black boxes, there is a reluctance in their adoption in areas where we need to remain accountable for business and/or legal reasons. This fact also impacts the trust we all place in such systems.

Because of this, providing tools that can help us understand how these models arrive at a certain decision is an area of active research. The tools range from visualisation tools, tools that identify the most significant input, tools that allow experimenting with the input data to understand the relative importance the model attributes to different signals, and extracting ``rules'' from neural networks which can be reviewed or even used by human decision-makers.

I wrote a short paper and sample code to demonstrate a simple solution that makes it possible to extract linear rules from a neural network that employs Parametric Rectified Linear Units (PReLUs). This is based on introducing a new force, applied in parallel to backpropagation, that aims to reduce PReLUs into identity functions, which then causes the neural network to collapse into a smaller system of linear functions and inequalities. It works well with a number of toy examples, and also has the promise to help with overfitting.

Read the paper »

More on GitHub »