Recurrent Neural Networks for Prediction
Authored by Danilo P. Mandic, Jonathon A. Chambers
Copyright
c
2001 John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
12
Exploiting Inherent
Relationships Between
Parameters in Recurrent
Neural Networks
12.1 Perspective
Optimisation of complex neural network parameters is a rather involved task. It
becomes particularly difficult for large-scale networks, such as modular networks, and
for networks with complex interconnections, such as feedback networks. Therefore, if
an inherent relationship between some of the free parameters of a neural network can
be found, which holds at every time instant for a dynamical network, it would help
to reduce the number of degrees of freedom in the optimisation task of learning in a
particular network.
We derive such relationships between the gain β in the nonlinear activation function
of a neuron Φ and the learning rate η of the underlying learning algorithm for both
the gradient descent and extended Kalman filter trained recurrent neural networks.
The analysis is then extended in the same spirit for modular neural networks.
Both the networks with parallel modules and networks with nested (serial) modules
are analysed. A detailed analysis is provided for the latter, since the former can be
considered a linear combination of modules that consist of feedforward or recurrent
neural networks.
For all these cases, the static and dynamic equivalence between an arbitrary neural
network described by β, η and W (k) and a referent network described by β
R
=1,
learning rate parameter; and
2. every learning rate parameter should vary from one iteration to the next.
These arguments are intuitively sound. However, if there is a dependence between
some of the parameters in the network, this approach would lead to suboptimal learn-
ing and oscillations, since coupled parameters would be trained using different learning
rates and different speed of learning, which would deteriorate the performance of the
network. To circumvent this problem, some heuristics on the values of the parameters
have been derived (Haykin 1994). To shed further light onto this problem and offer
feasible solutions, we therefore concentrate on finding relationships between coupled
parameters in recurrent neural networks. The derived relationships are also valid for
feedforward networks, since recurrent networks degenerate into feedforward networks
when the feedback is removed.
Let us consider again a common choice for the activation function,
Φ(γ,β,x)=
γ
1+e
−βx
. (12.1)
This is a Φ : R → (0,γ) function. The parameter β is called the gain and the product
γβ the steepness (slope) of the activation function.
1
The reciprocal of gain is also
referred to as the temperature. The gain γ of a node in a neural network is a constant
that amplifies or attenuates the net input to the node. In Kruschke and Movellan
(1991), it has been shown that the use of gradient descent to adjust the gain of the
node increases learning speed.
Let us consider again the general gradient-descent-based weight adaptation algo-
rithm, given by
W (k)=W (k − 1) − η∇
W
(k)w(k))e(k)x(k). (12.4)
From (12.4), if β increases, so too will the step on the error performance surface for a
fixed η. It seems, therefore, advisable to keep β constant, say at unity, and to control
the features of the learning process by adjusting the learning rate η, thereby having one
degree of freedom less, when all of the parameters in the network are adjustable. Such
reduction may be very significant for nonlinear optimisation algorithms employed for
parameter adaptation in a particular recurrent neural network.
A fairly general gradient algorithm that continuously adjusts parameters η, β and
γ can be expressed by
y(k)=Φ(X(k), W (k)),
e(k)=s(k) − y(k),
W (k +1)=W (k) −
η(k)
2
∂e
2
(k)
∂W (k)
,
η(k +1)=η(k) −
ρ
2
∂e
2
(k)
∂η(k)
,
β(k +1)=β(k) −
θ
2
(12.5)
where ρ is a small positive constant that controls the adaptive behaviour of the step
size sequence η(k), whereas small positive constants θ and ζ control the adaptation
2
For the logistic function
σ(β, x)=
1
1+e
−βx
= σ(βx),
202 INTRODUCTION
x(k)
ww w w
12 3 N
y(k)
zzz z
-1 -1 -1 -1
(k) (k)
(k)
(k)
x(k-N+1)
Φ
x(k-1) x(k-2)
Figure 12.1 A simple nonlinear adaptive filter
of the gain of the activation function β and gain of the node γ, respectively. We will
concentrate only on adaptation of β and η.
The selection of learning rate η is critical for the gradient descent algorithms (Math-
ews and Xie 1993). An η that is small as compared to the reciprocal of the input signal
power will ensure small misadjustment in the steady state, but the algorithm will con-
verge slowly. A relatively large η, on the other hand, will provide faster convergence at
the cost of worse misadjustment and steady-state characteristics. Therefore, an ideal
choice would be an adjustable η which would be relatively large in the beginning of
adaptation and become gradually smaller when approaching the global minimum of
the error performance surface (optimal values of weights).
We illustrate the above ideas on the example of a simple nonlinear FIR filter, shown
in Figure 12.1, for which the output is given by
y(k)=Φ(x
T
(k)w(k)). (12.6)
We can continually adapt the step size using a gradient descent algorithm so as to
(k − 1)x
T
(k − 1)x(k),
(12.7)
where Φ
i
(k)+η
i
(k)e(k)x
i
(k),i=1,...,N. (12.9)
These expressions become much more complicated for large and recurrent networks.
As an alternative to the continual learning rate adaptation, we might consider
continual adaptation of the gain of the activation function Φ(βx). The gradient descent
RELATIONSHIPS BETWEEN PARAMETERS IN RNNs 203
algorithm that would update the adaptive gain can be expressed as
e(k)=s(k) − Φ(w
T
(k)x(k)),
w(k)=w(k − 1) + η(k − 1)e(k − 1)Φ
(k − 1)x(k − 1),
β(k)=β(k − 1) −
θ
2
∂
∂β(k − 1)
e
2
(k)
= β(k − 1) −
θ
2
∂
T
(12.10)
For the adaptation of β(k) there is a need to calculate the second derivative of the
activation function, which is rather computationally involved. Such an adaptive gain
algorithm was, for instance, analysed in Birkett and Goubran (1997). The proposed
function was
σ(x, a)=
x, |x| a,
sgn(x)
(1 − a) tanh
|x|−a
1 − a
+ a
, |x| >a,
(12.11)
−5 −4 −3 −2 −1 0 1 2 3 4 5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
The adaptive sigmoid
a=0
a=0.5
a=1
Figure 12.2 An adaptive sigmoid
errors of particular modules. Hence, the algorithms for training such networks are
extensions of standard algorithms designed for single modules. Serial (nested) modu-
lar architectures are more complicated, an example of which is a pipelined recurrent
neural network (PRNN). This is an emerging architecture used in nonlinear time
series prediction (Haykin and Li 1995; Mandic and Chambers 1999f). It consists of a
number of nested small-scale recurrent neural networks as its modules, which means
that a learning algorithm for such a complex network has to perform a nonlinear
optimisation task on a number of parameters. We look at relationships between the
learning rate and the gain of the activation function for this architecture and for
various learning algorithms.
12.3 Overview
A relationship between the learning rate η in the learning algorithm and the gain β
tracking some dynamical process. We therefore differentiate between the equivalence
in the static and dynamic sense. We define the static and dynamic equivalence between
two networks below.
Definition 12.4.1. By static equivalence, we consider the equivalence of the outputs
of an arbitrary network and the referent network with fixed weights, for a given input
vector u(k), at a fixed time instant k.
Definition 12.4.2. By dynamic equivalence, we consider the equivalence of the out-
puts between an arbitrary network and the referent network for a given input vector
u(k), with respect to the learning algorithm, while the networks are running.
The static equivalence is considered for already trained networks, whereas both
static and dynamic equivalence are considered for networks being adapted on the
run. We can think of the static equivalence as an analogue to the forward pass in
computation of the outputs of a neural network, whereas the dynamic equivalence
can be thought of in terms of backward pass, i.e. weight update process. We next
derive the conditions for either case.
12.4.1 Static Equivalence of Two Isomorphic RNNs
In order to establish the static equivalence between an arbitrary and referent RNN,
the outputs of their neurons must be the same, i.e.
y
n
(k)=y
R
n
(k) ⇔ Φ(u
T
n
(k)w
n
(k)) = Φ
R
To illustrate this, consider, for instance, the logistic nonlinearity, given by
1
1+e
−βu
T
n
w
n
=
1
1+e
−u
T
n
w
R
n
⇔ βw
n
= w
R
n
, (12.15)
where the time index (k) is neglected, since all the vectors above are constant during
the calculation of the output values. As the equality (12.14) should be valid for every
neuron in the RNN, it is therefore valid for the complete weight matrix W of the
RNN.
The essence of the above analysis is given in the following lemma, which is inde-
pendent of the underlying learning algorithm for the RNN, which makes it valid for
two isomorphic
= β
∂Φ(1,x)
∂x
. (12.17)
In our case, it becomes
Φ
(β,w, u)=βΦ
(1, w
R
, u). (12.18)
Indeed, for a simple logistic function (12.12), we have
Φ
(x)=
βe
−βx
(1+e
−βx
)
2
= βΦ
(x
R
),
where x
R
= βx denotes the argument of the referent logistic function (with β