A Study of Generalization in Genetic Programming = Nghiên cứu khả năng khái quát hóa của lập trình di truyền (tóm tắt + toàn văn) - Pdf 22



MINISTRY OF NATIONAL DEFENSE
MILITARY TECHNICAL ACADEMY NGUYEN THI HIEN A STUDY OF GENERALIZATION
IN GENETIC PROGRAMMING Specialized in: Applied Mathematics and Computer Science
Code: 60 46 01 10
SUMMARY OF THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY


Reviewer 2: Assoc/Prof. Dr. Nguyen Duc Nghia Reviewer 3: Dr. Nguyen Van Vinh

The thesis was evaluated by the examination board of the academy by the
decision number 1767/QĐ-HV, 1
st
July 2014
of the Rector of Military Technical Academy, meeting at Military Technical
Academy on day month year

This thesis can be found at:
- Library of Military Technical Academy
- National Library of Vietnam
INTRODUCTION
Genetic programming (GP) paradigm was first proposed by Koza
(1992) can be viewed as the discovery of computer programs which
produce desired outputs for particular inputs. Despite advances
in GP research, when it is applied to learning tasks, the issue
of generalization of GP has not been taken the attention that it
deserves. This thesis focuses on the generalization aspect of GP
and proposes mechanisms to improve GP generalization ability.

sues and a metaphor of GP search are then discussed. Next, the
major issues of GP are discussed especially when the GP is consid-
ered as a ML system. It first overviews the approaches proposed
in the literature that help to improve the generalization ability of
GP. Then, the solution complexity (code bloat), in the particular
context of GP, and its relations to GP learning performance are
discussed. Chapter 2 provides a backgrounds on a number of con-
cepts and techniques subsequently used in the thesis for improv-
ing GP learning performance. They include progressive sampling,
layer learning, and early stopping in ML. Chapter 3 presents one
of the main contributions of the thesis. It proposes some criteria
used to determine when to stop the training of GP with an aim
to avoid over-fitting. Chapter 4 develops a learning framework for
GP that is based layer learning and progressive sampling. Chap-
ter 4.4. concludes the thesis summarizing the main results and
proposing suggestions for future works that extend the research
in this thesis.
2
Chapter 1
BACKGROUNDS
This chapter describes the representation and the specific al-
gorithm components used in the canonical version of GP. The
chapter ends with a comprehensive overview of the literature on
the approaches used to improve GP generalization ability.
1.1. Genetic Programming
The basic algorithm is as follows:
1) Initialise a population of solutions
2) Assign a fitness to each population member
3) While the Stopping criterion is not met
4) Produce new individuals using operators and the existing

other subtrees.
• Swap the selected subtrees between the two parents. The
resulting individuals are the children.
1.1.6. Mutation
An individual is selected for mutation using fitness proportional
selection. A single function or terminal is selected at random from
among the set of function and terminals making up the original
individual as the point of mutation. The mutation point, along
with the subtree stemming from mutation point, is then removed
4
from the tree, and replaced with the new, randomly generated
subtree.
1.1.7. Reproduction
The reproduction operator copies an individual from one popula-
tion into the next.
1.1.8. Stopping Criterion
A maximum number of generations usually defines the stopping
criterion in GP. However, when it is possible to achieve an ideal
individual, this can also stop GP evolutionary process. In this
thesis, some other criteria are proposed, such as a measure of
generalization loss, a lack of fitness improvement, or a run of gen-
erations with over-fitting.
1.1.9. Some Variants of GP
The GP community has proposed numerous different approaches
to program evolution: Linear GP, Graph Based GP, Grammar
Based GP.
1.2. An Example of Problem Domain
This thesis uses the 10 polynomials in Table 1.1 with white (Gaus-
sian) noise of standard deviation 0.01. It means each target func-
tion is F (x) = f (x) + ϵ. 4 more real-world data sets are from

1
− 1)(x
2
− 1))
5 F
5
(x) = x
4
1
− x
3
1
+
x
2
2
2
− x
2
Friedman1 F
6
(x) = 10 sin (πx
1
x
2
) + 20(x
3
− 0.5)
2
+ 10x

1
+x
2
2
)
cos[2π(x
1
+ x
2
)]
Multi F
9
(x) = 0.79 + 1.27x
1
x
2
+ 1.56x
1
x
4
+ 3.42x
2
x
5
+ 2.06x
3
x
4
x
5

the only parameter that limits the search space of GP, an alter-
native could be the maximum number of nodes of an individual
or both (depth and size).
1.3.2. Bloat
In the course of evolution, the average size of the individual in the
GP population often get increased largely. Typically the increase
in program size is not accompanied by any corresponding increase
in fitness. The origin of this phenomenon, which is know as bloat,
has effectively been a subject of research for over a decade.
6
1.3.3. Generalization and Complexity
Most of research in improving generalization ability of the GP
has concentrated on avoiding over-fitting on training data. One
of main cause of over-fitting has been identified as the "complex-
ity" of the hypothesis generated by the learning algorithm. When
GP trees grow up to fit or to specialize on "difficult" learning
cases, it will no longer have the ability to generalize further, this
is entirely consistent with the principle of Occam’s razor stating
that the simple solutions should always be preferred. The rela-
tionship between generalization and individual complexity in GP
has often been studied in the context of code bloat, which is ex-
traordinary enlargement of the complexity of solutions without
increasing their fitness.
1.4. Generalization in GP
1.4.1. Overview of Studies on GP Generalization Capability
Common to most of the research are the problems with obtaining
generalization in GP and the attempts to overcome these prob-
lems. These approaches can be categorized as follows:
• Using training and testing and in order to promote general-
ization in supervised learning problems.

are unbiased, the ML system parameters converge to one of the
8
local minima of the specified risk function (expected loss). When
the number of training samples is finite, the true risk function is
different from the empirical risk function to be minimized. Thus,
since the training samples are biased the parameters of a learning
machine might converge to a biased solution. This is known as
over-fitting or over training in ML.
2.1.2. Early Stopping
During the training phase of a learning machine, the generaliza-
tion error might decrease in an early period reaching a minimum
and then increase as the training goes on, while the training er-
ror monotonically decreases. Therefore, it is considered better to
stop training at an adequate time, the class of techniques that are
based on this are referred to as early stopping
2.1.3. Regularization Learning
Regularization is important for ML as it relates to over-fitting.
Regularization is based on the idea that one does not only want
to minimize the error between the model and the data but also
the complexity of the model. Methods of regularization balance
degree of fit with model complexity, which leads to simpler models
that still fit the data without over-fitting. This idea is closely
related to Occam’s razor. However, it is noticed that Occam’s
razor does not claim that the simplest solution is always the best
one. Sometimes, a more complex solution can explain the data
better than a simpler one. Occam’s razor recognizes the trade-off
between explanatory power and complexity and says we should
not increase the complexity of the solution unless it is necessary
to do so.
9

gorithms of polynomial time complexity.
2.3. Layered Learning
The layered learning (LL) paradigm is described formally by Stone
and Veloso (2000). Intended as a means for dealing with problems
for which finding a direct mapping from inputs to outputs is in-
tractable with existing learning algorithms, the essence of the ap-
proach is to decompose a problem into subtasks, each of which is
then associated with a layer in the problem-solving process. LL is
a ML paradigm defined as a set of principles for the construction
of a hierarchical, learned solution to a complex task.
10
Chapter 3
AN EMPIRICAL STUDY OF EARLY STOPPING
FOR GP
In this chapter we empirically investigate several early stopping
criteria for GP learning process. The first group of stopping cri-
teria are based on the idea of over-fitting detection as proposed in
[5] and subsequently developed in [1]. The second group of criteria
are those proposed in Prechelt’s work (1997) for neural networks.
3.1. Method
Before the training (evolutionary run) commences, the entire data
set is divided into three subsets - training (50%), validation (25%),
and test sets (25%) in random way.
3.1.1. Proposed Classes of Stopping Criteria
To detect the over-fitting during a run of GP, the amounts of over-
fitting is calculated as proposed in Vanneschi’s paper (2010). The
detail of the method is presented in Algorithm 1 below. In the
algorithm, bvp is the "best valid point" and it stands for the best
validation fitness (error on the validation set) reached until the
current generation of a GP run, excluding the generations (usu-

over-fit phenomenon without consideration to its value. This leads
us to The second class of stopping criteria: stop when the
over-fitting does not decrease in m successive generations
V F
m
: stop when m successive generations where over-fitting value
increased.
overf it(g)≥overf it(g − 1)≥ ≥overf it(g − m) > 0
In the early generation, the desire for early stopping should be
smaller than that for the late generations. Therefore, we also pro-
pose a variant of the second stopping criterion called the true
adaptive stopping criteria and its work as in Algorithm 2
(where d is the number of generations those over-fit values of in-
creasing, f g is the generation started the chain of over-fit value
not decreasing, lg is the last generation of the chain of over-fit
value not decreasing, random is a random value in [0, 1], g is cur-
rent generation, M axGen is maximum number of generation).
12
Algorithm 2: True Adap Stopping
pr oAp =
(d−f g)∗g
(lg−f g)∗generation
;
if proAp ≥ random and d < m then
Stop = T rue
Three last classes of early stopping criteria are adapted from on
those presented in Prechelt’s work for training a neural network.
Their approach considered the validation set to predict the trend
on test set, so that the validation error is used as an estimate of
the generalization error. The third class of stopping criteria:


) (3.2)
in which E
va
(g) is the validation fitness value of the best indi-
vidual of generation g. However, if the training error still gets
decreased rapidly, generalization losses have higher chance to be
"repaired"; thus assume that over-fitting does not begin until the
error decreases only slowly. The progress training is considered
training strip of length k to be a sequence k generations num-
bered g + 1 , , g + k where g is divisible by k. It measures how
much the average training error during the strip larger than the
13
minimum training error during the strip. It is formulated by 3.3
P
k
(t) = 1000.

g
t

=g−k+1
E
tr
(t

)
k min
g
t

va
(g) > E
va
(g − k)
UP
1
: stop after first end of strip generation g with
E
va
(g) > E
va
(g − k)
3.2. Experimental Settings
The parameter settings for the GP systems are shown in Table 3.1.
It is noted that GPV (GP with early stopping, OF stands for GP
with the first class stopping criterion, VF for second class, GL for
third, PQ for fourth, UP for fifth) is just GPM (standard GP)
with an exception that at each generation the stopping criterion
is checked and if it is satisfied the run is stopped at the current
generation, Tar (GP with Tarpeian Bloat Control) is GPM with
Tarpeian Control as proposed in Mahler’s work (2003) where the
Target ratio parameter gives the percentage of over average sized
programs that are targeted at every generation was taken 0.1.
Otherwise, the three systems have the same setting as in Table 3.1,
14
they even use the same initial random seed at the beginning of
each run. All the runs were conducted on a Compaq Presario
CQ3414L computer with Intel Core i3-550 Processor (4M Cache,
3.20 GHz) running on Ubuntu Linux operating system.
Population Size 500

is similar), the same is applied for the
15
fourth stopping criteria (excepted F
7
, F
8
and Conc where the So-
BIs are similar), the first stopping criteria, the second stopping
criteria and true adaptive, and finally the fifth stopping criteria
often obtain solutions of much smaller sizes than GPM. To investi-
gate the effectiveness of stopping time criteria, the third and fifth
class of stopping criteria are of little difference compared to stan-
dard GP. However, the second, True Adap, and the fourth class of
stopping criteria all demonstrate their capabilities in determining
right stopping time, which explains why the running time of these
criteria are much shorter than standard GP while their solutions’
GE were almost at least as go od as standard GP. Event though
Tarpeian implements an explicit bias towards shorter solutions,
as any regularisation technique does, their solution sizes were not
much better than the solutions found by early stopping GP. Even,
sometimes early stopping GP could find solutions that are signifi-
cantly smaller than Tarpeian GP (for instance soultions found by
PQ for target functions (F
4
, F
10
). Overall, the results show that
early stopping are very competitive compared to a regularized
technique as Tarpeian GP.
3.4. Conclusion

experiments on such a system were presented in [2]; they were
sufficient to suggest value in the approach, which we examine in
more detail in this chapter.
4.1. The Proposed Method
The learning/evolutionary process is divided into l layers. Learn-
ing starts as in standard GP, but only a subset of the training
examples are used. The next layer commences when the stopping
criteria of the current layer are met, using the final population of
17
the previous layer as its starting population. However the training
sample is incremented with new samples drawn under the same
distribution – usually uniform random –from the problem samples.
This process is repeated until predetermined stopping criteria are
met. The main extensions (briefly outlined in [2]) are derived from
progressive sampling.
4.1.1. The Sampling Schedule
The sampling schedule uses geometric sampling:
S
g
= {2
i
.n
0
: i ∈ (0, 1, . . . , ⌊log
2
N⌋)}

{N} (4.1)
In this, GPLL differs from PS, in that it always continues the
learning process until the data set is exhausted.

i
N
) then
update corresponding statistics for sample i;
foreach sample i do
calculate its Q
i
and output (S
i
, Q
i
);
18
We then estimate the SOSS from the points (S
i
, Q
i
) as Baohua
did.
4.1.3. The Number of Learning Layers
Having chosen a = 2, it follows that the number l of learning
layers is given by:
l = ⌊log
2
(
N
n
0
)⌋ + 1 (4.2)
4.1.4. The Stopping Criterion for Each Layer

each problem) were conducted on a Compaq Presario CQ3414L
computer with Intel Core i3-550 Processor (4M Cache, 3.20GHz)
running a Ubuntu Linux operating system. No other jobs were
running at the same time.
4.3. Results and Discussions
For each setting, we recorded the generalization error (GE - test
set error) of the best individual of the run, its size, the total run-
ning time, and the first hitting time (the generation where the
best individual was found). To summarise, GPLL
3
was signifi-
cantly worse than GP (except for F
1
); the performance degrada-
tions ranged from 5% (Poll) to 81% (F
7
). Thus for m = 3, the
stopping criterion forced GPLL to move onto the next layer pre-
maturely, limiting its exploration. GPLL
6
was closer to GP: at
least as good on 10 out of 14 problems, and for the remainder, the
disadvantage of GPLL
6
ranged from about 3% to 28%. GPLL
9
was at least as good as GP on all problems, and better on F
1
, F
2

6
and GPLL
9
, the stopping criterion was more tolerant, and conse-
quently more solutions were found in early layers (especially when
the KL distance between the training set and the full data set is
small). This suggests that the run time of GPLL could be further
improved if we could find stopping criteria that would allow GPLL
to terminate even if all layers had not been consumed. GPLL in-
corporates three important components beyond those in standard
GP: layered learning, incremental sampling, and early stopping.
Are all these components necessary for the performance of GPLL?
To test this, we conducted experiments on the 14 test problems,
omitting one or more of these components. Specifically, we ran
GP with the following settings:
• GPM: GP using the SOSS for training rather than the orig-
inal data set
• GPMSt
9
: GPM with the stopping criterion of GPLL
9
• GPSt
9
: GP with the original data set and the stopping cri-
terion of GPLL
9
In detail, GPLL
9
out-performed GPM, GPMSt, and GPSt
9

proving on practices and approaches of evolutionary learn-
ing, and in particular of GP. Several new methods aiming
at promoting generalization performance of the solutions (in
terms of (generalization error, training time, solution com-
plexity) obtained by GP were developed and their use was
22
supported by experimental evidence.
3) A class of learning problems were represented as a challenge
for both conventional learning methods and evolutionary
methods. The thesis developed various new methods which
were able to produce successful generalization performance
on the selected examples of learning problems. Most of the
results obtained using these methods were b etter than the
standard GP.
The research on this thesis has considered GP as a machine learn-
ing system and aim to improve the its performance. The main
contributions of the thesis is summarized as follows.
1. Contributions
In addition to a summarization of literature regarding the gener-
alization ability of GP, following original work and results have
been reported in the thesis:
• A method to improve the performance of GP based
on early stopping:Early stopping method was used with
some other machine learning techniques, but it has not been
applied to GP. The results of the experiments showed that
early stopping technique has significant advantages in re-
ducing the training time of and the complexity of solution
found by GP that does not alter the quality of the solutions.
As a consequence, three new stopping criteria for GP have
been proposed and three others have been adopted from the


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status