May, Gary S. "Computational Intelligence in Microelectronics Manufacturing"
Computational Intelligence in Manufacturing Handbook
Edited by Jun Wang et al
Boca Raton: CRC Press LLC,2001
©2001 CRC Press LLC
13
Computational
Intelligence
in Microelectronics
Manufacturing
13.1 Introduction
13.2 The Role of Computational Intelligence
13.3 Process Modeling
13.4 Optimization
13.5 Process Monitoring and Control
13.6 Process Diagnosis
13.7 Summary13.1 Introduction
manufacturing of integrated circuits
(IC-CIM) is aimed at optimizing the cost-effectiveness of integrated
circuit manufacturing as
computer-aided design
(CAD) has dramatically affected the economics of circuit
design.
Under the overall heading of reducing manufacturing cost, several important subtasks have been
identified. These include increasing chip fabrication yield, reducing product cycle time, maintaining
consistent levels of product quality and performance, and improving the reliability of processing equip-
ment. Unlike the manufacture of discrete parts such as electrical appliances, where relatively little rework
is required and a yield greater than 95% on salable product is often realized, the manufacture of integrated
circuits faces unique obstacles. Semiconductor fabrication processes consist of hundreds of sequential
steps, and yield loss occurs at every step. Therefore, IC manufacturing processes have yields as low as 20
to 80%. The problem of low yield is particularly severe for new fabrication sequences. Effective IC-CIM
systems, however, can alleviate such problems. Table 13.1 summarizes the results of a Toshiba 1986 study
that analyzed the use of IC-CIM techniques in producing 256K dynamic RAM memory circuits [Hodges
et al., 1989]. This study showed that CIM techniques improved the manufacturing process on each of
the four productivity metrics investigated.
Because of the large number of steps involved, maintaining product quality in an IC manufacturing
facility requires strict control of literally hundreds or even thousands of process variables. The interde-
pendent issues of high yield, high quality, and low cycle time have been addressed in part by the ongoing
development of several critical capabilities in state-of-the-art IC-CIM systems:
in situ
process monitoring,
13.2.1 Neural Networks
Because of their inherent learning capability, adaptability, and robustness, artificial neural nets are used
to solve problems that have heretofore resisted solutions by other more traditional methods. Although
the name “neural network” stems from the fact that these systems crudely mimic the behavior of biological
neurons, the neural networks used in microelectronics manufacturing applications actually have little to
do with biology. However, they share some of the advantages that biological organisms have over standard
computational systems. Neural networks are capable of performing highly complex mappings on noisy
and/or nonlinear data, thereby inferring very subtle relationships between diverse sets of input and output
parameters. Moreover, these networks can also generalize well enough to learn overall trends in functional
relationships from limited training data.
There are several neural network architectures and training algorithms eligible for manufacturing
applications. However, the
backpropagation
(BP) algorithm is the most generally applicable and most
popular approach for microelectronics manufacturing. Feedforward neural networks trained by BP
consist of several layers of simple processing elements called “neurons” (Figure 13.2). These rudimentary
processors are interconnected so that information relevant to input–output mappings is stored in the
weight of the connections between them. Each neuron contains the weighted sum of its inputs filtered
by a sigmoid transfer function. The layers of neurons in BP networks receive, process, and transmit
critical information about the relationships between the input parameters and corresponding responses.
In addition to the input and output layers, these networks incorporate one or more “hidden” layers of
neurons that do not interact with the outside world, but assist in performing nonlinear feature extraction
tasks on information provided by the input and output layers.
In the BP learning algorithm, the network begins with a random set of weights. Then an input vector
is presented and fed forward through the network, and the output is calculated by using this initial weight
matrix. Next, the calculated output is compared to the measured output data, and the squared difference
th
neuron in layer
k
;
in
i,k
= input to the
i
th
neuron in the
k
th
layer; and
out
i,k
Computer-Integrated Manufac-
turing of VLSI, Proc. IEEE/CHMT Int. Elec. Manuf. Tech. Symp.
, 1-3. With permission.
©2001 CRC Press LLC
Equation (13.1)
where the summation is taken over all the neurons in the previous layer. The output of a given neuron
is a sigmoidal transfer function of the input, expressed as
Equation (13.2)
Error is calculated for each input–output pair as follows: Input neurons are assigned a value and com-
putation occurs by a forward pass through each layer of the network. Then the computed value at the
output is compared to its desired value, and the square of the difference between these two vectors
provides a measure of the error (
E
) using
Equation (13.3)
where
n
is the number of layers in the network,
q
is the number of output neurons,
Statistical Techniques,
IEEE Trans. Semi. Manuf.
, 6(2):103-111. With permission.)
in w out
ik ijk jk
j
,,,,–
=⋅
[]
∑
1
out
e
ik
in
ik
,
–
,
=
+
1
1
E d out
jjn
j
q
=
(weights), inputs and outputs of neurons in different layers. (
Source:
Himmel, C. and May, G., 1993
.
Advantages of
Plasma Etch Modeling Using Neural Networks Over Statistical Techniques,
IEEE Trans. Semi. Manuf.
, 6(2):103-111.
With permission.)
∂
∂
=
∂
∂
=
()
in
w
out
out
in
out out
ik
ijk
,
,
–
–
δ
φ
∂
∂
=
∂
∂
∂
∂
=⋅
E
w
E
in
where the expressions in Equations 13.3 and 13.4 have been substituted. Likewise, the quantity
φ
i,n
is
given by
Equation (13.8)
Consequently, for the inner layers of the network
Equation (13.9)
where the summation is taken over all neurons in the (
k
+ 1)
th
layer. This expression can be simplified
using Equations 13.1 and 13.5 to yield
Equation (13.10)
Then
δ
i,k
is determined from Equation 13.7 as
Equation (13.11)
Neural networks are an extremely useful tool for defining the often complex relationships between
controllable process conditions and measurable responses in electronics manufacturing processes. How-
ever, in addition to the need to predict the output behavior of a given process given a set of input
conditions, one would also like to be able to use such models “in reverse.” In other words, given a target
response or set of response characteristics, it is often desirable to derive an optimum set of process
conditions (or process “recipe”) to achieve these targets. Genetic algorithms (GAs) are a method to
optimize a given process and define this reverse mapping.
In the 1970s, John Holland introduced GAs as an optimization procedure [Holland, 1975]. Genetic
algorithms are guided stochastic search techniques based on the principles of genetics. They use three
operations found in natural evolution to guide their trek through the search space:
selection,
crossover,
and
mutation
. Using these operations, GAs search through large, irregularly shaped spaces quickly,
–
,
,,
,
,
δ
in
,,
φ
in i jn
d out=
()
–
,
,,
,
,
φ
ik
ik jk
jk
ik
j
E
out
E
in
in
out
=
∂
∂
=
∂
∂
ik ik jk ijk
j
out out
out out w
,, , ,
,,,,,
–
–
=
()( )
=
()
⋅
[]
++
∑
1
1
11
©2001 CRC Press LLC
requiring only objective function values (detailing the quality of possible solutions) to guide the search.
Furthermore, GAs take a more global view of the search space than many methods currently encountered
in engineering optimization. Theoretical analyses suggest that GAs quickly locate high-performance
regions in extremely large and complex search spaces and possess some natural insensitivity to noise.
These qualities make GAs attractive for optimizing neural network based process models.
In computing terms, a genetic algorithm maps a problem onto a set of binary strings. Each string
represents a potential solution. Then the GA manipulates the most promising strings in searching for
improved solutions. A GA operates typically through a simple cycle of four stages: (i) creation of a
min
,
U
max
] (where
l
is the length
of the binary string). In this way, both the range and precision of the decision variables are controlled.
To construct a multiparameter coding, as many single parameter strings as required are simply concat-
enated. Each coding has its own sub-length. Figure 13.4 shows an example of a two-parameter coding
with four bits in each parameter. The ranges of the first and second parameter are 2-5 and 0-15,
respectively.
The string manipulation process employs genetic operators to produce a new population of individuals
(“offspring”) by manipulating the genetic “code” possessed by members (“parents”) of the current
population. It consists of selection, crossover, and mutation operations. Selection is the process by which
strings with high fitness values (i.e., good solutions to the optimization problem under consideration)
receive larger numbers of copies in the new population. In one popular method of selection called elitist
roulette wheel selection, strings with fitness value F
i
are assigned a proportionate probability of survival
into the next generation. This probability distribution is determined according to
Equation (13.12)
∑
©2001 CRC Press LLC
Thus, an individual string whose fitness is
n
times better than another’s will produce
n
times the number
of offspring in the subsequent generation. Once the strings have reproduced, they are stored in a “mating
pool” awaiting the actions of the crossover and mutation operators.
The crossover operator takes two chromosomes and interchanges part of their genetic information to
produce two new chromosomes (see Figure 13.5). After the crossover point is randomly chosen, portions
of the parent strings (P1 and P2) are swapped to produce the new offspring (O1 and O2) based on a
specified crossover probability. Mutation is motivated by the possibility that the initially defined popu-
lation might not contain all of the information necessary to solve the problem. This operation is imple-
mented by randomly changing a fixed number of bits in every generation according to a specified
mutation probability (see Figure 13.6). Typical values for the probabilities of crossover and bit mutation
range from 0.6 to 0.95 and 0.001 to 0.01, respectively. Higher rates disrupt good string building blocks
more often, and for smaller populations, sampling errors tend to wash out the predictions. For this
reason, the greater the mutation and crossover rates and the smaller the population size, the less frequently
predicted solutions are confirmed.
13.2.3 Expert Systems
Computational intelligence has also been introduced into electronics manufacturing in the areas of
Source:
Han, S. and May, G., 1997. Using Neural Network Process Models to Perform PECVD
Silicon Dioxide Recipe Synthesis via Genetic Algorithms,
IEEE Trans. Semi. Manuf.
, 10(2):279-287. With permission.)
0
000 001
00000
©2001 CRC Press LLC
Neural networks have recently emerged as an effective tool for fault diagnosis. Diagnostic problem
solving using neural networks requires the association of input patterns representing quantitative and
qualitative process behavior to fault identification. Robustness to noisy sensor data and high-speed
parallel computation makes neural networks an attractive alternative for real-time diagnosis. However,
the pattern-recognition-based neural network approach suffers from some limitations. First, a complete
set of fault signatures is hard to obtain, and representational inadequacy of a limited number of data sets
can induce network overtraining, thus increasing the misclassification or “false alarm” rate. Also,
approaches such as this, in which diagnostic actions take place following a sequence of several processing
steps, are not appropriate, since evidence pertaining to potential equipment malfunctions accumulates
at irregular intervals throughout the process sequence. At the end of process sequence, significant mis-
processing and yield loss may have already taken place, making this approach economically undesirable.
Hybrid schemes involving neural networks and traditional expert systems have been employed to
circumvent these inadequacies. Hybrid techniques offset the weaknesses of each individual method used
by itself. Traditional expert systems excel at reasoning from previously viewed data, whereas neural
networks extrapolate analyses and perform generalized classification for new scenarios. One approach
to defining a hybrid scheme involves combining neural networks with an inference system based on the
A
),
p
(
A
)] which lies in {0,1}. The parameter
s
(
A)
represents the
support
for
A
, which
measures the weight of evidence in support of
A
) is the uncertainty of
A
, which is
the difference between the evidential plausibility and support. For example, an evidence interval of [0.3,
0.7] for proposition
A
indicates that the probability of
A
is between 0.3 and 0.7, with an uncertainty of 0.4.
In terms of diagnosis, proposition
A
represents a given fault hypothesis. An evidential interval for fault
is determined from a basic probability mass distribution (BPMD). The BPM indicates the portion
of the total belief in evidence assigned exactly to a particular fault hypothesis set. Any residual belief in
the frame of discernment that cannot be attributed to any subset of
Θ
is assigned directly to
Θ
is the sum of support ascribed to
A
and all subsets thereof.
Dempster’s rules for evidence combination provide a deterministic and unambiguous method of
combining BPMDs from separate and distinct sources of evidence contributing varying degrees of belief
to several propositions under a common frame of discernment. The rule for combining the observed
BPMs of two arbitrary and independent knowledge sources m
1
and m
2
into a third m
3
is
mA〈〉
sA mA
i
()
=
∑
pA mB
i
()
=
∑
1–
⊆ ⊆
©2001 CRC Press LLC
1
and m
2
when each contain different evidence concerning
the diagnosis of a malfunction in a plasma etcher [Manos and Flamm, 1989]. Such evidence could result
from two different sensor readings. In particular, suppose that the sensors have observed that the flow
of one of the etch gases into the process chamber is too low. Let the frame of discernment
Θ
= {A, B, C, D},
where A, . . ., D symbolically represent the following mutually exclusive equipment faults:
A = mass flow controller miscalibration
B = gas line leak
C = throttle valve malfunction
D = incorrect sensor signal
These components are illustrated graphically in the etcher gas flow system shown in Figure 13.7.
Suppose that belief in this frame of discernment is distributed according to the BPMDs:
FIGURE 13.7 Partial schematic of RIE gas delivery system. (Source: Kim, B. and May, G., 1997. Real-Time Diagnosis
of Semiconductor Manufacturing Equipment Using Neural Networks, IEEE Trans. Comp. Pack. Manuf. Tech. C,
20(1):39-47. With permission.)
MFC
Sensor
Throttle valve
Gas line
mZ
mX mY
k
ij
3
12
1
beliefs. Note that the intersection of any proposition with
Θ
is the original proposition. The BPM
attributed to the empty set, k, which originates from the presence of various propositions in m
1
and m
2
whose intersection is empty, is 0.11. By applying Equation 13.16, BPMs for the remaining propositions
result in:
The plausibilities for propositions in the combined BPM are calculated by applying Equation 13.15. The
individual evidential intervals implied by m
3
are A[0.225, 0.550], B[0.169, 0.472], C[0.079, 0.235],
D[0.135, 0.269]. Combining the evidence available from knowledge sources m
1
and m
2
thus leads to the
conclusion that the most likely cause of the insufficient gas flow malfunction is a miscalibration of the
mass flow controller (proposition A).
13.3 Process Modeling
The ability of neural networks to learn input–output relationships from limited data is beneficial in
electronics manufacturing, where a plethora of nonlinear fabrication processes exist, and experimental
data are expensive to obtain. Several researchers have reported noteworthy successes in using neural
networks to model the behavior of a few key fabrication processes. In so doing, the basic strategy is
usually to perform a series of statistically designed characterization experiments, and then to train BP
neural nets to model the experimental data. The process characterization experiments typically consist
of a factorial exploration of the input parameter space, which may be subsequently augmented by a more
advanced experimental design. Each set of input conditions in the design corresponds to a particular set
of measured process responses. This input–output mapping is what the neural network learns.
B 0.15 C 0.03 D 0.06
θ
0.06
A
∪
B 0.50 C 0.10 D 0.20
θ
0.20
m
2
Source: Kim, B. and May, G., 1997. Real-Time Diagnosis of Semiconductor Manufacturing Equipment
Using Neural Networks, IEEE Trans. Comp. Pack. Manuf. Tech. C, 20(1):39-47. With permission.
m AA CA BBB DCD
3
0 225 0 089 0 169 0 169 0 067 0 079 0 135 0 067
,,,,,,,
.,.,.,.,.,.,.,.
∪∪ ∪ =
Θ
©2001 CRC Press LLC
Plasma process modeling efforts have previously focused on statistical response surface methods (RSM)
[Box and Draper, 1987]. RSM models can predict etch behavior under a wide range of operating
conditions, but they are most efficient when the number of process variables is small (i.e., six or fewer).
The large number of experiments required to adequately characterize the many significant variables in
processes like plasma etching is costly and usually prohibitive, forcing experimenters to manipulate a
reduced set of variables. Because plasma etching is a highly nonlinear process, this simplification reduces
the accuracy of the RSM models.
Himmel and May compared RSM to BP neural networks for modeling the etching of polysilicon films
in a carbon tetrachloride (CCl
4
were able to successfully model ECR plasma responses using neural nets trained with a polynomial error
function derived from the statistical properties of the error signal itself.
Other manufacturing processes have also benefited from the neural network approach. Specifically,
chemical vapor deposition (CVD) processes, which are also nonlinear, have been modeled effectively.
Nadi et al. [1991] combined BP neural nets and influence diagrams for both the modeling and recipe
synthesis of low pressure CVD (LPCVD) of polysilicon. Bose and Lord [1993] demonstrated that neural
networks provide appreciably better generalization than regression based models of silicon CVD. Simi-
larly, Han et al. [1994] developed neural process models for the plasma-enhanced CVD (PECVD) of
silicon dioxide films used as interlayer dielectric material in multichip modules.
13.3.2 Modifications to Standard Backpropagation in Process Modeling
In each of the previous examples, standard implementations of the BP algorithm have been employed
to perform process modeling tasks. However, innovative modifications of standard BP have also been
developed for certain other applications. In one case, BP has been combined with simulated annealing
to enhance model accuracy. In addition, a second adjustment has been developed that incorporates
knowledge of process chemistry and physics into a semi-empirical or hybrid model, with advantages over
the purely empirical “black-box” approach previously described. These two variations of BP are described
below.
13.3.2.1 Neural Networks and Simulated Annealing in Plasma Etch Modeling
Kim and May [1996] used neural networks to model etch rate, etch anisotropy, etch uniformity, and etch
selectivity in a low-pressure form of plasma etching called reactive ion etching (RIE). The RIE process
consisted of the removal of silicon dioxide films by a trifluoromethane (CHF
3
) and oxygen plasma in a
Plasma Therm 700 series dual chamber RIE system operating at 13.56 MHz. The process was initially
characterized via a 2
4
factorial experiment with three center-point replications augmented by a central
composite design. The factors varied included pressure, RF power, and the two gas flow rates.
Data from this experiment were used to train modified BP neural networks, which resulted in
improved prediction accuracy. The new technique modified the rule used to update network weights.
the learning rate. Equation 13.17 is called the generalized delta rule. Kim and May’s new K-step prediction
rule, modified the generalized delta rule by using portions of previously stored weights in predicting the
next set of weights. The new update scheme is expressed as
w
ijk
(n + 1) = w
ijk
(n) +
η∆
w
ijk
(n) +
γ
K
w
ijk
(n – K) Equation (13.18)
The last term in this expression provides the network with long-term memory. The integer K determines
the number of sets of previous weights stored and the
γ
K
factor allows the system to place varying degrees
of emphasis on weight sets from different training epochs. Typically, larger values of
γ
K
are assigned to
more recent weight sets.
This memory-based weight update scheme was combined with a variation of simulated annealing. In
thermodynamics, annealing is the slow cooling procedure that enables nature to find the minimum
energy state. In neural network training, this is analogous to using the following function in place of the
exp –
net
T
ik ik
β
λ
T
low
F
x
T
high
©2001 CRC Press LLC
BP neural networks were trained using this procedure with data from the 2
4
factorial array plus the
three center-point replications. The remaining axial trials from the central composite characterization
experiment were used as test data for the models. The annealed K-step training rule and the generalized
2
deposition rate.
The first step in this hybrid modeling technique involves developing an analytical model. For TiO
2
deposition via MOCVD, this was accomplished by applying the continuity equation to reactant concen-
tration as the reactant of interest is transported from the bulk gas and incorporated into the growing
film. Under these conditions and several key assumptions, the average deposition rate R for TiO
2
is given
by
Equation (13.20)
where R is expressed in micrometers per hour, T
inlet
is the inlet gas temperature in degrees Kelvin, P is
the chamber pressure (mtorr), P
e
is the equilibrium vapor pressure of the precursor (mtorr), P
0
is the
total bubbler pressure (mtorr), ν is the carrier gas flow rate (in standard cm
3
/min), Q is the total flow
rate (in standard cm
3
/min), D is the diffusion coefficient of the reactant gas, δ is the boundary layer
thickness, and K
D
is the mass transfer coefficient given by K
D
= Ae
+
1200
1
0
δ
ν
–
∆
wn
E
w
ijk
ijk
()
=
∂
∂
©2001 CRC Press LLC
The gradient of the error with respect to the weights is calculated for one pair of input–output patterns
at a time. After each computation, a step is taken in the direction opposite to the error gradient, and the
procedure is iterated until convergence is achieved.
In the hybrid approach, the network structure corresponding to the deposition of TiO
2
by MOCVD
), and the third is the same as that of standard BP. The second partial derivative is
computed individually for each unknown parameter to be estimated. Referring to Equation 13.20, the
partial derivative of R
p
with respect to activation energy is
Equation (13.23)
The partial derivatives for the other two parameters are computed similarly, and after error minimization,
values of the three parameters for the TiO
2
MOCVD process are known explicitly.
Because hybrid neural networks rely on network training to predict only portions of a physical model,
they require less training data. The hybrid network developed by Nami et al. [1997] was trained using
only 11 training experiments. A three-layer neural network with six inputs, eight hidden neurons, and
three outputs was the best network architecture for this case. After error minimization, the values of the
diffusion coefficient, pre-exponential constant, and activation energy were 2.5 × 10
–6
m
2
/s, 1.04 m/s, and
5622 cal/mol, respectively. Once trained, the hybrid neural network was subsequently used to predict the
deposition rate for five additional MOCVD runs, which constituted a test data set not part of the original
experiment. The RMS error of the deposition rate model predictions using the estimated parameters for
the five test vectors was only 0.086
µ
m/h. The hybrid neural network approach, therefore, represents a
general-purpose methodology for deriving semi-empirical neural process models that take into account
underlying process physics.
13.3.2.2.2 The Model Transfer Approach
Model transfer techniques attempt to modify physically based neural network process models to reflect
specific pieces of processing equipment. Marwah and Mahajan [1999] proposed model transfer
E
w
E
R
R
out
out
w
ijk p
p
ik
ik
ijk
∂
∂
=
+
–
δ
ν
©2001 CRC Press LLC
approaches for modeling a horizontal CVD reactor used in the epitaxial growth of silicon. The goal was
to develop an equipment model that incorporated process physics, but was economical to build. The
techniques investigated included (i) the difference method, in which a neural network was trained on the
difference between the existing physical model (or “source” model) and equipment data; (ii) the source
weights method, in which the final weights of the source model were used as initial weights of the modified
model; and (iii) the source input method, in which the source model output was used as an additional
input to the modified network.
The starting point for model transfer was the development of a physical neural network (PNM) model
trained on 98 data points generated from a process simulator utilizing first principles. Training data was
obtained by running the simulator for various combinations of input parameters (i.e., inlet silane con-
centration, inlet velocity, susceptor temperature, and downstream position) using a statistically designed
experiment. The numerical data were then split into 73 training vectors and 25 test vectors, and the physical
neural network source model was trained using BP to predict silicon growth rate and tested against the
validation points for the desired accuracy. The average relative training and testing error obtained were
1.55% and 1.65%, respectively. The source model was then modified by training a new neural network
with 25 extra experimentally derived data points obtained from central composite experiment.
In the difference method, the modified neural network model was trained on the difference between
the source and equipment data (see Figure 13.11(a)). The inherent expectation was that if this difference
was a simpler function of the inputs as compared to the pure equipment data, then fewer equipment
data points would be required to build an accurate model. In the source weights method, the source
model was retrained using the equipment data as test data. The final weights of the source model were
then used as the initial weights of the modified model. The rationale for this approach was that training
the source network with the experimental data as test data captures the common features of the source
and final modified models. For the source input method, the source model is used as an additional input
to the modified network (Figure 13.11(b)). Since the source model should be close to the final modified
model, the source output should be some internal representation of the input data, which should be
technique was approximately 25% of that required to develop a complete equipment model from scratch.
Furthermore, the source model can be reused for developing additional models of other similar equipment.
13.3.2.3 Process Modeling Using Modular Neural Networks
Natale et al. [1999] applied modular neural networks to develop a model of atmospheric pressure CVD
(APCVD) of doped silicon dioxide films, a critical step in dynamic random access memory (DRAM)
chip fabrication at the Texas Instruments fabrication facility in Avezzano, Italy. Modular neural networks
consist of a group of subnetworks, or modules, competing to learn different aspects of a problem. As
shown in Figure 12(a), “gating” network is applied to control the competition by assigning different
regions of the input data space to different local modules. The gating network has as many outputs as
FIGURE 13.11 Schematic of two model modifiers: (a) difference method; and (b) source input method. (Source:
Marwah, M. and Mahajan, R., 1999. Building Equipment Models Using Neural Network Models and Model Transfer
Techniques, IEEE Trans. Semi. Manuf., 12(3):377-380. With permission.)
Actual Data
Difference
NN
SOFTWARE
Equipment
Model
DIFFERENCE
MODEL
+
COMPARE
Physical-
Neural Model
(PNM)
PNM
Inputs
(a)
(b)
Source
module 1
module 2
output
module n
gating
network
input
i1
i2
in
∑
123
hot muffle
chamber pressure
Injectors temperatures
gas flows
Butterfly
valve
position
muffle temperatures
Modular
Neural
Network
P weight
B weight
thickness
©2001 CRC Press LLC
13.4.1 Network Optimization
The problem of optimizing network structure and learning parameters has been addressed by Kim and
May [1994] for plasma etch modeling and Han and May [1996] in modeling plasma-enhanced CVD.
4
plasma.
To develop the optimal neural process model, these researchers designed a D-optimal experiment [Galil
and Kiefer, 1980] to investigate the effect of six factors: the number of hidden layers, the number of
neurons per hidden layer, training tolerance, initial weight range, learning rate, and momentum. This
experiment determined how the structural and learning factors affect network performance and provided
an optimal set of parameters for a given set of performance metrics. The network responses optimized
were learning capability, predictive capability, and training time. The experiment consisted of two stages.
In the first stage, statistical experimental design was employed to fully characterize the behavior of the
etch process [May et al., 1991]. Etch rate data from these trials were used to train neural process models.
Once trained, the models were used to predict the etch rate for 12 test wafers. Prediction error for these
wafers was also computed, and these two measures of network performance, along with training time,
were used as experimental responses to optimize the neural etch rate model as the structural and learning
parameters were varied in the second stage (which consisted of the D-optimal design).
13.4.1.1.1 Individual Network Parameter Optimization
Independent optimization of each performance characteristic was then performed with the objective of
minimizing training error, prediction error, and training time. A constrained multicriteria optimization
technique based on the Nelder–Mead simplex search algorithm was implemented to do so. The optimal
©2001 CRC Press LLC
parameter set was first found for each criterion individually, irrespective of the optimal set for the other
two. The results of the independent optimization are summarized in Table 13.4.
Several interesting interactions and trade-offs between the various parameters emerged in this study.
One such trade-off can be visualized in two-dimensional contour plots such as those in Figures 13.13
and 13.14. Figure 13.13 plots training error against training tolerance and initial weight range with all
other parameters set at their optimal values. Learning capability improves with decreased tolerance and
wider weight distribution. Intuitively, the first result can be attributed to the increased precision required
by a tight tolerance. Figure 13.14 plots network prediction error vs. the same variables as in Figure 13.13.
As expected, optimum prediction is observed at high tolerance and narrow initial weight distribution.
The latter result implies that the interaction between neurons within the restricted weight space during
training is a primary stimulus for improving prediction. Thus, although learning degrades with a wider
= 1. Optimization was
performed on the overall cost function. The results of this collective optimization appear in Table 13.5.
The parameter values in this table yield the minimum cost according to Equation 13.24. This combination
resulted in a training error of 412 Å/min, a prediction error of 340 Å/min, and a training time of 292 s.
Although this represents only marginal performance, these values may be further tuned by adjusting the
cost function constants K
i
and the optimization constraints until suitable performance is achieved.
13.4.1.2 Network Optimization Using Genetic Algorithms
Although Kim and May had success with designed experiments and simplex search to optimize BP neural
network learning, the effectiveness of the simplex method depends on its initial search point. With an
improper starting point, performance degrades, and the algorithm is likely to be trapped in local optima.
Theoretical analyses suggest that genetic algorithms quickly locate high-performance regions in extremely
TABLE 13.4
Independently Optimized Network Inputs
Parameter Training Error Prediction Error Training Time
Hidden Layer 1 1 1
Neurons/Hidden Layer 6 9 3
Training Tolerance 0.08 0.13 0.09
Initial Weight Range +/– 2.00 +/– 1.04 +/– 1.00
Learning Rate 2.78 2.80 0.81
Momentum 0.35 0.35 0.95
Optimal Value 239 Å/min 162 Å/min 37.3 s
Source: Kim, B. and May, G., 1994. An Optimal Neural Network Process Model for Plasma Etching, IEEE Trans.
Semi. Manuf., 7(1):12-21. With permission.
Cost K K K T
tp
=++
1
440
420
420
420
400
380
380
380
360
360
340
340
360
340
320
300
300
280
280
260
320
320
400
400
0.13
0.12
0.11
0.10
0.09
0.08
rate = 2.8, momentum = 0.35, number of hidden neurons = 6, number of hidden layers = 1). Optimum prediction
occurs at high tolerance and narrow initial weight distribution. (Source: Kim, B. and May, G., 1994. An Optimal
Neural Network Process Model for Plasma Etching, IEEE Trans. Semi. Manuf., 7(1):12-21. With permission.)
0.13
0.12
0.11
0.10
0.09
0.08
1.0 1.5 2.0
Prediction Error (A
O
/min)
LEARNING RATE = 2.8, MOMENTUM = 0.35, NEURON NUMBER = 9, LAYER NUMBER = 1
INITIAL WEIGHT
170
170
180
180
180
190
190
200
200
210
210
220
220
190
200
was implemented:
Equation (13.25)
where
σ
t
is the RMS training error,
σ
p
is the RMS prediction error, and K
1
and K
2
represent the relative
importance of each performance measure. The values chosen for these constants were K
1
= 1 and K
2
=
10. The desired output was reflected by the following fitness function:
Equation (13.26)
Maximization of F continued until a final solution was selected after 100 generations. If the optimal
solution was not found, the solution with the best fitness value was selected.
13.4.1.2.1 Optimization of Individual Responses
Individual response neural network models were trained to predict PECVD silicon dioxide permittivity,
refractive index, residual stress, and nonuniformity, and impurity (H
2
O and SiOH) concentration. The
result of genetically optimizing these neural process models is shown in Table 13.6. Analogous results
for network optimization by the simplex method are given in Table 13.7. Examination of Tables 13.6 and
13.7 shows that the most significant differences between the two optimization algorithms occur in the
Momentum 0.35
Source: Kim, B. and May, G., 1994. An Optimal Neural Network
Process Model for Plasma Etching, IEEE Trans. Semi. Manuf.,
7(1):12-21. With permission.
PI K K
tp
=+
1
2
2
2
σσ
F
PI
=
+
1
1