Advances in Sound Localization Part 2 potx - Pdf 14

Direction-Selective Filters for Sound Localization
27
When the quality factor is 10, then the parameter a of the prototype filter is 1.105. The
discriminating function of the filter is given by Eq. (30). The function has a value of 1 at
0
ψ
= . The beamwidth of the prototype filter is obtained by equating Eq. (30) to 12,
solving for
ψ
, and multiplying by 2. The result is

(
)
1
3
BW 2 2cos 1 2 2
dB
a
ψ
−
⎡
⎤
== −+
⎣
⎦
(45)
For the case 1.105a = , the beamwidth is 33.9
o
. This is in sharp contrast to the beamwidth of
the maximum DI vector sensor which is 104.9
o

1
cos
cos
cos
cos
L
j
j
j
jj
j
j
j
j
j
b
d
gK
c
a
μ
μ
νν
ψ
ψ
ψ
ψ
ψ
==
=

ψ
ψ
=
=
∑
−
(47)
The function specified by Eq. (47) may be realized by a parallel interconnection of ν
prototype filters (with γ
= 0). Each component of the above expansion has the form of Eq.
(30). Normalizing the discriminating function such that it has a value of 1 at
0
ψ
=
yields

1
1
1
i
i
i
K
a
ν
=
=
∑
−
(48)

=
∑∑
(50)
where

2
1
1
ii
i
g
a
=
−
(51)

1
1
1
coth ,
ij
ij
ij ij
aa
gij
aa aa
−
⎛⎞
−
⎜⎟

[
]
K
12
KK K
ν
′
=
 (54)

U
12
11 1
11 1aa a
ν
⎡
⎤
′
=
⎢
⎥
−− −
⎣
⎦
… (55)
and
G is the matrix containing the elements
i
j
g . Utilizing the Method of Lagrange

UG U
1
max 10
DI 10log
−
′
=−
(58)
3.2 An example: a second-degree rational discriminating function
As a example of applying the contents of the previous section, consider the proper rational
function of the second degree,

()
u
01
12
2
12
01
cos
cos cos
cos cos
L
dd
KK
g
aa
cc
ψ
ψ

1.105,a = and let
2
1.200a = . The value of the matrices G and U are given by

G
4.5244 3.1590
3.1590 2.227
⎡
⎤
=
⎢
⎥
⎣
⎦
(61)

U
9.5238
5.0000
⎡
⎤
=
⎢
⎥
⎣
⎦
(62)
If Eqs. (56) and (58) are used to compute
K and DI
max

==−
(65)
Figure 2 illustrates the discriminating function specified by Eqs. (59) and (65). Also shown
(as a dashed line) for comparison the discriminating function of Fig. 1. The dashed-line plot
represents a discriminating function that is a rational function of degree one, whereas the
solid-line plot corresponds to a discriminating function that is a rational function of degree
two. The latter function decays more quickly having a 3-dB down beamwidth of 22.6
o
as
compared to a 3-dB down beamwidth of 33.9
o
for the former function. Fig. 2. Plots of the discriminating function of the examples presented in Sections 2.3 and 3.2.
In order to see what directivity index is achievable with a second-degree discriminating
function, it is useful to consider the second-degree discriminating function of Eq. (59) with
equal roots in the denominator, that is,
2
01
,2cac a
=
=−
. It is shown in a technical report by
the author (2010c) that the maximum directivity index for this discriminating function is
equal to

max
1
4

31
4
a
da
−
=
−
(68)
Note that the directivity given by Eq. (66) is four times the directivity given by Eq. (38).
Analogous to Eqs. (42) and (43), the maximum directivity index can be expressed as

2
max 10 10
DI 6 10lo
g
14 dB910lo
g
dBQQ=+ + ≈+ (69)
For
1
1.105,a = 10Q
=
and the maximum directivity index is 19 dB which is a 6 dB
improvement over that of the first-degree discriminating function of Eq. (30). In the example
presented in this section,
12 max
1.105, 1.200,DI 17.8 dBaa== =. As
2
a moves closer to
1

−
−
=
−
(70)
where
ρ
is real, positive and less than 1. Equation (70) corresponds to a causal, stable
discrete-time system. The digital frequency ω is not to be confused with the analog
frequency ω appearing in previous sections. The magnitude-squared response of this system
is obtained from Eq. (70) as

()
2
2
2
12
12cos
j
He
ω
ρρ
ρ
ωρ
−+
=
−+
(71)
Letting e
σ

()
()
u
2
L
j
gHe
ψ
ψ
 (73)
To illustrate the process, consider the magnitude-squared response of a low pass
Butterworth filter of order 2, which has the magnitude-squared function

(
)
()
()
2
4
1
tan 2
1
tan 2
j
c
He
ω
ω
ω

()
()
()()
2
2
22
1cos
1 cos 1 cos
j
He
ω
αω
α
ωω
+
=
++−
(76)
where

()
()
2
4
2
1cos
tan
2
1cos

=
−+
(78)
where

2
2cos
cos
1cos
c
c
ω
θ
ω
=
+
(79)
By replacing ω by
ψ
in Eq. (78), one obtains the discriminating function

()
u
2
2
1 cos 1 2cos cos
2
1 2cos cos cos
L
g

discriminating function is equal to the magnitude-squared function of the Butterworth filter.
The discriminating function of Fig. 3 can be said to be providing a “maximally-flat beam” of
order 2 in the look direction
u
L
. Equation (80) cannot be realized by a parallel
interconnection of first-order prototype filters because the roots of the denominator of Eq.
(80) are complex. Its realization requires the development of a second-order prototype filter
which is the focus of current research.
4. Summary and future research
4.1 Summary
The objective of this paper is to improve the directivity index, beamwidth, and the flexibility
of spatial filters by introducing spatial filters having rational discriminating functions. A
first-order prototype filter has been presented which has a rational discriminating function
of degree one. By interconnecting prototype filters in parallel, a rational discriminating
function can be created which has real distinct simple poles. As brought out by Eq. (33), a
negative aspect of the prototype filter is the appearance at the output of a spurious
frequency whose value is equal to the input frequency divided by the parameter
a of the
filter where
a > 1. Since the directivity of the filter is inversely proportional to 1a − , there
exists a tension as
a approaches 1 between an arbitrarily increasing directivity D and
destructive interference between the real and spurious frequencies. The problem was Fig. 3. Discriminating function of Eq. (80).
Advances in Sound Localization
34
alleviated by placing a temporal bandpass filter at the output of the prototype filter and

Q = to 19.0 dB at 10Q
=
. The beamwidth varies from
o
63.2 at 1Q
=
to
o
19.7 at 10Q = .
The directivity index and beamwidth of the two-equal-poles discriminating function at
1
Q = is essentially the same as that of the dyadic sensor. But as the quality factor increases,
the directivity index goes up while the beamwidth goes down. It is important to note that
the curves in Fig. 4 are theoretical curves. In any practical implementation, one may be
required to operate at the lower end of each curve. However, the performance will still be an
improvement over that of a dyadic sensor. The two-equal-poles case cannot be realized
exactly by first-order prototype filters, but the implementation presented in Section 3.2
comes arbitrarily close. Finally, in Section 3.3 it was shown that discriminating functions can
be derived from the magnitude-squared response of digital filters. This allows a great deal
of flexibility in the design of discriminating functions. For example, Section 3.3 used the
magnitude-response of a second-order Butterworth digital filter to generate a discriminating
function that provides a “maximally-flat beam” centered in the look direction. The
beamwidth is controlled directly by a single parameter.
4.2 Future research
Many rational discriminating functions, specifically those with complex-valued poles and
multiple-order poles, cannot be realized as parallel interconnections of first-order prototype
filters. Examples of such discriminating functions appear in Figs. 2 and 3. Research is
underway involving the development of a second-order temporal-spatial filter having the
prototypical beampattern

1cos cos
L
dd
g
cc
ψ
ψ
ψ
ψ
+
=
++
(82)
Direction-Selective Filters for Sound Localization
35

Fig. 4. DI and beamwidth as a function of Q.
With the second-order prototype in place, the discriminating function of Eq. (80), as an
example, can be realized by expressing it as a partial fraction expansion and connecting in
parallel two prototypal filters. For the first,
(
)
0
1cos 2d
θ
=−
and
112
0dcc
=

specifically the full range of robustness properties typical for these filters (Fettweis, 1990). Of
special interest in the filter implementation process is the length of the aperture. The goal is
to achieve a particular directivity index and beamwidth with the smallest possible aperture
length. Another important area for future research is studying the effect of noise (both
ambient and system noise) on the filtering process. The fact that the prototypal filter tends to
act as an integrator should help soften the effect of uncorrelated input noise to the filter.
Finally, upcoming research will also include the array gain (Burdic, 1991) of the filter
prototype for the case of anisotropic noise (Buckingham, 1979a,b; Cox, 1973). This paper
considered the directivity index which is the array gain for the case of isotropic noise.
5. References
Bienvenu, G. & Kopp, L. (1980). Adaptivity to background noise spatial coherence for high
resolution passive methods,
Int. Conf. on Acoust., Speech and Signal Processing, pp.
307-310.
Bilbao, S. (2004).
Wave and Scattering Methods for Numerical Simulation, John Wiley and Sons,
ISBN 0-470-87017-6, West Sussex, England.
Bresler, Y. & Macovski, A. (1986). Exact maximum likelihood parameter estimation of
superimposed exponential signals in noise,
IEEE Trans. ASSP, Vol. ASSP-34, No. 5,
pp. 1361-1375.
Buckingham, M. J. (1979a). Array gain of a broadside vertical line array in shallow water,
J.
Acoust. Soc. Am.
, Vol. 65, No. 1, pp. 148-161.
Buckingham, M. J. (1979b). On the response of steered vertical line arrays to anisotropic
noise,
Proc. R. Soc. Lond. A, Vol. 367, pp. 539-547.
Burdic, W. S. (1991). Underwater Acoustic System Analysis, Prentice-Hall, ISBN 0-13-947607-5,
Englewood Cliffs, New Jersey, USA.

Processing
, Vol. 3, pp. 7-24, Kluwer Academic Publishers, Boston.
Direction-Selective Filters for Sound Localization
37
Fettweis, A. & Nitsche, G. (1991b). Transformation approach to numerically integrating
PDEs by means of WDF principles,
Multidimensional Systems and Signal Processing,
Vol. 2, pp. 127-159, Kluwer Academic Publishers, Boston.
Hawkes, M. & Nehorai, A. (1998). Acoustic vector-sensor beamforming and capon direction
estimation,
IEEE Trans. Signal Processing, Vol. 46, No. 9, pp. 2291-2304.
Hawkes, M. & Nehorai, A. (2000). Acoustic vector-sensor processing in the presence of
a reflecting boundary,
IEEE Trans. Signal Processing, Vol. 48, No. 11, pp. 2981-
2993.
Hines, P. C. & Hutt, D. L. (1999). SIREM: an instrument to evaluate superdirective and
intensity receiver arrays,
Oceans 1999, pp. 1376-1380.
Hines, P. C.; Rosenfeld, A. L.; Maranda, B. H. & Hutt, D. L. (2000). Evaluation of the endfire
response of a superdirective line array in simulated ambient noise environments,
Proc. Oceans 2000, pp. 1489-1494.
Johnson, C. (2009).
Numerical Solution of Partial Differential Equations by the Finite-Element
Method
, Dover Publications, ISBN-13 978-0-486-46900-3, Mineola, New York,
USA
Krim, H. & Viberg, M. (1996). Two decades of array signal processing research,
IEEE Signal
Processing Magazine
, Vol. 13, No. 4, pp. 67-94.

Antennas and Propagation
, Vol. AP-34, No. 3, pp. 276-280.
Silvia, M. T. (2001). A theoretical and experimental investigation of acoustic dyadic sensors,
SITTEL Technical Report No. TP-4, SITTEL Corporation, Ojai, Ca.
Silvia, M. T.; Franklin, R. E. & Schmidlin, D. J. (2001). Signal processing considerations for a
general class of directional acoustic sensors,
Proc. of the Workshop of Directional
Acoustic Sensors
, Newport, RI.
Van Veen, B. D. & Buckley, K. M. (1988). Beamforming: a versatile approach to spatial
filtering,
IEEE ASSP Magazine, Vol. 5, No. 2, pp. 4-24.
Advances in Sound Localization
38
Wong, K. T. & Zoltowski, M. D. (1999). Root-MUSIC-based azimuth-elevation angle-of-
arrival estimation with uniformly spaced but arbitrarily oriented velocity
hydrophones, IEEE Trans. Signal Processing, Vol. 47, No. 12, pp. 3250-3260.
Wong, K. T. & Zoltowski, M. D. (2000). Self-initiating MUSIC-based direction finding in
underwater acoustic particle velocity-field beamspace,
IEEE Journal of Oceanic
Engineering
, Vol. 25, No. 2, pp. 262-273.
Wong, K. T. & Chi, H. (2002). Beam patterns of an underwater acoustic vector hydrophone
located away from any reflecting boundary, IEEE Journal Oceanic Engineering, Vol.
27, No. 3, pp. 628-637.
Ziomek, L. J. (1995).
Fundamentals of Acoustic Field Theory and Space-Time Signal
Processing
, CRC Press, ISBN 0-8493-9455-4, Boca Raton, Ann Arbor, London, Tokyo.
Zou, N. & Nehorai, A. (2009). Circular acoustic vector-sensor array for mode beamforming,

speech using a clean speech model without having to rely on user utterance texts, where a
GMM (Gaussian Mixture Model) is used to model clean speech features. This estimation is
performed in the cepstral domain employing an approach based upon maximum likelihood.
This is possible because the cepstral parameters are an effective representation for retaining
useful clean speech information. The results of our talker-localization experiments show the
effectiveness of our method.

Single-Channel Sound Source Localization
Based on Discrimination of Acoustic
Transfer Functions
3
Estimation of the frame sequence data
of the acoustic transfer function using
the clean speech model
(Each training position)
Single mic.
Observed speech
from each position
x
x
x
$
30
T
GMMs for each position
$
60
T
T
Training of the acoustic transfer

ˆ
H
Fig. 1. Training process for the acoustic transfer function GMM
2. Estimation of the acoustic transfer function
2.1 System overview
Figure 1 shows the training process for the acoustic transfer function GMM. First, we record
the reverberant speech data O
(θ)
from each position θ in order to build the GMM of the
acoustic transfer function for θ. Next, the frame sequence of the acoustic transfer function
ˆ
H
(θ)
is estimated from the reverberant speech O
(θ)
(any utterance) using the clean-speech
acoustic model, where a GMM is used to model the clean speech feature:
ˆ
H
(θ)
= argmax
H
Pr(O
(θ)
|H, λ
S
). (1)
Here, λ
S
denotes the set of GMM parameters for clean speech, while the sufﬁx S represents

|λ
(θ)
H
), (2)
where λ
(θ)
H
denotes the estimated acoustic transfer function GMM for direction θ (location).
40
Advances in Sound Localization
(User’s test position)
Single mic.
argmax
)|
ˆ
Pr(
)(
T
O
H
H
Reverberant speech

T
ˆ
T
x
x
x
$

l=0
s(t − l)h(l) (3)
where s
(t) is a clean speech signal and h(l) is an acoustic transfer function (room impulse
response) from the sound source to the microphone. The length of the acoustic transfer
function is L. The spectral analysis of the acoustic modeling is generally carried out using
short-term windowing. If the length L is shorter than that of the window, the observed
complex spectrum is generally represented by
O
(ω; n)=S(ω; n) · H(ω; n). (4)
However, since the length of the acoustic transfer function is greater than that of the window,
the observed spectrum is approximately represented by O
(ω; n) ≈ S(ω; n) · H(ω; n). Here
O
(ω; n), S (ω; n), and H(ω; n) are the short-term linear complex spectra in analysis window
n. Applying the logarithm transform to the power spectrum, we get
log
|O(ω; n)|
2
≈ log |S(ω; n)|
2
+ log |H(ω; n)|
2
. (5)
In speech recognition, cepstral parameters are an effective representation when it comes to
retaining useful speech information. Therefore, we use the cepstrum for acoustic modeling
that is necessary to estimate the acoustic transfer function. The cepstrum of the observed
signal is given by the inverse Fourier transform of the log spectrum:
O
ce p

WK
RUGHU
&HSVWUDOFRHIILFLHQW0)&&
WK
RUGHU
/HQJWKRILPSXOVHUHVSRQVHPVHF
         







䢢
䢢
GHJ
GHJ
&HSVWUDOFRHIILFLHQW0)&&
WK
RUGHU
&HSVWUDOFRHIILFLHQW0)&&
WK
RUGHU
PVHF 1RUHYHUEHUDWLRQ
Fig. 3. Difference between acoustic transfer functions obtained by subtraction of
short-term-analysis-based speech features in the cepstrum domain
2.3 Difference of acoustic transfer functions
Figure 3 shows the mean values of the cepstrum, H


cepstral coefﬁcients for 216 words are plotted in Figure 3. As shown in this ﬁgure (300
msec) a difference between the two acoustic transfer functions (30 and 90 degrees) appears
in the cepstral domain. The difference shown will be useful for sound source localization
estimation. On the other hand, in the case of the 0 msec impulse response, the inﬂuence of
the microphone and the loudspeaker characteristics are a signiﬁcant problem. Therefore, it is
difﬁcult to discriminate between each position for the 0 msec impulse response.
Also, this ﬁgure shows that the variability of the acoustic transfer function in the cepstral
domain appears to be large for the reverberant speech. When the length of the impulse
response is shorter than the analysis window used for the spectral analysis of speech,
the acoustic transfer function obtained by subtraction of short-term-analysis-based speech
features in the cepstrum domain comes to be constant over the whole utterance. However,
as the length of the impulse response for the room reverberation becomes longer than the
analysis window, the variability of the acoustic transfer function obtained by the short-term
analysis will become large, with acoustic transfer function being approximately represented
by Equation (7). To compensate for this variability, a GMM is employed to model the acoustic
transfer function.
3. Maximum-likelihood-based parameter estimation
This section presents a new method for estimating the GMM (Gaussian Mixture Model) of the
acoustic transfer function. The estimation is implemented by maximizing the likelihood of
the training data from a user’s position. In (Sankar & Lee, 1996), a maximum-likelihood (ML)
estimation method to decrease the acoustic mismatch for a telephone channel was described,
and in (Kristiansson et al., 2001) channel distortion and noise are simultaneously estimated
using an expectation maximization (EM) method. In this paper, we introduce the utilization
of the GMM of the acoustic transfer function based on the ML estimation approach to deal
with a room impulse response.
The frame sequence of the acoustic transfer function in (6) is estimated in an ML manner by
using the expectation maximization (EM) algorithm, which maximizes the likelihood of the
observed speech:
ˆ
H

S
)
·
log Pr(O, c|
ˆ
H, λ
S
) (10)
Here c represents the unobserved mixture component labels corresponding to the observation
sequence O.
The joint probability of observing sequences O and c can be calculated as
Pr
(O, c|
ˆ
H, λ
S
)=
∏
n
(v)
w
c
n
(v)
Pr(O
n
(v)
|
ˆ
H, λ

(v)
; μ
(S)
k
n
(v)
+
ˆ
H
n
(v)
, Σ
(S)
k
n
(v)
) (12)
where N
(O; μ, Σ) denotes the multivariate Gaussian distribution. It is straightforward to
derive that (Juang, 1985)
Q
(
ˆ
H
|H)
=
∑
k
∑
n

; μ
(S)
k
+
ˆ
H
n
(v)
, Σ
(S)
k
) (13)
Here μ
(S)
k
and Σ
(S)
k
are the k-th mean vector and the (diagonal) covariance matrix in the clean
speech GMM, respectively. It is possible to train those parameters by using a clean speech
database.
Next, we focus only on the term involving H.
Q
(
ˆ
H
|H)
=
∑
k

k
∑
n
(v)
γ
k,n
(v)
D
∑
d=1

1
2
log
(2π)
D
σ
(S)
2
k,d
+
(
O
n
(v)
,d
−μ
(S)
k,d
−

(S)
2
k,d
are the d-th mean
value and the d-th diagonal variance value of the k-th component in the clean speech GMM,
respectively.
The maximization step (M-step) in the EM algorithm becomes “max Q
(
ˆ
H
|H) ”. The
re-estimation formula can, therefore, be derived, knowing that ∂Q
(
ˆ
H
|H)/∂
ˆ
H = 0as
ˆ
H
n
(v)
,d
=
∑
k
γ
k,n
(v)
O

H
n
as follows:
μ
(H)
m
=
∑
v
∑
n
(v)
γ
m,n
(v)
ˆ
H
n
(v)
γ
m
(17)
Σ
(H)
m
=
∑
v
∑
n

localization is handled in an ML framework:
ˆ
θ
= argmax
θ
Pr(
ˆ
H
|λ
(θ)
H
), (19)
where λ
(θ)
H
denotes the estimated GMM for θ direction (location), and a GMM having
the maximum-likelihood is found for each test data from among the estimated GMMs
corresponding to each position.
4. Experiments
4.1 Simulation experimental conditions
The new talker localization method was evaluated in both a simulated reverberant
environment and a real environment. In the simulated environment, the reverberant
speech was simulated by a linear convolution of clean speech and impulse response. The
impulse response was taken from the RWCP database in real acoustical environments
(Nakamura, 2001). The reverberation time was 300 msec, and the distance to the microphone
was about 2 meters. The size of the recording room was about 6.7 m
×4.2 m (width×depth).
Figure 4 and Fig. 5 show the experimental room environment and the impulse response (90
degrees), respectively.
The speech signal was sampled at 12 kHz and windowed with a 32-msec Hamming window

Amplitude
Fig. 5. Impulse response (90 degrees, reverberation time: 300 msec)
set of test data, we found a GMM having the maximum-likelihood from among those GMMs
corresponding to each position. These experiments were carried out for each speaker, and the
localization accuracy was averaged by four talkers.
4.2 Performance in a simulated reverberant environment
Figure 6 shows the localization accuracy in the three-position estimation task, where 50 words
are used for the estimation of the acoustic transfer function. As can be seen from this ﬁgure,
by increasing the number of Gaussian mixture components for the acoustic transfer function,
the localization accuracy is improved. We can expect that the GMM for the acoustic transfer
function is effective for carrying out localization estimation.
Figure 7 shows the results for a different number of training data, where the number of
Gaussian mixture components for the acoustic transfer function is 16. The performance of
the training using ten words may be a bit poor due to the lack of data for estimating the
acoustic transfer function. Increasing the amount of training data (50 words) improves in the
performance.
In the proposed method, the frame sequence of the acoustic transfer function is separated
from the observed speech using (16), and the GMM of the acoustic transfer function is trained
by (17) and (18) using the separated sequence data. On the other hand, a simple way to carry
46
Advances in Sound Localization










/RFDOL]DWLRQDFFXUDF\>@
Fig. 7. Comparison of the different number of training data
out voice (talker) localization may be to use the GMM of the observed speech without the
separation of the acoustic transfer function. The GMM of the observed speech can be derived
in a similar way as in (17) and (18).
μ
(O)
m
=
∑
v
∑
n
(v)
γ
m,n
(v)
O
n
(v)
γ
m
(20)
Σ
(O)
m
=
∑
v
∑

on Discrimination of Acoustic Transfer Functions






















SRVLWLRQ SRVLWLRQ SRVLWLRQ SRVLWLRQ
1XPEHURISRVLWLRQV
/RFDOL]DWLRQDFFXUDF\>@
*00RIDFRXVWLFWUDQVIHUIXQFWLRQ3URSRVHG
*00RIREVHUYHGVSHHFK
0HDQRIREVHUYHGVSHHFK
&63WZRPLFURSKRQHV

48
Advances in Sound Localization

























&OHDQ   
6LJQDOWRQRLVHUDWLR>G%@
/RFDOL]DWLRQDFFXUDF\>

independent
Fig. 10. Comparison of performance using speaker-dependent/-independent speech model
(speaker-independent, 256 Gaussian mixture components; speaker-dependent,64 Gaussian
mixture components)
GMM was trained using 160 sentences (40 sentences
× 4 males) and it has 256 Gaussian
mixture components. The acoustic transfer function for training locations was estimated by
this clean speech model from 10 sentences for each male. The total number of training data for
the acoustic transfer function GMM was 40 (10 sentences
× 4 males) sentences. For training
the speaker-dependent model and testing, the speech data spoken by four males in the ATR
Japanese speech database were used in the same way as described in section 4.1. The speech
data for the test were provided by the same speakers used to train the speaker-dependent
model, but different speakers were used to train the speaker-independent model. Both
the speaker-dependent GMM and the speaker-independent GMM for the acoustic transfer
function have 16 Gaussian mixture components. As shown in Figure 10, the localization
accuracy of the speaker-independent speech model decreases about 20 % in comparison to
the speaker-dependent speech model.
4.4 Performance using Speaker-dependent speech model in a real environment
The proposed method, which uses a speaker-dependent speech model, was also evaluated in a
real environment. The distance to the microphone was 1.5 m and the height of the microphone
49
Single-Channel Sound Source Localization Based
on Discrimination of Acoustic Transfer Functions
Loudspeaker
Microphone
Fig. 11. Experiment room environment




position:
45 deg.
position:
90 deg.
Fig. 13. Effect of speaker orientation
was about 0.45 m. The size of the recording room was about 5.5 m
× 3.6 m × 2.7 m (width ×
depth × height). Figure 11 depicts the room environment of the experiment. The experiment
used speech data, spoken by two males, in the ASJ Japanese speech database. The clean
speech GMM (speaker-dependent model) was trained using 40 sentences and has 64 Gaussian
50
Advances in Sound Localization
0LFURSKRQH
q

q

q

2ULHQWDWLRQ
RIVSHDNHU
6SHDNHU¶VSRVLWLRQ
Fig. 14. Speaker orientation
mixture components. The test data for one location consisted of 200, 100 and 66 segments,
where one segment has a length of 1, 2 and 3 sec, respectively. The number of training data
for the acoustic transfer function was 10 sentences. The speech data for training the clean
speech model, training the acoustic transfer function, and testing were spoken by the same
speakers, but they had different text utterances respectively. The experiments were carried
out for each speaker and the localization accuracy of the two speakers was averaged.
Figure 12 shows the comparison of the performance using different test segment lengths.

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Advances in Sound Localization Part 2 potx - Pdf 14

Tài liệu, ebook tham khảo khác

Học thêm