DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications - Pdf 47

Int. J. Intelligent Information and Database Systems, Vol. 5, No. 1, 2011

A parallel dimensionality reduction for time-series
data and some of its applications
Hoang Chi Thanh*
Department of Informatics,
Hanoi University of Science, VNUH,
334 Nguyen Trai Rd., Hanoi, Vietnam
E-mail:
*Corresponding author

Nguyen Quang Thanh
Da Nang Department of Information and Communication,
15 Quang Trung Str., Da Nang, Vietnam
E-mail:
Abstract: The subsequence matching in a large time-series database has been
an interesting problem. Many methods have been proposed that cope with this
problem in an adequate extent. One of the good ideas is reducing properly the
dimensionality of time-series data.
In this paper, we propose a new method to reduce the dimensionality of
high-dimensional time-series data. The method is simpler than existing ones
based on the discrete Fourier transform and the discrete cosine transform.
Furthermore, our dimensionality reduction may be executed in parallel. The
method is used to time-series data matching problem and it decreases
drastically the complexity of the corresponding algorithm. The method
preserves planar geometric blocks and it is also applied to minimum bounding
rectangles as well.
Keywords: time-series data; database; dimensionality reduction; matching
problem; minimum bounding rectangle; MBR.
Reference to this paper should be made as follows: Thanh, H-C. and
Thanh, N-Q. (2011) ‘A parallel dimensionality reduction for time-series data

Introduction

Time-series data are the sequences of real numbers representing values at specific points
in time. For example, the bid prices and the ask prices of stock items, exchange rates,
weather data and human speech signals… are typical illustrations of time-series data. The
data stored in a database are called data sequences. The aim of the subsequence matching
problem in a large time-series database is finding data sequences similar to the given
query sequence from the database. This problem has attracted a lot of interest by its
applications.
Many methods have been proposed that cope with this problem in an adequate extend
(Agrawal et al., 1993; Keogh et al., 2000; Keogh et al., 2001; Faloutsos et al., 2001;
Moon et al., 2002). One of good ideas to increase the matching speed is a proper
dimensionality reduction for high-dimensional time-series data. In 2007, Moon proposed
a data transformation based on the discrete Fourier transform and then Moon and Kim
presented a data transformation based on the discrete cosine transform.
In this paper we present another dimensionality reduction for high-dimensional
time-series data. The method splits a high-dimensional time-series data into parts as equal
in time scale as possible and then takes the average of each part. The reduction is simpler
than existing ones above presented and it may be performed in parallel. So this method
decreases the time for ‘narrowing’ data and speeds up the matching process in a large
time-series database. We also use this dimensionality reduction for a special type of
time-series data – minimum bounding rectangles (MBR).
This paper is organised as follows. In Section 2 we present a dimensionality reduction
function for high-dimensional time-series data and point out some its properties.
Section 3 presents application of the dimensionality reduction function to time-series data
matching and to MBR. When applying this reduction function to MBRs we show that it
becomes safe. Some conclusion remarks are given in the last section.

2

So each proper dimensionality reduction function on time-series data is a shrinking
mapping. The properness of a reduction function guarantees no false dismissals for range
queries.
Let T[1..n] be an n-dimensional time-series data and let m be a positive integer such
that 0 < m

(2.3)

⎪⎧ 2 / 2, if i = 1;
.
where c(i ) = ⎨
if 2 ≤ i ≤ m ≤ i ≤ m
⎪⎩1,

Back to an n-dimensional time-series data T[1..n]. To reduce the dimensionality of the
data we split it into m parts as equal in time scale as possible. This always may be done
because of the following arithmetic fact:
For two positive integers n and m with 0 < m < n, there exist two non-negative
integers q and d, such that n = d.(q + 1) + (m – d).q.
The proof of this fact is very simple.
Let choose q = n div m and d = n mod m. We get, n = m.q + d = d.q + d + m.q – d.q =
d.(q + 1) + (m – d).q.

42

H.C. Thanh and N.Q. Thanh

The above fact offers us a method to part an n-dimensional time-series data into the
following m parts: d first parts with the size of q + 1 and m – d remaining parts with the
size of q. Then we take the average of each part. So we are able to transform an
n-dimensional time-series data to an m-dimensional time-series data.
Let denote q = n div m and d = n mod m.
Definition 2.2: The m-dimensional time-series data TR[1..m] constructed as follows:
i.( q +1)

distance.
n

So, L1 ( X , Y ) =

∑

m

|X [k ] − Y [k ] | and L1 ( f ( X ), f (Y )) =

k =1

∑ |X

R [i ] − YR [i ] |, ,

where

i =1

XR[i] and YR[i] are the corresponding components of the m-dimensional time-series data
transformed by the formula (2.4).
To prove the properness of the function f we check the inequality (2.1) only on each
part split as in Definition 2.2. On the first part we have:
X [1] − Y [1] + X [2] − Y [2] + ... + X [q + 1] − Y [q + 1]
≥ X [1] − Y [1] + X [2] − Y [2] + ... + X [q + 1] − Y [q + 1]
= ( X [1] + X [2] + ... + X [q + 1]) − (Y [1] + Y [2] + ... + Y [q + 1])
≥
=

Theorem 2.2: Line segments are invariable under the dimensionality reduction function f,
i.e.,
∀X ( X ∈ A − B ⇒ X R ∈ AR − BR ) .

Proof: The equation of the line passing A and B in the n-dimensional space is:
x − A[n]
x1 − A[1]
x − A[2]
= 2
= ... = n
.
B[1] − A[1] B[2] − A[2]
B[n] − A[n]

As the point X belongs to the line A – B, we have:
X [1] − A[1] X [2] − A[2]
X [n] − A[n]
=
= ... =
= k,
B[1] − A[1] B[2] − A[2]
B[n] − A[n]

with 0 ≤ k ≤ 1.
It means,
⎧ X [1] − A[1] = k ( B[1] − A[1])
⎪ X [2] − A[2] = k ( B[2] − A[2])
⎪
⎨
⎪...........................................

B[1] + B[2] | +... + B[q + 1] A[1] + A[2] + ... + A[q + 1]
−
q +1
q +1
=

( X [1] + X [2] + ... + X [q + 1]) − ( A[1] + A[2] + ... + A[q + 1]) =
( B[1] + B[2] + ... + B[q + 1]) − ( A[1] + A[2] + ... + A[q + 1])

( X [1] − A[1]) + ( X [2] − A[2]) + ... + ( X [q + 1] − A[q + 1])
=
( B[1] − A[1]) + ( B[2] − A[2]) + ... + ( B[q + 1] − A[q + 1])
k ( B[1] − A[1]) + k ( B[2] − A[2]) + ... + k ( B[q + 1] − A[q + 1])
=
( B[1] − A[1]) + ( B[2] − A[2]) + ... + ( B[q + 1] − A[q + 1])
=

= k.

Analogously, we show that each fraction in (2.6) is equal to k. So they are all identical.
This proves the theorem.
Corollary 2.3: The dimensionality reduction function f as in (2.4) preserves polygons.
Note that spheres are not preserved by the dimensionality reduction function f.

3

Some applications

3.1 Application to matching problem
Finding all occurrences of a pattern in a database is the purpose of matching problem.

n

S [i ][ j ] by adding

∑ S[q + k − 1][ j]

to the

j =1

j =1

n

previous sum and then subtracting

∑ S[q − 1][ j],

and compare the obtained sum with

j =1

the sum of the pattern P. The step is called a preprocessing.
Figure 1

Calculating the value of S[q..q + k – 1]

If a candidate is found (p = t) we move forward to the matching step. In this step we have
to compare tightly the candidate with the pattern and print a notice if the comparison

t←

n

∑∑ S[i][ j]
i =1 j =1

6
7
8

for q ← 1 to l – k + 1 do
begin
k −1

t ←t+

n

∑∑
i =1 j =1

9

◊ Matching
then print ‘pattern occurs beginning at position’ q

end

j =1

j =1

∑ S[q + k − 1][ j] and ∑ S[q − 1][ j] in the instruction (8)

can be performed in parallel. So can be the comparison of P[1..k] = S[q..q + k – 1] in the
instruction (10), too. The complexity of the algorithm is O(l.n).
We apply the dimensionality reduction transformation f constructed as in the formula
(2.4) to the time-series matching problem. Firstly, we reduce the dimensionality of the
database S[1..l][1..n] and the pattern P[1..k][1..n] from n to m, with 0 < m
(2.4) is MBR-safe.
Proof: By the definition of an MBR we have n following double inequalities:
L[ j ] ≤ X [ j ] ≤ U [ j ], ∀j = 1, 2,..., n.

Adding q + 1 first double inequalities and dividing totals by q + 1 we get:
L[1] + L[2] + ... + L[q + 1]
≤
q +1
X [1] + X [2] + ... + X [q + 1]
≤
q +1
U [1] + U [2] + ... + U [q + 1]
.
q +1

This means, LR[1] ≤ XR[1] ≤ UR[1].
Analogously for remaining parts, we obtain:
LR [i ] ≤ X R [i ] ≤ U R [i ], ∀i = 1, 2, …, m.

So, X R ∈ [ LR , U R ] .
Corollary 2.3 and Theorem 3.2 show that the dimensionality reduction transformation
f preserves planar geometric blocks represented by line segments. They point out an
important role of the transformation f in computer graphics and image processing.

4

Conclusions

In this paper we present a dimensionality reduction transformation for multi-dimensional
time-series data and some its applications in matching problem and MBR. The

ACM SIGMOD, pp.419–429.
Keogh, E., Chakrabarti, K, Pazzani, M. and Mehrotra, S. (2000) ‘Dimensionality reduction for fast
similarity search in large time-series database’, Journal of Knowledge and Information
Systems, Vol. 3, No. 3, pp.263–286.
Keogh, E., Chakrabarti, K., Mehrotra, S. and Pazzani, M. (2001) ‘Locally adaptive dimensionality
reduction for indexing large time-series databases’, Proceedings of the International
Conference on Management of Data, ACM SIGMOD, pp.151–162.
Moon, Y.S. (2007) ‘An MBR-safe transformation for high-dimensional MBRs in similar sequence
matching’, Proceedings of the International Conference on Database systems for Advanced
Applications, Thailand.
Moon, Y.S. and Kim, J. (2007) ‘A theoretical study on MBR-safe transformations’, Proceedings of
the 12th International Conference on Knowledge-Based and Intelligent Information &
Engineering Systems, Italy.
Moon, Y.S., Whang, K.Y. and Han, W.S. (2002) ‘General match: a subsequence matching method
in time-series databases based on generalized windows’, Proceedings of the International
Conference on Management of Data, ACM SIGMOD, pp.382–393.
Thanh, H.C. (2007) ‘Transforming sequential processes of a net system into concurent ones’,
International Journal of Knowledge-based and Intelligent Engineering Systems, Vol. 11,
No. 6, pp.391–397.
Thanh, H.C. (2009) ‘Parallel dimensionality reduction transformation for time-series data’, in
Ngoc Thanh Nguyen, Huynh Phan Nguyen and Adam Grzech (Eds.): ACIIDS 2009, IEEE
Computer Society, pp.104–108.

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications - Pdf 47

Tài liệu, ebook tham khảo khác

Học thêm