Automatic text extraction using DWT and Neural Network - Pdf 75

Automatic Text Extraction Using DWT and Neural Network
Po-Yueh Chen (陳伯岳), Chung-Wei Liang (梁忠瑋)

Department of Computer Science and Information Engineering,
Chaoyang University of Technology
(168 Gifeng E. Rd., Wufeng, Taichung County, Taiwan, R.O.C.)
Tel: (04) 23323000 ext. 4420
Email：摘要
本論文提出一個利用離散小波轉換與類神經網
路來擷取影像中的文字區域的方法。原始影像經過
離散小波轉換分解成四個子頻帶，正確文字區域的
高頻子頻帶與非文字區域不同，所以可利用其差距
計算出三個特徵值來當作類神經網路的輸入，然後
用倒傳遞架構的類神經網路來訓練待測的文字區
域。文字區域的類神經網路輸出值不同於非文字區
域的輸出值，因此可利用一臨界值來判定其是否為
文字區域。最後，將其偵測的文字區域經過擴張運
算後便可得到正確的文字區域。

關鍵詞：文字擷取、離散小波轉換、類神經網路

Abstract

In this paper, we present a new text extraction
method based on discrete wavelet transform and
neural network. The method successfully extracts
features of candidate text regions using discrete
wavelet transform. This is because the intensity

regions are detected by analyzing the robust edges or
homogeneous color/grayscale components that
belong to characters. For example, Cai et al. [1]
detect text edges in video sequences using a color
edge detector and then apply a low threshold to filter
out definite non-edge points. Real text edges are
detected using an edge-strength-smoothing operator
and an edge-clustering-power operator. Finally, they
employ a string-oriented coarse-to-fine detection
method to extract the real text regions. Datong Chen
et al. [2] detect vertical edges and horizontal edges in
an image and dilate these two kinds of edges using
different dilation operators. The logical AND
operator is performed on dilated vertical edges and
dilated horizontal edges to obtain candidate text
regions. Real Text regions are then identified using
the support vector machine.
Text regions usually have special texture features
because they consist of components of characters.
These components also contrast the background and
exhibit a periodic horizontal intensity variation due to
the horizontal alignment of characters. As a result,
texts can be extracted according to these special
texture features of characters. Paul et al [3]
segmented and classified texts in a newspaper by
generic texture analysis. Small masks are applied to
obtain local textural characteristics.
All the text extraction methods described above are
applied on uncompressed images. Today, most of
digital videos and static images are usually stored in

image into four sub-bands. The transformed image
includes one average component sub-band and three
detail component sub-bands. Each detail component
sub-band contains different features information of
the real text regions. Those features are applied to the
back-propagation (BP) algorithm for training a neural
network which eventually extracts the text regions.
In a colored image, the color components may
differ in a text region. However, the information
about colors does not help extracting texts from
images. If the input image is a gray-level image, the
image is processed directly starting at the discrete
wavelet transform. If the input image is colored, the
RGB components are combined to give an intensity
image Y as follows:

Y = 0.299R + 0.587G +0.114B
(1)

Image Y is then processed with discrete wavelet
transform and the whole extraction algorithm
afterward. If the input image itself is already stored in
the DWT compressed form, the DWT operation can
be omitted in the proposed algorithm.
The flow chart of the proposed algorithm is shown
in Figure 1. We choose Haar DWT because it is the
simplest among all wavelets [6]. The working
principle of Haar DWT is discussed in the next
sub-section in details.

shown in Fig 2. In these three detail components of
an image, we can obtain various edge features of the
original image.
ABC D
EFGH
IJKL
MNOP







(A+B)(C+D)(A-B)(C-D)
(E + F) (G + H ) (E - F) (G - H )
(I + J) (K + L ) (I - J) (K - L)
(M + N ) (O + P) (M - N ) (O - P )





 (a) (b)

(A + B) + (E + F) (C + D) + (G + H) (A - B) + (E - F) (C - D) + (G - H)
(I + J) + (M + N) (K + L) + (O + P) (I - J) + (M- N) (K - L) + (O - P)

Figure 3(c). 2-D Haar DWT decomposes a gray-level
image into one average component sub-band and
three detail component sub-bands. From these three
detail components, we can obtain important features
of candidate text regions.
As a practical example, a gray-level original
image is shown in Figure 4. The corresponding DWT
sub-bands are shown in Figure 5. We can extract
features of candidate text regions from the detail
component sub-bands in Figure 5 In next subsection,
a neural network is employed to learn the features of
candidate text regions obtained from those detail
component sub-bands. Finally, the well trained neural
network is ready to extract the real text regions. Figure 5. 2-D Haar discrete wavelet transform image

LH HL HH
Output node
Hidden node
Input nodeFigure 6. Proposed architecture of the neural network

2.2 Neural Network
In this subsection, text extraction from static image
or video sequences is accomplished using the
back-propagation (BP) algorithm on a neural network.

The sample images chosen for experiments include
some pure text samples and some samples containing
non-text regions. Corresponding to the text
characteristics of an image, the intensity of detail
component sub-bands is quiet different from one
sub-band to another. We employ this intensity
difference to compute 3 features of candidate text
regions. Those features are used as the input of a
neural network for training based on the
back-propagation algorithm for neural networks.
After the neural network is well trained, new input
data will produce an output value between zero and
one. The output values of real text regions are pretty
different from those of the non-text regions.
Therefore, we can apply an appropriate threshold to
remove the non-text regions. Finally, the remained
real text regions are processed by some dilation
operations and shown in Figure 7.

3. Experiment Results
Experiments are performed on static images and
video sequences. The frame size is 1024×768 in BMP
or MPEG format. We convert the colored frames into
gray-level before applying the proposed method. In
Figure 8, the results of the proposed algorithm are
illustrated step by step. The original images shown in
Figure 8(a) are decomposed into one average
component sub-band and three detail component
sub-bands as shown in Figure 8(b). Those detail

[3] Williams. P.S., Alder. M. D., ” Generic texture
analysis applied to newspaper segmentation,”
IEEE International Conference on Neural
Networks, 1996. , Volume: 3, 3-6 June 1996
Page(s): 1664 -1669 vol.3[4] Yu Zhong, Hongjiang Zhang, Jain, A.K., ”
Automatic caption localization in compressed
video, “ IEEE Transactions on Pattern Analysis
and Machine Intelligence, Volume: 22 Issue: 4 ,
April 2000 Page(s): 385 –392

[5] Byung Tae Chun, Younglae Bae, Tai-Yun Kim,
Fuzzy ”Automatic Text Extraction in Digital
Videos using FFT and Neural Network”, IEEE
International Conference of Fuzzy systems 1999,
FUZZ-IEEE '99. Volume: 2, 22-25 Aug. 1
Page(s): 1112 -1115 vol.2, 1999

[6] K. Grochening, W. R. Madych “Multiresoultion
Analysis, Haar Bases, and Self-Similar Tilings
of R
n
“ IEEE Transaction on Information
Theory, Vol. 38, No 2, Mar. 1992.

[7] S. G. Mallat, “A theory for Multiresolution
Signal Decomposition: The Wavelet

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Automatic text extraction using DWT and Neural Network - Pdf 75

Tài liệu, ebook tham khảo khác

Học thêm