Tài liệu Báo cáo khoa học: "Grammar Error Correction Using Pseudo-Error Sentences and Domain Adaptation" - Pdf 10

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 388–392,
Jeju, Republic of Korea, 8-14 July 2012.
c
2012 Association for Computational Linguistics
Grammar Error Correction
Using Pseudo-Error Sentences and Domain Adaptation
Kenji Imamura, Kuniko Saito, Kugatsu Sadamitsu, and Hitoshi Nishikawa
NTT Cyber Space Laboratories, NTT Corporation
1-1 Hikari-no-oka, Yokosuka, 239-0847, Japan
{
imamura.kenji, saito.kuniko
sadamitsu.kugatsu, nishikawa.hitoshi
}
@lab.ntt.co.jp
Abstract
This paper presents grammar error correction
for Japanese particles that uses discrimina-
tive sequence conversion, which corrects erro-
neous particles by substitution, insertion, and
deletion. The error correction task is hindered
by the difficulty of collecting large error cor-
pora. We tackle this problem by using pseudo-
error sentences generated automatically. Fur-
thermore, we apply domain adaptation, the
pseudo-error sentences are from the source
domain, and the real-error sentences are from
the target domain. Experiments show that sta-
ble improvement is achieved by using domain
adaptation.
1 Introduction
Case marks of a sentence are represented by postpo-

learner’s and the correct sentences. However, col-
lecting a sufficient number of pairs is expensive. To
avoid this problem, we use additional corpus con-
sisting of pseudo-error sentences automatically gen-
erated from correct sentences that mimic the real-
errors (Rozovskaya and Roth, 2010b). Furthermore,
we apply a domain adaptation technique that re-
gards the pseudo-errors and the real-errors as the
source and the target domain, respectively, so that
the pseudo-errors better match the real-errors.
2 Error Correction by Discriminative
Sequence Conversion
We start by describing discriminative sequence con-
version. Our error correction method converts the
learner’s word sequences into the correct sequences.
Our method is similar to phrase-based statistical ma-
chine translation (PBSMT), but there are three dif-
ferences; 1) it adopts the conditional random fields,
2) it allows insertion and deletion, and 3) binary and
real features are combined. Unlike the classification
388
Incorrect Particle Correct Particle Note
φ no/POSS. INS
φ o/ACC. INS
ga/NOM. o/ACC. SUB
o/ACC. ni/DAT. SUB
o/ACC. ga/NOM. SUB
wa/TOP. o/ACC. SUB
no/POSS. φ DEL
: :

Since an insertion can be regarded as replacing an
empty word with an actual word, and deletion is the
replacement of an actual word with an empty one,
we treat these operations as substitution without dis-
tinction while learning/applying the CRF models.
mail
noun
Input Words
o
ACC.
todoi
verb
tara
PART

Phrase Lattice
mail
o
todoi
tara
copy
INS
copy
SUB
copy copy
<s>
Incorrect Particle
Phrase Lattice
mail
noun

are defined as the pairs of the output phrase and 1-,
2-, and 3-grams in the window.
The link features are important for the error cor-
rection task because the system has to judge output
correctness. Fortunately, CRF, which is a kind of
discriminative model, can handle features that de-
pend on each other; we mix two types of features
as follows and optimize their weights in the CRF
framework.
• N-gram features: N-grams of the output words,
from 1 to 3, are used as binary features. These
are obtained from a training corpus (paired sen-
tences). Since the feature weights are optimized
considering the entire feature space, fine-tuning
can be achieved. The accuracy becomes almost
perfect on the training corpus.
• Language model probability: This is a logarith-
mic value (real value) of the n-gram probability
of the output word sequence. One feature weight
is assigned. The n-gram language model can be
389
constructed from a large sentence set because it
does not need the learner’s sentences.
Incorporating binary and real features yields a
rough approximation of generative models in semi-
supervised CRFs (Suzuki and Isozaki, 2008). It can
appropriately correct new sentences while maintain-
ing high accuracy on the training corpus.
3 Pseudo-error Sentences and Domain
Adaptation

mentation method for the domain adaptation, which
eliminates the need to change the learning algo-
rithm. This method regards the models for the
source domain as the prior distribution and learns
the models for the target domain.
Common Source Target
Feature Space
D
s
D
s
0
Source Data
D
t
0 D
t
Target Data
Figure 2: Feature Augmentation
We briefly review feature augmentation. The fea-
ture space is segmented into three parts: common,
source, and target. The features extracted from the
source domain data are deployed to the common
and the source spaces, and those from the target do-
main data are deployed to the common and the target
spaces. Namely, the feature space is tripled (Figure
2).
The parameter estimation is carried out in the
usual way on the above feature space. Consequently,
the weights of the common features are emphasized

0.7
0.8
0.9
1
Precision Rate
TRG
SRC
ALL
AUG
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25
Precision Rate
Recall Rate
TRG
SRC
ALL
AUG
Figure 3: Recall/Precision Curve (Error Generation Mag-
nification is 1.0)
Japanese Linux manuals, 527,151 sentences in total.
SRILM (Stolcke et al., 2011) was used to train a
trigram model.
Pseudo-error Corpus: The pseudo-errors were

0.0 0.5 1.0 1.5 2.0
Relative Improvement
-150
-100
-50
0
+50
+100
0.0 0.5 1.0 1.5 2.0
Relative Improvement
Error Generation Probability
(Magnification)
TRG
SRC
ALL
AUG
Figure 4: Relative Improvement among Error Generation
Probabilities
source domain and the real-errors as the target do-
main.
The SRC case, which uses only the pseudo-error
sentences, did not match the precision of TRG. The
ALL case matched the precision of TRG at high
recall rates. AUG, the proposed method, achieved
higher precision than TRG at high recall rates. At
the recall rate of 18%, the precision rate of AUG was
55.4%; in contrast, that of TRG was 50.5%. Fea-
ture augmentation effectively leverages the pseudo-
errors for error correction.
Figure 4 shows the relative improvement of each

Computational Linguistics (NAACL-HLT 2010), pages
163–171, Los Angeles, California.
Na-Rae Han, Joel Tetreault, Soo-Hwa Lee, and Jin-
Young Ha. 2010. Using an error-annotated learner
corpus to develop an ESL/EFL error correction sys-
tem. In Proceedings of the Seventh International
Conference on Language Resources and Evaluation
(LREC’10), Valletta, Malta.
Kenji Imamura, Tomoko Izumi, Kugatsu Sadamitsu, Ku-
niko Saito, Satoshi Kobashikawa, and Hirokazu Masa-
taki. 2011. Morpheme conversion for connecting
speech recognizer and language analyzers in unseg-
mented languages. In Proceedings of Interspeech
2011, pages 1405–1408, Florence, Italy.
John Lafferty, Andrew McCallum, and Fernando Pereira.
2001. Conditional random fields: Probabilistic mod-
els for segmenting and labeling sequence data. In
Proceedings of the 18th International Conference
on Machine Learning (ICML-2001), pages 282–289,
Williamstown, Massachusetts.
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata,
and Yuji Matsumoto. 2011. Mining revision log of
language learning SNS for automated Japanese error
correction of second language learners. In Proceed-
ings of 5th International Joint Conference on Natural
Language Processing (IJCNLP 2011), pages 147–155,
Chiang Mai, Thailand.
Alla Rozovskaya and Dan Roth. 2010a. Generating
confusion sets for context-sensitive error correction.
In Proceedings of the 2010 Conference on Empirical


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status