DAI HQC QUOC HA HA N I
TRUIING DQ HQC CONG
NGHE,
THUAT TOAN BAYES VA IJ’NG DUNG
KHOA LU TOT NGHIEP DAI HOC HE CHINH QUY
Nganh : Cong Ngh e) Thong
Tin
DQ HQC QUOC HA HA N I
TRU£1NG DQ HQC CONG
NGHE,
THUAT TOAN BAYES VA IJ’NG DUNG
KHOA LU TOT NGHIEP DAI HOC HE CHINH QUY
Nganh : Cong Ngh e) Thong
Tin
C:in b(o huiing dd n: ThS. Nguye’n Nam Hai
C:in b(o dong hinting din: ThS. Dfi Hoii ng
Ki“en
Viet khoa 1ua) n khoa hoc la mot trong nhiing viec kho khan nhat ma em
ph:ii
hoén thanh tir trucrc de'n nay. Trong qua trinh thirc hie(n de tai em da gap rat
nhiéu kho khan va bcr ngo. Néu khong co nhiing su giup do va Hi dong vie“n chan
thanh cua
nhiéu thay co ban be va gia gia dinh co lé em kho co the hoén thénh 1ua( n
van my.
Dau tién em xin gfri Hi cain on chan thanh den thay Nguyen Nam Hai va thay
Do Hoang Kién da truc tiép huéing dan em hoén thanh 1ua)n van my. Nhcr co thay
ma
em dupc tie'p cm voi nguon tai lieu gia tri. cling nhir nhiirig gop y guy gin sau my.
Trong khoa 1ua) n my trinh bay
ve“
mot tiép cfin thong
ke“
trong vie(c du doén
sir
kie)n dua vao 1y thuyet Bayes. Ly thuyet my noi ve viec tinh xac suat ciia su kien
dira
vao cac két qua thong ké cac sir kien trong qua khu. Sau viec tinh toiin
Sau phan 1y thuyet chfing ta sé tim hieu ve bai to:in thuc té trong ngiinh cong
nghe“ thong tin. Bai toiin ve vie(c loc thu rite tjr do( ng. Giai quyet bai my la su két
hpp
tu“
rat nhiéu phuong an nhu DNS Blacklist, kie“m na nguoi nhan, nguoi gin,
dung bo loc Bayes, chan dia chi IP, Blacklist/Whitelist, Dung bo loc Bayes la
phuong tin
thong minh no gan gin véii nguoi dung bcri chinh ngucri dung da hua
luyen no nhan
bie't thu rac. Khoa 1ua)n my tap chung vao viéc tim hieu bo
loc thu rac Bayesspam
—
ma n on mcr cai
da)tt cho he thong email co ten la
S
i mo quan trong, cung cap
cac
thong tin thong ké trung thuc, khach quan, chinh xiic, diiy dii, ki.p tho trong viéc
diinh
gia, du biro tinh hinh, hoach dinh chién luoc, chinh such, xay dpng ké hoach phat
trién kinh té - xa hoi v:i d:ip ring nhu ciiu thong tin thong ké ciia cac to chirc, ca nhfin.
Trong so nhiing vai tro quan trong thi du bio tinh hinh la mot trong nhiing vat tro
mang
nhie“u y nghia, no co ca mot qu:i trinh hua luyen bén trong va co tinh xii 1y tu
dong khi da duoc hua luyen. Hay noi kh:ic hon la khi da co tri thiic lay tir ciic dii lieu
thong
ke“
hay kinh nghiem ciia nguéii dung két hpp véii mot phuong phap hoc (huan
1uye“n)
dua trén 1y thuyet thong ké ta sé co duoc mot
duoc nhiing quyet dinh voi do chinh xac kha cao.
thuc
de“
tu no co the dira ra
Phan tich thong ké la mot khau quan trong khong the thieu dirpc trong cac
cong trinh nghién ciiu khoa hoc, nhat la khoa hoc thuc nghiem. Mot cong trinh nghién
ciiu khoa hoc, cho du co ton ke" m va quan trong co mo, néu khong dupe phfin
tich
dung phirong phap
gicr co cc hoi duoc xuat hieu trong cac tap san khoa
hoc. Ngay nay, chi
nghién ciiu y khoa.
De“
biet hai thu)at diéu tri co hieu qu:i giong nhau hay khong, nha nghién ciiu phai thu thap
dir 1ie(u trong hai nhéim be“nh nhfin (mot nhom dupc diéu tri. bang phuong phiip A,
va mot nhom duoc diéu tri. bang phuong ph:ip B). Truéng phiii tiin so da)t ciiu hot rang
“néu hat thuat die“u tri. co hieu qu:i nhu nhau, x:ie suat run dir lieu quan sat
1:i
bao
nhie“u”, nhung truéing phiii Bayes hoi khiic: “Voi dli lieu quan siit dupc, xiic suiit rn:i
thua) t diéu tri. A co hieu qu:i cao hon thu(at diéu tri B la bao nhiéu”. Tuy hai ciich
hoi
thoat diiu méii dpc qua thi chiing co gi khiic nhau, nhung suy right chung ta se
thay
do la so khac biet mang tinh triét 1y khoa hoc va y nghia ciia no rat quan trong.
Doi voi nguoi bite st (hay nha khoa hoc noi chung), suy luan theo tru6ng phai Bayes
la rat tu
nhie“n, rat hpp véii thuc Ie. Trong y khoa tain sang, ngucri biic st ph:ii su
dijng ket qua
xét nghi e( m de phiin doiin bénh nhiin mac hay khong mac ung thir (ciing
giong nhu
trong nghién ciiu khoa hoc, chting ta phiii
st
dqng so lieu de suy 1ua)n ve
khii uang ciia mot gin thiét).
Thu a)t to:in Bayes va ring
dpng
Theo dinh li Bayes, xac suat xay ra A khi biet B se" phu thuoc vao 3 yéu to:
> Xiic suat xay ra A cua rie“ng no, khong quan
than
den B. Ki hieu la
P(A) va dpc la xac suat cua A. Day dupc got la xac suat bién duyén
hay xiic suat tién nghiem, no la "tién nghiem" theo nghia rang no
khong
quan
than
den bat ky thong tin mo ve B.
Xiic suat xay ra B khi biet A xay ra. Ki hie(u la P(B A) va doc la
"xiic
suat cua B néu co A
.
Dai luong my goi la kha uang
(
likelihoods xay
ra B khi biet A da x:iy ra. Chu y khong nham lan giita
kha uang xay ra A khi biet B va xac suat xay ra A khi biet B.
Khi biet ba dai luong my, xac suat cua A khi biet B cho boi cong thuc:
Thuat toiin Bayes va ting dung
1
0
2.2 Circ tiéu héa riii ro trong bai toiin phan lép
Bayes
Bay gio xem xét bai toiin nut chat, hay hinh dung réng nha rniiy sun xuat dirpc 2
loai la: wi = Super va wi = Average
cac gin tr] cua ham run.t do xac suat. De y rang khi so s:inh (1-2a) ta co gia tri chung la
p(x) do do ta viét lai:
if p(x|w;) P(w;) > p(x|w )P(w ) then x ÷ w; else x c w . (1-4)
Hay la:
then x ÷ w; else x c w
.
(1-4a)
Trong “rig thuc (1-4a) thi v(x) goi la so hpp 1y (likelihood ratio)
20
18
16
6
t4
Hinh 1: Bieu do ciia da)c trung N cho hai 1dp hpc ciia cue nut chai. Gin tri
ngufing N = 65 dupe dénh dau bang mot duéing thang diltig
Gin ski rang moi nut chai chi co mot dac trung la N, tire la vec to d)ac trung la x = [N],
gia sir co m(ot nut chai co x = [65].
Tit do thi. ta tinh dupe cue xiic suat likelihood:
p(x|w ) = 20/24 = 0.833
——›
P(wi) p(x|wi) = 0.333
(1-5a)
p(x|w ) = 16/23 = 0.696
——+
P(w ) p(x|wi) = 0.418
(1-5b)
Ta sé phfin x = [65] vao 1ép w iinc du hpp 1y(likelihood) ciia wi hit hon
ciia
Wz
thuéing
xuyén hon. Khi do sai toén phfin tang ten dieu la la su anh huéing ciia xiic suat
tién
nghiem la co loi. Cau trii Hi cho cau hoi my la tién quan den chit de phan 1dp
mao
hiém, ma sé dupe trinh bay ngay bay gici.
Chfing ta gia dinh réng gin ciia mot nut chai (cord stopper) thu(oc 1éip w; la
0.025£, lip w la 0.015f:. Gia sir la cue nut chai 1dp wi dope dung cho cue chai da)c
bi(et, co c:ie n6t chai 1dp w thi dttng cho cue chai binh thufing.
cho loai chai d)ac biet.
NB - Hiinh dong ciia viec sit d;ing mot nfit chai(cord shopper) de phan
cho loai chai binh thuéing.
×
wi = S (siéu 16p); wz = A (1dp trung binh)
Thufit to:in Bayes va King d;ing
DISCRIH Rows Observed classifications
AR ALYSIS Columns: Predicted classifications
Total
. 0 0
.
0
7 3
,
0 0 0 0 0
Hinh 3
:
Ket qua phan léip ciia cork stoppers véii xac suat tién nghiem khong dong
——
j
1 if j « j
(I -7a)
Trong truéing hpp my tit tat cii cue x:ie suat hau nghiem déu tiing ten mot,
chfing ta cén phiii cue tiéu hoa:
Thua)t toiin Bayes va ting
d;ing
Dieu my tuong dirong voi viec chfing ta cuc dai P(wi | x), 1ua)t quyet dinh
Bayes cho riu ro cuc tie“u tuong Eng véii viec tong quat hoa van
de:
(1-7c)
Tém lai. lu‹1t quyét dinh Bayes cho
ii
ro circ tiéu, khi sir phén top dung thi khong bi
wit
mét vé néu nhir phiin top sai thi co
wit
incit, la cén phéi chon dirac top co xéc
suit
héu nghiém
ID
crc dai.
Hann quyet dinh cho lip wi la:
g,(x) = P(w, | x) (4-18d)
Bay gio hay xem xét cac tinh huong khac nhau cua ciic thie“t hai x:iy ra
cho
nhimg quye“t dinh sai law,
Ta co the tinh gin tri. rfii ro trung binh truéng hpp co 2 1éip:
Thuat to:in Bayes va ring dung
Chung ta hay sit d rig tap dir lieu h 1uye(n de diinh gin nhimg sai so n:iy,
Pe; =0. 1 va Pe =0.46 (xem hinh 6). Rim ro trung binh doi vcii moi n6t chai bay gici
la:
R = 0.015Pei + 0.01Peii = 0.0061C.
Véri f 2 la tap cue l6p ta co cong thuc (1-9) tong qu:it:
Lu(at quye't dinh Bayes khong phai la lua chon duy nhat trong thong ké phén
1ép. Ciing fun y ning, trong thuc
te'
mot trong
nhimg
giiim thieu riii ro trung
binh la su d;ing me tuong ciia ham mat do xiic suat tinh dupe tit mot tap dli lieu huan
luyen, nhu chiing ta da tain o trén cho cork Stoppers. Ne'u chiing ta co nhimg cm cir de
au thi ta thay the viéc tinh
cue thaw biéu thich hpp tit ta’p h luyen. Hoac la chiing ta
" co
the’
sts d;ing
phuong phiip cue tiéu hoa riii ro theo kinh nghiem (empirical risk minimization
t
ERM)), nguye’n tae la cue tiéu hoa rid ro theo kinh nghiem thay vi riii ro thuc té.
2.3 Phan l6p Bayes chuan tae
Cho den gicr chiing ta'“ chua
gia d)mh da)c trung ciia phiin phoi mau cho
likelihoods. Tuy nhién, mo hinh chufin tae la mot gin d(mh hpp 1y. Mo hinh chufin tae
co tién quan de'n dinh 1y giéii han trung tain noi tiéng, theo (dinh 1y my thi tong ctia mot
tuong 1éin cue bién
ter
thaw so ctia phan phoi
(chang han nhu vec to trung binh ctia phén phoi chufin). Mot ciich dung chit y tinh
dupe me lupng u ciia vecto thaw bie'n la cue dai hoa hum
rrif.it
do xiic suat p(T
|
8),
co the coi diiy la mot ham ciia 8 goi la likelihood of 8 cho tap hum luyc:n. Gia su
ning
'i
mau la dna
vao doc lap tit mot tap vo han, chiing ta co the biéu th] likelihood
nhu
sau:
Khi sit dpng trc›c lirpng hpp
IQ
cvc dqi (maximum likelihood estimation) ciia
cue bun phfin phoi thi no thuéing de dang hon la tinh cue dai ciia In[p(T|8)], dieu my
la tuong duong nhau. Voi phén phoi Gauss trtc /trpzig méu dupe cho béii cue cong
thuc (1-10a) va (1-10b) chinh la inc lirpng hpp ly cvc dqi va no sé hoi tu
thuc.
14
'
mot gin
Thua)t toiin Bayes va ring
d;ing
Hinh 7: The be ll -shapcJ .surfacc of a two-dimensional normal disiribu‹ion
An ct \ ipsis with equal probability density points is at.so shown.
(1-12)
(1-12a)
H
a
i
1
é
i
Thuat to:in Bayes va ring dung
l6
p
phfin blet voi phan phoi chua, x:ie suat tién nghiem dong nhat
va
covariance va viin con co mot cong thiic rat don gum cho xiic suat cua loi
ciia phiin
(1-13)
(1- 13a)
(1-13b)
Thuat to:in Bayes va ring dung
l6
binh phuong ciia kho:ing ciich Bhattacharyya, mot khoang ciich Mahalanobis
ciia sai phiin trung binh, the“ hieu tinh de t:ich 1éip.
Hinh 8 the hieu dung die(u ciia Pe voi s;r tiing dan ciia binh phuong khéng
c:ich
Bhattacharyya. Ham my gum dan theo cap so mii va no hoi tu tiém cén Hi 0. Vi
vay
that kho de gum sai so phan 1dp khi gin tri. my la nho.
khiic
biet nhau nhie“u thi s;r khiic biet giiia cue giiii ph:ip b)ac hai va tuyén tinh chi
dung ke
khi cue ma'
u
u
c
ci
i
i
i
c
c
h
h xa
n
nguyéenn
u nhu o hinh 10.
Hinh 10: Discrimination of two c:lasses with optimum quudrutic c lassi fier (solid
line) and sub-optimum linear classified (dotieil line)
Chfing ta se minh hoa béng ciich sit d;ing b(o dir lieu Norm2c2d. Sai so 1y
thuyet doi voi truéing hpp hai 16p, hai chiéu va bo dli lieu tre’n la:
0.8 —0.8 2
—0.8 1.6 3
Ucic tuong sai so ciia bo dli lieu hua luye(n cho tap du lieu my la 5%. Bang
ciich dna vao sai so ±0.1 vao cue gin tri ciia ma tr(an énh xa A cho bo dir lieu, voi
do(