đánh giá độ tin cậy của bài thi trắc nghiệm THứ NHấT TRÊN MáY TíNH cuối kỳ 4 dành cho sinh viên năm thứ hai không chuyên ngành tiếng anh trờng đại học kinh doanh và công nghệ hà nội -A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT COMPUTER-BASED M - Pdf 24

VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF FOREIGN LANGUAGES
DEPARTMENT OF POSTGRADUATE STUDIES
NGUYEN THI VIET HA
A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT
COMPUTER-BASED MCQS TEST 1 FOR THE 4
TH
SEMESTER NON - ENGLISH
MAJORS AT HANOI UNIVERSITY OF BUSINESS AND TECHNOLOGY
(đánh giá độ tin cậy của bài thi trắc nghiệm THứ NHấT TRÊN MáY
TíNH cuối kỳ 4 dành cho sinh viên năm thứ hai không chuyên
ngành tiếng anh trờng đại học kinh doanh và công nghệ hà nội)
Minor Programme Thesis
Field: Methodology
Code: 601410
HANOI, 2008
1
VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF FOREIGN LANGUAGES
DEPARTMENT OF POSTGRADUATE STUDIES
NGUYễN THị VIệT Hà
A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT
COMPUTER-BASED MCQS TEST 1 FOR THE 4
TH
SEMESTER NON - ENGLISH
MAJORS AT HANOI UNIVERSITY OF BUSINESS AND TECHNOLOGY
(đánh giá độ tin cậy của bài thi trắc nghiệm THứ NHấT TrÊN MáY
TíNH cuối kỳ 4 dành cho sinh viên năm thứ hai không chuyên
ngành tiếng anh trờng đại học kinh doanh và công nghệ hà nội)
Minor Programme Thesis
Field: Methodology

English Department, HUBT for her willingness to offer test score data.
I wish to show my special thanks to the students of K11 at Hanoi University of Business
and Technology who have actively participated in the survey
Finally, it is my great pleasure to acknowledge my gratitude to beloved members of my
family, especially my husband who constantly encouraged and helped me with my thesis.
ii
ABSTRACT
The main aim of this minor thesis is to evaluate the reliability of the final Achievement
Computer-based MCQs Test 1 for the 4th semester non-English majors at Hanoi
University of Business and Technology.
In order to achieve this aim, a combination of both qualitative and quantitative research
methods were adopted. The findings indicate that there is a certain degree of unreliability
in the final achievement computer-based MCQs test1 and there are two main factors that
cause the unreliability including test item quality and test- takers performance. ’
Having carefully considered a thorough analysis of the collected data, the author made
some suggestions in order to improve the quality of the final achievement test and the
MCQs test 1 for the non-majors of English in the 4
th
semester in Hanoi University of
Business and Technology. Firstly, the test objectives, sections and skill weight should be
adjusted to be more compatible with the course objectives and the syllabus. Secondly, a
testing committee should be set up for the construction and development of a multi choice
item bank including test items which are of good p-value and discrimination value.
iii
LIST OF ABBRIVIATIONS
1. CBT: Computer-based testing
2. HUBT: Hanoi University of Business and Technology
3. MC: Multi choice
4. MCQs: Multi choice questions
5. ML Pre- : Market Leader Pre-intermediate

CANDIDATE S STATEMENT’
i
ACKNOWLEDGEMENT
ii
ABSTRACT
iii
LIST OF ABBREVIATION
iv
LIST OF TABLES AND CHARTS
v
TABLE OF CONTENT
vi
Chapter 1: INTRODUCTION
1
1.1. Rationale for the study
1
1.2. Aims and research questions
2
1.3. Theoretical and practical significance of the study
2
1.4. Scope of the study
2
1.5. Method of the study
2
1.6. Organization of the paper
3
Chapter 2: LITERATURE REVIEW
4
2.1. Language testing
4

2.4.2. Methods for test reliability estimate
12
2.4.3. Measures to improve test reliability
15
2.5. Summary
15
Chapter 3: The Context of the Study
16
3.1. The current English learning, teaching and testing situation at HUBT
16
3.2. The course objectives, syllabus and materials used for the second non-
majors of English in Semester 4.
17
3.2.1. The course objectives
17
3.2.2. Business English syllabus
17
3.2.3. The course book
19
3.2.4. Specification grid for the final achievement Computer-based MCQs test
in Semester 4.
19
Chapter 4: Methodology
21
vi
4.1. Participants
21
4.2. Data collection instruments
21
4.3. Data collection procedure

38
Chapter 6: CONCLUSION
39
6.1. Summary of the findings
39
6.2. Limitations of the study
40
6.3. Suggestions for further study
40
REFERENCES
41
APPENDICES
I
APPENDIX 1
Grammar, Reading, Vocabulary and Functional language check list
II
APPENDIX 2
Survey questionnaire (for students at HUBT)
IV
APPENDIX 3
Students test scores’
VII
APPENDIX 4
Item analysis of the final achievement computer-based MCQs test 1- 150
items, 349 examinees
XII
APPENDIX 5
Item indices of the final achievement computer-based MCQs test 1
XVII
vii

improve the current testing situation in HUBT.
1.2. Aims and research questions
1
The main aim of the study is to investigate the reliability of the existing final
achievement MCQs test 1 (4
th
semester) for non-English majors at HUBT through
analyzing the test objectives, test content and test skill weight format, students’ scores, test
items, perception and comments from students on the test and then to make suggestions
towards the test’s improvement.
To achieve this aim, the following research questions are set for exploration:
1. Are the objectives, content and skill weight format of the final achievement
computer-based MCQs test 1 compatible with the course objectives, the
syllabus content and skill weight format ?
2. To what extend is the test 1 reliable?
3. What is the student’s attitude towards the final achievement Computer-based
MCQs test 1?
1.3. Scope of the study
The existing final achievement Computer-based MCQs test 1 in the 4
th
semester for
the second-year non-English majors at HUBT
1.4. Theoretical and practical significance of the study
Theoretically, the study proves that testing is crucial in order to measure and
evaluate the quality of learning and teaching. Also, test reliability is one of the most
important criteria for the evaluation of a test.
Practically, the study presents how reliable the final achievement MCQs test 1
administered at HUBT is and how to improve its quality.
1.5. Method of the study :
Both qualitative and quantitative methods are used.

According to Henning (1987, p.1), “Testing, including all form of language test, is
one form of measurement”. In his opinion, tests such as listening or reading
comprehension are delivered in order to find out the extent to what the abilities of these
skills are present in the learners. Similarly, Bachman (1990, p.20) stated: “A test is a
measurement instrument designed to elicit a specific sample of an individual’s
behavior”. He also considered obtaining the elicited sample of behavior as the
distinction of a test from other types of measurement.
Brown H.D (1995, p.384) presented the notion in a simpler way: “A test, in plain
words, is a method of measuring a person’s ability or knowledge in a given domain”.
He explained that a test first and foremost is a method which includes items and
techniques requiring the performance of testees. Via this performance, a person’s
ability or language competence is measured.
These viewpoints show that a language test is an effective tool of measuring and
assessing students’ language knowledge and skills and providing precious information
for better future teaching and learning.
2.1.2. The purposes of language tests
Language tests regarding their purposes are perceived from different perspectives
by different scholars. Typically, Henton (1990) mentioned 7 points which can be
represented as follows:
• Finding out about progress
• Encouraging students
• Finding out about learning difficulties
• Finding out about achievement
• Placing students
• Selecting student
• Finding out about proficiency
4
In general, a language test is used to evaluate both teaches and students’
performance, to make judgment and adjustment to teaching materials and methods, and
to strengthen students’ motivation for their further study.

A test is considered to be valid if it possesses content validity, face validity and
5
construct validity. The practicality of a test is administrative. A test is practical when it
is time and money- saving. Also, it is easy to administer, mark and interpret. The
discrimination of a test is the extent to which a test separates the students from each
other (Harrison, 1983). In other words, it is the capacity of the test to discriminate
among different students and to reflect individuals’ performance of the same group.
2.2. Achievement test
2.2.1. Definition
Achievement tests are of extensive use at different levels of education due to their
distinguished characteristics. Researchers define the notion of achievement tests in
various ways.
Henning (1987, p.6) held that:
Achievement tests are used to measure the extent of learning in a
prescribed content domain, often in accordance with explicitly stated
objectives of a learning program. .
From this definition, it followed that an achievement test was a measurement tool
designed to examine language competence of learners over a period of instruction
learning and to evaluate instruction program. In the same token, Hughes (1989) put that
achievement tests were intended to assess how successful individual students, groups of
students or the courses themselves have been in achieving objectives. Achievement
tests play an important role in the education programs, especially in evaluating
students’ acquired language knowledge and skills during a given course.
2.2.2. Types of achievement test
Achievement tests can be subdivided into the final achievement and progress
achievement according to the time of administration and the desired objectives
(Henton, 1990).
Final achievement tests are usually given at the end of the school year or at the end
of the course to measure how far students have achieved the teaching goals. The
contents of these tests must be related to the teaching content and objectives concerned.

testing technique. An MC item is a test item where the test taker is required to choose
the only correct answer from a number of given options (McNamara, 2000; Weir,
1990). .
In the view of Heaton (1988), MC items take many forms but their basic structure
includes two parts. The initial part is known as stem. The primary purpose of the stem is
to present the problem clearly and concisely. The stem needs to provide the testees a
very general idea of the problem and the answer required. The stem may be in the form
of an incomplete statement, a complete statement or a question. The other part is the
7
choices from which the students select their answers and is referred as options/
responses or alternatives. In an MC item there may be three, four or five options of
which one is the correct options or key while the others are distractors of which the task
is to distract the majority of poor students from the correct option. The optimum
number of options in most public test for each multi choice item is five. And it is
desirable to use four options for grammar items and five for vocabulary and reading.
2.3.2. Benefits of MCQs test
MC items are undoubtedly one of the most widely used types of items in objective
test (Heaton, 1988). The popularity of this testing technique results from its efficiency.
Researchers such as Weir (1990), Heaton (1988) and Hughes (1989) pointed out a
number of benefits which are presented as detailed below.
Firstly, the scoring of MCQs test is perfectly reliable, rapid and economical. There
is only one correct in the format of an MC item so that the scorers’ interference into the
test is minimized. The scorers are not permitted to impose their personal expertise,
experience, attitudes and judgment when giving marks to testees’ responses. The
testees, thus, always get a consistent result whoever the scorers are and whenever their
tests are given marks. In addition, MCQs tests can be marked mechanically with
minimal human intervention. As a result, the marking is not only reliable and simple
but also more rapid and often more cost effective than other forms of written test (Weir,
1990).
Secondly, an MCQs test can cover a much wider sample of knowledge than a

advantages. First, just as paper-done MCQs tests, scoring of fixed response items can be
done automatically and the candidate can be given a score immediately. Second, the
computer can deliver tests that are tailored to the particular abilities of the candidate.
This type of test, as also called computer-adaptive test, can provide far more information
about the testees’ ability.
2.3.3. Limitations of MCQs tests:
Despite the fact that MCQs tests bring lots of benefits, especially, to test
administrators, there are several problems associated with the use of MC items. These
problems were identified by a number of researchers such as Weir (1990), Hughes
(1989), Heaton (1988), McCOUBRIE. P (2004) and McNamara (2000).
First of all, Hughes (1989) criticized that MCQ technique tests only recognition
knowledge. To do a given task, a testee just needs to look at the stem and four or five
options and then picks out the key. His or her performance is not much more than the
recognition of the right form of language. It shows no evidence that this person can
produce the language. Obviously, this type of test presents a lack between at least some
candidates’ productive and receptive skill and therefore the performance on an MCQs
9
test may give an inaccurate picture of these candidates’ ability (Hughes, 1989). Heaton
(1988) also pointed out that an MC item does not lend itself to the testing language as
communication and the process involved in the actual selection of one out of four or five
options does not bear much relation to the language used in most real life situation.
Normally, in everyday situation we are required to produce and receive language while
MC items are merely aimed to test receptive skills.
Another problem arises when using MCQs tests is that “the multi choice item is one
of the most difficult and time consuming types of items to construct” (Heaton, 1988,
p.27). In order to write a good item, test designers have to strictly follow certain
principles. For example, they have to write many more items than they actually need for
a test. After that they have to pre-test and analyze students’ performance on the item
evaluate items and recognize the usable ones or even to rewrite the items for a
satisfactory final version. These procedures take a lot of test constructors’ time and need

• Multi choice items should be as brief and as clear as possible
• Multi choice items should be arranged in rough order of increasing difficulty and
there should be one or two simple items to “lead in” the testees.
2.4. Reliability of a test
2.4.1. Definition
In research, the term reliability means ‘repeatability’ or ‘consistency’. A test is
considered reliable if it would give us the same result over and over again assuming that
what we are measuring isn't changing. Lynch (2003, p.83) stated that reliability refers to
“the consistency of our measurement”. In the same vein, Harrison (1983) explained that to
be reliable, tests should not be elastic in their measurement. Whatever the version of the
test a testee take, whatever the occasion the test is administrated, and whatever raters who
score the test, it still yields the same results.
2.4.2. Methods of test reliability estimate
Reliability may be estimated through a variety of methods which is presented below:
* Test-retest method is a classic way to calculate the reliability coefficient of a test. The
test is given to a group of students and then given again to these students immediately
afterward (the interval between two test administration is no more than two weeks). The
test is assumed to be perfectly reliable if the students get the same score on the first and the
second administration (Alderson, J.S. et al., 1995)
* Parallel-form methods involve correlating the scores from two or more similar (parallel)
tests which are administrated to the same sample of persons. A formula for this method
may be expressed as follows:
Rtt = rA,B (Henning, 1987)
11
Rtt: the reliability coefficient
rA,B: the correlation of form A with form B of the test when administered to the same
people at the same time.
* Inter-rater method is applied when scores on the test are independent estimates by two
or more raters. It involves the correlation of the ratings of one rater with those of another.
The following formula is used in calculating reliability:

2
Rtt: The KR 20 reliability estimate
n: The number of items in the test
s
t
2
: The variance of test scores
12
∑s
i
2
: The sum of the variances of all items (or ∑pq)
Kuder- Richardson Formula 21 (KD-21) is based on total test scores and assumes that all
items of an equal level of difficulty. The KD-21 is as follows:
n x – x
2
/n
Rtt = ( 1 - ) (Henning, 1987)
( n-1) s
t
2
Rtt : The KR 20 reliability estimate
n: The number of items in the test
x: The mean of scores on the test
s
t
2
: The variances of test scores
Alderson, J.S. et al (1995) stated that for the internal consistency reliability, the
perfect reliability index is +1.0. In the same view, Hughes (1989, p.31-32) noted that “ the

D: discriminability
Hc: the number of correct response in the high group
Lc: the number of correct response in the low group
The optimal size of each group is 28% of the total sample. For very large samples of
examinees, the number of examinees in the high and low groups are reduced to 20% for
computational convenience. The acceptable discrimination value by sample separation
method is >= 0.67 (Henning, 1987)
2.4.3. Measures to improve test reliability
Reliability may be improved by eliminating its sources of error. Hughes (1989)
makes a list of recommendation to improve test reliability as follows:
• Take enough sample of behavior
• Do not allow candidates too much freedom
• Write unambiguous items
• Provide clear and explicit instructions
• Ensure that the test are well laid out and perfectly legible
• Candidate should be familiar with format and testing techniques
• Provide uniform and non-distracting conditions of administration
Furthermore, item difficulty and item discriminability show that the reliability of an
MCQs test. is low or high (Henning, 1987). Therefore the most straight forward ways to
improve test reliability is to design MCQs items with good level of difficulty and
discrimination value.
2.5. Summary
14
This chapter presents the theoretical framework for the study. In Section 2.1, the
notion of a language test as a measuring device of people’s ability is reviewed.
Additionally, the purposes of language testing, types of language tests and criteria of a
good test are also discussed. Section 2.2 classifies achievement tests into two types and
mentions consideration in designing final achievement tests. The definition, benefits and
limitations of MCQs tests and principles of this type of test construction are dealt with in
section 2.3. The final Section - 2.4 is concerned with test reliability, methods for estimating

and 4
th
semester for the second year
students.
The Computer-based MCQs test administered in HUBT is similar to a paper-done
one. The main different is that the test is delivered on computers and students simply click
mouse for their chosen response among A, B, C, D. This kind of test is different from
computer adaptive tests which are tailored to the particular abilities of the candidate. In
16


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status