- 1 -
CHAPTER 1: INTRODUCTION
1.1 Rationale for choosing this topic
English has already played a specially important role in the increasing development
of science, technology and international relations, which has resulted in the growing needs
for English language learning and teaching in many parts of the world. English has become
a compulsory subject in national education in many countries, among which Vietnam has
considered learning and teaching English as a major strategic tool to develop human
resources, as a way to keep up with other countries. Therefore, in any level of education,
from primary to university or postgraduate degree, learners must learn or want to learn
English as a compulsory subject or their target to access to information technology and to
find a good job. It is true that English teaching/ learning is essential for job training.
Fully aware of the importance of the English language, the University of
Technology, Ho Chi Minh National University has encouraged and required their students
to learn it as a compulsory subject during the first three academic years. Therefore, English
has been taught at the University of Technology since it was established, aiming at
equipping the students with an essential tool to go deeper into the world. However, to
evaluate how students acquire when they learn a foreign language, how well they use what
they have been taught and at which level of English they are standing is not paid much
attention to. The evaluation only counts for calculating the percentage of the number of
students who pass English tests, which ; therefore, doesn’t say anything about the validity,
reliability or discrimination of the tests. The results of English test are not successfully and
completely employed. In addition, during the time I have worked as a teacher of English at
the University of Technology, I have heard teachers and learners complaining about the
English achievement test in terms of its content, its structure. As a result, the English
section has decided to implement the renewal of the item bank in order to make it more
valid and more reliable.
Seeing the point, the author is encouraged to undertake this study entitled
“Evaluating the Reliability and Validity of an English Achievement Test for Third-
year Non- major students at the University of Technology, Ho Chi Minh National
University and some suggestions for changes” with the intention to find out how valid
year students
in terms of its validity and reliability.
Then, quantitative methodology was used to collect and analyze data. After
collecting data, the author employed statistic software to interpret it and to present
suggested findings.
- 3 -
1.5 Research questions
This study is implemented to find answers to the following research questions:
1. Is the achievement test for third-year non-English major students at the
University of Technology, Ho Chi Minh National University reliable?
2. Is the achievement test for third-year non-English major students at the
University of Technology, Ho Chi Minh National University valid?
3. Is it necessary to make some changes to the test? If yes, what are the changes?
1.6 Design of study
The thesis is organized into four major chapters:
Chapter 1- Introduction presents such basic information as: the rationales, the aims, the
method, the research questions and the design of the study.
Chapter 2- Literature Review reviews theoretical backgrounds on evaluating a test, which
includes language testing, criteria of good tests and theoretical ideas on test reliability and
validity as well as achievement tests.
Chapter 3- The study is the main part of the thesis showing the context of the study and the
detailed results obtained from collected tests and findings in response to the research
questions.
Chapter 4- Conclusion offers conclusions and practical implications for the test
improvement. In this part, the author also proposes some suggestions for further research
on the topic.
- 4 -
CHAPTER 2: LITERATURE REVIEW
This chapter provides an overview of the theoretical background of the study. It
includes four main sections. Section 2.1 discusses the importance of testing in education.
testing is “what teachers measure or judge learners’ competence all the time and, ideally,
learners measure and judge themselves”.
Shortly speaking, it is undeniable that testing is an integrative part of teaching and
it can be separated from the program or from the course goals. Testing has both positive
and negative impact on teaching. Testing provides the teacher with information on how
effective his teaching has been, or the teacher can use tests to diagnose his own efforts as
well as those of his students.
Testing and Learning
Testing is a tool to “pinpoint strengths and weaknesses in the learned abilities of the
student” (Henning, 1987: 1). That is, through testing, learners can find out at which level
they are standing and what difficulties they have faced up with. As a result, they can adjust
their learning; explore more effective ways of learning. At the same time, the teacher can
rely on the result of tests to understand better learners’ ability and then can improve his
methods of teaching or revise knowledge. Thus, Read (1982: 2) said that “a test can help
both teachers and learners to clarify what the learners really need to know”. It is clear that
not only the teacher but also learners may achieve the benefits through testing.
To sum up, tests can benefit students, teachers and even administrators by
confirming progress that has been made and showing how they can best redirect their
future efforts. Tests can help. In addition, good tests can sustain or enhance class morale
and aid learning.
2.2 Language Testing
Language testing is one of the forms of testing and it is also one form of
measurements. Its importance in English learning is reviewed as: “properly made English
tests can help create positive attitudes toward instruction by giving students a sense of
accomplishment and a feeling that the teacher’s evaluation of them matches what he has
taught them. Good English tests also help students learn the language by requiring them to
study hard, emphasizing course objectives, and showing them where they need to improve”
(Davies, 1996: 5).
Mc Namara (2000) presented three main roles of language testing, which is applied
not only in education but in other fields as well. Firstly, language testing is considered as a
there is problem that should be pointed out that rather than emphasizing the tension among
the different qualities, test developers need to recognize their complementarity.
- 7 -
Bachman and Palmer (1996) consider the criteria as qualities of test usefulness
rather than individual factors. Their idea of usefulness can be visually presented as in
Figure 2.1:
Usefulness = reliability + validity +impact + authenticity +
interactiveness + practicality
Figure 2.1 Usefulness
(Bachman and Palmer, 1996)
Henning (1987) added more test characteristics and he summarized in the form of
the table called A checklist for Test Evaluation. The checklist is for rating of the adequacy
of a test for any given purpose.
Test
usefulness
Practicality
Reliability
Validity
Authenticity
Interactiveness
Impact
- 8 -
Table 2.1 A checklist for test evaluation
Name of test ________________________________
Purpose Intended ____________________________
Test characteristic Rating (0 = highly inadequate, 10 = highly adequate)
1. Validity _______________________
2. Difficulty _______________________
alternate forms of the same test. Due to differences in the exact content being assessed on
the alternate forms, environmental variables such as fatigue or lighting, or student error in
responding, no two tests will consistently produce identical results. This is true regardless
of how similar the two tests are. For example, a test that includes a translation part would
probably produce different scores from one administration to another because it is
subjective, and it would thus be unreliable.
Henning (1987: 10) claimed that all tests are subject to inaccuracies. The ultimate
scores gained by the test-takers only provide approximate estimations of their true abilities.
While some measurement error is unavoidable, it is possible to quantify and greatly
minimize the presence of measurement error. A test on which the scores obtained are
generally similar when it is administered to the same students with the same ability, but at
a different time is said to be a reliable test. And since test reliability is related to test length,
so that the longer tests tend to be more reliable than shorter tests, knowledge of the
importance of the decision to be based on examination results can lead us to use tests with
different numbers of test items.
Test reliability is considered as “a quality of test score” by Bachman (1990: 24). He
makes a further point that if a student receives a low score on a test one day and high score
on the same test two days later, the test doesn’t yield consistent results, and the score
cannot be considered reliable indicator of the individual’s ability.
Reliability can also be viewed as an indicator of the absence of random error when
the test is administered. When random error is minimal, scores can be expected to be more
consistent from administration to administration.
Sources of Error
According to Bachman (1990, 165), there are four factors that affect language test
scores. The effects of these various factors on a test score can be illustrated as in Figure
2.2.
- 10 -
Figure 2.2 Factors that affect language test scores
We can infer from the figure that a score in a language test is indicated by
communicative language ability. Also, the language test is affected by factors other than
Rt is expressed as a number ranging between 0 and 1.00, with r = 0 revealing no reliability
and r = 1.00 indicating perfect reliability. An acceptable reliability coefficient must not be
below 0.90, less than this value indicates inadequate reliability. For instance, r = 0.90 on a
test means that 90% of the test score is accurate while the remaining 10% consists of
standard error. If the r = 0.60, it means that only 60% of the test score is reliable and the
other 40% may be caused by an error.
Thus, the higher the reliability coefficient is, the lower the standard error is. The
lower the standard error is, the more reliable the test scores are.
Types of reliability estimates
According to Henning (1987), there are several types of reliability estimates, each
influenced by different sources of measurement error, which may arise from bias of item
selection, from bias due to time of testing or from examiner bias. These three major
sources of bias may be addressed by corresponding methods of reliability estimate:
a. Selection of specific items:
- Parallel Form Reliability
- Internal Consistency Reliability estimates (Split Half Reliability)
- Rational equivalence
b. Time of testing:
- Test-retest Method
c. Examiner bias
- Inter-rater Reliability
Parallel form reliability: indicates how consistent test scores are likely to be if a person
takes two or more forms of a test. A high parallel form reliability coefficient
- 12 -
indicates that the different forms of the test are very similar which means that it
makes virtually no difference which version of the test a person takes. On the other
hand, a low parallel form reliability coefficient suggests that the different forms are
probably not comparable; they may be measuring different things and therefore,
cannot be used interchangeably.
A formular for this method may be expressed as follows:
+
=
(Henning, 1987)
(In which: R
tt
: reliability estimated by the split half method; r
A, B
: the correlation of the
scores from one half of the test with those from the other half).
Rational equivalence is another method which provides us with coefficient of internal
consistency without having to compute reliability estimates for every possible split
half combination. This method focuses on the degree to which the individual items
are correlated with each other.
(4).
( )
−
−
=
∑
2
22
1
(6).
( )
BA
BA
tt
rn
nr
R
,
,
11
−+
=
(In which R
tt:
Inter-rater reliability, N: the number of raters whose combined estimates form
the final mark for the examinees, r
A, B
: the correlation between the raters, or the
average correlation among the raters if there are more than two).
To improve the reliability of a test is to become aware of test characteristics that
may affect reliability. Among these characteristics are test difficulty, discriminability, item
quality, etc.
Test difficulty: is calculated by the following formular:
(7)
(In which, p: difficulty, Cr: sum of correct responses, N: number of examinees)
According to Heaton (1988: 175), the scale for the test difficulty is as follows:
p: 0.81-1: very easy (the percentage of correct responses is 81%-100%)
p: 0.61-0.8: easy (the percentage of correct responses is 61%-80%)
p: 0.41-0.6: acceptable (the percentage of correct responses is 41%-60%)
(1988: 159) also provides a simple but complete definition of validity as “the validity of a
test is the extent to which it measures what it is supposed to measure”. Hughes (1989: 22)
claimed that “A test is said to be valid if measures accurately what it is intended to
measure”.
It is taken from the Standards for Educational and Psychological Testing (1985: 9)
that “Validity is the most important consideration in test evaluation. The concept refers to
the appropriateness, meaningfulness, and usefulness of the specific inferences from the test
scores. Test validation is the process of accumulating evidence to support such inferences”.
Thus, to be valid, a test needs to assess learners’ ability of a specific area that is proposed
on the basis of the aim of the test. For instance, a listening test with written multiple-choice
options may lack validity if the printed choices are so difficult to read that the exam
actually measures reading comprehension as much as it does listening comprehension.
- 15 -
Validity is classified into such subtypes as:
Content validity
This is a non-statistical type of validity that involves “the systematic examination
of the test content to determine whether it covers a representative sample of the
behavior domain to be measured” (Anastasi & Urbina, 1997: 114). A test has
content validity built into it by careful selection of which items to include. Items
are chosen so that they comply with the test specification which is drawn up
through a thorough examination of the subject domain. Content validity is very
important in evaluating the validity of the test in terms of that “the greater a test’s
content validity, the more likely it is to be an accurate measure of what is supposed
to measure” (Hughes, 1989: 22).
Construct validity
A test has construct validity if it demonstrates an association between the test
scores and the prediction of a theoretical trait. Intelligence tests are one example of
measurement instruments that should have construct validity. Construct validity is
viewed from a purely statistical perspective in much of the recent American
literature Bachman and Palmer (1981a). It is seen principle as a matter of the
Reliability and validity are the two most vital characteristics that constitute a good
test. However, the relationship between reliability and validity is rather complex.
On the one hand, it is possible for a test to be reliable without being valid. It means
that a test can give the same result time after time but not measure what it was
intended to measure. For example, a MCQ test could be highly reliable in the sense
of testing individual vocabulary, but it would not be valid if it were taken to
indicate the students’ ability to use the words productively. Bachman (1990: 25)
says “While reliability is a quality of test scores themselves, validity is a quality of
test interpretation and use”.
On the other hand, if the test is not reliable, it cannot be valid at all. To be valid, as
for Hughes (1988: 42), “a test must provide consistently accurate measurements. It must
therefore be reliable. A reliable test, however, may not be valid at all”. For example, in a
writing test, candidates may be required to translate a text of 500 words into their native
language. This could well be a reliable test but it cannot be a valid test of writing.
Thus, there will always be some tension between reliability and validity. The tester
has to balance gains in one against losses in the other.
2.4 Achievement test
Achievement tests play an important role in the school programs, especially in
evaluating students’ acquired language knowledge and skills during the course, and they
are widely used at different school levels.
- 18 -
Achievement tests are known as attainment or summative tests. According to
Henning (1987: 6), “achievement tests are used to measure the extent of learning in a
prescribed content domain, often in accordance with explicitly stated objectives of a
learning program”. These tests may be used for program evaluation as well as for
certification of learned competence. It follows that such tests normally come after a
program of instruction directly.
Davies (1999: 2) also shares an idea that “achievement refers to the mastery of
what has been learnt, what has been taught or what is in the syllabus, textbook, materials,
etc. An achievement test therefore is an instrument designed to measure what a person has
objectives. Tests based on course objectives work against the perpetuation of poor teaching
practice, a kind of course-content-based test, almost as if part of a conspiracy fail to do. It
is the author’s belief that test content based on course objectives is much preferable, which
provides more accurate information about individual and group achievement, and is likely
to promote a more beneficial backwash effect on teaching.
Progress achievement tests, as the name suggests, are intended to measure the
progress that learners are making. Since “progress” in approaching course objectives, these
tests should be related to objectives. These should make a clear progression toward the
final achievement tests based on course objectives. Then if the syllabus and teaching
methods are appropriate to the objectives, progress tests based on short-term objectives
will fit well with what has been taught. If not, there will be pressure to create a better fit. If
it is the syllabus that is at fault, it is the tester’s responsibility to make clear that it is there,
that change is needed, not in the tests.
In addition, more formal achievement tests need careful preparation; teachers could
feel free to set their own ways to make a rough check on students’ progress to keep
students on their toes. Since such tests will not form part of formal assessment procedures,
their construction and scoring need not be purely towards the intermediate objectives on
which a more formal progress achievement tests are based. However, they can reflect a
particular “route” that an individual teacher is taking towards the achievement of
objectives.
Summary
In this chapter, the writer has presented a brief literature review that sets the ground
for the thesis. Due to the limited time and the volumn of this thesis, the writer
wishes to focus only on evaluating the reliability and the validity of a chosen
- 20 -
achievement test. Therefore, this chapter only deals with those points on which the
thesis is carried out.
- 21 -
CHAPTER 3: THE STUDY
This chapter is the main part of the study. It provides practical background for the
years. It is divided into two phases: Basic English (1) and ESP (2). Phase 1, which lasts
three first semesters with 99 forty-minute periods, is covered by Lifelines series in which
the students only pay attention to reading skill and grammar.
Phase 2, including three final semesters with 93 forty-minute periods in total, is
wholy devoted to ESP. It should be noted that the notion ESP in this context is simple the
constitution of the English language and the contents for Information Technology. In Phase
2, the students work with Basic English for Computing which consists of twenty eight units
providing background knowledge and vocabulary for computing. This book covers four
skills such as listening, speaking, reading and writing and language focus. The reading
texts in the course book are meaningful and useful to the students because it first revises
their knowledge, language items and then supplies the students with background
knowledge and source of vocabulary relating to their major - Information Technology.
Table 3.1 illustrates how the syllabus is allocated to each semester.
Table 3.1 Syllabus content allocation
Semester 45-minute periods Teaching content Course book
1 33 Reading and grammar Lifelines Elementary
2 33 Reading and grammar Lifelines Elementary
3 33
Reading and grammar Lifeline Pre-Intermidiate
4 39
Reading and grammar
and vocabulary
Basic English for
Computing
5 27
Reading and grammar
and vocabulary
Basic English for
Computing
6 27
subject-specific lexis.
* Simple, authentic texts and diagrams present up-to-date computing content in an
accessible way.
* Tasks encourage learners to combine their subject knowledge with their growing
knowledge of English.
* Glossary of current computing terms, abbreviations, and symbols.
* Teacher's Book provides full support for the non-specialist, with background
information on computing content, and answer key.
- 24 -
The book was designed to cover all four skills and followed by language focus.
However, because of the objectives of the ESP taught at the University of Technology,
only reading skill and grammar are focused as mentioned so far.
The detail content of the book can be found at Appendix 2. The book appears good with
authentic and meaningful texts. The final achievement tests are often based closely on the
content of the course book.
3.2 English testing at the University of Technology
3.2.1 Testing situation
English tests for students at the University of Technology are designed by the staff
of the English section. Each teacher from the staff is responsible for test items for each
semester and then, all the materials will be fed into a common item bank that is controlled
by a kind of software in a server. Before the examinations, the person who is in charge of
preparing the tests will use the software to mix the test items in the item bank and print out
the tests. All the tests are designed under the light of syllabus-content approach. All in all,
the students are required to take six formal tests throughout their courses. Within the
limited scope of the study, the writer would like to focus on the third-year final test or the
sixth semester, which is the last test that the students have to do.
Current English testing situation at the University of Technology has several worth-
noting points as follows:
• Students are often instructed with the test format long before the actual test, which
leads to the test-oriented learning.
Narrative text relating
to the computing,
approx. 300-400 words
× 5, 4-option multiple
choice
25
III Reading and
Vocabulary
Narrative text relating
to the computing,
approx. 150-200 words
× 10, open cloze 15
IV Writing Incomplete sentences × 5, sentence building 15
V Writing Incomplete sentences × 5, sentence
transformation
15
VI English-Vietnamese
translation
Sentences in English 2 sentences 10
VII Vietnamese-
English translation
Sentences in
Vietnamese
2 sentences 5
Total 100
(For the specific test, see Appendix 3)
As explained above, the students are supposed to apply their reading skills,
grammar and vocabulary in preparation for the final examination, so the test is aimed at
accessing both knowledge and skills. In the first part of the test, the students have to
perform their tasks with the background knowledge, vocabulary and language items