Nguyen Thi Phuong Thu August 2005
Vietnam national university, hanoi
College of foreign languages
Designing & evaluating an English reading
test for the non-majors of Civil Engineering
at Haiphong publicrivate university
Thiết kế và đánh giá một bài kiểm tra tiếng anh chuyên ngành
cho sinh viên xây dựng dân dụng tại
trờng đại học dân lập hải phòng
M.A. minor thesis
Field: methodology
Code: 50702
Course: k11
By : Nguyen Thi Phuong Thu
Supervisor : Tran Hoai Phuong, MEd.
Hanoi - August 2005
1
1
Nguyen Thi Phuong Thu August 2005
Ack n owl edg emen ts
During the process of further studying and conducting this research I was really
honored to receive guidance, assistance, and encouragement from various lecturers as well as
supervisors among whom I would like to acknowledge my sincere thanks to the leaders of the
College of Foreign Languages who have given me permission and created favorable conditions
for study and research.
I would also like to thank my supervisor, Mrs.Tran Hoai Phuong, Med, who really
sympathized with me and also gave me great help as well as invaluable guidance and
encouragement from the very start to the end of my research.
It is also my pleasure to give my special thanks to the students of classes XD 501, XD
18. ve very easy
19. e easy
20. d difficult
21. vd very difficult
22. D Iitem discrimination
23. CU The number of the correct asnwers of the upper half
24. CL The number of the correct asnwers of the lower half
25. gd good discrimination
26. md bad discrimination
27. bi bad item
28. p Spearman rho correlation coefficient
29. SU Score on the upper half
30. SL Score on the lower half
3
3
Nguyen Thi Phuong Thu August 2005
Table o f c on te n ts
Acknowledgement
List of abbreviations
Part I: Introduction 1
1.Rationale 1
2.Aims of the study 2
3.Scope of the study 2
4.Methods of the study 2
5.Design of the study 3
Part II: Development 4
Chapter one: Literature review 4
1.1.Language testing 4
1.2.Communicative language tests 6
1.3.Testing reading skills 7
2.4.Methods of data collection and data analysis 17
2.5.Limitations of the research 17
Summary 17
Chapter three: Discussion 18
3.1-The content area of the test 18
3.2-The relative weights of the different parts of the test 19
3.3-Constructing the test 19
3.4-Administering the test 24
3.5-Marking the test 25
3.6-Test scores interpreting and evaluation 25
3.6.1.The frequency distribution 25
3.6.2.The central tendency 26
3.6.2.1.The mode 26
3.6.2.2.The median 27
3.6.2.3.The mean 27
3.6.3.The dispersion 28
3.6.3.1.The low-high 28
3.6.3.2.The range 28
3.6.3.3.The standard deviation 29
3.7-Test item analysis and evaluation 30
3.7.1.Item difficulty 30
3.7.2.Item discrimination 32
3.8.Estimating reliability 34
Summary 36
Part III: Conclusion and recommendations 37
References 39
Appendices
5
5
opportunity to undertake the study entitled “Designing a reading test for the non-majors of
Civil Engineering at Haiphong Public University” with a view to evaluating the students’
reading ability after one term’s study last school year (2004-2005) as well as to gaining some
knowledge and experience of foreign language testing for herself after completing the study.
2.Aims of the study
The minor thesis is aimed at designing an achievement test of ESP reading which
would be conducted in a class of Civil Engineering English at HPU. The test was considered as
6
6
Nguyen Thi Phuong Thu August 2005
a final examination. Then the results of the test will be analysed, evaluated, and interpreted.
The test takers are non - English - majors.
The specific aims of the research are:
to assess the learners’ achievement in improving reading skill with English of Civil
Engineering after 120 period reading course.
to measure their aptitude for the reading skill.
to diagnose their strength and weakness in reading the subject matter.
to find out whether or not the test satisfies the qualities of a good test. From there
the test will measure the effectiveness of the teacher’s teaching. If the test is not a
good one, some suggestions will be made for a better test form.
3.Scope of the study
“Not all language tests are of the same kinds. They differ with respect to how they are
designed, and what they are for; in other words, in respect to test method and test purpose.”
(Mc Namara, 2000: 5). For example, in terms of method, there are paper-and-pencil language
tests, performance tests, ect. And in terms of purpose, there are achievement tests, proficiency
test, and so on. In fact, the same form of test may be used for different purposes, although in
other cases the purpose may affect the form.
Due to the limitation of time and ability, it is impossible for the author to design tests
of all these types or of all the four language skills (speaking, writing, listening and reading).
Therefore, this minor thesis is limited to designing and evaluating an achievement test of ESP
of language testing. Section 1.2 is the introduction of communicative language tests.
Testing reading skills will be discussed in section 1.3 which is followed by section 1.4
with the investigation into major characteristics of a good test. The final area to be
mentioned is a brief review of achievement tests which is presented in section 1.5.
1.1.Language testing
An understanding of language testing is relevant both to those who are actually
involved in creating language tests, and also to those who are involved in using tests or the
information tests provide in practical research contexts. For this very reason, this section
wishes to take a close look at what a language test is.
Most researchers agree that language tests play many important roles in life. Firstly the
moment one does a test can be considered an important transitional moment in his life, for
example, a pupil wishing to enter a university has to pass the entrance tests, or a job seeker has
to do a certain test so that the employer will know whether he is competent, or if somebody
needs to drive a motor or a car, he or she has to pass a driving test, ect. Secondly, language
tests are also important to many occupations. We teachers rarely teach without testing our
students’ performance in the subjects. Tests will help us to put them in right places; therefore,
language tests, if used properly, can be considered a valuable teaching device for any teacher,
and they will contribute positively to the development of both teachers and learners. Last but
not least, any researcher who needs measurement of the language proficiency of the subjects
cannot do it without using an already existing test or designing his or her own test.
As for Caroll (1968) a test in general will certainly tell something about a testee’s
characteristics. Thanks to the results from his test, it is possible for a teacher to judge whether
this student is good or bad at the subject tested. Caroll provides the following definition of a
test: “a psychological or educational test is a procedure designed to elicit certain behavior
from which one can make inferences about certain characteristics of an individual.” (Caroll,
1968: 46)
According to Hughes (1989: 9), tests can be classified as follow:
Proficiency tests
Achievement tests
• Class progress tests
Nguyen Thi Phuong Thu August 2005
1.2.Communicative language tests
There is one thing that is essential to the activities of designing a test and interpreting
the meaning of test scores. It is the view of language and language use embodied in the test.
The term ‘test construct’ refers to these aspects of knowledge or skill possessed by the
candidate which are being measured. To define test construct it is important to be clear about
what knowledge of language consists of and how that knowledge is used in actual performance
(i.e. language use). It is also essential to understand what view the test takes of language use
because if the view the test takes is different, then the test will be different. As a result, the
reporting of score will be different, and the test performance will be interpreted differently.
Therefore, the difference of format between tests is not just incidental; it implies a difference
between views of language and language use. Accordingly, communicative language tests are
different from other types of tests such as discrete point test or integrative and pragmatic tests
in the following aspects:
According to Mc Namara (2000: 17) discrete point test focuses on students’ knowledge
of the grammatical system, of vocabulary and aspects of pronunciation and tends to test these
aspects of knowledge in isolation. With this type of test, multiple choice questions are most
suitable. This discrete point tradition of testing is seen as focusing too much on knowledge of
the formal linguistic system for its own sake rather than on the way the knowledge is used to
achieve communication.
Aslo as for Mc Namara using integrated tests is a new orientation in which integrated
knowledge of relevant systemic features of language (pronunciation, grammar, vocabulary)
with an understanding of context is deployed. Yet, these tests are regarded as time consuming
and difficult to score. For example for an oral interview, the test will involve comprehension of
extended discourse (both spoken and written), and as a result besides the disadvantages
mentioned above it also requires trained raters.
Because of those disadvantages another type of test, pragmatic test, replaced the old
ones. It focuses less on knowledge of language and more on psycholinguistic processing
involved in language use. With this type, a cloze test was seen the most suitable and was once
believed to be easy to construct, relatively easy to score. However, it soon turned out to be
1.3.2. Short answer questions
In the test there are questions which require the candidates to write down specific
answers in spaces provided on the question paper.
1.3.3. Cloze
This type is also familiar with students. In the cloze procedure, words are deleted from
a text after allowing a few sentences of introduction. The deletion rate is mechanically set,
usually between every fifth and eleventh word because deleting too many or too few words can
cause problems with test validity. Candidates have to fill each gap by supplying the word they
think has been deleted.
1.3.4. Selective deletion gap filling
13
13
Nguyen Thi Phuong Thu August 2005
It is selecting items for deletion based upon what is known about language, about
difficulty in text and about the way language works in a particular text.
1.3.5. C-Tests
In C-test every second word in a text is partially deleted. In an attempt to ensure
solutions, students are given the first half of the deleted word. The examinee completes the
word on the test paper and an exact word scoring procedure is adopted.
1.3.6. Cloze elide
In cloze elide test, words that do not belong to the original text are inserted into a
reading passage and candidates have to indicate where these insertions have been made.
1.3.7. Information transfer
This is a task where the information transmitted verbally is transferred to a non-verbal
form, e.g. by labeling a diagram, completing a chart or numbering a sequence of events. This
type of test is an objective method for testing the test takers’ understanding of the texts.
1.3.8.Jumbled sentences
This type of test is intended to test the student’s understanding of a sequence of stages
in a process or events in a narrative. A successful student is the one who can reorder jumbled
sentences or unscrambled sentences of a story correctly.
measurements: if a student takes a test at the beginning of the course and again at the end, any
improvement in his score should be the results of differences in his skills and not inaccuracies
in the test. In the same way, it is important that the student’s score should be the same (or as
nearly the same as possible) whether he takes one version of the test or another and whether
one person marks the test or another. Reliability also means ‘the consistency with which a test
measures the same thing all the time’(Harrison, 1987). This can be presented in the figure
below:
Reliability
Figure 1: Reliability
There are therefore three aspects to reliability: the circumstances in which the test is taken,
the way in which it is marked, and the uniformity of the assessment it makes.
According to Hughes (1989) there are two components of test reliability: the performance of
candidates from occasion to occasion and the reliability of the scoring. Therefore, to make
tests more reliable Hughes (1989) gives a long list of and clear instructions for what we should
do:
- take enough samples of behavior,
- do not allow candidates too much freedom in choosing what and how to answer,
- write unambiguous items,
- provide clear and explicit instructions,
- ensure that tests are well laid out and perfectly legible,
15
15
Scores on test tasks with
characteristics A
Scores on test tasks
with characteristics A’
Nguyen Thi Phuong Thu August 2005
- make sure candidates are familiar with format and testing techniques,
- provide uniform and non-distracting conditions of administration,
- use items that permit scoring which is as objective as possible,
16
Nguyen Thi Phuong Thu August 2005
skills or the structures. Such a specification should be made at a very early stage in test
construction.
According to Weir (1990: 24) the more a test stimulates the dimensions of observable
performance and accords with what is known about that performance, the more likely it is to
have content validity and construct validity. Thus, for Kelly (1978: 8) content validity seems
‘an almost completely overlapping concept” with construct validity, and for Moller (1982: 68):
‘the distinction between construct and content validity in language testing is not always very
marked, particularly for tests of general language proficiency.’ Slightly different from other
researchers, Anastasi (1982: 131) defined content validity as: ‘essentially the systematic
examination of the test content to determine whether it covers a representative sample of the
behavior domain to be measured.’
So we could see that content validity has been defined differently, but most researchers
agree that content validity is highly important for the two following reasons. First, the greater a
test’s content validity is, the more likely it is to be an accurate measure of what it is supposed
to measure. A test in which major areas identified in the specification are under-representedor
not represented at all is unlikely to be accurate. Secondly, such a test is likely to have harmful
backwash effect. Areas which are not tested are likely to become areas ignored in teaching and
learning.
1.4.2.2. Face validity
A test is said to have face validity if it looks as if it measures what it is supposed to
measure. Face validity is hardly a scientific concept, yet it is very important. A test which does
not have face validity may not be accepted by candidates, teachers, education authorities or
employers.
1.4.2.3. Criterion-related validity
There are essentially two kinds of criterion-related validity: concurrent validity and
predictive validity. According to Viete (1992), concurrent validity is used to refer to the
relationship between the test results and the results of another assessment (using an
appropriate, reliable and validated assessment procedure) which was made at approximately
available for these activities’. (Bachman & Palmer, 1996: 35). This relationship can be
represented as in the figure below:
Available resources
Practicality=
Required resources
When practicality ≥ 1, the test development and use is practical
When practicality< 1, the test development and use is not practical.
In a nutshell, when designing a test the tester should always bare in mind this quality-
practicality-to ensure that the test is as economical as possible, both in time (preparation,
sitting and marking) and in cost (materials and hidden costs of time spent). In other words, a
practical test is the one which can minimize the use of the available resources, i.e., the required
resources must not be more than the available resources.
1.4.4. Discrimination
Finally, a discussion of the basic concepts behind testing would be incomplete without
the treatment of the closely related idea of discrimination. According to Harrison (1994:14)
18
18
Nguyen Thi Phuong Thu August 2005
discrimination is ‘the extent to which a test separates the students from each other.’ However,
the extent of discrimination varies according to each kind of test. For instance, an achievement
test should result in a wide range of scores because it is easier to make decisions about where
to separate one group of students from another so that they can be awarded different grades. A
diagnostic test, however, may be intended to show that nearly all students have learnt the
material tested, and in this case they should all get fairly high scores.
1.5. Achievement tests
Different researchers have different points of view of an achievement test. According to
Harrison (1983: 65) ‘designing and setting an achievement test is a bigger and more formal
operation than the equivalent work for a diagnostic test, because the student's result is treated
as a qualification which has a particular value in relation to the results of other students. An
achievement test involves more detailed preparation and covers a wide range of material, of
Summary
In this chapter I have briefly dealt with the concept of a language test, how it is defined
and what is important in designing it. Moreover, I also mentioned the concept of
communicative language ability in which communicative competence was also discussed.
Also, in this chapter the definition of an achievement test as well as testing reading skills were
presented because they play an important role in the process of doing this research.
20
20
Nguyen Thi Phuong Thu August 2005
Chapter two: Me tho d olog y
This chapter will include a brief introduction of a quantitative study, the selection of
participants who took part in doing the test, and the materials from which the test items
were taken. The methods of data collection and data analysis are presented afterwards.
Finally come the limitations of the research.
2.1.A quantitative study
Like qualitative research, quantitative research comes in many approaches including
descriptive, correlational, exploratory, quasi-experimental, and true-experimental
techniques.
As a teacher of Civil Engineering English, I designed this reading test to understand better
how things are really operating in my own classroom as well as to describe the
performance of my learners in the reading skill. After 120 period reading course 50
students were chosen from three different classes (XD501, XD 502, XD 503) to do a
reading test in the time given (60 minutes) and then the results collected from the testing
papers would be described in different terms with the use of the descriptive statistics
technique. The correlational research technique was also used to find out the reliability
coefficient latter in the study.
2.2.The selection of Participants
The students at Haiphong PrivatePublic University mainly come from different towns
and cities in the North of Vietnam. They are generally aged between 18 and 22, or older.
At the university, they study for eight terms in four years. There students are classified
to certain conclusions. Secondly, instead of designing different types of test, the author was
able to make solely one type, that is an achievement test to measure the progress her students
had made in terms of reading skills after undertaking the course of English for Civil
Engineering in their last term in 2004-2005 school year. From the results, the author could also
measure the effectiveness of her teaching.
Summary
This chapter gives a brief account on a quantitative study, in which the author used the
descriptive statistics and correlational technique to analyse the data. Following the methods,
the selection of participants and materials has also been dealt with. A quick introduction of the
data collection and data analysis methods was also presented and finally came the limitations
of the research.
22
22
Nguyen Thi Phuong Thu August 2005
Chapter three: Discu s sion
This chapter is the discussions firstly about the content area of the test, how the test was
divided, how to construct and mark the test. Afterwards, the whole test results and
each test item would be analysed and then interpreted. Finally, the author will
evaluate the test based on the four criteria of a good test as mentioned in the
previous chapter.
3.1. The content area of the test
The following topic checklist of the course book will help to point out the content area of the
reading test.
The Topic checklist of the course book
Topic Material Number of unit/ page
Architectural composition
Skeleton construction
Concrete, reinforced
concrete, prestressed
concrete
Unit 12-p.29
Unit 13-p.32
Unit 14-p.35
Unit 15-p.35
Unit 16-p.42
23
23
Nguyen Thi Phuong Thu August 2005
The hinge Unit 17-p.46
3.2.The relative weights of the different parts of the test
The test is composed of 5 parts, and the weighting of each part is illustrated in the following
table:
Test of reading
Part Input Response/ Item type Scores Weighting
1
Factual text,
approx.120 words
5 comprehension
questions
10 20%
2 5 word columns
Matching to make 5
sentences
10 20%
3
10 jumbled
sentences
Rearranging
10 20%
4 10 statements True / False 10 20%
blanks with the given words.
FORMAT AND TIMING
Scanning: 1 passage with about 120 words in length.
5 short answer items, the items in the order in which relevant information
appears in the texts. Responses were controlled.
Time: 10 minutes.
Detailed reading
-5 columns of words. Responses were controlled.
Time: 10 minutes.
-10 jumbled sentences to be rearranged. Responses were controlled.
Time: 20 minutes.
-10 statements to be marked T or F. Responses were controlled.
Time: 10 minutes.
-1 passage with about 120 words in length.
5 gaps to be filled. Responses were expected.
Time: 10 minutes.
CRITICAL LEVELS OF PERFORMANCE
All test items were written such that any student completing the course successfully would be
able to respond correctly to all of them. Allowing for ‘performance errors’ on the part of
candidates, a critical level of 80 percent was set. The students reaching this level would be the
ones succeeding in terms of the course’s objectives.
SCORING PROCEDURES
There was a detailed key and the scoring was completely objective.
SAMPLING
The texts were chosen from a variety of topics in the course book. Draft items were written
before the test was officially used.
ITEM WRITING AND MODERATION
All the items in the test were based on a consideration of what a competent non-major would
be able to obtain from the texts. Considerable time was set side for moderation and rewriting
of items.