Vietnam National University
hanoi university of languages and international studies
post-graduate Department Hoà ng HỒNG TRANG
A STUDY ON VALIDITY OF 45 MINUTE TESTS FOR
THE 11
TH
GRADE
NGHIÊN CỨU TÍNH GIÁ TRỊ CỦA BÀI KIỂM TRA 45
PHÚT TIẾNG ANH LỚP 11
M.A. COMBINED PROGRAMME THESIS
Major: Methodology
Major code: 60.14.10
HANOI - 2009
iv
TABLE OF CONTENTS
DECLARATION i
ABSTRACT ii
ACKNOWLEDGEMENTS iii
TABLE OF CONTENTS iv
LIST OF ABBREVIATIONS vii
LIST OF TABLES viii
INTRODUCTION
1. Rationale for the study 1
2. Significance of the study 2
3. Aims of the study 2
4. Scope of the study 3
5. Research questions 3
6. Organization of the study 3
CHAPTER 1: LITERATURE REVIEW
1.1. LANGUAGE TESTING AS PART OF APPLIED LINGUISTICS 4
1.1.1. Language testing – a brief history and its characteristics 4
1.1.2. Purposes of language testing 5
1.1.3. Validity in language testing 7
1.1.3.1. Definition and types of validity 7
v
4.1. Phonetics section in 45-minute tests 31
4.1.1. Data concerning construct validity 31
4.1.2. Data concerning content validity 37
4.2. Grammar section in 45-minute tests 42
4.2.1. Data concerning construct validity 42
4.2.2. Data concerning content validity 47
4.3. Vocabulary section in 45-minute tests 53
4.3.1. Data concerning construct validity 53
4.3.2. Data concerning content validity 57
CONCLUSION
1. DISCUSSION OF FINDINGS AND RECOMMENDATIONS 60
1.1. On pronunciation testing 60
1.2. On grammar testing 62
1.3. On vocabulary testing 64
2. CONCLUSION 66
REFERENCES 68
APPENDICES
Copies of test papers collected
vii
LIST OF ABBREVIATIONS
MCQ Multiple-choice question
GF Gap-filling
ER Error recognition
ST Sentence transformation
SB Sentence building
C.V. Construct Validity
V. Validity
viii
Table 25: Content validity of vocabulary test items of Group 3 tests
Table 26: Content validity of vocabulary test items of Group 4 tests
1
INTRODUCTION
1. Rationale for the study
Language testing, a branch of applied linguistics, has witnessed its robust
development within the last fourty (nearly fifty) years in terms of professionalization,
internationalization, cooperation and collaboration (Stansfield, 2008, p. 319). Along the
process of its development, validity, together with fairness, has become a matter of
increasing concern and it is predicted that research into validity will form “the prominant
paradigm for language testing in the next 20 years” (Bachman, 2000, p. 25).
On discussing validity, much has been said about validation of standardised tests,
especially those large-scale EFL tests such as TOEFL, IELTS and TOEIC (Stoynoff, 2009;
Bachman et al., 1995, cited in Stansfield, 2008) since decisions based on the scores of
these tests are usually considered of prime importance to test takers in both their career and
life perspectives. Teacher-produced tests, on the contrary, receive much less attention.
Studies have shown that designing a good test is a “demanding” task for teachers
(Davidson and Lynch, 2002, p. 65, cited in Coniam, 2009, p. 227), since in a language test
“language is both the instrument and the object of measurement” (Bachman, 1990) (which
means difficulty regarding the careful choice of linguistic elements in a language test), and
due to teachers’ lack of time and resources (Popham, 1990, p. 200, cited in Coniam, 2009,
p. 227). Also, teachers are “unlikely to be skilled in test construction techniques” (Popham,
2001, p. 26, cited in Coniam, 2009, p. 227). That explains the reason why test item quality
of teacher-produced tests is often lower than that of standardised tests in terms of reliability
(Cunningham, 1998, p. 171, cited in Coniam, 2009, p. 227), and this leads to the low
In a narrow scale, results of the quality assessment of school tests will assist in
improvement of test items quality, creating more reliable and valid tests.
3. Aims of the study
Within the small scope of an MA thesis, this study only aims at investigating two
aspects of validity of a common type of English tests used in schools in Vietnam. In
particular, this research tries to investigate content and construct validity of the language
components of English forty-five-minute tests used for the 11
th
grade in some high schools
in northern Vietnam.
3
4. Scope of the study
Due to the time and finance constraint, the study could only focus on forty-five
minute tests for the 11
th
grade, collected from ten high schools in five provinces in the
north of Vietnam. No other types of tests or other grades were investigated. The language
used in those tests is English so all the findings and discussions are restricted to the English
language only. However, suggestions are useful to the teaching of other foreign languages.
Furthermore, the scope of an MA thesis could only allow for an investigation into two
types of validity, that is, content and construct validity and the area chosen for
investigation is the language components in the tests collected.
5. Research questions
In short, this research aims at answering the following questions:
1.1. How valid is the construct of the language components in 45 minute English tests for the 11
th
grade?
1.2. How valid is the content of the language components in 45 minute English tests for the 11
th
presence, they claim their tests to be reliable and valid. This stage corresponds to the first
approach to language testing: the essay-translation approach, in which “subjective
judgement of the teacher” is of utmost importance, rather than “skill or expertise” in
testing (Heaton, 1988). Popular components of a language test in this stage are essay
writing, translation, and grammatical analysis (Heaton, 1988).
The second period saw the dominance of structural linguistics and this explained
the reason why test items in this stage were designed to test discrete language elements
(such as sounds, words, and structures) in isolation from context (Stansfield, 2008, p.312).
This came to be known as discrete point testing, and named as the structuralist approach to
language testing. Also, the emphasis of this approach on quality of a language test was put
on reliability and objectivity (Heaton, 1988, p.16).
The third period – the integrative-sociolinguistic stage – witnessed a more scientific
appearance of language testing compared to the previous stages as statistics started to be
utilized in the examination of tests. John Oller, an outstanding author of this period,
proclaimed that there was “a general factor” constituting language proficiency, and he
called it “a grammar of expectancies”, which could be “directly tested through the cloze
test” (Oller, 1972; 1973; 1975; cited in Stansfield, 2008). Cloze tests and dictation,
together with oral interviews, translation and essay writing, are present in most integrative
tests and this was called the integrative approach to language testing.
5
It can be understood from Stansfield’s (2008) review that another fourth stage
should be added to Spolsky’s summary of history of language testing, which is
characterised by the communicative approach, and in which stage how language is used in
communication is the primary concern (Heaton, 1988). Therefore, instead of testing the
four skills separately like the structuralist approach, which is irrelevant in real life, the
communicative approach advocates integrative assessment, and authenticity of language
tasks and materials. Also, context of language use is a matter of great concern. Besides,
this stage witnessed the shift of concern from reliability to validity (Stansfield, p. 318),
which, according to Stansfield, “brought US and European testing specialists much closer
are moving forward, thus encouraging them to continue making efforts in their language
study.
3. Finding out about learning difficulties: This particular job is often taken over by
diagnostic tests, in which items are carefully designed so as students’ strength and
weaknesses are clearly reflected.
4. Finding out about achievement: Achievement tests are somewhat like progress
tests but they cover a longer period of time and are often conducted at the end of the
semester, school-year or language course to make educational decisions, for example,
promoting students to the higher level.
5. Placing students: Tests are sometimes also given to categorize students into
different groups based on their ability. Language tests are often divided into several levels
of language proficiency such as KET, PET, FCE, CAE, CPE (as in the Cambridge
rankings), or A-, B-, C- level in the Vietnamese language education system, and so on.
6. Selecting students: After the purpose of finding out about students’ ability,
strengths and weaknesses comes the task of selecting students for a job or a course.
Categorizing students is inevitably one part of identifying and selecting them.
7. Finding out about proficiency: This purpose of language tests relates closely to
two other purposes mentioned above, that is, placing and selecting. Actually, finding out
about students’ language proficiency is just one step towards making decisions concerning
students’ future education or future life (migration, for example). If language tests serving
7
other purposes tend to look back at what students have learnt, proficiency tests looks
forward to anticipate what students will have to do/be able to do in the future.
Other purposes may include “program evaluation”, “providing research criteria”, or
“assessment of attitudes and sociopsychological differences” (Henning, 1987).
1.1.3. Validity in language testing
1.1.3.1. Definition and types of validity
Validity refers to “the appropriateness of a given test or any of its component parts
as a measure of what it is purported to measure” (Henning, 1987). Validity is “the most
other words, content relevance and content coverage (Bachman, 1990, p. 244).
With regard to content relevance, Messick (1980: p. 1017) (cited in Bachman,
1990, p, 244) suggested that the investigation of content relevance requires “the
specification of the behavioral domain in question and the attendant specification of the
task or test domain”. This can be understood that not only the content of the test is a matter
of content validity but also the setting in which the test is given, or the measurement
procedure. Popham (1978) (cited in Bachman, 1990, p. 245) specifies the elements in test
design: “what it is that the test measures”, “the attributes of the stimuli that will be
presented to the test taker”, and “the nature of the responses that the test taker is expected
to make. Hambleton (1984) relates these three elements to content validity (in Bachman).
Concerning content coverage, test developers need to closely analyse the language
tested and the course objectives (Heaton, 1998) so that there is always an apparent
correspondence between the two. This is especially true to the achievement tests while
things would not be that easy in case of proficiency tests for test designers in this context
have to base on their knowledge, experience and research results to decide which content
to choose.
Content validity is one component of qualitative validity as mentioned above, and it
plays a central role in developing language tests for specific purposes, for which content
9
relevance is a matter of primary concern. Usually a test is selective in content and the
method of content selection should be taken into great consideration.
1.1.3.3. Construct validity
A construct, or psychological contruct, is “an attribute, proficiency, ability or skill
that happens in the human brain and is defined by established theories” (Brown, 2000, p.9).
While content validity mostly discusses the relationship between test content and
course objectives (in achievement tests) or test content and what examinees are supposed
to be able to do with language in non-test contexts (in proficiency tests), construct validity
is concerned with the relationship between “performance on tests” and “a theory of
abilities, or constructs” (Bachman, 1990, p. 255). And a test which shows considerable
language ability (which may include “language competence”, “strategic competence”, and
“psychophysiological competence” according to the communicative approach to language
testing (Weir, 1990)). Bachman (1990) mentioned five features to categorize language
tests, and each criterion will result in different test types. According to purpose or use,
there are selection, entrance, and readiness tests (related to admission decisions);
placement and diagnostic tests (regarding specific areas which need instruction); and
progress, achievement, attainment, or mastery tests (in terms of how well students achieve
the objectives of the study program, or how students should “proceed with the program”).
Or we can have theory-based tests like proficiency tests and syllabus-based tests like
achievement tests when talking about the content of the test. Regarding frame of reference,
there are norm-referenced and criterion-referenced tests; or subjective versus objective
tests if basing on scoring procedure; or multiple choice, completion, dictation, cloze tests,
and so on, when considering testing methods used in a test. Also based on testing methods,
McNamara could divide tests into paper-and-pencil and performance tests.
Generally, according to Heaton (1998), most testing specialists divide tests into
achievement/attainment, proficiency, aptitude and diagnostic tests.
1.2.2. Class progress tests as a type of achievement tests
11
According to Henning (1987), achievement tests “are used to measure the extent of
learning in a prescribed content domain, often in accordance with explicitly stated
objectives of a learning program”. While proficiency tests are knowledge-based,
achievement tests are syllabus-based and therefore, if it is not based on a specific syllabus,
it is no longer an achievement test. Syllabus content and objectives are the first and
foremost criteria on which achievement tests are based and assessed.
Class progress tests are a subtype of achievement tests, often referred to as progress
achievement tests, besides final achievement tests, and they are also the most popular test
type, commonly designed by teachers in and for a specific situation (Heaton, 1998). In
order to design a class progress test, a teacher often has to base on his/her knowledge of
students’ ability, objectives of the program he/she is teaching, content of the specific part
3. are as economical of time and effort as possible;
4. will have a beneficial backwash effect.
Regarding categorization, common testing techniques may be divided in terms of
the language areas or skills they are applied to, for example, techniques to test grammar,
vocabulary, reading, listening, writing, and speaking. Besides, we also have objective and
subjective testing techniques according to whether the test items will be graded objectively
or subjectively.
To serve the objectives of this study, this section will first discuss the differences
between objective and subjective testing. Then common types of objective and subjective
testing techniques will be presented.
To begin with, subjective and objective here refer to the scoring of tests, not the
construction of tests or performance on tests. Every stage in devising a test requires
teachers/test designers to make subjective judgements on selecting what to test and how to
test. As for students, they also have to carry out subjective judgements when doing the
tests. The only thing objective here is how teachers/markers grade the tests. If the tests will
13
be scored the same no matter who grades it, they are objective. Otherwise, they must be
subjective tests.
Objective testing can be applied to any skill or element, however, it will be used far
more effectively in some skills than the others. Grammar, phonology, reading, vocabulary,
or listening, for example, often lend themselves to objective testing. However, writing and
speaking can only be satisfactorily tested via subjective testing methods (Heaton, 1998).
That explains the reason why we come across multiple-choice grammar, vocabulary,
reading and listening items far more frequently than writing ones.
However, objective testing is often criticised on the ground that objective testing
does not allow for real communicative ability to be tested. Instead, students are tested on
their ability to manipulate language and such situations have never happened in everyday
language use. Besides, objective testing gives room to wild guessing and chances. Even
though, most students base their guesses on partial knowledge (Heaton, 1998, p. 27), it is
Gap-filling:
Gap-filling is “the test in which the candidate is given a short passage in which
some words or phrases have been deleted. The candidate’s task is to restore the missing
words” (Alderson, Clapham, Wall, 1995). Gap-filling indeed is a modified form of cloze
test and it has managed to avoid cloze tests’ weakness. Weir (1990) named it “selective
deletion gap-filling”. Gap-filling has been very useful in testing grammar, reading
comprehension, or vocabulary since test writers are able to focus on the items that are
considered important by selecting them to be deleted. The difficulty in using this testing
technique is to ensure that students are led to write the expected words in the gaps. It
would be ideal if there is only one correct answer for each gap, however, this is difficult to
achieve. Therefore, in order to achieve marking reliability, it is essential that the number of
alternative answers be reduced to the minimum and no other possible answers be not listed
in the answer key.
15
A banked gap-filling task can be the solution to this (Alderson, Clapham, Wall,
1995). In a banked gap-filling task, missing words and phrases are provided, together with
some distracting words, which means that there are more words/phrases than necessary.
And students’ task is just to select the correct word for each gap.
According to Weir (1990), this technique “restricts to sampling a much more
limited range of enabling skills than do the short answer and multiple-choice formats”.
Sometimes the deleted word does not at all affect the sentence, that is, the sentence
is equally good with or without the deleted word. Such case should be avoided because of
its confusion towards students.
Sentence transformation items:
This type of item is very useful for testing ability to produce structures, so it can
test grammatical production. It is the objective item type which “comes closest to
measuring some of the skills tested in composition writing”, although transforming
sentences and producing sentences are not alike.
There are two common types of sentence transformation. In the first type, there is
2.1. TYPE OF RESEARCH: A QUALITATIVE RESEARCH
This research is conducted qualitatively in the sense that it does not aim at testing
hypothesis or generalization, but rather “exploratory” and “discovery-oriented” (Nunan,
1992) as qualitative research “is not set out to test hypothesis” (Larsen, 1999).
Burnes (1999) defines qualitative research as the one conducted “to draw
conclusions from the data collected to make sense of how human behaviours, situations
and experiences construct realities”. When one carries out qualitative research, one wants
to find out what is going on “from the actor’s own frame of reference” (Nunan), that is
from the points of view of those being investigated. Besides, qualitative researchers view
each individual as a unique entity so there is no point in generalization because there is no
theory that fits all and is true to all. Because of no generalization, the number of samples in
qualitative research is often restricted and underplayed. While quantitative data are usually
gathered using probability sampling, that is, each unit in the population stands some
chance of being selected, using some form of random selection, qualitative research mostly
relies on non-probability sampling for data collection. Non-probability sampling does not
involve random selection, and does not “depend on the rationale of probability theory”
(Trochim). Also, each researcher is a unique individual. He brings his viewpoints into his
research so each research is actually biased by its researcher(s)’s individual perceptions
(Trochim); thus, establishing external validity or objectivity in any research, according to
qualitative researchers, is just pointless.
Additionally, while many researchers claim that there would be no numbers
(quantification) in qualitative data, Trochim (2006) argues that “all qualitative data can be
coded quantitatively” or “anything that is qualitative can be assigned meaningful numerical
values”. Indeed, “qualitative” data are usually categorized in the analysis process and the
act of categorizing is quantitative in itself, which many people fail to realize (Trochim,
2006). Trochim furthers his statement by saying that “all quantitative data is based on
qualitative judgement” and he believes that without qualitative judgement, quantitative
data is just valueless.