Niken Larasati Wening



Language Corpora

  1. Introduction

Since the 1990s, a “language corpus” usually means a text collection which is:

• large: millions, or even hundreds of millions, of running words, usually sampled from hundreds or thousands of individual texts;

• computer-readable: accessible with software such as concordancers, which can find, list and sort linguistic patterns;

• designed for linguistic analysis: selected according to a sociolinguistic theory of language variation, to provide a sample of specific text-types or a broad and balanced sample of a language.

Much “corpus linguistics” is driven purely by curiosity. It aims to improve language description and theory, and the task for applied linguistics is to assess the relevance of this work to practical applications. Corpus data are essential for accurately describing language use, and have shown how lexis, grammar, and semantics interact. This in turn has applications in language teaching, translation, forensic linguistics, and broader cultural analysis. In limited cases, applications can be direct. For example, if advanced language learners have access to a corpus, they can study for themselves how a word or grammatical construction is typically used in authentic data. Hunston (2002, pp. 170–84) discusses data-driven discovery learning and gives further references.

However, applications are usually indirect. Corpora provide observable evidence about language use, which leads to new descriptions, which in turn are embodied in dictionaries, grammars, and teaching materials. Since the late 1980s, the influence of this work is most evident in new monolingual English dictionaries (CIDE, 1995; COBUILD, 1995a; LDOCE, 1995; OALD, 1995) and grammars (e.g., COBUILD, 1990), aimed at advanced learners, and based on authentic examples of current usage from large corpora. Other corpus-based reference grammars (e.g., G. Francis, Hunston, & Manning, 1996, 1998; Biber et al., 1999) are invaluable resources for materials producers and teachers. Corpora are just sources of evidence, available to all linguists, theoretical or applied. A sociolinguist might use a corpus of audio-recorded conversations to study relations between social class and accent; a psycholinguist might use the same corpus to study slips of the tongue; and a lexicographer might be interested in the frequency of different phrases. The study might be purelydescriptive: a grammarian might want to know which constructions are frequent in casual spoken language but rare in formal written language.

Corpora solve the problem of observing patterns of language use. It is these patterns which are the real object of study, and it is findings about recurrent lexico-grammatical units of meaning which have implications for both theoretical and applied linguistics. Large corpora have provided many new facts about words, phrases, grammar, and meaning, even for English, which many teachers and linguists assumed was fairly well understood. Valid applications of corpus studies depend on the design of corpora, the observational methods of analysis, and the interpretation of the findings.

Applied linguists must assess this progression from evidence to interpretation to applications, and this chapter therefore has sections on empirical linguistics (pre- and post-computers), corpus design and software, findings and descriptions, and implications and applications.

  1. Empirical Linguistics

Since corpus study gives priority to observing millions of running words, computer technology is essential. This makes linguistics analogous to the natural sciences, where it is observational and measuring instruments (such as microscopes, radio telescopes, and x-ray machines) which extended our grasp of reality far beyond “the tiny sphere attainable by unaided common sense” (Wilson, 1998, p. 49).

Observation is not restricted to any single method, but concordances are essential for studying lexical, grammatical, and semantic patterns. Printed concordance lines (see Appendix) are limited in being static, but a computer accessible concordance is both an observational and experimental tool, since ordering it alphabetically to left and right brings together repeated lexico-grammatical patterns. A single concordance line, on the horizontal axis, is a fragment of language use (parole). The vertical axis of a concordance shows repeated co-occurrences, which are evidence of units of meaning in the language system (langue).

Corpus methods therefore differ sharply from the view, widely held since the 1960s, that native speaker introspection gives special access to linguistic competence. Although linguists’ careful analyses of their own idiolects have revealed much about language and cognition, there are several problems with intuitive data and misunderstandings about the relation between observation and intuition in corpus work. Intuitive data can be circular: data and theory have the same source in the linguist who both proposes a hypothesis and invents examples to support or refute it. They can be unreliable or absent: many facts about frequency, grammar, and meaning are systematic and evident in corpora, but unrecorded in pre-corpus dictionaries. They are narrow: introspection about small sets of invented sentences cannot be the sole and privileged source of data.

There is no point in being purist about data, and it is always advisable to compare data from different sources, both independent corpora, and also introspection and experiments. Corpus study does not reject intuition, but gives it a different role.

3. Some Brief History

There was corpus study long before computers (W. Francis, 1992) and, from a historical perspective, Saussure’s radical uncertainty about the viability of studying parole, followed by Chomsky’s reliance on introspective data, were short breaks in a long tradition of observational language study. Disregard of quantified textual data was never, of course, accepted by everyone. Corder (1973, pp. 208–23) emphasizes the relevance of frequency studies to language teaching, and language corpora have always been indispensable in studying dead languages, unwritten languages and dialects, child language acquisition, and lexicography. So, within both philological and fieldwork traditions, corpus study goes back hundreds of years, within a broad tradition of rhetorical and textual analysis.

Early concordances were prepared of texts of cultural significance, such as the Bible (Cruden, 1737). Ayscough’s (1790) index of Shakespeare is designed “to point out the different meanings to which words are applied.” Nowadays we would say that he had a concept of “meaning as use.” By bringing together many instances of a word, a concordance provides evidence of its range of uses and therefore of its meanings, and this essential point is still the basis of corpus semantics today.

  1. Modern Corpora and Software

Modern computer-assisted corpus study is based on two principles.

  1. The observer must not influence what is observed. What is selected for observation depends on convenience, interests and hypotheses, but corpus data are part of natural language use, and not produced for purposes of linguistic analysis.
  2. Repeated events are significant. Quantitative work with large corpora reveals what is central and typical, normal and expected. It follows (Teubert, 1999) that corpus study is inherently sociolinguistic, since the data are authentic acts of communication; inherently diachronic, since the data are what has frequently occurred in the past; and inherently quantitative. This disposes of the frequent confusion that corpus study is concerned with “mere” performance, in Chomsky’s (1965, p. 3) pejorative sense of being characterized by “memory limitations, distractions, shifts of attention and interest, and errors.” The aim is not to study idiosyncratic details of performance which are, by chance, recorded in a corpus. On the contrary, a corpus reveals what frequently recurs, sometimes hundreds or thousands of times, and cannot possibly be due to chance.

5. New Findings and Descriptions

The main findings which have resulted from the “vastly expanded empirical base” (Kennedy, 1998, p. 204) which corpora provide concern the association patterns which inseparably relate item and context:

• lexico-grammatical units: what frequently (or never) co-occurs within a span of a few words;

• style and register: what frequently (or never) co-occurs in texts.

Findings about lexico-grammar question many traditional assumptions about the lexis–grammar boundary. The implications for language teaching are, at one level, rather evident. A well-known problem for even advanced language learners is that they may speak grammatically, yet not sound native-like, because their language use deviates from native speaker collocational norms. I once received an acknowledgment in an article by a non-native English speaking colleague, for my “repeated comments on drafts of this paper,” which seemed to connote both irritation at my comments and to imply that they were never heeded. (I suppose this was better than being credited with “persistent comments”!)

  1. Applications, Implications, and Open Questions

There are often striking differences between earlier accounts of English usage (pedagogical and theoretical) and corpus evidence, but the applications of corpus findings are disputed. Since I cannot assess the wide range of proposed, rapidly changing, and potential applications, I have tried to set out the principles of data design and methods which applied linguists can use in assessing descriptions and applications. Perhaps especially in language teaching, one also has to assess the vested interests involved: both resistance to change by those who are committed to ways of teaching, and also claims made by publishers with commercial interests in dictionaries and teaching materials. Apart from language teaching and lexicography, other areas where assessment is required are as follows:

  1. Translation studies. By the late 1990s, bilingual corpora and bilingual corpus-based dictionaries had developed rapidly. The main finding (Baker, 1995; Kenny, 2001) is that, compared with source texts, the language of target texts tends to be “simpler,” as measured by lower type-token ratios and lexical density, and the proportion of more explicit and grammatically conventional constructions.
  2. Stylistics. Corpora are the only objective source of information about the relation between instance and norm, and provide a concrete interpretation of the concept of intertextuality. Burrows (1987) is a detailed literary case study, and Hockey (2001) discusses wider topics. The next category might be regarded as a specialized application of stylistics.
  3. Forensic linguistics. Corpus studies can establish linguistic norms which are not under conscious control. Although findings are usually probabilistic, and an entirely reliable “linguistic fingerprint” is currently unlikely, corpus data can help to identify authors of blackmail letters, and test the authenticity of police transcripts of spoken evidence.
  4. Cultural representation and keywords. Several studies investigate the linguistic representation of culturally important topics: see Gerbig (1997) on texts about the environment, and Stubbs (1996) and Piper (2000) on culturally important keywords and phrases.
  5. Psycholinguistics. On a broader interpretation of applications, psycholinguistic studies of fluency and comprehension can use findings about the balance of routine, convention, and creativity in language use (Wray, 2002). Corpus-based studies of child language acquisition have also questioned assumptions about word-categories and have far-reaching implications for linguistic description in general (Hallan, 2001).
  6. Theoretical linguistics. The implications here lie in revisions or rejection of the langue/parole opposition, the demonstration that the tagging and parsing of unrestricted text requires changing many assumptions about the part-of-speech system (Sinclair, 1991, pp. 81–98; Sampson, 1995), and about the lexis/grammar boundary (G. Francis, Hunston, & Manning, 1996, 1998).


Question and answer:

  1. How many principles in modern computer-assisted corpus study?

Answer: Modern computer-assisted corpus study is based on two principles

  1. The observer must not influence what is observed.
  2. Repeated events are significant.
  3. What does “language corpus” mean?


  • • Large: millions, or even hundreds of millions, of running words, usually sampled from hundreds or thousands of individual texts;
  • • Computer-readable: accessible with software such as concordances, which can find, list and sort linguistic patterns;
  • • designed for linguistic analysis: selected according to a sociolinguistic theory of language variation, to provide a sample of specific text-types or a broad and balanced sample of a language.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s