Linguistic Corpora

Linguistic Corpora: A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics). Linguistic descriptions which are ‘corpus-restricted’ have been the subject of criticism, especially by generative grammarians, who point to the limitations of corpora (e.g. that they are samples of performance only, and that one still needs a means of projecting beyond the corpus to the language as a whole). In fieldwork on a new language, or in historical study, it may be very difficult to get beyond one's corpus (i.e. it is a ‘closed’ as opposed to an ‘extendable’ corpus), but in languages where linguists have regular access to native-speakers (and may be native-speakers themselves) their approach will invariably be ‘corpus-based’, rather than corpus-restricted. Corpora provide the basis for one kind of computational linguistics. A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition and machine translation.

-David Crystal. A Dictionary of Linguistics and Phonetics, 2003