At the recent Linguistic Society of America annual meeting in Chicago, Sandra Chung from University of California Santa Cruz gave an invited plenary address on the topic “How much can understudied languages really tell us about how language works?” She argued, among other things, that data from understudied languages should play a crucial role in the development of linguistic theory since only by including them can we get a full picture of the array of phenomena found in human languages that need to be taken account of. She illustrated her talk with examples from her work on Chamorro, an endangered Austronesian language spoken on Guam.
During the question time following Sandy’s talk, one person commented something along the following lines (I paraphrase, since I was rather stunned to hear the opinion being openly expressed before a linguistics audience, and don’t recall the exact formulation):
Linguistic research needs to concentrate on working with corpora and for the sort of languages you were talking about, like Chamorro, you will never be able to put together a corpus of sufficient size to be able to do anything meaningful. We should give up on the small (and disappearing) languages and concentrate on ones where we are likely to be able to get a decent sized corpus.
There was quite a corpus buzz at the meeting (John Goldsmith gave an invited plenary talk entitled “Towards a new empiricism for linguistics” presenting his ideas about statistical corpus-based research), and I imagine many people had in mind ‘big language’ corpora of the 1-100 million words range (or perhaps even the two billion word corpus of English that the Oxford Dictionary folks have just compiled). At the Symposium on “Mobilizing Linguistic Resources Within Speaker Communities” (held after Sandy Chung’s talk) one of the presenters, Andrew Garett, was explicitly asked by an audience member how big was the text corpus for Yurok, the indigenous Californian language that he has been working on for some years and which has been the focus of recent language revitalization and teaching efforts.
So, should we just pack up, stop wasting our time, and leave the small languages alone? How big does a corpus have to be in order to be useful?
A partial answer can be found in Friederike Luepke’s 2005 paper entitled ‘small is beautiful: contributions of field-based corpora to different linguistic disciplines, illustrated by Jalonke – published in Language Documentation and Description, Volume 3. Friederike shows how her Jalonke corpus of 7,000 intonation units (roughly 6,000 clauses) of transcribed and glossed text data can be explored quantitatively and qualitatively to uncover significant information on verb argument structure and alternations, genre-based variation, language contact phenomena, and language standardization tendencies. It is an impressive demonstration of the value of a richly annotated ‘small’ corpus.
Alternatively, there is Andrew Garrett’s response to the LSA Symposium question: the Yurok corpus of audio and text data is larger than the corpus for Luwian, an extinct Indo-European language that has played an important role in elucidating the Anatolian branch. it’s also bigger than that for Palaic, or several other languages that are ‘well respected’ in historical linguistics research.
Size is just one measure of value, and a pretty poor one it seems to me when it comes to endangered languages corpora in particular.
Andrew Taylor contacted me with the following information which he has given me permission to reproduce here:
“Your recent contribution on corpus size reminded me of a paper given by Leonard Newell of SIL Philippines at a conference on lexicography in Manila in 1992, in which he discussed this issue. I no longer have the paper, alas, but if I remember correctly he then suggested aiming for a corpus of a million words. The paper, ‘Computer processing of texts for lexical analysis’ was published in the conference proceedings (Papers from the first Asia International Lexicography Conference. Manila: Linguistic Society of the Philippines Special Monograph No. 35).
Then, in his Handbook on Lexicography for Philippine and Other Languages (Linguistic Society of the Philippines, Special Monograph No. 36, 1995) the third chapter, Developing a textual corpus, deals with a range of issues involved in compiling a useful corpus. The last section is 3.8, The size of the corpus for a modest project. By this time, his suggestion was for a somewhat larger corpus.
He estimated that a keyboarder could, conservatively, collect, enter, and do a spelling edit on about one million words of text in a year and goes on to say ‘Based on the experience of the Romblomanon project, a corpus yielding about three million morphemes is considered both attainable and adequate to meet the needs of a modest lexicographic project on a lesser-known language’ (p.43). However, he does acknowledge the limitations of human and financial resources which usually apply to projects on languages with small numbers of speakers. (I notice the change from words to morphemes in his paragraph, which would affect the count.)
I am not suggesting his view is correct, and he may well have changed it subsequently, but it is an interesting early attempt to quantify the problem.”