How long is a piece of string?

Last month I received the following email query from a colleague:

“I am currently submitting a grant application for a small grant at the HRELP to document …. One concern I have is how many hours it will realistically take to transcribe one hour of text. I have done fieldwork in the past, but this would be the first time that I will have trained a transcriber who would work (mostly) independently. (The linguists on the project would consult with them.) I would like to give some sort of concrete number of total hours transcribed and translated (in contrast to fully annotated).”

Since this is an issue I have been asked about several times, I present here an elaborated version of what I wrote back to my correspondent (here I am using ‘source language’ to refer to the language of the recording, and ‘target language’ to refer to the language of a translation of the recording. I restrict my remarks to transcription of spoken languages).
I wrote back:
The answer to your questions is kind of like the answer to the question: ‘How long is a piece of string?’
There are so many variables:

  1. how many languages/varieties are represented in the recording (is it monolingual or multilingual) and what languages these are
  2. the transcriber’s familiarity with and fluency in the source language(s) (including, if they are a native speaker, whether they speak the same dialect as the interviewees)
  3. whether the transcriber can work alone or needs to work together with someone else (the interviewee or another speaker) to listen to the recording and have it repeated back (possibly at a slower rate) for transcription. Some transcribers do a ‘first pass’ rough transcription that is then checked with another person to arrive at a more refined transcription. The transcription time should be calculated as the sum of the times for these two processes
  4. the phonology of the source language – some languages have more segmental distinctions than others, and, depending on who is doing the transcribing, some distinctions may be more difficult to hear and transcribe than others. If a language has suprasegmental contrasts to be included in the transcription (eg. tonal contrasts) the nature of these will also affect the amount of transcription time. Tony Woodbury reports that:

    “The Eastern Chatino of Quiahije has 20 phonemically different tones, with complex sandhi phenomena that affect morpholexical tones. Transcription alone by trained fluent native speakers takes 1 hour for 5-10 minutes of clear monologic speech. I’m slower than that, and I typically transcribe in tandem the post-sandhi phonemic version and a lexical version, mainly as a check on myself (that commits me to sandhi testing except if the context is just right). So for transcription alone, I’d be more like 1 hour for 2-3 minutes. In reality though, I also gloss and determine inflectional categories as I go, because that also helps me narrow down the tone possibilities.

    I’m a bit slower with the Eastern Chatino of Zacatepec; it is hard because two of the most “populous” tone categories sound exactly alike in isolation and can only be distinguished with sandhi tests.”

    Transcription of other aspects such as melody of songs or chants, or gesture, will require special training and be correspondingly more time consuming.

  5. the transcriber’s familiarity with and fluency in the orthography for the transcription
  6. the transcriber’s familiarity with a number of aspects of the recording, including:
    • genre — talk in more everyday registers may be less time-consuming to transcribe than special and rarer genres, eg. chants
    • topic — talk about more familiar topics is easier to transcribe than less familiar ones, eg. topics that require specialist knowledge
    • mode (monologue vs. dialogue vs. multi-party) — conversation between two people is more difficult to transcribe than monologue, and increasing the number of conversational partners greatly increases the difficult of transcription
    • setting — recordings made in noisy environments are more difficult to transcribe, especially if there is spoken language (eg. on a TV or radio) in the background
    • identity of participants — if the transcriber knows that the people recorded have particular speech traits then that can help to identify that person in conversation and to transcribe their speech

    The more familiar the transcriber is with these factors, the easier it will be to do the transcription

  7. the attention spans and stamina of the linguist/transcribers — Pete Budd reports that he found that doing more than 60-90 minutes of transcribing at a stretch was tough for all parties
  8. whether the transcription is digital (typed as a computer file) or analogue (hand written) that needs conversion to digital? If digital:
    • whether there is a (continuous) power supply that allows transcribers to work for extended periods
    • what level of IT skills does the transcriber have — several colleagues have reported low levels of basic computer skills of collaborators (like being able to save files and then find them again) which adds to training and transcription time
    • what input method is used? (eg. if there are accents or non-ASCII characters are they entered via the keyboard or via ‘insert symbol’?)
  9. whether the transcription is time aligned? Is software to be used for this? How familiar is the transcriber with the software, and how easy is it to use for the given task (eg. ELAN is good for multi-party transcription but requires a lot of training — see this article for an interesting discussion of some relevant issues)?
  10. whether the transcription needs checking and post-editing, and how much time needs to be allocated for that
  11. for translation, the level of fluency in the target language
  12. what kind of translation is intended – will it be literal, morpho-syntactic, idiomatic, UN-style, literary? (see Woodbury 2007)
  13. whether notes, exegeses, comments are to be included?

The experience of several colleagues is that having video recordings available speeds up the transcription process by making it easier to identify speaker turns and providing some access to context and extra-linguistic cues. Anthony Jukes, a colleague who works in Indonesia on Toratan, found that video recordings made transcription a more bearable and interesting task for the documentation team, and that the transcribers would persistently place audio-only transcription at the bottom of their ‘to do’ list.
A rough rule of thumb seems to be that for an experienced transcriber fluent in the source language and skilled with transcription software a ratio of at least 10:1 for monologue and 15:1 for conversation is needed for transcription, ie. 6 minutes of monologue takes at least 1 hour to transcribe, 6 minutes of conversation takes at least 1.5 hours. For rough transcription plus checking and refinement a factor of 15:1 for monologue and 30:1 for conversation seems not uncommon. If we add translation, it is not uncommon for a ratio of 50:1 to apply, ie. for 6 minutes of recording at least 5 hours is required to transcribe and translate it.
Note that, for all of this, as the car ads say “your mileage may differ”.


Reference
Woodbury, Anthony C. 2007. On thick translation in language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 4, 120-135. London: SOAS.


Footnote: Many thanks to Pete Budd, Anthony Jukes and Tony Woodbury for comments on a draft of this post.

12 thoughts on “How long is a piece of string?”

  1. Interesting post.
    An important issue that slows down transcription, in my experience at least, is the usage of time-aligning software, and in particular ELAN. Although time-aligning sounds like a good idea in general, the actual design of ELAN makes it highly non-user friendly and non-ergonomic, especially since using the software only through the keyboard is not so easy. Moreover, in many cases it is useful to add notes to specific time points (such as “here begins the second story”), and not to time intervals, but ELAN is not flexible enough for this. Of course, all of this can be solved, but the people at the MPI should start designing their software with user-friendliness in mind.

  2. Yeah my estimate is about the same as Peter’s – I say that one minute of recording takes about an hour for a good quality transcription, gloss and translation. Of course there are always lots of factors, but one in particular is that we’re dealing with endangered languages so I’m happy to be a bit less fussy and let some finer points go through to the keeper if it means we can actually process more recordings.
    As for ELAN, I’m not a tech-head but I still find it really easy to use. I’ve taught stacks of people to use it, including endangered language speakers and language workers themselves and they’ve all picked it up pretty well and run with it. Now, I’d like to think that’s attributable solely to my brilliant training skills (joking!) but no, I think ELAN is a great program.
    One of the highlights of my own remote language work has been some great sessions with an endangered language team – me and 3-4 language speakers/workers: One on ELAN playing each annotation through the speakers for us, one or two old ladies repeating and translating for us, one scribe on the white/blackboard for us to jointly agree on the translation and transcription and someone writing it down so that we can just do ‘data entry’ into ELAN at the end of the session. A great way to work and I found ELAN to be much more a help than a hindrance.

  3. Peter I see commodification in EL is here to stay!
    Ariel I think you’re spot on with your comments on ELAN and keyboard use. I guess you’ve tried Transcriber for audio transcription?
    I mention it because previously I’d dismissed
    it due to lack of support for special characters, but just recently the thought of transcribing two DVD commentaries delivered at break-neck speed drove me back to Transcriber!
    Using a kind of SAMPA notation and then exporting to Toolbox – either through ELAN or see Andrew Margett’s paper
    – for further annotation turned out to be an order of magnitude quicker than ELAN.
    (of course there are other considerations like number of speakers, video, and who is doing the transcription).
    More broadly, I guess a program can be “easy to use” without being “user-friendly”. ELAN is not too difficult to learn, as both Wamut and the article Peter linked to testify.
    But the main source of frustration (for me, at least) in what is otherwise a nice program is that transcription in ELAN is simply slow(er than it could be), no matter how well you know the program.
    I guess different people have different tolerance-levels for that sort of thing.

  4. Just to add to Peter’s excellent list of factors :
    . orthography and keyboard layout of the target language (aside from phonology and familiarity with it); my Yolŋu transcription sped up noticeably when I defined my own keyboard layout that let me type Yolngu and English on a single keyboard without switching between two layouts.
    . the speech style and rate of the speaker. Whether there’s head-tail linkage (speeds things up, I reckon); whether there are pauses — e.g. with clear pauses you can use Elan’s automatic segmentation with less manual editing, which also speeds things up.

  5. Excellent post. One way to speed up the process somewhat is to go through a presegmented conversation quickly while recording someone repeating what is said – but without halting each time to transcribe it. With these repeated utterances in hand it’s quite easy to do a rough transcription on your own. This avoids the situation in which consultants are waiting for you to get it right. Described in more detail here.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment