You gotta be in it to win it

29 May 201129 May 2011 by Nick Thieberger

Peter Austin’s blog post deals with online endangered language archive searchability. As one of the targets of his latest post, PARADISEC apparently does not provide him with the results he wants in searching a catalog. Searching for ‘Educational material’ in a catalog makes lots of assumptions about the way that catalog has been constructed, one of which must be that the term is provided by the catalog or that the typical depositor would use the term in their freeform description of the item. Strangely, the answer he offers is not to provide the infrastructure on which such searches may succeed in future, but to advocate a folksonomy in which such searches will always be sure to fail.

The post is an advertisement for what is undoubtedly a very nice interface to a set of material held by ELAR, but we should also bear in mind the large amount of funding that ELAR/ELDP have had, so we would hope for at least a nice looking webpage after eight years now. It is also interesting that ELAR holds only 70 collections after ELDP has funded 216 projects, what has happened to the rest of the material, or am I being too commodifying to think of such a thing?

The comments on the post raise OLAC – a great service that provides information for the broader community (including linguists, but especially speakers who can access it via google), harvesting information from archives around the world every 8 hours to update its language documentation index. OLAC provides a system for digital archives to maximise the searchability of their catalogs. There are 45 digital archives who take advantage of this free service. That represents almost all language archives in the world but to date ELAR has unfortunately chosen not to be part of that community.

What OLAC may lack in flashiness (although the new faceted search that Tom points to in his comment is pretty smart) it certainly makes up for in depth of coverage, see http://search.language-archives.org/index.html.

And if, as Peter says he is, you are only interested in searching for endangered languages, well who has had the resources to have created a list of those languages? Rather than one of the well-endowed projects providing this resource, the World Oral Literature Project (WOLP) has done a fine first job of this with minimal resources, and they harvest suitable archives (those that comply to the relevant standards for exchange of metadata) to get that information, and, yes, you guessed it, ELAR’s silo catalog is not there either.

Peter dismisses efforts to standardise terms as being outdated (in the olden days it seems, ‘key metadata notions were interoperability, standardisation, discovery, and access’) and advocates a relativist metadata mush in which there is a ‘focus on expressivity and individuality in metadata descriptions’. Expressivity and individuality certainly have their place, but they don’t help when it comes to targeted location of information, especially at the scale of material to be searched on the web. The keywords given in the short set of genres in Peter’s post is a perfect example.

Looking for ‘songs’ will not find song, looking for ‘kastom’ will not find Custom description or Custom narrative or Custom story, let alone Folk Tale, Narrative, Myth narrative, Narration, Narrative from visual prompt and many more. Who knows what ‘Chronicle’ or ‘Semi-spontaneous interview’ will find. And it is nice that the terms can be in any language, but that reduces the predictability of the search finding anything even more. I can’t see why it is an advantage to have all of those terms that Peter lists rather than a standard set of terms and then a free form field in which such stream of consciousness tags can also be listed.

A product of allowing users to enter their own terms rather than providing them with a set list and a freeform field for their own version is that a collection will not have any standard terms for locating information. Thus, for example, the Arandic songs project in ELAR is tagged with ‘Language: Arandic’, while none of the standard language terms lists ‘Arandic’. Searching for the more usual term ‘Arrernte’ does not locate the ELAR items in the first ten pages of a google search (I gave up looking any deeper than that).

By participating in international standards, the items in the ELAR collection could be found by pages like this: http://www.language-archives.org/language/are

Here, the standard three-letter code at the end of the URL links to a page listing all available information held in participating archives, and this is updated every 8 hours, effectively providing a dynamic documentation index. Of course there are still problems with the three-letter codes, but they are improving over time, and this and other issues could be improved by cooperation rather than competition from the small community who are doing this work.

ELDP/ELAR is a multi-million dollar enterprise that has been running for eight years and has achieved great things. It could lead the field with open-source tools for linguists to use, and perhaps an open-source version of their catalog for other archives to adopt. Archives like PARADISEC have no funding beyond occasional grants and are staffed by committed people concerned to make legacy linguistic material safe. We are content to know that we have digitised field recordings and curate over 3,000 hours of recorded material that would otherwise have been lost and that the catalog makes it locatable. We are in the Open Language Archives Community and in WorldCat (1.5 billion metadata items) and we take advantage of this existing infrastructure by having our catalog maximally exposed to targeted search tools.

14 thoughts on “You gotta be in it to win it”

Andrew Garrett

30 May 2011 at 8:30 am

Another factor that hasn’t been mentioned much in these discussions is “areal density”. An archive has relatively high areal density if, for some area, the archive is likely to hold significant documentation for any given indigenous language in that area; the archive thickly covers the ground in its area of specialization. At one end of the areal density spectrum are archives like the ANLC for Alaska, the University of Washington Special Collections Division for the northwest US and British Columbia, or the archives at Berkeley for California and the adjoining US west; intermediate but still relatively dense are the American Philosophical Society for North America, AILLA for Latin America, and PARADISEC for the Pacific; and not at all dense are new global archives like ELAR and the DOBES Archive. The reason areal density is important is that users, whether academic researchers or community members, are likely to know to check the archives that are areally dense for their areas of interest; for example, probably everybody doing sustained work with an Alaskan language knows to consult ANLC. But because the low-density global archives are hit-or-miss (for any given language, they almost certainly don’t have anything), it’s especially important that their collections be readily discoverable. So I agree with Nick that it’s too bad that they don’t currently contribute to OLAC. (It’s also too bad that the most important of all language archives for North America, the American Philosophical Society and the National Anthropological Archives, don’t contribute to OLAC, but this is probably just because they’re run by archivists who are not in the linguistic community and for whom entities like OLAC are not on the radar.)
Helen Aristar-Dry

30 May 2011 at 10:36 am

Correction to Peter’s post: The fact is that Olac can’t simply ‘find’ Ailla or the holdings of any archive — the archive must expose the holdings in Olac metadata format. As far as I know Ailla simply hasn’t yet had the resources to write that script, though they have been a nominal Olac participant for some time. But I am struck by how many small archives HAVE expended the resources to participate in this community metadata collection effort. And in fact DOBES should be there, under IMDI. So Peter–it’s over to you. Will ELAR join the OLAC community?
I think we all hope so.
Helen
Andrew Garrett

30 May 2011 at 10:59 am

I apologize for writing that the DOBES Archive contents are not discoverable through OLAC, and I thank Helen for pointing out my error. Admittedly, “IMDI” is one of several non-transparent archive names seen in the OLAC list.

The scripting for OLAC metadata is a little time-consuming, but not very much so, especially if one ignores what’s optional.
Helen Aristar-Dry

30 May 2011 at 11:21 am

And a brief PS: Paradisec is one of those archives which,despite underfunding, has always been in the forefront of community efforts. As a proponent of language codes, a supporter of OLAC, and one of the earliest archives to expose as many of it’s holdings as IPR allows, it is a model of the kind of cooperation that allows the the discipline to take full advantage of an archive’s holdings. ANLC is another such, and the Berkeley archives. The amount of participation ( and far-sightedness) seems to me remarkable in light of the under-funding.
Helen Aristar-Dry

30 May 2011 at 11:29 am

Thanks, Andrew–you make a good point. Indeed, the scripting isn’t that onerous. But I believe that AILLA was a one-person operation for years. (I should let Heidi speak for herself,of course.) But my guess is that the ELAR programmers could do it in a matter of days.
Anthony Aristar

30 May 2011 at 12:04 pm

So, do I have this right? ELAR doesn’t code it’s data by language code? How do you handle ambiguous names under these circumstances?
Tom Honeyman

30 May 2011 at 3:17 pm

Peter dismisses efforts to standardise terms as being outdated (in the olden days it seems, ‘key metadata notions were interoperability, standardisation, discovery, and access’) and advocates a relativist metadata mush in which there is a ‘focus on expressivity and individuality in metadata descriptions’. Expressivity and individuality certainly have their place, but they don’t help when it comes to targeted location of information, especially at the scale of material to be searched on the web. The keywords given in the short set of genres in Peter’s post is a perfect example.

Looking for ‘songs’ will not find song, looking for ‘kastom’ will not find Custom description or Custom narrative or Custom story, let alone Folk Tale, Narrative, Myth narrative, Narration, Narrative from visual prompt and many more. Who knows what ‘Chronicle’ or ‘Semi-spontaneous interview’ will find. And it is nice that the terms can be in any language, but that reduces the predictability of the search finding anything even more. I can’t see why it is an advantage to have all of those terms that Peter lists rather than a standard set of terms and then a free form field in which such stream of consciousness tags can also be listed.

At the height of their popularity a few years back, user-generated tagging systems, tag-clouds or folksonomies as seen in services such as del.icio.us or flickr really worked because of the scale of the data subjected to this approach, and the size of the active community contributing to it. There was no way anyone was going to sort through millions of photos or links and categorise them, and it was a hassle trying to fit your own specific content into the rigid confines of a controlled vocabulary. Instead content bubbled to the surface because many many people happened to tag a few things the same way as many other people, or they tagged different things with the right (ie popular) collection of tags. Sure, some stuff would fall through the cracks, but what did it matter – there are only so many hours in a day.

Although this approach is not so popular these days, I still think it can be very good when there are active communities of the right size. It is not a universally applicable solution however. The recently resurrected delicious appears to be a shadow of its former self, poisoned with occasional bad links because the critical mass needed to get a link into the “popular” listing is nowhere near as high as it used to be. Who knows though, with time it may draw back a large crowd and start working well again. Flickr on the other hand continues despite many competitors entering the market simply because it has a committed and active user base.

I think a few of the problems with this bold approach by ELAR relate to the size of the community providing the tags. If existing tagging categories are provided to depositors before they hand over materials, or if users were allowed to tag content then there might be some convergence allowing useful folksonomies to emerge, but as it is at the moment, what we get is a fragmentation of the collection rather than interesting sub-groupings across collections. What really matters though is, given the specific constraints on the size and dynamics of the community of depositors and users accessing the content, for the tagging system to succeed, how will it be facilitated to evolve as the collection grows? As Nick points out, there are currently 146 projects set to appear some time in the archive, and that number grows every year.

Nick has already raised the other main problem – a user-defined tagging system is not a good way to create an authoritative and absolute listing of content of a given particular type/category. Lack of standardisation actually makes it much harder to export to a standards based system like OLAC as well.
David Nash

30 May 2011 at 7:57 pm

I agree with Tom. There are enduring good reasons for using controlled vocabulary in subject cataloguing. Professional librarians (and archivists) are trained in knowledge organization and the digital archives we’re discussing could draw more on their expertise. Actually, have any of the archives we’re discussing employed someone with library training?
Anthony Aristar

31 May 2011 at 1:07 am

I don’t see why we need to choose between controlled vocabularies and folksonomies. Why not have both? Folksonomies on their own fragment the data into personal silos; but they could be useful as an adjunct to a controlled vocabulary. Tom’s point about scale is an important one too. Folksonomies could become useful in very large bodies of data. It’s unlikely they’ll work very well on the scale we have in linguistics when used alone.

One further comment…. It seems to me almost as if ELAR has gone out of it’s way to reject really useful parts of the controlled vocabulary already available. Perhaps I’m missing something, but I can’t see that ISO 639-3 codes are being used at all. I agree that there are problems with these codes, but they are better than no codes, and what is wrong can be fixed. At this moment, for example, with Claire Bowern’s help, we’re working at LINGUIST List to try to fix the Australian codes. We hope to have this done by the end of the northern summer. This can be done elsewhere.
Andrew Garrett

31 May 2011 at 3:26 am

To answer David’s question: we at Berkeley have worked closely with Lisa Conathan, an archivist at the Beinecke Library (Yale University) who has both professional library training and experience and a linguistics Ph.D. Obviously, too, the APS and NAA have trained archivists.
David Nash

31 May 2011 at 10:26 am

Thanks Andrew.
Over the last couple of years at AIATSIS, the ASEDA collection (or at least the part of it that has been retained) has been moved to the Australian Indigneous Languages Electronic Collection (AILEC). ASEDA was managed within the Research section (with ignorable advice at times from the Library), but now AILEC is within the Library, and so the resources are listed in Mura and managed by librarians, which in the long term I suppose is for the best. The downside is that the Library doesn’t currently support features ASEDA had, such as OLAC exposure, or the acceptance of a variety of file types (not just TXT, RTF, PDF) and file packages (including linked media). And ISO 639-3 coding is still not quite in reach.
David Nash

31 May 2011 at 11:15 am

Yes Anthony, agreed: user tagging (with spam control) clearly adds value to a catalogue, and we can imagine how in the longer term it will guide subject cataloguers in managing the controlled vocabulary.

However, what to make of this one example of the (lack of) uptake of folksonomy capability… A few years ago the AIATSIS Library catalogue moved to a new platform, and added user tagging. Training was provided on “new Mura”, and users were encouraged to add tags. After a few years, look at the accumulated tags (list or cloud) (you may need to navigate via the Mura entry page): just 7 tags have been used more than once, just 5 tags have been added in the last year – and look at them!: ‘visited this community’, ‘article published on this topic’, ‘attended’ and ‘personal library’.
Tom Honeyman

31 May 2011 at 3:24 pm

I think in the digital age, it is actually quite difficult to distinguish a digital archive from the curated display of its contents. I know when I enter the dimly lit, box filled, musty basement of a physical archive, I’m dealing with something quite different to the display in the foyer showing off a selection of valuable records, with little plaques giving short simple interpretations of those records. The same is not so true with digital archives. The foyer and the basement overlap. It can be hard to distinguish one from the other. I think perhaps distinguishing between the proper domain of use of tagging and controlled vocabularies would be useful here, and more generally the distinction is important when looking at search engines for archives too.

To me, a tagging system can be a useful experiment in the dissemination of materials held in an archive – and I say power to those that want to have a crack at it, but I myself have some reservations. And I agree with Anthony – we can have both. But of course, tagging is not really part of the business end of an archive, which is to store records and do it well. Likewise, to choose not to use standardised language codes in the dissemination of materials is a decision one can make (and arguably in some cases its a good one), but to not record them elsewhere in a standardised form and in expanded form where necessary is just not good archiving. And if you’re interested in inter-operability, standards based vocabularies are a must!

From what I can see poking around on the website, it seems that language codes are recorded only when they are provided by the depositor. But I don’t actually know what ELAR does for its actual archive – basically I assume we are seeing the curated display of their archive. What is kept on a tape backup, or separate server to their webserver, or wherever they keep their originals is presumably a little different. Hopefully they do record ISO 639-3 codes. Doing so for 70 collections is not impossible!

I think its important to note that many of the issues of best practice in archiving don’t apply to a curated display – they apply to the archive from which the display is derived. By all means add a tagging system, simplify the interface, provide downloadable derivatives, but keep them in their proper place, and keep the originals well looked after. But, to turn it the other way round, in Peter’s review, I feel the distinction was not made between searching a catalogue of archival objects, and a search engine that constitutes the front end of a curated display. Really one is rummaging in the basement and the other is strolling through in the foyer. I happen to enjoy both activities, but really they’re quite different, and it doesn’t seem fair to me to consider them the same. There is nothing wrong with a basement in good order!
Peter Austin

8 June 2011 at 9:25 am

I want to clarify and contextualise one aspect of my post that Nick remarks on, namely:

“Searching for ‘Educational material’ in a catalog makes lots of assumptions about the way that catalog has been constructed, one of which must be that the term is provided by the catalog or that the typical depositor would use the term in their freeform description of the item.”

First, a bit of context: the research for my post started when a colleague, who is an ELDP-grantee, mentioned that they were planning to develop some materials for educational use in an endangered language community and asked me if I knew of any samples of such material that could be looked at to get some ideas about the types of things that might be prepared. So, I thought “let’s go and have a look in the endangered languages archives”. My first stop was AILLA (as in the order presented in my post) and sure enough the drop down menu for Genre includes the search term “Educational Material”. As I noted, AILLA includes 56 deposits with this metadata tag. I then tried the same search term in the other online archives, and later in OLAC, with the results described.

I was in Paris last week for a three-day Experts Group meeting jointly organised by the Culture, Education and Communications sectors of UNESCO on the topics of endangered languages, education and policy development. Over the three days there were repeated mentions of the importance of language documentation for policy and materials development for education. Let’s hope that more of Nick’s “typical depositors” (as well as untypical ones) in future do archive the community-oriented educational and language support materials that they are creating (and that they sometimes ask for and get grant money to produce), and that they clearly describe them in their metadata. We can be sure that there is, and will be increasingly, an audience out there searching for just such materials.

Comments are closed.