The long road to language resources—CLARIN

CLARIN, the ‘Common Language Resources and Technology Infrastructure’ is a European initiative to support the creation, curation and exploration of language material for research purposes and for as broad an audience as possible. The stated aim is that you should not need to be a technical expert to use the corpora, lexica and annotations that are targeted in CLARIN.

It is part of the European Research Infrastructure Consortium (ERIC). This is a huge project, with a budget of some €104 million. CLARIN-D is the German section of CLARIN and it recently had its 2-year showcase, which I was able to attend (see current activities at http://clarin-d.net/de/aktuelles/). Given that this is the first two years of a longterm project it has clearly achieved a great deal already, and certainly more than can be glimpsed in a short blog post.

This is part of a ‘roadmap’ process that actually leads somewhere, unlike the Australian version I reported on earlier that appears to have cost hundreds of thousands of dollars only to have been abandoned even before it was published.

In its place arose yet another committee structure, the Australian Research Committee (not to be confused with the Australian Research Council) which is now setting a new Australian research agenda and that includes not a single Humanities and Social Science (HASS) researcher in its membership (see its webpage). This ARCommittee released a set of guidelines on June 21st which may, for the next period, be important for funding applications to the Australian government.

But I digress. Back to CLARIN-D and the 9 centres in Germany working on a timeline ending in 2020 (yes, a funding programme that covers 12 years!).
The sort of questions that CLARIN should be able to answer are:

      • give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350)
      • give me all negative articles about Islam or about soccer in the Slovenski Narod daily newspaper (1868-1943)
      • find Norwegian TV news interviews that involve speakers with a German accent
      • summarize all articles in European newspapers of April 2012 about machine translation – in Nynorsk
      • Show me the pronoun systems of the languages of Alaska

source: http://clarin.b.uib.no/files/2012/08/krauwer-clarino.pdf, page 4

Most tools shown at the workshop center on text processing in well-known languages but there are some central technologies being developed that would underlie tools that can be used in language documentation work. For example, ISOcat is a data registry for concepts used in linguistics that could be a point of reference for part of speech tags, specifying usage more clearly than present practices generally do. However, it is rather cumbersome and is designed for developers to implement and not for individual researchers to use. It could be the point of reference for newly developed tools that display encoding concepts from ISOcat with provision for new ones to be added. A big problem that will no doubt emerge is a proliferation of ‘standard’ terms each slightly different to the next and each embedded within its own community and history of practice.
So far, CLARIN has provided storage space and personal workspace (sort of like RDSI and NECTAR in Australia). There are several existing projects that have become part of CLARIN, for example WebLicht, a chain of tools that do part of speech tagging, parsing, lemmatisation and so on, for mainstream languages in a distributed set of interlinked services located in different physical locations around the CLARIN-D projects. TextGrid is another tool that has, since its start in 2006, established the infrastructure for a text-based virtual research environment.
The projects that look like being of most use to language documentation are the media annotation services like Avatech for automatic recognition of video content, and SpeechFinder and WebMAUS (also mentioned earlier here).

 

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment