At the Linguistic Society of America Summer Institute in Berkeley last week (17-19th July) the National Science Foundation sponsored Cyberling 2009, a workshop exploring how computational infrastructure (called “cyberinfrastructure” in the US, and e-Science or e-Humanities in the UK) can support linguistic research in a variety of fields. There was a panel discussion about data sharing that looked at the proposal:
“A cyberinfrastructure for linguistic data would allow unprecedented access [to] the empirical base of our field, but only if we collectively build that empirical base by contributing data. This panel addresses the benefits of data sharing and the obstacles to the widespread adoption of sharing practices, from the perspective of a variety of subfields”
But the bulk of the workshop was given over to closed discussion sessions by seven working groups looking at annotation standards, other standards, new multi-purpose software (so-called “killer apps”), data reliability and provenance, models from other fields, funding sources, and collaboration structure. The group discussions and resulting final day presentations are available on the Cyberling Wiki.
I was co-chair of Working Group 4 that was charged with discussing “protecting data reliability and provenance”, i.e. how to keep track of the creation of data and analysis and its passage through the electronic infrastructure as researchers access and use each other’s materials. As the Cyberling Wiki says, this is crucial
“for data creators (who need credit for the work they have done and the academic contribution of collecting, curating and annotating data) and the data users (who need to know where the data has come from so they can form an opinion of how much credence to give it and how to give proper credit to the originator of the data)”.
We also looked at how to establish a culture of data sharing and what mechanisms might be put in place to encourage people to share data. Clearly, for endangered language research where data are unique and fragile, these are very important issues.
After two and a half days of intense discussions our group came up with a set of proposals relating to data reliability and provenance that can be summarised as follows: