Webinar 2: Controlled vocabularies (Bentley and Weiland)

The following question(s) were asked in the Collection Management Systems Webinar and will be answered here.

Andrew Bentley: In reviewing this it strikes me that the new unified data model relies more heavily on controlled vocabularies - particularly with the ***type fields in most tables. How does this model intend to address this perceived conflict and potential for ballooning terms found in these particular fields?

Claus Weiland: Within DiSSCo, we use comprehensive typing for the Digital Specimen and other digital objects including ImageObject, VideoObject etc… all are detailed with regard to interoperability with other (cross-domain) object types (they are “self-contained”). How do you detail digitalEntityType in your model? We will have a Type Registry - maybe a “digitalEntityTypeID” linking to this registry could be added?

Response:

Specifically to Andy’s questions, “How does this model intend to address this perceived conflict and potential for ballooning terms found in these particular fields?”

  1. What is the perceived conflict? You don’t say what it is.

  2. The model doesn’t address what happens in the community to enable the model to answer questions, it provides the structure to enable the community to do so in a unified (consistent) way to the extent they want to put energy into making that happen.

The Unified Model highlights the benefits of using controlled vocabularies even more than Darwin Core. I think the reason the vocabularies stand out more in the Unified Model than ever before is that all of the entities have a type (e.g., eventType, entityType, [any]AssertionType), whereas Darwin Core really only had four: dc:type, dwc:basisOfRecord, dwc:measurementType, and dwc:relationshipOfResource.

The fundamental problem with dwc:basisOfRecord is that a “record” In Darwin Core is about a bunch of different things, since it is a flattened view. It is about a dwc:Event, potentially about a specimen, parts of a specimen, related specimens, photos, sequences, places, people’s roles, etc. In the Unified Model we can respect that all of these subjects are jumping off points to look at connected data from various perspectives. For example, we might be interested in such differing things as “everything that happened within 1 km of this place” and “which collections have a cranium of the species associated with this genetic sequence”. They are all valid questions, and hindered by the limitations of looking at the biodiversity world through the lens of a flattened dwc:Occurrence because the Occurrence record was designed to answer the question, “where and when was this species found and what was the basis of the evidence for that”.

To be able to ask more questions readily, we benefit by two big things. One is to support the perspective the question is coming from. This is where the structure of the conceptual model comes in, with concrete classes such as MaterialEntity and DigitalEntity. The other is to support the extraction of the specific data of interest without having to sort through a lot of data that is not of interest. This is where the shared vocabularies come in. They let us filter into the results what we are specifically interested in getting back, and let us filter out what we know we don’t want when we ask the question. The utility of those vocabularies depends on how well developed they are, where by developed I mean they are sufficiently complete, sufficiently understood, and sufficiently used that they really do help people get what they are after. This much is no different from Darwin Core. What IS different is the impact the effort to develop vocabularies can make, because more questions are enabled by it in the new model. More effort, more capability.

So if there is a conflict it seems to me that it would be between the data publisher with few resources and the data user with specific needs. Ideal for this data publisher is to be able to say, “Here is what I have”. Ideal for the data user is to have someone interpret faithfully everything the data publishers have shared and provide access to it in support of as many questions as possible. A solution for the first problem is simple data publishing models with simple publishing tools. A solution to the second problem is a data aggregator with a rigorous model, well-developed vocabularies, and a processing pipeline that can interpret to get the most out of what data publishers share.

Claus mentioned type registries and identifiers by which to designate the values from those registries. The Biodiversity Data Quality Task Group 4 (https://github.com/tdwg/bdq/blob/master/tg4/README.md) is keen to advance mechanisms for community development and maintenance of vocabularies of values, and GBIF has piloted a vocabulary registry (https://registry.gbif-uat.org/vocabulary/search) for this purpose. Arctos in their presentation invited people to explore the extensive shared vocabularies (“code tables” https://arctos.database.museum/info/ctDocumentation.cfm) as a contribution to broader community vocabulary management. All of these are in recognition of the importance of vocabularies. If we construct these carefully, and allow for hierarchies, we should be able to make a robust system for exploring biodiversity data that can be useful immediately, and even more useful as vocabularies are developed and refined. When there is a certain level of stability in a vocabulary, and if it makes sense, they could be put through the standards process as was done, for example, with the invasive species vocabularies for dwc:establishmentMeans (https://dwc.tdwg.org/em/, https://registry.gbif.org/vocabulary/EstablishmentMeans), dwc:pathway (https://dwc.tdwg.org/pw/, https://registry.gbif.org/vocabulary/Pathway), and dwc:degreeOfEstablishment (https://dwc.tdwg.org/doe/, https://registry.gbif.org/vocabulary/DegreeOfEstablishment).

I guess the perceived conflict I was thinking of is that our community traditionally does not do controlled vocabularies well. Thus a model that more heavily relies on these controlled vocabularies will be more problematic and not less - despite its benefits. This is not to say that the community does not need prodding in the direction of better adhering to controlled vocabularies but how to do that is a sticky question. It will require a massive effort on the part of the data providers (again) who already have more than enough data problems to solve :slight_smile:

The other is to support the extraction of the specific data of interest without having to sort through a lot of data that is not of interest. This is where the shared vocabularies come in. They let us filter into the results what we are specifically interested in getting back, and let us filter out what we know we don’t want when we ask the question. The utility of those vocabularies depends on how well developed they are, where by developed I mean they are sufficiently complete, sufficiently understood, and sufficiently used that they really do help people get what they are after. This much is no different from Darwin Core. What IS different is the impact the effort to develop vocabularies can make, because more questions are enabled by it in the new model. More effort, more capability.

This WILL be challenging, but it is a challenge we need to accept! This is exactly where we are in the TDWG Material Sample Task Group looking at a controlled vocabulary for the soon to be proposed term dwc:materialSampleType. I believe we have realized that a single “type” term may not be enough to allow for drilling down to topics of interest. Anyone who is ready to take on the controlled vocabulary challenge can start with us in this task group and offer your use cases for vocabularies we are considering. If you are interested, our next set of meetings are on August 17, one at 10AM mountain time and the next at 4PM mountain time. Email me for meeting link. jegelewicz66@gmail.com