Secret Sauce to Visible Sauce! Controlled Vocabularies

Debbie · August 5, 2024, 10:17pm

Greetings @cecsve @mgrosjean @larussell @markus and all,

SPNHC-TDWG 2024 starts in just 4 weeks! I’m quite excited about your symposium: SYM13 Controlled Vocabularies: The secret sauce that unlocks the power of data consistency and accessibility!

I would have contributed a talk Secret Sauce to Visible Actionable Delicious
Sauce but alas, already over-committed for other talks / session organization. So, I’m writing here to share my enthusiasm for your topic and happily anticipating the community conversation that comes from this effort you are making.

First, at our recent TaxonWorks Together 2024 meeting, please note this talk: Looking Inside One’s (TaxonWorks) Data. Some of you here will recognize this topic as one of great interest to me for quite a few years (even before the CitSci Hackathon at iDigBio in 2014!), including: @trobertson @ekrimmel @dshorthouse @jbest @dimus @tmcelrath @ehaston @seltmann @tkarim @matt @vijaybarve @LaurenceLivermore @Steen @JuttaBuschbom @libby Rob Guralnick, Cat Chapman, and others who have listened to me describe what I’m looking for. Recently, the developers here in the Species File Group wrote some code to envision the beginnings of what I’m dreaming of – Thank You SFG!

Some brief(ish) summary points.

In a local CMS, many folks struggle to understand (“grok”) the contents of a given field. In SQL speak “Select DISTINCT [FROM xxxx] count(*)”. They might export the contents of a field to a spreadsheet, then sort, etc. to attempt understanding the contents before they even begin to try to fix any issues discovered. (^ HINT for RECODE

As a result, they lack some effective agency in this conversation to even understand the contents / issues inside their own data, let alone compare their data to others.

At least some work to improve “local” ability to contribute better data in this respect, means these folks need tools that help them understand / explore / visualize and then act on their own data.

In TaxonWorks, in just the past few months, we’ve added two new Tasks (Project Vocabulary and Field Synchronize) to improve one’s ability to look inside a given bucket in the database and see what terms / strings exist and the count for those strings. The results are output in OpenRefine style (sort by name / count) AND in a clickable word cloud – so one can “see” the situation (see image next). You can see patterns and act on them.

And then you can pass the “results” to “field synchronize” which - visually - gives you the power to edit a given field, or pass data from one field to another. And we’ve added the option to do this editing with regex, if desired. Instead of doing regex at the command line, you can now do it visually. That is, you can see what your regex would do, before you apply it. And, using ChatGPT (or similar) it’s much easier to learn / use regex (we’ve given folks more agency).

As @trobertson and @tuco and I discussed long ago, these data can also serve as a metric (at the aggregator level) to look at whether data are getting “better” or not (with respect to controlled vocabs) or if there is new terminology that needs adding or terms needed in different languages for discoverability. As y’all know quite well, the need for all this work is clearly revealed at the aggregator level. You see what’s happening inside each concept across different sectors of our greater community.
At the aggregator level, there’s a clear opportunity to help each community answer the question

What do other folks in my community (e. g. botany, paleo, entomology, etc) put into this field?

And, I mean, without downloading these data, rather, via filters at the aggregator level to create subsets of data to visualize for them and then download if desired and use to catalyze controlled vocab conversations in their own communities.

Then, of course, we (desperately) need groups like the PDWG (Paleo Data Working Group) that help bridge these controlled vocab gaps (awareness, data skills, tools, standards knowledge, etc) across respective communities.
And then, we discover the standards development needs for various vocabularies … through all these levels and processes including ontology development.

If you’ve read this far, and you want to peak inside various Darwin Core fields (from ALA, VertNet, GBIF, and iDigBio) to grasp the scope of the need for this vocab work, have a look-see inside:

Curious about uses of the distinct values directory - are you grateful it exists? and the related GitHub repo.

In happy anticipation of your session and to all of us looking to do our part to improve these data!
Deb

PS: to ALA, GBIF, iDigBio, and VertNet, it would be great in the gitHub repo to have new dumps for your data in the 27+ fields in Darwin Core that hope for a controlled vocab. We could do some cool stuff to compare what’s in the 2017 files with what’s in our respective databases now getting published to aggregators.

cecsve · August 6, 2024, 9:14am

Thank you for sharing @Debbie! There seems to be great synergy between TaxonWorks and the work behind GBIFs vocabulary server. GBIFS and collaborators have been using verbatim values (and potentially old mappings) + counts to map verbatim values to controlled values for a bunch of terms, including DwC terms (be aware that not all are in production yet, though). We have a vocabulary API that might be relevant for you to use in further developments of TaxonWorks? We can discuss this at the conference, but you can access controlled and verbatim values, for example.

You asked for metrics on how the controlled values will improve data (searchability) and I plan to present this in the talk on GBIFs vocabulary server in the symposium.

Regarding the 27+ fields, are they listed here? We generate tables for each field and link to them in all issues when we construct new vocabularies.

Debbie · August 7, 2024, 8:30pm

@tuco Please and thank you, is there a separate list of just the dwc:terms for which DwC recommends the use of a controlled vocab? Did we make / extract such a list somewhere? (I can’t remember). @cecsve the page you point to has all the terms, not just ones asking for a controlled vocab.

Debbie · August 8, 2024, 9:50pm

Thanks much @cecsve for reminding me these exist! @trobertson showed them to me a while ago. Yes, I do hope your symposium brings up more ways for these data to be used and more ways to help those in collections improve their data AND for others to use this information in ontology development.

system · September 8, 2024, 7:51am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Webinar 2: Controlled vocabularies (Bentley and Weiland) Diversifying the GBIF data model	2	615	July 26, 2022
Curious about uses of the distinct values directory - are you grateful it exists?	8	612	April 7, 2023
Collections Catalogue - Daily Summaries Collections Catalogue	9	4314	April 30, 2020
Finding citizen science datasets on GBIF - GBIF Data Blog data-blog	5	976	November 27, 2018
Integrated summary from 17 to 30 April 2020 Collections Catalogue	1	1521	May 1, 2020

Secret Sauce to Visible Sauce! Controlled Vocabularies

Related topics