Greetings @cecsve @mgrosjean @larussell @markus and all,
SPNHC-TDWG 2024 starts in just 4 weeks! I’m quite excited about your symposium: SYM13 Controlled Vocabularies: The secret sauce that unlocks the power of data consistency and accessibility!
I would have contributed a talk Secret Sauce to Visible Actionable Delicious
Sauce but alas, already over-committed for other talks / session organization. So, I’m writing here to share my enthusiasm for your topic and happily anticipating the community conversation that comes from this effort you are making.
First, at our recent TaxonWorks Together 2024 meeting, please note this talk: Looking Inside One’s (TaxonWorks) Data. Some of you here will recognize this topic as one of great interest to me for quite a few years (even before the CitSci Hackathon at iDigBio in 2014!), including: @trobertson @ekrimmel @dshorthouse @jbest @dimus @tmcelrath @ehaston @seltmann @tkarim @matt @vijaybarve @LaurenceLivermore @Steen @JuttaBuschbom @libby Rob Guralnick, Cat Chapman, and others who have listened to me describe what I’m looking for. Recently, the developers here in the Species File Group wrote some code to envision the beginnings of what I’m dreaming of – Thank You SFG!
Some brief(ish) summary points.
- In a local CMS, many folks struggle to understand (“grok”) the contents of a given field. In SQL speak “Select DISTINCT [FROM xxxx] count(*)”. They might export the contents of a field to a spreadsheet, then sort, etc. to attempt understanding the contents before they even begin to try to fix any issues discovered. (^ HINT for RECODE
As a result, they lack some effective agency in this conversation to even understand the contents / issues inside their own data, let alone compare their data to others.
- At least some work to improve “local” ability to contribute better data in this respect, means these folks need tools that help them understand / explore / visualize and then act on their own data.
In TaxonWorks, in just the past few months, we’ve added two new Tasks (Project Vocabulary and Field Synchronize) to improve one’s ability to look inside a given bucket in the database and see what terms / strings exist and the count for those strings. The results are output in OpenRefine style (sort by name / count) AND in a clickable word cloud – so one can “see” the situation (see image next). You can see patterns and act on them.
And then you can pass the “results” to “field synchronize” which - visually - gives you the power to edit a given field, or pass data from one field to another. And we’ve added the option to do this editing with regex, if desired. Instead of doing regex at the command line, you can now do it visually. That is, you can see what your regex would do, before you apply it. And, using ChatGPT (or similar) it’s much easier to learn / use regex (we’ve given folks more agency).
-
As @trobertson and @tuco and I discussed long ago, these data can also serve as a metric (at the aggregator level) to look at whether data are getting “better” or not (with respect to controlled vocabs) or if there is new terminology that needs adding or terms needed in different languages for discoverability. As y’all know quite well, the need for all this work is clearly revealed at the aggregator level. You see what’s happening inside each concept across different sectors of our greater community.
-
At the aggregator level, there’s a clear opportunity to help each community answer the question
What do other folks in my community (e. g. botany, paleo, entomology, etc) put into this field?
And, I mean, without downloading these data, rather, via filters at the aggregator level to create subsets of data to visualize for them and then download if desired and use to catalyze controlled vocab conversations in their own communities.
-
Then, of course, we (desperately) need groups like the PDWG (Paleo Data Working Group) that help bridge these controlled vocab gaps (awareness, data skills, tools, standards knowledge, etc) across respective communities.
-
And then, we discover the standards development needs for various vocabularies … through all these levels and processes including ontology development.
If you’ve read this far, and you want to peak inside various Darwin Core fields (from ALA, VertNet, GBIF, and iDigBio) to grasp the scope of the need for this vocab work, have a look-see inside:
- Curious about uses of the distinct values directory - are you grateful it exists? and the related GitHub repo.
In happy anticipation of your session and to all of us looking to do our part to improve these data!
Deb
PS: to ALA, GBIF, iDigBio, and VertNet, it would be great in the gitHub repo to have new dumps for your data in the 27+ fields in Darwin Core that hope for a controlled vocab. We could do some cool stuff to compare what’s in the 2017 files with what’s in our respective databases now getting published to aggregators.