Curious about uses of the distinct values directory - are you grateful it exists?

Debbie · March 7, 2023, 9:40pm

RE: Curious about use of the distinct values directory
See Distinct Values – Why This Data Directory

Please consider this an informal survey. Some six! years ago already, I pinged folks from VertNet, GBIF, ALA, and iDigBio to create outputs of distinct values (with counts) for the 27 (or so) dwc terms that suggest use of a controlled vocabulary. Why (see the above link for more) – to help empower all of us to improve our data at the levels that we can, from local to global, with metrics!

Have you made use of these data? Or have you been inspired by the concept to make changes to how you
“use these data” to make the data better in your own CMS or aggregation or in your own communities (e.g. paleo?). Please do tell / share! Lots we could be doing with these data.

Hash tags: #data-use
#data-for-good

@trobertson @tuco @seltmann @MattBlissett @jhpoelen @ekrimmel @tkarim @qgroom @DaveMartin @mtrekels @LaurenceLivermore @sgrant @nickyn @dshorthouse @jmacklin @grungle @matt @mswoodburn @sharif.islam @cubey @ehaston @DavidFichtmueller @jegelewicz @rdmpage @vijaybarve

jhpoelen · March 7, 2023, 10:11pm

@Debbie I can help create a list of dwc terms used from across collections 2018 onwards with their usage trends. You can do it too! Please let me know what two/three terms you are most interested in. . . and I can put some example together.

For instance, if you are interested in lifeStage term values as tracked in Mar 2023 across iDigBio / GBIF registered collections -

preston cat --no-cache --remote https://linker.bio hash://sha256/3923883fbcc7ab50d134f5f14c76710d3c73912887af64a619709cfab8e78f9c\
 | grep hasVersion\
 | preston dwc-stream --no-cache --remote https://linker.bio\
 | grep lifeStage\
 | jq --raw-output '.["http://rs.tdwg.org/dwc/terms/lifeStage"]'\
 | grep -v null\
 | uniq

produces:

NO APLICA
NO DISPONIBLE
[...]
over-winter
[...]
young of year
[...]
embryo
juvenile
adult
[...]

Note that this will take awhile depending on your bandwidth . . .

And, the values can be traced to the exact copies of the original data was retrieved from a known location at a documented time.

So, you can make your top 10 distinct lifestage values of 2018, 2020, 2021, 2022, 2023 using similar method by substituting uniq with sort | uniq -c | sort -nr and selecting a specific “anchor” or “version” of the iDigBio/GBIF data universe (e.g., any value from preston history --remote https://linker.bio)

@Debbie - How would I go about reproducing the values contained in the dwc-qa/data at master · tdwg/dwc-qa · GitHub folder ? Are these values interpreted values, or values as provided by their contributing institutions?

Debbie · March 7, 2023, 11:44pm

Greetings @jhpoelen. Thanks for your detailed reply!

Some further thoughts and clarifications.

First, inside those datasets, some groups shared only their indexed values.
Others (e. g. iDigBio) shared both indexed and raw.
- In this case, I really wanted both, because I wanted to show people
1. the different goop sometimes found in various fields, AND
2. what “indexing” means and what the results are so that I can show them how their records might OR might not be found with a search of a given field as a result of what they put in a given field
To reproduce these data please note:
- Some orgs already put up a second dataset (look carefully in the repo)
- We’d need to ask ALA, VertNet, GBIF, iDigBio to provide us with 2023 data dumps (but not replace the earlier datasets)
- It’s possible that GBIF (ask @trobertson) keeps stats for these now – for GBIF. Tim understood my ideas well and I think he may have implemented some of them (like keeping metrics on distinct values found for a given field, over time).
  - For this particular case, the idea that is if the community is empowered with tools like I imagine, that the distinct values would decrease (hopefully).
What I was envisioning includes ideas like you will find described here:
- Sahdev S, Paul D, Collins M, Fortes J (2017) Automated Generation of Lists of Unique Values from iDigBio Data Fields to Facilitate Data Quality Improvements. Proceedings of TDWG 1: e20306. Automated Generation of Lists of Unique Values from iDigBio Data Fields to Facilitate Data Quality Improvements
I imagine tools we don’t yet have that are for all of us, not just those of us who can manipulate large CSV files. With these data, easily accessible, visualized, and clickable, I posit that various individuals and communities can better understand their own data issues and move forward to address them together. (Happy to chat more about how I think this can work). To some extent, the paleo data happy hour group, led by @tkarim @ekrimmel Holly Little and Lindsay Walker, have been using these data like I imagine, to help paleo collection managers work together as a collective for improving their individual and community data issues.
For me the goal is a tool/s much like I showed you that incorporate features and functions found in
- https://voyant-tools.org/ AND
- Carrotsearch and Carrotsquared

We’re moving toward at least pieces of some of these ideas in TaxonWorks. Of course we can hope to see some of this implemented by aggregators. Beyond rows and columns, Data Visualization is not just for researchers. Those managing / creating / curating data need these tools too.

tuco · March 7, 2023, 11:56pm

Yes. I have used them in many contexts since the time they were generated. The foremost has been to help GBIF to produce controlled vocabularies and vocabulary lookups. A second one was to build a comprehensive countryCode from country lookup (33k+ distinct values in the country field) to incorporate into BELS to improve location matching performance. A third one was to enhance the VertNet prepublication lookups so that data publishers who chose to use “migrators” to prepare data for publication would have more standardized values than otherwise. Finally, and most recently, we used the combination of vocabularies for lifeStage, sex, and preparations in an North American Ornithological Conference Workshop to try to reach a community consensus on the concepts for these three vocabularies for birds and the mapping of values found in these fields to the chosen concepts. This latter work is in the process of being reconciled with the vocabularies in the GBIF vocabulary server.

datafixer · March 8, 2023, 12:10am

@tuco, it’s excellent that the distinct values lists have been used to help produce lookups in the GBIF Registry and in VertNet prepublication “forms”. To what extent are those lookups actually used by data publishers to select controlled-vocabulary entries rather than entries from CMS or other?

I’m hoping for an “efficacy” measure, something like “invalid entries reduced by X% since lookups introduced”.

Debbie · March 8, 2023, 12:14am

Greetings John,
Thanks, wow! Lots of great uses. Between your ornithological use case, BELS, and GBIF vocabs for standard “lookups” and the work of the paleo group, we can see how these data can be useful in different (excellent) ways. Hoping this inspires tool makes and tool users to help us all produce higher quality data. I wonder if VertNet (or GBIF?) could generate metrics for some fields you worked on to see if the resulting data for these fields are improved?

Debbie · March 8, 2023, 12:17am

Hi ya’ @datafixer, some work was done by the TDWG DwC Controlled Vocab group and the DwC Darwin Core Hour group to produce a resource gathering known vocabularies from different disciplines. The idea, as you suggest was to try and help community awareness of these and perhaps coalesce around using them / enhancing them…

jhpoelen · March 8, 2023, 12:34am

@Debbie Thanks for this engaging discussion. I like your idea to make point-and-click tools. And these tools are built on transformations of data publications. And, with the example I shared earlier, the method of data transformation and the specific data publications are well-defined. So, you wouldn’t have to ask anyone to get access to the raw, unfiltered data. And, you can access this data in a streaming manner, just like when when you stream terrabytes of data when binge watching all seasons of “Derry Girls”, “Mandalorian”, and “Game of Thrones.” Also, you can shove it on an external hard disk, like I did, for offline analysis.

This may sound like some weird geeky method using strange command-line toys, and they may very well be. And, I think they provide a solid, well-defined foundation for the application you envision. And, they are only a data carpentries workshop away, and available to anyone willing to learn, no VIP access needed.

system · April 7, 2023, 10:35am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Secret Sauce to Visible Sauce! Controlled Vocabularies data-quality , fair-data , controlled-vocab , data-visualization	4	95	September 8, 2024
What I've learned from 500+ biodiversity data audits Data Publishing	0	226	April 25, 2024
Data Use Club Practical Session : Data Quality Data Use	1	798	December 14, 2022
Webinar 2: Controlled vocabularies (Bentley and Weiland) Diversifying the GBIF data model	2	607	July 26, 2022
Use Case: iNaturalist Observations Diversifying the GBIF data model	5	839	January 5, 2023

Curious about uses of the distinct values directory - are you grateful it exists?

Related topics