The strange case(s) of the missing identity

As of 2024-08-03, there were 3,023,883 occurrence records shared with GBIF with no entries at all in the following fields:

1

These are occurrence records but we are not told what is occurring, which seems contrary to the purpose of occurrence records.

To explore this further I checked the institutionCode field in the 3M+ record set and found the top 25 contributors:

No. of records institutionCode Institution
448685 NHMUK Natural History Museum, London
430864 (blank) (a wide variety of projects)
245303 BRIT Botanical Research Institute of Texas
176762 USNM Smithsonian Institution, NMNH
156863 MNHN Museum national d’Histoire naturelle
78059 CU Clemson University Arthropod Collection
76482 RSA California Botanic Garden
71963 FMNH Field Museum of Natural History
62698 MANCH Manchester Museum, U. of Manchester
61338 MSU Michigan State University Museum
59921 UM University of Manitoba (Wallis-Roughley Mus. of Ent.)
46303 UT Natural History Museum of Utah
45922 UCSC University of California, Santa Cruz
45261 KU University of Kansas Biodiversity Institute
41269 ASU Arizona State University Biocollections
40288 UCD University of California, Davis
40015 INHS Illnois Natural History Survey
36805 NHMD Natural History Museum of Denmark
35650 BDRS (Waterwatch project through Atlas of Living Australia)
31106 UTBC University of Texas Biodiversity Collections
26245 MISSA Mississippi Entomological Museum
25871 LA University of California, Los Angeles Herbarium
25108 MDBA Murray-Darling Basin Authority (Australia)
21995 SDNHM San Diego Natural History Museum
21932 CLF Herbiers-Universitaires de Clermont-Ferrand

I’m not sure what to make of this list, but it does seem curious that 15 of the 24 identifiable institutions with unidentified occurrences are in the USA. It’s also interesting that the unidentified records from NHMUK, the global winner, are spread across several collections:

No. of records institutionCode collectionCode
28664 NHMUK BMNH(E)
27414 NHMUK BOT
207053 NHMUK PAL
185554 NHMUK ZOO

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com


Data source: verbatim.txt in https://doi.org/10.15468/dl.ufrs46, which collects 8,600,994 occurrence records flagged with “Taxon match none”.

I was curious about the records from UCSC (University of California, Santa Cruz) because they are a relatively small collection, so the issue you’re describing affects around 50% of their records. Also my alma mater and the reason I work in this field today. In many cases UCSC is an exemplar of things I think many of us want: they are based at a university and actively used for both teaching and research, they are accessioning new specimens in targeted ways, they are the focal point for a regional natural history community, they provide training for undergraduate and graduate students, etc.

The vast majority of the records with missing identities are coming from the UCSC insect collection, which is live managed in the Ecdysis. This is an active Symbiota portal with peer support for insect collections management and data mobilization. Again, an exemplar of things I think many of us want: data mobilization is early and often, cyberinfrastructure is shared, collections without existing opinions can follow the data norms of collections who have put time into deciding stuff, etc. These UCSC specimens could be redacted individually to prevent them from being included in the IPT data that makes its way to GBIF, but this would be an opt-in decision, not a default.

These specimens do not appear to be low data; rather, they appear to be from a research project and just have not been fully processed since being collected in 2019. E.g. they’re waiting for someone to identify them. We all know that might take a while because [pick your favorite issue concerning the lack of resources for taxonomic expertise]. Like @datafixer points, a lack of identity seems contrary to the purpose of occurrence records. But what would you rather see done differently here?

  1. UCSC could choose to redact individual records without an identification.
  2. Symbiota and/or the Ecdysis Portal could choose to redact records without an identification as a default.
  3. GBIF could refuse to accept records lacking sufficient data.
  4. Data users could ignore these records.

To me, #4 seems both the most practical (for data publishers and publishing systems) and the most useful (for data users). Some minor confusion followed by 60 seconds of active curiosity, aka Googling, led me to find out about a cool natural history resurvey project that UCSC is working on! Any data user will have criteria for ignoring records, and GBIF’s “Taxon match none” flag–or basic operations in any data processing/analysis tool (R, Python, OpenRefine, Excel)–makes it very easy to start filtering out based on cases of missing identity


2 Likes

I don’t know which INHS collection @datafixer? Maybe @matt or @tmcelrath can help shed some light if it’s our INHS ENT data.

@ekrimmel, I would prefer to see option 1, 2 or 3, and I am having trouble understanding why you want to leave it to users to decide what is usable data and what isn’t.

I should also say that I focused on identity in this post, but a large proportion of these records (from the major contributing publishers) are also deficient in other field categories. Just one example from the UCSC dataset: https://www.gbif.org/occurrence/2884100333; https://ecdysis.org/collections/individual/index.php?occid=2385109. This UCSC record has neither an identity nor a location. What is it doing in GBIF? What purpose is served by sharing it?

@Debbie, yes, from the insect collection, and as with UCSC (see above), many of the records are hard or impossible to use. In this case, for example
https://www.gbif.org/occurrence/3801587943
the “verbatimLabel” field has locality, coordinates, date and collector, but none of this information is in the Darwin Core fields established for these data items.

It’s hard to avoid the impression that certain collections are putting incomplete or unusable records into biodiversity data’s shared, aggregated pool as placeholders to be completed or made usable at some unspecified future date, as if GBIF was just another CMS, but a public one.

1 Like


as if GBIF was just another CMS, but a public one

I think this is a balance point in the discussion. From my perspective, I do not see anything practically wrong with it. In an ideal world, I would love to have “complete,” trustworthy data for every record on GBIF. But we don’t live in an ideal world and it is very hard to picture that changing. Constraints on time and resources mean that we can have more data of lesser overall completeness/trustworthiness, or less data of better completeness/trustworthiness. The former puts data assessment in the hands of the data user, and I am in favor of this division of responsibility. But I also see plenty of arguments for the latter: plenty of data users have a limited idea of how to assess their data, “more” is rationale in and of itself, etc.

In any case, all of these records with a case of missing identity do seem pretty useless, but they also seem pretty harmless. Even the relatively small amount of time it would take to do #1, #2, or #3 above could be better spent managing physical collections, improving the data records themselves, building better software, sustaining data publisher communities, etc., etc. So why bother worrying about these records? I say tongue-in-cheek as someone who has (1) totally enjoyed this thought exercise of worrying about them, and (2) totally wasted time responding here when it could have been spent more productively elsewhere :joy:.

1 Like

@ekrimmel, if I can take just a little more of your valuable time, please visit
https://www.gbif.org/data-quality-requirements-occurrences
where GBIF says that occurrence datasets offer “evidence of the occurrence of a species (or other taxon) at a particular place on a specified date”, and then lists data requirements including scientificName.

There is nothing practically wrong with sharing unusable records with GBIF that don’t offer the required evidence, and there is nothing practically wrong with leaving it to end users to decide what’s usable and what isn’t.

Fortunately the vast majority of GBIF occurrence records now come from citizen science platforms and include the required evidence. Perhaps the “harmless”, unusable records from collections should be seen as minor pollution of an otherwise valuable resource.

1 Like

Hm. My two cents and I’m looping in @dshorthouse @trobertson @cboelling too. Why?

I understand your puzzlement over sharing these data contained in a string inside verbatimLabel. Digitization is hard and very time-consuming.

Skeletal records aka Minimal data capture.
One workflow opted for by many digitization projects strives to capture only a few fields or in this case, the verbatim text from the label. Future efforts will atomize these data in the way you are hoping for. The real practical trick is 
 figuring out how / when to complete these. So meanwhile, we have some hope because 


Indexing
Indexing at iDigBio, GBIF will make these data somewhat more discoverable, event in their stringy state. Thank goodness!

LLMs, e. g. ChatGPT
And, I created several examples (see that gitHub ticket above – scroll to near end of ticket) that show how you can share these text data with ChatGPT and get back data parsed to Darwin Core (for example). Yes, LLMs are not perfect. We need a lot of (human) work and tools to get these data into useful formats.

I’m guessing (@trobertson?) that clustering algorithms and LLMs will help us in the future at GBIF, to discover the data hidden in the verbatimLabel strings (but no longer invisible)?

As to records that lack even verbatimLabel text (as you point out) I can’t speak for them.

FAIR Data
I think getting these data, even if sparse, out the door, does mean one has increased their potential for discovery. Data is always caveat emptor. I “plus one” point #4 from @ekrimmel. In regards to this point, I think I hear you suggesting you would expect ALL data in GBIF to be ready-to-use and free from imperfections. I think if we wait for that, we lose opportunity. Those records with “low data” from UCSC may benefit from visibility – by their very emptiness. That is, someone writes to say – can I help? Of course, folks could “redact” and that’s fine. I’d think then you do lose the opportunity to broaden awareness of a potential dataset 
 even if it’s currently incomplete.

We have much work to do.

It occurs to me that the new MIDS standard coming out, and its use at GBIF, will help reveal these empty fields (in addition to using the search filters as Erica points out).

Making gaps visible is one topic I like very much. I’m not for hiding them or making them go away. I’d rather they show themselves for all to see — and then we can do something about it — together. You making this list, is eye-opening for many, I’m sure. What power does it give us? Well, this conversation, for one thing.

2 Likes

@Debbie, I hope you’re suggesting that ML could be used at the data publisher end for organising biodiversity data items in the DwC categories we’ve all worked so hard to establish, and not somewhere downstream.

Please also note that I object to the “straw man” argument being made here that I want an ideal world. I did not say that biodiversity data must be perfect or that perfect data is the only alternative to the unusable data currently being shared by certain collections.

Hi Bob, for me, the ML will be very useful for the data publisher! As to “downstream” I’m not sure what folks like Tim envision at the level of GBIF. It could propose possible data clusters to users as they explore these data.

Ah, if I drew too sweeping of a statement that misinterpreted your take, my bad. I wanted to suggest that I expect these data to have the sorts of gaps you are finding. I expect exactly what I think you do not?

You wrote: It’s hard to avoid the impression that certain collections are putting incomplete or unusable records into biodiversity data’s shared, aggregated pool as placeholders to be completed or made usable at some unspecified future date, as if GBIF was just another CMS, but a public one.

Hope that’s clearer.

Deb

All of our time is valuable, that’s my whole point! Everyone is making choices about how to prioritize the limited time they have, and, e.g., it does not bother me that whoever manages the collections data for UCSC has chosen not to prioritize adding values for dwc:scientificName. Fair if it bothers you. To each their own!

please visit https://www.gbif.org/data-quality-requirements-occurrences where GBIF says that occurrence datasets offer “evidence of the occurrence of a species (or other taxon) at a particular place on a specified date”, and then lists data requirements including scientificName.

These “missing identity” records offer evidence of an occurrence simply by being a digital representation of a physical specimen. Yay! Requiring a value for dwc:scientificName is a good thing to message to data providers (as GBIF does here) but actually enforcing this seems arbitrary. We could bring most of these 3 million records into compliance in a few minutes by adding a kingdom-level value to dwc:scientificName based on the name of the dataset. But doing that doesn’t really add usefulness to these records.

On the other end of the spectrum, I am 100% pro citizen science records, but in some circumstances they offer an example of false precision that could actually be harmful (the same could be said for collections-based records). “Required evidence” is a box tick not an actual assessment of said evidence. It’s very easy for me to “identify” an organism I know nothing about on iNaturalist with their computer vision functionality, and it is impossible for iNat to be certain my identification is accurate, despite systems (e.g. research grade observations) that help. None of this seems like a problem to me–again, I feel good putting the responsibility of data assessment on the data users!

Perhaps the “harmless”, unusable records from collections should be seen as minor pollution of an otherwise valuable resource.

^ This I agree with! :slight_smile:

1 Like

@Debbie, the reason I hope that ML is used by publishers to sort verbatim label data into (verbatim) DwC categories is that I see that as the responsibility of the data publishers. If publishers leave that to “downstream” to do, aren’t the publishers saying “Here’s some raw data, somebody else can work out what it means”?

Getting back to my post, one question I was asking was “Why are data publishers offering occurrence records that don’t meet even the most basic requirement of such a record, namely stating what was occurring?”

Biological records for the last couple of centuries have consisted of a what, a where and a when, and usually a by-whom. Sure, there can be missing or imprecise data items in a record, but at some point the record compiler needs to decide whether or not the record will be useful to other biologists. To suggest that this isn’t the responsibility of the compiler is to suggest (as I said above) that a biological records scheme can also be used as a CMS, and that the fact that a specimen exists is enough for it to be included, even if the what, where, when and by whom are all missing.

Another (so far unanswered) question is why North American institutions figure so prominently in the missing-identity issue.

Quite possibly because North American institutions figure so prominently in contributing data to GBIF? The United States alone contributes ~39% of occurrence records in GBIF and is responsible for ~36% of these “missing identity” records on your top 25.

1 Like

@ekrimmel, I like that answer because it suggests that the proportion of missing-identity records from all publishers might be more or less the same. But I know from my audits of US institutions that missing-identity records are actually fairly rare, and some US institutions have none at all in their GBIF sharings. These are “compensated” by (for example) the Botanical Research Institute of Texas. One of its collections
https://www.gbif.org/dataset/bb0366a9-2a7e-44f8-9c9a-e49c23dae2f8/metrics
has more than 50% no-ID records and ca 99% without coordinates.

Do you think the problem here could be that once a US institution shares its data with (or holds its data in) a local aggregator, e.g. SERNEC

https://sernecportal.org/portal/collections/individual/index.php?occid=8772011
https://www.gbif.org/occurrence/4101769764

then it loses control of its records, which could then be shared with GBIF even if the publisher doesn’t want to?

I agree, this situation seems to be happening at the individual institution level (rather than more systemically/regionally), and the drivers are therefore also likely unique to their individual contexts.

These records are from our “in-progress” datasets, usually where we upload images first and then transcribe from those. I agree that they aren’t useful in the current form, but we are working on providing images via the image extension of DWC. Right now, they wouldn’t be useful in a large format dataset, but eventually they might be. It’s harder for us to “filter out” “non-useful” records than to just let them out there where they won’t be used.

2 Likes

Generally, as I see it, data consumers would want to make sure to only use data that is fit for purpose. As such, I would say, responsible use of data requires checking if input data is fit for purpose (whatever those criteria are).

Having guarantees from upstream data sources that the data is compliant with certain criteria will make life easier for data consumers in that they can (in theory) rely on certain structure or content to be ingested (although I would still recommend to validate the data that is to be used).

But I would not want the above to suggest that data publishers have no responsibility in what they publish - it’s always best to correct problems at the source. If data records reflect incomplete knowledge or preliminary status of the data, it would be helpful if this fact is indicated explicitly in some way.

What I find confusing, is that according to [1] GBIF categorizes the attribute scientificName as Required and explains (on the same website, below) that

The items listed below constitute the minimum formal requirements for publishing an occurrence dataset. GBIF.org will not accept a dataset without these terms and will not index the records.

(Actually, the list of terms sorted by status category is above this excerpt, not below it on page.)

As it seems, this policy is not enforced or has changed. Whatever the case, the documentation and the data ingestion process should be aligned to be consistent.

Imputing a kingdom as taxon in the incomplete records may seem arbitrary, but from my point of view it isn’t. This is because the data records that collections publish are usually about a specimen, collating information about that specimen. In transferring the data to GBIF, a conceptual change may be prone to occur, since now the data record is about an occurrence, which is a totally different thing (although, of course, related to the specimen). An occurrence, as I interpret the GBIF data model, without specifying a taxon is non-sensical (and as a data consumer I am left to wonder as to why there isn’t a taxon). A specimen without taxonomic categorization is fine and simply reflects the state of knowledge on the specimen (as does the inclusion of a verbatimLabel attribute and its respective value).

Imputing a higher ranked taxon (a kingdom, in the extreme) that can be assigned with confidence will conceptually remedy this issue - although, certainly, an occurrence where the taxon is identified only on the level of a kingdom, is not useful for most applications of the data (same is true for location (“earth”) or time (“sometime between 1500-2024”). But based on the GBIF specifications, these would be valid records, which is what this post is primarily concerned with.

[1] Data quality requirements: Occurrence datasets

2 Likes

@tmcelrath. Hi, Tommy, thanks for joining this discussion. When you write “transcribe”, do you also mean from labels ID’ing the specimen? The problem with the INHS records is that there’s no ID, not so much from my POV that the locality, coordinates, date and collector are only in the verbatimLabel field.

Well, yes, that would be true for any preparation of data for export. But isn’t there a simple filter in TaxonWorks to identify and select the records with some sort of ID?

Thank you Bob for this curious post and the lively discussions!

I wanted to quickly point out that “taxon match none” in GBIF does not equal to “scientificName empty”. Example: Occurrence Detail 4534233256

In the documentation, “taxon match none” is defined as

Matching to the taxonomic backbone cannot be done cause there was no match at all or several matches with too little information to keep them apart(homonyms).

I don’t think there is any flag in GBIF that could help data consumers to filter out records with empty fields easily or assess the completeness of a record. Empty fields could be one of the conditions of certain flags (e.g. Taxon match higherrank, ___ date invalid), but records with these flags do not necessarily mean that the field(s) assessed is empty.

1 Like

@ymgan, thank you for making that point. I of course filtered (with command-line tools) the 8,600,994 “Taxon match none” records (see DOI) to get the 3,023,883 records with nothing at all in the 22 fields listed at the top of the post.

Looking through the scientificName entries in the 8M records is pretty disappointing, as the field includes thousands of punctuation-only entries (“?”), numbers (“0.99”), UUIDs, filler (“0-Unknown”), date/times (“2019-12-16 16:55:56 UTC”), comments (“2nd no bloody idea”), descriptions (“5 kleine, rote StĂŒckchen”), corrupted characters (â€œĂ‚Ă„Ă±Ă­Ă  Ă” ĂŒĂšĂ·ĂłĂ°ĂšĂ­ĂȘà”; ïżœïżœïżœïżœïżœ 'ïżœŚ·ïżœïżœÌœïżœïżœïżœïżœ'f. Benigaku), informal names (“Actinomma sp. c”) and unclassifiable entries like “仄排é›Čć±±èŽŠç‚șè”·é»žç™»çŽ‰ć±±äž»ćł°ïŒŒæ‰€ç¶“é€”ćŸ‘ćłć˜‰çŸ©çžŁèˆ‡ć—æŠ•çžŁäș€ç•Œă€‚(æ„ŠćŻŒéˆžïŒŒ2024-03-01)”, which translates as “Take Paiyun Villa as the starting point to climb the main peak of Yushan, passing through the junction of Chiayi County and Nantou County. (Yang Fujun, 2024-03-01)”.

2 Likes