basisOfRecord for Plazi datasets

ekrimmel · July 29, 2020, 3:12pm

I am curious why the occurrence records getting published on GBIF by Plazi have “preserved specimen” as their basisOfRecord. Why not “literature”? I look for specimen records based on basisOfRecord = “preserved specimen” and it seems misleading that I have thousands of Plazi taxonomic treatments in my results. Although those treatments refer to preserved specimens, the treatment is not a preserved specimen in and of itself. For example, this Plazi treatment occurrence record looks like it is actually referring to a subset from within these FMNH specimens…

dnoesgaard · July 30, 2020, 2:28pm

Perhaps @agosti would like to comment?

ekrimmel · July 30, 2020, 11:17pm

I notice that something similar is happening with datasets published by Biodiversity Data Journal. In the data I am using, both the BDJ and the Plazi datasets are checklists, which I also am surprised to find in my occurrence results. It doesn’t necessarily NOT make sense to have checklist records in occurrence results, but it would be nice to be able to decide whether or not I as a user want to include them.

dnoesgaard · July 31, 2020, 6:33am

Erica,

All occurrences regardless of source dataset class is returned in occurrence searches. I’m not aware of any way of filtering out occurrences from checklists—at least not in a simple search.

MattBlissett · July 31, 2020, 7:17am

The Basis of Record “Literature” was removed from Darwin Core some years ago, although GBIF still supports it. It was originally intended to share details of specimens known only from literature – early natural history books, digitized collector’s notes, field logbooks etc.

Otherwise, preserved specimens should be equally valid whatever the source. Is there a reason (e.g. duplication or poor quality data) you would like to exclude occurrence records from treatments? They do introduce duplicates, but they also document specimens not yet digitized by the museum or herbarium holding the specimen, or not yet shared by that institution to GBIF. There’s a new, experimental feature from earlier this week to detect record clusters, which has found many duplicates like this, but it is not yet available to use in searches.

There isn’t a way using the GBIF website to exclude things from a search. There isn’t any way to search on the type of dataset (checklist, occurrence, sampling event) an occurrence comes from.

It is possible to exclude records when requesting a data download through the occurrence download API. Using not predicates it would be possible to exclude some publishing organizations that only publish checklists, and then the remaining datasets individually.

dshorthouse · July 31, 2020, 11:56am

Reasons for making basisOfRecord something other than PRESERVED_SPECIMEN for Plazi:

These records are machine-generated, generally static & frequently littered with errors
The occurrence data record aligns more with the treatment – a published version of record – than it does with the physical specimen (eg if the specimen has since been destroyed, lost, or ownership transferred this has no bearing on this record whereas it should if the basisOfRecord is PRESERVED_SPECIMEN)
When feedback is sent via gbif.org, who’s responsibility is it to respond and make repairs? Plazi, the author of the treatment, or the identified, owning institution?

dshorthouse · July 31, 2020, 12:14pm

Here’s an example record from Plazi: https://www.gbif.org/occurrence/1303591499 from the paper, Smith-Pardo, Allan H. (2016): Two new species of Neocorynura from Guatemala (Hymenoptera: Halictidae: Augochlorini) with a key to the species known from the country. Zootaxa 4161 (2): 193-206, DOI: http://doi.org/10.11646/zootaxa.4161.2.3.

The digitized record from Plazi erroneously places the specimen in Grenada & even has a decimal latitude, 85.9805. The title of the paper is, “Two new species…from Guatemala…”.

Here’s the treatment from the paper:

This was reported 28 days ago https://github.com/gbif/portal-feedback/issues/2866.

Here is the corresponding specimen record from SEMC https://www.gbif.org/occurrence/657674006. This and the Plazi version do not appear to participate in a cluster presumably because the Plazi record is so poor that there is little to go on.

waddink · July 31, 2020, 1:55pm

Hi all,
Interesting discussion. I think the decision in TDWG to remove Literature as basis of record was good - given that it was intended to share details of specimens known only from literature rather than literature in general which I think does not really make sense and is confusing. But, given the very valid reasons stated by David Shorthouse, I think Treatment as Basis of Record would make a lot of sense. Not that I would like to extend BoR with a dozen new terms, but this one seems a valid case for a new controlled term (At least one more is needed for earth sample specimens, but this is not currently in the domain of GBIF). There is also MachineObservation but I think that should not be used either in this case.

Kind regards,
Wouter

ekrimmel · July 31, 2020, 10:44pm

I agree with @dshorthouse above, and in particular I want to highlight the importance of differentiating between an occurrence record directly linked to a physical specimen (e.g. one provided by natural history collections) vs. an occurrence record linked to a specimen via an intermediary (e.g. one provided by Plazi where a treatment is the intermediary). Those treatment occurrence records are undoubtedly valuable, as per @MattBlissett here, but they are functionally different and I think @waddink makes a good suggestion for improving the way we communicate this difference.

tuco · August 1, 2020, 12:21am

I would just like to add a point of information. “Literature” was never among the list of controlled vocabulary terms for basisOfRecord. TDWG never took any action with respect to the non-standard term. Also, the examples for HumanObservation currently say, “Evidence of an Occurrence taken from field notes or literature. A record of an Occurrence without physical evidence nor evidence captured with a machine.” The definition doesn’t say so, but the intention of the term basisOfRecord is to capture the nature of the evidence for the occurrence. Ultimately, the evidence expressed in the record from the literature (in some cases such as the ones that generated this discussion) comes from preserved specimens, so in my opinion Plazi is following the standard perfectly within the standard’s recommendation limitations.
My intention is not to diminish the needs expressed, nor to interject any obstacle for change, but rather to comment on the usage of the standard in its current state.

waddink · August 1, 2020, 7:40am

Hi John,
Thanks for correcting me. You must be right that TDWG never took any action with respect to the non-standard term “Literature” as the term in DWC only has a recommended best practice (to use the standard label of one of the Darwin Core classes). I got confused by the message from Matt. However that the intention of the term basisOfRecord is to capture the nature of the evidence for the occurrence, is new for me. The definition currently says: “The specific nature of the data record” . I have always interpreted this as the nature of the record itself, rather than the nature of the evidence for the record. If the latter is the intention then I think the definition should change. But I think the most obvious use case for the term is rather to select records of a certain type/domain/nature as they will come with a set of data/metadata that is specific for that type and so determine if a record is of potential use for a specific research question or not.

Kind regards,
Wouter

dshorthouse · August 3, 2020, 10:37am

Interesting @tuco that the spirit of basisOfRecord was intended to represent the genesis of an item as opposed to its manifestation. If that were logically true, then the basisOfRecord of all the concepts such as PreservedSpecimen, HumanObservation, MachineObservation, etc. could equally be the singular WildPopulation. Likewise then, we could make the argument that the basisOfRecord for a material sample IS PreservedSpecimen. So, why do we have MaterialSample?

But, what we have as definition as @waddink points out is “The specific nature of the data record.” In the IPT, the definition is qualified as “a subtype of the dcterms:type.” This speaks more to the manifestation of the item than its origin. Perhaps it’s the prefix basis within the term that has been the source of confusion over the years.

markus · August 3, 2020, 10:37am

A difficult topic. As much as I think the issues raised by @dshorthouse are important, I think basis of record should still be used as the evidence, the primary reason why the occurrence exists as @tuco has explained. There will always be data provenance, not only in digital ways, but also via literature. But the BoR should not change just because the specimen record is based on a literature citation, a collection catalogue or an aggregated database. If the occurrence is documented by a specimen, it’s BoR should be PRESERVED_SPECIMEN. Otherwise I fear we overload the term.

Clustering would be good and ultimately a single currently accepted identifier for a specimen. But again these problems should not be the reason how we select the BoR of a record.

Dealing with data issues from Plazi is a very different thing. If you do not trust Plazi records alltogether, there needs to be a negated publishingOrganisationKey filter to remove them.

We could also propagate the Dataset.type and Dataset.subtype properties to all occurrences and make them searchable. All Plazi datasets are of type=CHECKLIST and their subtype should be TREATMENT_ARTICLE, see https://github.com/gbif/registry/issues/107

ekrimmel · August 3, 2020, 3:32pm

I think this would be helpful. It’s essentially the solution I came up with for my current situation, and although it was not difficult for me to go fetch type from the GBIF dataset API and bring that into my data (to then exclude everything where type = checklist), that step is removed from the download event, which ultimately affects the usefulness of GBIF data use/citation tracking. It seems like being able to search occurrences based on dataset type and subtype would be valuable in particular because there are so many Plazi, BDJ, and probably other literature-based, checklist occurrence datasets.

Making dataset.type and dataset.subtype searchable would also better expose this situation to users so that they can make good choices in regards to what they do with GBIF data. Clustering, adoption of single unique identifiers for specimens, etc. will be helpful in the future but we also need near-term solutions.

waddink · August 12, 2020, 2:33pm

I agree this would be helpful, however it is a solution that only works within GBIF. I think it would still be useful to continue discussion on TDWG standard level. I noticed that RecordBasis in ABCD3 is described as “An indication of what the unit record describes”. In case we would treat DwC BoR as nature of the evidence for the occurrence, RecordBasis would not be compatible with BoR anymore and we would create confusion. So I still think “PRESERVED_SPECIMEN” is not appropriate for treatments. NB In case Plazi would use ABCD RecordBasis instead of DwC BoR that would have the additional benefit of being able to use the element SourceReference which provides for the case that the record is based on a publication.

DagEndresen · October 21, 2020, 3:53pm

Maybe what we do need is an Evidence class in Darwin Core and an “evidenceType” = “Literature” / “Treatment” or similar? [or simply rdf:type for each class/thing]

The denormalized simple Darwin Core record is lumping together many different things, and describing the type for all these things with the basisOfRecord will in my opinion heavily overload this term. dwc:basisOfRecord = Literature makes no sense to me.

If the dwc:Occurrence is the species occurrence in nature (in situ), then maybe also a specimen (dwc:PreservedSpecimen , dwc:MaterialSample, etc) is not an appropriate value for basisOfRecord – if basisOfRecord here means the type for the “species occurrence”?

agosti · October 21, 2020, 9:41pm

Dear All
Sorry that I missed this discussion and we (plazi) did not contribute. We only found out today during our symposium SYM09 at TDWG because @ekrimmel brought it up in the chat during my lecture.
We agree that we should discuss what a material citations is, clearly not an occurrence but a citation of one. And in fact it might cite many occurrences or specimens.

What is the best way to proceed on this clearly very vital discussion, especially now that GBIF is starting to clustering in big scale which makes a digital specimen more likely.
We consider the material citation important also for GBIF, because it is in fact the interface from an occcurrence to all the data about the specimen in the literature:
Specimen is cited in a material citation which is part of a taxonomic treatment (which provides an expert opinion of the taxonomic name) which is part of the scholarly publication. A treatment might cite other treatments, and with this more data might become available.
With other words, we are very much interested not only in the discussion but find a solution that fits our (yours and hours) needs.

once again, sorry for missing out on this discussion

donat

P.S. another aspect brought up by @dshorthouse about erroneous data we already work on, This is a problem though that needs efforts at various levels, from the authors , the publishers, ot our algorithms to decisions what and how GBIF wants to import the data. We have in place a feedback mechanism to solve problems quickly, but, as Dave pointed out, emails might not be the best because they also can fall between cracks. What @trobertson showed today at GB27 will solve this to a large extend, more so if we can implement what @mgrosjean is discussing with our @mguidoti

agosti · October 26, 2020, 9:31am

this is actually not a Plazi problem with the decimal degree, but the source of the error is the original publication

thanks for this, since I we are looking for such cases to illustrate the various sources that lead to “erroneous” GBIF occurrences mediated by Plazi

Topic		Replies	Views
Understanding basis of record Data Publishing	18	1107	April 11, 2025
About the Data Publishing category Data Publishing	1	1258	May 3, 2018
The strange case(s) of the missing identity Miscellaneous	23	282	September 8, 2024
Traceability and version control when publishing a curated regional occurrence dataset with mixed original and previously published records Data Publishing data-quality	13	70	May 13, 2026
GBIF Issues & Flags - GBIF Data Blog Data blog	14	7134	May 22, 2024

basisOfRecord for Plazi datasets

Related topics