Duplicates while publishing a big herbarium

Dmitry_Slastunov · May 17, 2023, 12:45am

Hello!

I read that “GBIF doesn’t deduplicate the occurrences between the datasets” in this topic.

But I would be glad if community will clarify the answer for my special case.

We have a large herbarium (LE - ~6 million specimens), but haven’t yet published specimens with metadata to GBIF as one big dataset. My colleagues already have published several small datasets, containing LE specimens.

My question is: when I’ll publish a big dataset representing all specimens with metadata and images in LE it’ll have also the same specimens that are already present in these small datasets, published earlier (all DWC fields, including occurenceID will be the same) .

Is this a problem?

If it is - a “big” dataset is main priority, because in the first place the samples belong to the LE collection and only in the second place - to individual projects.

Offtopic: Is it possible to include links to small thumbnail images of specimens in GBIF dataset, but organize metadata in such a way that clicking on thumbnail image on GBIF occurrence page will open external link to our collection site (we have a good own big picture viewer, allowing to view images in several resolutions)? Now we are publishing data in such a way that clicking on thumbnail on GBIF page opens it fullscreen (produces a very blurry image of cource) and only if you click on link below thumbnail it opens our site with viewer (it’s counter intuitive). I don’t want to include links to big images in GBIF (though this can solve the problem of viewing blurry image in GBIF site), because it’ll greatly increase load on our server - I see that GBIF don’t create own thumbnails for large images and then you open GBIF gallery it loads original files - if I give links to big images to such a gallery our server internet connection will be overloaded (our “medium size” images are ~10 Mb each and if several people open GBIF gallery with such our images simultaneously our site will suffer and may be down).

Dmitry_Slastunov · May 28, 2023, 4:04pm

I’ll post a reply, that I received from Dmitry Schigel (translated from Russian to English via Google translate):

Me: Can one herbarium specimen be included in several datasets at once?

Dmitry Schigel: The short answer is no, it is not recommended. Developed response - since such a tactic would lead to duplication with no benefit to using and citing this data outside of the original context of the creation and/or publication of this data. datasets – unlike articles – are generally not considered for individual or group projects, but can be compared to segments of collections – for example, by taxa (dataset for ferns, dataset for agars), indeed), identified cabinets with Komarov samples digit are a separate date and resp.

Me: LE is a large herbarium and staff members are interested in making datasets for specific projects. probably, datasets with LE samples have already been published, and there are probably links to them in published articles.

Dmitry Schigel: actually published - if we are talking about paper floras and additional materials in identified articles, then we are talking about the fundamental, theoretical availability of this information (open data), in the absence of practical use and detectability (FAIR data). There’s a PDF prison meeting here; in a ratze - if the dataset or its version is not available in the international standard form, we are not talking about FAIR - the choice, at the same time, is for the data holder - both for the researcher and for the discovery through which the data are linked.

Me: If inclusion of a single specimen in multiple datasets is unacceptable, then what happens to the small dataset specimens after the release of a large complete LE herbarium date set?

Dmitry Schigel: There will be a duplication, which is necessarily resolvable), but so far these are pilot developments. The best thing to do is to carefully and carefully treat identifiers such as herbariums, singulars, database identifiers.

Me: Excluding the existence of small datasets from the large LE date set is, of course, not an option, since, first of all, all these food sources are the LE herbarium.

Dmitry Schigel: It often happens that small datasets take on digital life long before the dataset where the data comes from. Often in this case, the digitization is more detailed, richer than the carpet process for the collection. Clustering at the link above gives hope to “project” duplicates and create the most complete record based on unsynchronized digitizations. You can wait a long time until it comes to a large dataset. I can’t give clear advice, since fragmentation of data provides more fractional authorship, which is often important for curatorial work with digital resources - there are many options and arguments, if you are interested, you can call.

system · June 28, 2023, 2:05am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sibling datasets to overcome star schema limitation Data Publishing	7	795	June 20, 2022
Duplicate observations across datasets Miscellaneous	3	1380	November 28, 2021
Which data can be shared through GBIF and what cannot - GBIF Data Blog Data blog	1	727	November 17, 2022
How GBIF identifies related occurrence records (GBIF technical support hour for Nodes) Data Publishing NodesSupportHour	2	881	December 14, 2023
What to do to occurrences for deaccessioned specimens of Natural history collections? Data Publishing	1	82	May 19, 2025

Duplicates while publishing a big herbarium

Related topics