Deep Dive: Date-related issues and flags

In the following technical support hour for GBIF nodes on May 7th at 4:00 pm CEST, we will introduce the first ‘Deep Dive’ session on handling issues and flags, and we will focus on Date issues for this particular session.

Several date flags and issues can be associated with records when they are processed in GBIF pipelines. In this session, we will cover how data and time are parsed, what fields are used for interpreting the occurrence date, and how users can search on date ranges. We will also cover relevant issues for each of these parts and how publishers can potentially fix the issue.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

‘Deep Dive’ sessions will go into detail of specific flags and issues that publishers encounter when sharing their data with GBIF and that GBIF users may use to filter their downloads. With over 60 flags and issues currently implemented, we will focus on a specific overall topic and cover relevant issues.

@cecsve thanks for sharing.

The video of the presentation is now added to the Vimeo showcase: Deep Dive: Date-related issues and flags on Vimeo and here are the questions and answers from the sessions:

Q: Why does GBIF not interpret event time?

A: GBIF is not using event time in interpretation because it would be too complex to use. For example, if you go on holiday far away, take some pictures and come back and upload them through iNaturalist, the time zones can get messy because it depends on the setting of your camera/phone, whether you correct the observation time etc. If users need that level of detail, the user doing the analysis will need to figure out for themselves what to do with the verbatim data.

Q: Why can you not filter on records with no date? it would be nice in general for the API to have negation searches, no date, no country, no any of the fields.

A: Negated searches will be possible in the new GBIF.org portal and are already a feature in the hosted portals UI (example).

Q: Do you know of any tools or algorithms that could be used to interpret verbatim dates before sending them to GBIF? I’ve been dealing with some data sets from historical collections, and the verbatim dates are complex. There are some French revolutionary calendar dates, there’s a mix of numbers and textual dates, like the month written in letters and the day with numbers.

A: GBIF’s date and time parser is a Java library found in a project called Parsers on GBIF’s GitHub. It is available for use, so if your work supports Java, you could incorporate it into your checks. It makes an effort to process textual dates. It has been developed and evaluated based on the data available in GBIF at that time, which was sufficient to make it worthwhile. We last reviewed this a few years ago.

Another option could be to use ChatGPT or the French ‘Le chat’ by Mistral to parse verbatim dates. Some nodes have had good experience with that option.

There is also a date parsing tool provided by Canadensys Tools API - Canadensys that may be helpful to use.

Lastly, OpenRefine can also be used to standardize verbatim even dates by applying the cluster mechanism.

It should be possible to add Java libraries to OpenRefine. Perhaps following the similar method to what’s described here: GitHub - RBGKew/String-Transformers: A collection of Java string transformers, suitable for use with OpenRefine. Some generic, some aimed at botany and taxonomy. might allow using the GBIF date parser library directly, i.e. GitHub - gbif/parsers: Various GBIF parsers for dates, countries, language, taxon ranks, etc and https://repository.gbif.org/repository/releases/org/gbif/gbif-parsers/0.67/gbif-parsers-0.67.jar


from org.gbif.common.parsers.date import DateParsers tp = DateParsers.defaultTemporalParser() return tp.parse(value).getPayload().toString()

Except the GBIF parsers library has a few more dependencies and this won’t work unless they’re also added.

Q: Is it possible to see when a dataset has been updated, perhaps in the API? I would like to perform my own backbone checks and would appreciate knowing if the source dataset (Catalogue of Life (COL)) has been updated.

A: The GBIF backbone consists of more sources than COL, but COL contributes the majority of the names. However, the COL checklist is currently more up to date than the GBIF backbone since COL is released monthly and GBIF has not updated the backbone in over a year. Eventually, the xRelease will be the same as the GBIF backbone and will be updated monthly. A public latest version of the xRelease can be accessed here: ChecklistBank and API documentation on checklistbank can be accessed here: https://api.checklistbank.org/.

The ‘crawl’ API like https://api.gbif.org/v1/dataset/7ddf754f-d193-4cc9-b351-99906754a03b/process shows when a dataset was last downloaded from the publisher. If the finishReason is NORMAL, then that means that the archive was refreshed at the source (GBIF checks the date of the last archive generation at source), not necessarily updated, since changes are not checked.

Registry API :: Technical Documentation – the modified field in the dataset API will be updated when the dataset metadata is changed, which is very often the case when a dataset is ingested/crawled, but not in every case.

Q: Will it be possible to filter datasets by publication date? We sometimes need that information for activity reporting to know how many datasets were published in a given year.

A: Currently, the IPT automates the first publication date of a dataset, so this information is retrievable. However, subsequent updates may not necessarily mean changes or changes that are worth reporting on. The general suggestion has been logged as an issue - please feel free to add to this issue if you have more ideas: Reporting on node activity · Issue #3 · gbif/CommunityMetrics · GitHub .

Q: The very long list of filters in the GBIF UI makes it more challenging to find the filters you are looking for. Will there be a new version of the UI where this is changed?

A: Yes, there will be a new version of the UI, and we expect to introduce it in the next support hour for nodes. It will be very similar to the hosted portals UI and have a much simpler filter setup.

Q: Are GBIF networks maintained, and is it only for national networks?

A: Yes, the network option is maintained by GBIF and is used to group datasets from different publishers. It has been used for datasets for herbaria, freshwater datasets and some national networks as well. Networks can be set up for groups other than national groups and are meant for active collaborations around a subject, taxonomic or geographic scope that cannot otherwise be serviced through the GBIF standard mechanisms. E.g., networks should not duplicate the national contribution of a country that is already adequately grouped on the country page, nor should they support the grouping of datasets simply for the interest of a user. Networks need active collaboration and curation to be considered. Please note that the network page is not dynamic and requires editing by the GBIF Secretariat, hence the content should be static. You can obtain an overview of certain metrics based on the datasets associated with the network. If you are interested in setting up a network, contact helpdesk@gbif.org, and we will explore the options available.

Q: I often have to help dataset creators publish a dataset where they have multiple institutional affiliations. Could more affiliations be added for the same publisher role?

A: Yes, there is an open GitHub issue https://github.com/gbif/ipt/issues/1557, and our developer is looking into adding the option in an IPT update.

Q: Are there any examples of national networks where, for example, all herbaria of a country is a GBIF network?

A: There is a national network of German herbaria, for example. All networks can be found in the GBIF Registry: GBIF Registry.

Q: In the new GBIF.org, will the same search filters be available?

A: Yes, the filter will still be available, but the standard selection will have fewer fields. Users can add more filters if they want to and all filters that are available now will also be available in the new portal.

Q: Will datasets published via the Metabarcoding Data Toolkit (MDT) be flagged as eDNA datasets?

A: GBIF adds a machine tag to datasets published via an MDT installation stating: [This dataset was processed using the GBIF Metabarcoding Data Toolkit.] This allows searching on the datasets like this: Search.