April technical support hour for GBIF nodes

In the April issue of the technical support hour for nodes, we will not have a practical topic due to Easter holidays in Denmark. Node staff is invited to sign up to the session on April 5th at 4pm CEST (UTC+2) and join if they want to ask technical questions to the helpdesk team. Questions can also be posed ahead of the support hour by commenting on this thread or contacting helpdesk at helpdesk@gbif.org.

The drop-in support hour will be every first Wednesday of the month at 4pm CEST (UTC+2) beginning from January 4th until summer 2023 as part of a trial period.

Questions to be covered during the support hour:

The support hours have so far covered the following topics:

  • During the January support hour, we introduced the Registry and how a dataset could be debugged by checking the different interpretation steps in the crawl and ingestion history tabs. You can questions and answers on the discourse thread for the topic or see the recorded practical session here.
  • In the February support hour, the helpdesk team gave a brief overview of the technical components of GBIF, and how they all contribute to the content on GBIF.org. The questions and answers from the support hour can be found in this discourse thread and you can view the recording of the practical session here.
  • In the March support hour, the Data Products team demonstrated how to install, update and backup the IPT, as well as updating the extensions. Questions and answers for the support hour was captured in this discourse forum post and you can review the video from the practical session here.

Last minute question! I can’t seem to figure out how to update/convert a dataset in the IPT from “resource metadata” to “sampling event data” and/or “occurrence data”. Is this possible?

In addition of using Google Drive for the Source Data URL in IPT, I have also tried using url of raw text file (e.g. https://raw.githubusercontent.com/biodiversity-aq/antarctic_subantarctic_asteroidea_isotopes/main/data/processed/measurementOrFact.txt) in GitHub because I had to use some script to transform the data. So far, I think it works really well! Hope that helps!

1 Like

Questions and Answers from the session.

Question 1: How to update/convert a dataset in the IPT from “resource metadata” to “sampling event data” and/or “occurrence data”. Is this possible?

If the data resource added to the metadata-only dataset is not yet registered on GBIF, then you can change the dataset type in the basic metadata. However, if the resource is registered, then it is not possible to change the dataset type. Instead, you can publish a new dataset with the dataset type you wish to change to, register the old dataset as a duplicate of the new dataset (in the Registry), and add a description in the metadata explaining the change in both the old and new dataset. Then the old dataset can be deleted – remember to keep track of the UUID of the deleted dataset if it contains citations. The citations will not be transferred to the new dataset by marking it as a duplicate.

However, it is possible to change dataset type for the three core types (Occurrence, Sampling-event, Taxon): FAQ :: GBIF IPT User Manual

It would be beneficial for the publisher to be able to change a metadata-only dataset type even though the added data resource is already registered on GBIF. We have created a GitHub issue to suggest the option for IPT development: Make it possible to change a registered metadata resource to core resources · Issue #2034 · gbif/ipt · GitHub.

Question 2: How does GBIF interpret the data provided by publishers?

Note: This question specifically concerns the following GitHub issue : Event core w. occurrence extension - inheritance of occurrence search parameters? · Issue #878 · gbif/pipelines please read comments in the issue for further information.

Question 3: Which fields will be populated/replaced by GBIF-interpreted values when users download the data? And does GBIF for example replace the lifeStage field with a controlled value from GBIF vocabulariesfor its filter in occurrence search?

In general, all filters in the occurrence search are based on the interpreted value unless stated otherwise. For example, verbatim scientificName allows users to search for the scientificName provided by the publisher, but this is the only searchable verbatim value we currently have, and currently does not allow for fuzzy search (in contrast to searches on the interpreted scientificName). Only exact matches to the verbatim value provided will show up when searching for the name here.

GBIF interprets values for fields where we have a controlled vocabulary. At present, few of these controlled vocabularies are in the vocabulary server, while most are still in the code base of the pipeline interpretation. You can find some of the vocabularies here, but they are not all used, and we currently have not documented which are in use and which are not.

The lifeStage vocabulary is in use and it is the controlled concepts of the lifeStage vocabulary you can filter for in occurrence search.

However, when users download the data, they will get the verbatim values in the verbatim file – if users download the full DwC-A and not only the simple format. So verbatim values of datasets will always be accessible, but not necessarily searchable.

Question 4: Which extensions and their fields are interpreted and indexed by GBIF?

Only media type (image, sound etc.) is interpreted and indexed by GBIF. The other extensions are stored as non-searchable verbatim values.

Question 5: What are GBIF’s data quality requirements for extensions, if any exists?

GBIF do not have any other requirements than for a few of the fields of the cores/dataset classes and for the mappings in the DNA-derived extension: Publishing DNA-derived data through biodiversity data platforms(which are similar to the core mapping requirements). Other than requirements, we have recommendations such as in guidelines, blog posts or general recommendations on DwC fields from TDWG.

Question 6: When GBIF replaces the scientificName provided by the publisher and pushes the provided name to the verbatimScientificName field, does the higher taxonomy also get replaced with what is in the backbone and how is this visible in the portal and in downloads?

Yes, GBIF replaces the higher taxonomy with the taxonomy of the scientificName in the backbone. In those cases, the remarks field in the occurrence records would show ‘altered’ or ‘inferred’. There might be flags such as ‘taxon match fuzzy’ etc. (for example if the verbatim scientificName is slightly misspelled) in the ‘remarks’ column as well. Flags and issues are included in the download, but other remarks such as ‘inferred and ‘altered’ are not.

Question 7: How does a publisher republish a dataset that was originally published as one big dataset but should be split into multiple smaller datasets? And is it possible to keep the citations and DOI?

The dataset can be split into smaller datasets, with the original dataset (which will eventually be deleted) then registered as ‘duplicate of’ the new version in the Registry. It is good practice to explain the change of splitting the larger dataset in the metadata description of both the large and the smaller datasets, before the actual change is made. Unfortunately, the DOI and citations of the original larger dataset will not be transferred to the smaller datasets. Publishers should themselves save the old UUID and keep track of the citations before the change.

Question 8: A publisher wants to publish an already published dataset from another IPT. How should they carry out the switch?

There are step-by-step instructions on how to migrate existing registered DiGIR, BioCASe, TAPIR, or DwC-A resources (datasets) to an IPT in the IPT manual: Manage Resources Menu :: GBIF IPT User Manual.

Question 9: Will a dataset be automatically deleted from GBIF.org when it is deleted in the IPT?

IPT managers can choose from two different options when deleting a resource (dataset) from the IPT:

· Delete from the IPT and GBIF.org

· Delete from the IPT only (Orphan)

For more information, please refer to the IPT manual: https://ipt.gbif.org/manual/en/ipt/latest/manage-resources#delete-a-resource.

Question 10: Does GBIF provide services for hosting images and do images shared on GBIF have to be publicly available?

GBIF does not host any data published to GBIF.org, and the images shared to GBIF will have to be publicly available. This blogpost gives a good introduction to sharing media files on GBIF.org: Sharing images, sounds and videos on GBIF - GBIF Data Blog.

Question 11: What is the best way to submit issues to helpdesk at the GBIF Secretariat? Through the portal feedback system, directly through GitHub or by email?

All three options are equally fine. In some cases, we submit GitHub issues based on email correspondence, since this allows us to track, migrate and share an issue more easily both internally and externally. GitHub issues are often created for backbone related issues, as they would require fixes in checklists published to GBIF or Catalogue of Life.

Question 12: In the new version of the IPT, there is an option to choose URL as a resource upload. What is an example of a URL resource?

You can, for example, use Google Drive or a raw text file for the Source Data URL in IPT (see Ming’s example in the comments below). For a Google Sheet, you can ‘File’  ‘Share’  ‘Publish to web’  ‘Publish as .tsv file’ and then you get a link for the .tsv file you can use as URL. You can choose the option to ‘automatically republish when changes are made’ but it only publishes the snapshot of the file, so any new version would have to be remapped in the IPT as well.

In the IPT settings, you can set the frequency of publishing for individual resources, for example daily or weekly.

Question 13: Can publishers bulk edit EML data on the IPT/cloud IPTs?

The question and comments are captured in this GitHub issue: Feature or API to update EML data for multiple datasets in cloud IPTs · Issue #1977 · gbif/ipt · GitHub. Currently it is not possible to bulk edit metadata on the cloud IPTs due to limited manager access. It should be possible to bulk edit EML data on publisher hosted IPTs or if resources are published through an API. Please see this GitHub issue: Publishing datasets using Python script fails or is very unstable after IPT was updated to v2.7.2 · Issue #1973 · gbif/ipt · GitHub.

Question 14: Is it possible to track when a record is published? For example, the is a field in the dataset for when it was recorded, but not for when it was published.

GBIF does not track individual records, but it is possible to track changes for datasets through the ingestion history in the Registry (example). Datasets are generally crawled for updates every week by GBIF.

Question 15: How can the language of the IPT be changed?

Please follow the FAQ described in the IPT manual: FAQ :: GBIF IPT User Manual. The languages currently supported are Portuguese (pt ), Japanese (ja ), French (fr ), Spanish (es ), Traditional Chinese (zh ), and Russian (ru ).