How to detect and solve formatting issues in your DwC-A (GBIF technical support hour for nodes)

May 1st, 2024, we will convene for the next GBIF technical support hour for nodes where the topic is how to detect and solve formatting issues in your DwC-A. The session will occur on May 1st, 2024, at 4pm CEST (UTC+2).

Publishers and nodes occasionally encounter errors that are not transparent when they try to validate their DwC-A’s with the data validator tool on GBIF.org. The errors often relate to issues with delimiters, special character, e.g. two double quotes (“”””) around values in one or few cells which affects interpretation of delimiters, etc. If you have any examples of weird formatting issues – please let us know and we will make sure to include it in the presentation!

We will show the validator tool and how to problem solve using a few tools to check your files.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

4 Likes

The video is available here: Detecting and solving formatting issues in your DwC-A

Here is the transcript of the questions during the session.

Can you recommend a local server?

We cannot recommend any specific local server, but other nodes have used HeidiSQL and find it useful.

Can you recommend a source for SQL statements/recipes when using e.g. Clickhouse to examine issues in your DwC-A?

Some nodes have found this W3Schools useful. We usually use Google search or Stackoverflow if we are uncertain how we best write a SQL query. Posing a question on Discourse is also a good option to survey what others use in the community. Please also be aware that GBIF has an agreement with DataCamp where GBIF users from Latin America and the Caribbean, Africa, Asia, Northern Eurasia and the Pacific can get free access to online training courses.

We may investigate for the July node support hour different resources for recipes that make sense since the topic is SQL downloads.

Sometimes when you are ready to publish your data via the IPT, the archive passes through the data validator tool without any issues, it fails due to un-updated extensions?

To save you the hassle of having to remap the data, it is always a good idea to check if the cores and extensions need an update in the IPT administrator menu (IPT admins only) before the next dataset is to be published. Users with registrations rights on your IPT may be dependent on the cores and extensions are up to date to be able to publish but cannot themselves update the required components.

Watch the node support hour on how to install, back-up and update your IPT here.

It is the plan to include notifications for admins of IPTs if the extensions or cores are out of date.

Can you access the ingestion logs for datasets published in the test environment/UAT?

Yes, the logs can be access at https://registry.gbif-uat.org/.

When you get a log result that states there is an error in a specific line, which file in the archive does it relate to – is it interpretable from the logs?

Tentative reply (awaiting Niks response): It is not always possible to see in which specific file the error occurs - in this specific case it was not possible although the occurrence.txt file was the only file that contained that many columns and rows.

Sometimes the data validator seems to process the archive differently depending on the file extension used e.g., .csv, .txt etc.? A node has experienced that the data validator sometimes cannot process certain file types, although this is not consistent.

This issue has not been reported to helpdesk before, but the node mentions that sometimes the simple solution is just to change the file type and try again. It sounds like the issue is related to a column separator being used within the values of the dataset so switching to another delimiter fixes the issue. If changing delimiters does not work, then OpenRefine is also an excellent tool to figure out issues with your data, and GBIF provides training on how to use it.

An older version of the data validator used to show the names of the species that wrongly spelled but now the tool only shows IDs, this is a change that affects the training data a node use. Are there any plans for updating the data validator tool, and if so, could this feature be revived?

There are currently no plans for updating the data validator tool but we have started a list of ideas for potential development: Data validator suggested changes compilation · Issue #867 · gbif/pipelines · GitHub.

The species names can still be found if the pills for the different flags are expanded, like in this validation report (under ‘validation issues’ for each file in the archive).

It seems like the validator tool does not show all the issues in your dataset and that the errors somehow are nested?

When you fix errors in your dataset based on the feedback from the data validator, you should rerun the dataset in the validator again. The system picks the first error and then stops the ingestion processing, so it will not follow up on any leading issues that may be present in the archive.

We have an open GitHub issue for the possibility to get a full report of the errors – please comment on this if you also find it could be useful.

2 Likes