Getting ALA data to the usable stage

datafixer · October 13, 2024, 6:26am

Working with biodiversity data often involves a fair amount of filtering and cleaning (these are different things). We don’t often have to reformat a Darwin Core dataset, though, thanks to GBIF’s Integrated Publishing Toolkit. Occurrence datasets that have been through the IPT typically come out as simple tab-separated tables in UTF-8 encoding, with a single header line containing Darwin Core field names.

This isn’t the case for datasets shared with GBIF by the Atlas of Living Australia. A Darwin Core archive from the ALA with an occurrence core has an occurrence.txt file with 182 fields but no header line, and with every data item surrounded by quotes. The accompanying meta.xml file has 182 field names, but many of the field name references are defective:

8 with "http://rs.ala.org.au/terms/1.0/…" don’t resolve
4 with "ABCD - Base Ontology Terms…" should instead go to "ABCD - Base Ontology Terms…"
3 have DwC terms (eventAttributes, locationAttributes, occurrence Attributes) which TDWG says are deprecated and should not be used
2 with "http://hiscom.chah.org.au/hispid/terms/…" don’t resolve
3 with "GGBN Data Standard…" don’t resolve
1 has no URL or explainer: "taxonRankID"

I can’t do anything about the meta.xml, but to make ALA occurrence data usable as a single table I can use the BASH shell function shown below. I extract occurrence.txt and meta.xml from the ALA archive, navigate to the directory containing those files and enter “alaprep”. The function generates a new TSV “ocala.txt” with the 182 field names in a header line and no quotes around data items.

alaprep() { cat <(sed 's|taxonRankID|/taxonRankID|' meta.xml | awk -v FS="/|\"/>" '/field index="0"/ {f=1} f {printf("%s\t",$(NF-1))} /field index="181"/ {exit}' | sed 's/\t$/\n/') occurrence.tsv | sed 's/^"//g;s/"\t"/\t/g;s/"$//' > ocala.txt; }

Please email me directly if you have questions about the function.

–

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

Topic		Replies	Views
Which data can be shared through GBIF and what cannot - GBIF Data Blog data-blog	1	656	November 17, 2022
How I check Darwin Core datasets Data Publishing	1	594	March 10, 2023
Data Use Club Practical Session: Data standards & processing Data Use	0	88	February 17, 2025
New website: Darwin Core table checker Data Publishing	0	359	December 21, 2023
All gbif taxon data	10	1249	July 2, 2020

Getting ALA data to the usable stage

Related topics