Reading GBIF downloads (SIMPLE_CSV) with R

Recently I have been thinking that we don’t advise users very well on how to get their data into R. GBIF downloads are sometimes difficult to read into R because of quoting that causes problems. For example, read.table does not typically do a good job of reading in GBIF SIMPLE_CSV.

I ran some experiments with various csv readers in R :

  • read.table()
  • readr::read_tsv()
  • vroom::vroom()
  • data.table::fread()
  • rgbif::occ_import (uses data.table::fread() internally)

Ideally we would want a csv reader that does the job with minimal changes to the default settings. I will be using this download from a previous post.

There is are 2,639,079 rows in this download and 50 columns.

read.table()

Best settings I could find for read.table 337,716. Changing the encoding to “UTF-8” also did not improve things.

path_to_download = ""
gbif_download_key = "0080847-200221144449610"
file_path = paste0(path_to_download,gbif_download_key,"/",gbif_download_key,".csv")

read.table(file_path,sep="\t",fill=TRUE,header=TRUE,quote="") %>%
dim()
# [1] 337716     50

read.table(file_path,sep="\t",fill=TRUE,header=TRUE, quote="",encoding="UTF-8") %>%
dim()
# [1] 337716     50

readr::read_tsv()

readr reads the entire file only when using quote="".

library(readr)
read_tsv(file_path) %>% 
dim()
# [1] 1681879      50

read_tsv(file_path,quote="") %>% 
dim()

# [1] 2639079      50

vroom::vroom()

vroom will also work with quote="".

vroom::vroom(file_path) %>%
dim()
# [1] 1785588      50

vroom::vroom(file_path,quote="") %>%
dim()
# [1] 2639079      50

data.table::fread() and rgbif::occ_download_import()

data.table::fread() works with default settings.

data.table::fread(file_path) %>% 
dim()
# [1] 2639079      50

rgbif::occ_download_import(key=gbif_download_key,path=path_to_download) %>% 
dim()
# [1] 2639079      50
3 Likes

Am not very versed with R, but this is a very good layout to go about reading the .csv data.

I faced a similar challenge with GBIF .csv data imports in Python. And I can attest changing the encoding to “UTF-8” did smooth things out for the read method.

1 Like

Just a quick note to say that the reason read.table fails at that row is probably because there is an obscure unicode character in that record. Look at the locality value here: https://api.gbif.org/v1/occurrence/1675531571/fragment

1 Like

Do you happen to know a setting in read.table that would fix it?

@jwaller, is there any reason to prefer read.table over the other options?

I don’t think it’s possible using those functions. Even readLines trips over that character. You’d have to remove these characters beforehand.

It’s probably best to recommend against using base R functions for importing GBIF data. The other packages are much faster as well.

the only reason I can think if is if you are stuck in an environment that does not let you download new packages like data.table, readr, vroom …

I agree with MatDillen that in general you are better off reading occurrence files with more specialized csv parsers in R. However, you could call system2 and then use something like awk to parse out the unicode characters you might have trouble with. Have a look at this thread for a code example. To find these kind of characters, you could use something like grep I suppose.

Regarding encoding, I believe most data on GBIF is actually UTF-8 encoded, however, keep in mind that there are files with mixed encoding on GBIF. Which means part of the file might be latin1 encoded, or even something completely different, even if the encoding is declared as UTF-8 (or something else). The encoding you can expect from a DarwinCore archive is enclosed in the meta.xml file of the archive, for example <core encoding="UTF-8"> + also your delimiter, quote character etc. But you can’t always rely on this information either. Especially if you are pooling from different large datasets.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.