Recently I have been thinking that we don’t advise users very well on how to get their data into R. GBIF downloads are sometimes difficult to read into R because of quoting that causes problems. For example, read.table
does not typically do a good job of reading in GBIF SIMPLE_CSV.
I ran some experiments with various csv readers in R :
read.table()
readr::read_tsv()
vroom::vroom()
data.table::fread()
-
rgbif::occ_import
(usesdata.table::fread()
internally)
Ideally we would want a csv reader that does the job with minimal changes to the default settings. I will be using this download from a previous post.
There is are 2,639,079 rows in this download and 50 columns.
read.table()
Best settings I could find for read.table
337,716. Changing the encoding to “UTF-8” also did not improve things.
path_to_download = ""
gbif_download_key = "0080847-200221144449610"
file_path = paste0(path_to_download,gbif_download_key,"/",gbif_download_key,".csv")
read.table(file_path,sep="\t",fill=TRUE,header=TRUE,quote="") %>%
dim()
# [1] 337716 50
read.table(file_path,sep="\t",fill=TRUE,header=TRUE, quote="",encoding="UTF-8") %>%
dim()
# [1] 337716 50
readr::read_tsv()
readr
reads the entire file only when using quote=""
.
library(readr)
read_tsv(file_path) %>%
dim()
# [1] 1681879 50
read_tsv(file_path,quote="") %>%
dim()
# [1] 2639079 50
vroom::vroom()
vroom
will also work with quote=""
.
vroom::vroom(file_path) %>%
dim()
# [1] 1785588 50
vroom::vroom(file_path,quote="") %>%
dim()
# [1] 2639079 50
data.table::fread() and rgbif::occ_download_import()
data.table::fread()
works with default settings.
data.table::fread(file_path) %>%
dim()
# [1] 2639079 50
rgbif::occ_download_import(key=gbif_download_key,path=path_to_download) %>%
dim()
# [1] 2639079 50