Skipping over species occ that DOE?

I am running into an issue where my function for downloading occurrences fails when there is a species with no data, or there is a spelling error in my AOI. Maybe someone knows a way to avoid this.

I am new to R but have a fair amount of experience. What I wanted to do was pass on two columns with the common name and scientific name for the species I am trying to pull occ data for, but there is too many issues with spelling errors, name errors, etc. I am also open to other workflows. I tried to download by taxa within my polygon but im not sure how to handle that large dataset now that I have it.

This is what I have, I apologies if this is not a standard practice for sharing code, I am learning! :slight_smile:

`#make a string of species names to use in the 'occ_data' function
species <- c("Grus americana", "Egretta rufescens", "Plegadis chihi","Mycteria americana", "Pelecanus erythrorhynchos", "Pelecanus occidentalis", "Peucaea botterii", "Platalea ajaja", "Vireo atricapilla Woodhouse", "Setophaga chrysoparia", "Sternula antillarum", "Tympanuchus cupido subsp. attwateri", "Falco femoralis", "Dryobates borealis", "Empidonax traillii", "Falco peregrinus", "Charadrius melodus", "Buteo nitidus", "Glaucidium brasilianum","Haliaeetus leucocephalus","Camptostoma imberbe","Pachyramphus aglaiae","Elanoides forficatus","Coccyzus americanus","Elanus leucurus","Peucaea aestivalis","Buteogallus anthracinus","Strix occidentalis","Buteo albicaudatus") *****#function fails when not found in the bounding box or species name is incorrect/capitalization****

#also make a string of common names
common_name <- c("Whooping Crane", "Reddish Egret", "White-faced Ibis","Wood Stork", "American White Pelican", "Brown Pelican", "Botteri's Sparrow", "Roseate Spoonbill", "Black-capped Vireo", "Golden-cheeked Warbler", "Least Tern", "Attwater's Greater Prairie Chicken", "Aplomado Falcon", "Red-cockaded Woodpecker", "Traill'S Flycatcher", "Peregrine Falcon", "Piping Plover", "Gray Hawk", "Ferruginous Pygmy Owl","American Bald Eagle", "Northern Beardless Tyrannulet","Rose-throated Becard","American Swallow-Tailed Kite","Yellow-Billed Cuckoo","White-Tailed Kite","Bachman's Sparrow","Common Black Hawk","Spotted Owl","White-tailed Hawk")

emptylist <- vector("list", length = length(species))
commonemplist <- vector("list", length = length(common_name))


#funtion for pulling data
crawl <- function(year){
  for (i in 1:length(species)) { #this function can pull data for multiple species 
    occ <- occ_data( #parameters outlined by the package to pull species specific data
      scientificName = species[[i]],
      hasCoordinate = TRUE, #spatial coordinates are an important feature for the observations
      geometry = st_bbox(AOI), #identifying the AOI to get observation within texas
      year = year,
    ) %>%
    .$data # 
  
  # add species name column as ID to use later
  occ$ID <- common_name [[i]]
  
  #clean by removing duplicate occurrences
  emptylist[[i]] <-
    occ %>% distinct(decimalLatitude, decimalLongitude, .keep_all = TRUE) %>%
    dplyr::select(Species = ID,
                  decimalLatitude,
                  decimalLongitude,
                  year,
                  month,
                  basisOfRecord) #grabbing geographic coordinates, year, month, and the type of record. For this data set, all are "Human Observations" 
  }
  whoop <- bind_rows(emptylist)
}


years <- c(2013:2023) #assigning the years to pull data from 
whoop <- map_dfr(years, crawl) #using our function and inputting years to pull species data

# Giving each observation a unique ID 
whoopunique <- rowid_to_column(whoop) %>% 
  st_as_sf(coords = c(x ="decimalLongitude", y ="decimalLatitude"), crs = 4269)`
`



thank you!

Hi @jrhollis

Why not use the occ_download() function and get all the species in one download?

When using the occ_download(), you can filter by year and specify that you want data with coordinates only, you can also exclude records with specific flags and issues (see examples here: Common things to look out for when post-processing GBIF downloads - GBIF Data Blog).
Plus it will give you a citable DOI so you can credit the people whose data contributed to your work.

I would advise to use the R examples in this blogpost Downloading occurrences from a long list of species in R and Python - GBIF Data Blog as a starting point. The key is querying by taxon keys instead of species names.

You can also get the GBIF taxon keys by vernacular names (=common names) by using the name_lookup() function in rgbif (you need to specify that you want keys for names in the GBIF backbone taxonomy by using datasetKey = 'd7dddbf4-2cf0-4f39-9b2a-bb099caae36c'). Note that using vernacular names isn’t very reliable.

I hope this can help you get started. Let us know if you have any question.

1 Like

Thank you for your reply. I am trying to go through this workflow instead but I continue to get errors when requesting Taxon keys. I used the example code:

# The 60,000 tree names file I downloaded from BGCI
file_url <- "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"
# match the names 
gbif_taxon_keys <- 
readr::read_csv(file_url) %>%
head(1000) %>% # use only first 1000 names for testing
pull(,"Taxon name") %>% # use fewer names if you want to just test 
name_backbone_checklist()  %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey) # get the gbif taxonkeys
# gbif_taxon_keys should be a long vector like this c(2977832,2977901,2977966,2977835,2977863)
# !!very important here to use pred_in!!
occ_download(
pred_in("taxonKey", gbif_taxon_keys),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

and pull(usageKey) continues to give errors. I see this is used frequently but I do not know where this call comes from and it is not a variable define up stream.

I appreciate your help. As you can see, I learned to pull rgbif in a much different way!

Thanks for letting us know @jrhollis. I understand that the pull function is from the dplyr library (Extract a single column — pull • dplyr). There could be an issue with the library installation or the function itself. Did you get any error message while installing the library?

You could also use a $ instead. The idea is to select only one column from the vector.

No errors in library installation. Also checked to make sure the package was up to date. Is that specific line trying to pull a column from the input dataset or the target? If it is the input, that example csv did not have this.

Thank you for your help. Do you have any more resource (i.e. videos, tutorials) on using these functions? I would love to do some more development as I am trying to bring this to my team in order to only use GBIF as our data of choice. Any more resources would be so helpful (and even a class would be cool too). This community forum has already been such an improvement.

TL;DR: remove that , in pull(, "Taxon name") then your code works.

@jrhollis The reason pull(usageKey) isn’t working is because your filter call directly upstream returns an empty tibble. This is because upstream of that, pull(, "Taxon name") is returning a a tibble where matchType is NONE for all rows. TBH, I’m not quite sure what is happening, but if your remove the , in that pull call, it returns a tibble, where matchType has three values (EXACT, FUZZY, HIGHERRANK).

Here is an example that gives you an EXACT match, a HIGHERRANK match, and a NONE match. It might be useful to continue testing.

tibble("Taxon name" = c("Abarema abbottii", "Abarema non-species", "barema")) %>% 
  pull("Taxon name") %>%
  name_backbone_checklist() %>%
  glimpse()
1 Like

After sleeping on it, I realized what is happening. That extra comma makes it so you’re passing a named character vector to name_backbone_checklist(). The values come from the Citation column, and names from Taxon names. It then returns a strange tibble that doesn’t have a column called usageName, which is what is causing the error, not because it’s empty. I’m not sure why name_backbone_checklist() changes the output tibble structure when you pass it a named vector.

The reason pull is acting the way it is, is because the %>% implies that the .data argument should be filled by the output from the pipe (If you pass the tibble inside of pull instead of the pipe, it returns the Taxon names vector correctly). Thus, the comma is defining the second (var) and third (name) arguments, not the first and second. If you look at ?pull, the default behavior for var is pulling the last column, if one isn’t named. So effectively, you’re telling it to pull the last column and use Taxon names as names for the output vector.

This may be TMI, but I learned a thing or two figuring it out, so thanks for asking the question!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.