Can I use the GBIF API to get the total number of species?

I’ve played with the GBIF API for a few days & I don’t see a way to get counts for SPECIES descendants of taxa (the “numDescendants” value includes higher taxa), without crawling the entire db. As a workaround, I downloaded all records and am building my own SQLite database. But it’d be way better if I could query the GBIF API for summaries, something like:

/species/summary/{highertaxonKey}?rank=species # Would return number of species under highertaxonKey

… or:

/species/summary/{highertaxonKey}?rank=class # Would return number of classes under highertaxonKey

Is there a way to accomplish this currently? I read through the docs a couple of times & haven’t found it.

Cheers,
Raphael

Hi Raphael,

Does this accomplish what you’re looking for?

https://api.gbif.org/v1/species/search?rank=SPECIES&highertaxon_key=1&limit=0

(example of number of SPECIES for Animalia)

{
"offset": 0,
"limit": 0,
"endOfRecords": false,
"count": 2759630,
"results": [],
"facets": []
}

Or arthropod families:

https://api.gbif.org/v1/species/search?rank=FAMILY&highertaxon_key=1&highertaxon_key=54&limit=0

{
"offset": 0,
"limit": 0,
"endOfRecords": false,
"count": 21167,
"results": [],
"facets": []
}

Edit: if someone knows why my post here was flagged by the community and hidden, please let me know!

Ah, sneaky! It makes sense that count is decoupled from results, although it’s a bit unintuitive at first glance. Thanks for the reply, Daniel.

Next: the results don’t seem to account for duplicates. For example, this query:

https://api.gbif.org/v1/species/search?rank=SPECIES&q=Aspidiotus%20piri&limit=0

… returns a count of 21 when I would expect a count of 1. We can reduce the count by sticking to one parent kingdom record:

https://api.gbif.org/v1/species/search?rank=SPECIES&q=Aspidiotus%20piri&limit=0&&highertaxon_key=1

… but that still returns 9 results, two of them with matching nubKey values. So, unless there’s a way to tell the API to resolve duplicates, it looks like I’m still stuck downloading all the data, cleaning it to remove duplicates, & building out a db on my end to quickly get accurate counts, right? I imagine other folks run into this issue. Any other tips before I start down this rabbit-hole?

Cheers,
Raphael

Assuming you want the count of species rather than the count of published species names, try adding the status=ACCEPTED parameter.

https://api.gbif.org/v1/species/search?rank=SPECIES&highertaxon_key=1&limit=0&status=ACCEPTED

{
“offset”: 0,
“limit”: 0,
“endOfRecords”: false,
“count”: 1709526,
“results”: [],
“facets”: []
}

For your Aspidiotus piri search, adding status=ACCEPTED results in a count of 0, but that is because all of the listed combinations are considered synonyms of Diaspidiotis pyri.

https://api.gbif.org/v1/species/search?rank=SPECIES&highertaxon_key=1&limit=0&status=ACCEPTED&q=Diaspidiotus%20pyri

{
“offset”: 0,
“limit”: 0,
“endOfRecords”: false,
“count”: 1,
“results”: [],
“facets”: []
}

You might want to try the Species Match API instead of the Species Search API. The match API will return a main match. More information on this blogpost: (Almost) everything you want to know about the GBIF Species API - GBIF Data Blog

1 Like

Perfect—thank you, Donald—that’s exactly what I was after. I didn’t know there was only one ACCEPTED instance of published species names.

Thanks, Marie! I’ll give the match API a shot—it sounds more performant, too.

Glad it was of use. For what it’s worth, I think there’s another wrinkle here in the number returned. (As someone who’s overseen content for several biodiversity websites, I’ve come to believe that “number of species” always requires massive amounts of explanation and caveats).

I’m pretty sure the count does not match the number of accepted species-rank taxa that are known to GBIF - COL has a larger number of these than the number returned here by the GBIF API. I think this represents the species-rank taxa for which GBIF also has occurrence data.

Edit: this post was marked as spam by the community & was hidden, but I don’t know why. Please let me know what’s wrong with it before hiding it again. Thanks!

Thanks, Donald. I’m coming from an arts & engineering background, so I appreciate your experience with biodiversity data. The GBIF numbers look higher to me:

Biota
COL: 1,896,632 (Taxon | COL)
GBIF: 2,182,233 (Sum of GBIF kingdom species below)

Animalia
COL: 1,339,437 (/taxon/N)
GBIF: 1,709,526 (https://api.gbif.org/v1/species/search?rank=SPECIES&highertaxon_key=1&limit=0&status=ACCEPTED)

Archaea
COL 377 (/taxon/R)
GBIF 1,092 (?rank=SPECIES&highertaxon_key=2&limit=0&status=ACCEPTED)

Bacteria
COL: 9,980 (/taxon/B)
GBIF: 26,089 (?rank=SPECIES&highertaxon_key=3&limit=0&status=ACCEPTED)

Chromista
COL 21,294 (/taxon/C)
GBIF: 100,288 (?rank=SPECIES&highertaxon_key=4&limit=0&status=ACCEPTED)

Fungi
COL 146,155 (/taxon/F)
GBIF: 227,509 (?rank=SPECIES&highertaxon_key=5&limit=0&status=ACCEPTED)

Plantae
COL 370,236 (/taxon/P)
GBIF: 475,893 (?rank=SPECIES&highertaxon_key=6&limit=0&status=ACCEPTED)

Protozoa
COL 2,565 (/taxon/Z)
GBIF: 5,271 (?rank=SPECIES&highertaxon_key=7&limit=0&status=ACCEPTED)

Viruses
COL 6,588 (/taxon/V)
GBIF: 6,654 (?rank=SPECIES&highertaxon_key=8&limit=0&status=ACCEPTED)

I don’t see an API for COL, but the data is available to download (Metadata | COL) & I’ve got an SQLite database going now to cache counts on higher taxa nodes—so I could use either GBIF or COL. Which would you go with?

For context: I’m working on an educational product that visualizes species numbers & incorporates data from the Red List. I’m trying to work in empty gaps to visualize species we don’t know about—so putting the caveats front & center. That’s the most interesting part of the story.

Cheers,
Raphael

For posterity, here’s how you can use the GBIF API to query the COL dataset, if you want to limit results to what’s in the COL:

/v1/species/search?datasetKey=7ddf754f-d193-4cc9-b351-99906754a03b&rank=SPECIES&limit=0&higherTaxonKey=170809364&status=ACCEPTED

… ^ that will give you the count for Archaea, according to the COL.

The COL checklist is summarised in a nerdy way here in Swagger:

https://api.catalogueoflife.org/

3LR is the dataset key to use to get the main COL list.

COL and GBIF are in the midst of unifiying how the GBIF and COL lists operate so that soon there should be one list with the ability to narrow to just the parts that have been formally reviewed (equivalent to COL today) or include the tentative auto-constructed parts (equivalent to GBIF today). Names and species should then share the same identifiers, and the same API should apply, across both.

In the meantime, the decision probably comes down to your purpose. If you wish to access all occurrence data or organise data for all parts of the tree of life, GBIF is currently more comprehensive. If you need more confidence that the names are ones that have been reviewed and are actually correctly interpreted, COL fits the bill - but currently with lower coverage.

So helpful! Thanks again, Donald.

Slightly off-topic: my reply to you was hidden because someone in the community flagged it as spam:

“Your post was flagged as spam: the community feels it is an advertisement, something that is overly promotional in nature instead of being useful or relevant to the topic as expected.”

I’m new to this forum & I’m not sure what I wrote that was inappropriate. If anyone knows what it was, let me know so I can avoid it in the future, thanks!

Cheers,
Raphael

It’s an automated behaviour, @raphaelmatto , to reduce spam and cruft of an open discussion system. The specific reason was that “This new user tried to create multiple posts with links to the same domain.” All good now.

As you probably remember @dhobern , we prepared this when we swapped the count of peer-reviewed uses into the home page:

With a new backbone, I’m sure we’re due to update the numbers we provide there, too…

cc: @dnoesgaard

May be a bit late, but having in mind all the considerations previously mentioned, if you use the facet arguments with limit = 0:

occurrence/search?taxonKey=1
   &facet=speciesKey
   &speciesKey.facetLimit=50000
   &speciesKey.facetOffset=0
   &limit=0

you get a table like this
[[1]]
name count
2480449 513
2481172 7
2481433 7
7788295 7
2498352 6
7595433 6
2351211 5
2481205 5
2498326 5

without the need to download all data. “name” is the species key