Google Dataset Search


#1

Google has released Dataset Search, which looks cool. GBIF data comes up, e.g. https://toolbox.google.com/datasetsearch/search?query=Coleoptera BUT there are no GBIF links or logos. Instead the kinks are to Datacite and the logos (if any) are for the sources of the data. I think this is potentially a problem for GBIF as it is essentially invisible in the search results, even if it is actually the reason the data occurs in those results. Maybe someone should talk to Google and Datacite about fixing this?


#2

There’s a Twitter thread here that discusses this a bit more: https://twitter.com/rdmpage/status/1037391171423744001?s=20


#3

Indeed, GBIF mediated datasets do not appear as they should for the moment.
Also noticed that lots of GBIF downloads appear as ‘datasets’ with GBIF logo.see search results


It should not be too difficult to fix and its crucial to GBIF visibility to have that right!
Hope the Secretariat will start investigating this soon.


#4

@andre I think there are a couple of issues here.

Firstly, who best to give credit to for GBIF (-insert politically correct prefix here) datasets? GBIF themselves seem happy with the original provider’s logo being used:

BTW—we’re really happy with the logo-and-star-credit for data providers, and with display of GBIF download DOIs… https://twitter.com/GBIF/status/1037618845224247297

If the goal is to increase visibility of the original provider, that makes sense. However, it is at the cost of GBIF’s visibility (and not helped by the link to the dataset being to a DataCite metadata page, not the GBIF page for the dataset).

The second issue is, as you point out, the huge number of GBIF downloads that appear with the GBIF logo. These are, mostly, a waste of space in that they aren’t datasets as such. I’ve never been happy with GBIF’s decision to assign a DOI to every download, especially as those downloads are not guaranteed to be persistent (which undermines the very idea of a DOI in the first place). But I guess it’s an attempt to make it easier to cite GBIF data. But a consequence is that meta search engines like Google’s get swamped with “datasets” that aren’t datasets.

So I think as things stand GBIF (a) doesn’t get any visibility for the data it mobilises and (b) gets visibility for essentially spamming the search results :disappointed:

None of this may be what GBIF imagined would happen, but it’s something that deserves some attention, especially given the potential visibility of Google’s latest toy.


#5

Point a) is basically correct as it stands now—but we do feel that it’s important for the data publishers to receive credit. What’s happening now is that DataCite is standing in for GBIF as the provider, in the absence of Schema.org-compatible metadata. Providing that is and has been on the docket, but with the release of Google Dataset Search, it takes on greater urgency. We weren’t expecting the tool, either.

Point b)…well, legal and regulatory frameworks recognize a download as a new dataset, whether it’s derivative or not. We do assign DOIs, and we couldn’t do be doing what we’re doing with literature tracking without them. And they provide a detailed, transparent and reproducible link back to the complete list of sources while reinforcing the provenance. ‘Spamming the search results’ seems a bit extreme.

If it is, it probably won’t for long, once Google start tweaking the results, providing facets and filters, AND adds more data from other domains—environmental data was an easy get for them to start. In the meantime, presumably because it provides full-text search of dataset titles, it provides an interesting way to search what people are downloading by common names.