Data will save the natural world: dogma, marketing or both?

datafixer · July 14, 2024, 1:30am

Everyone working in biodiversity informatics is familiar with the notion that the data we manage is important for conserving and managing biodiversity.

This notion has been expressed thousands of times in reports, journal articles and funding applications. I first saw it when GBIF was established:

The sustainable use and management of biodiversity will require that information about it be available when and where that information is needed by decision-makers and scientists alike.

Edwards, Lane & Nielsen (2000)

and it remains a theme in the current research literature:

In the marine realm, knowledge about biodiversity is still scarce, incomplete and concerns all taxa (Mora et al. 2011, Wiens 2023). This lack of knowledge, added to the current context of biodiversity loss which impacts all ecosystems (Diaz et al. 2019) makes biodiversity assessments crucial for exploring biodiversity and understanding its erosion. Accurate analyses are needed to determine relevant conservation strategies as well as planning and monitoring this marine biodiversity (Barnosky et al. 2011).

Haderlé et al. (2024)

But like many widely accepted notions, this one has become an article of faith, with little questioning of its validity in the real world:

Are masses of biodiversity data really needed to formulate policy and make decisions?
Do data-based policies for biodiversity conservation and management get implemented adequately, or at all?
Does more and better data lead to more and better policies and decisions?

The strength of the core notion relies on what might be called common-sense generalisations. Two of these are that you can’t make a decision if you’re ignorant of the facts, and that the more you know, the better your understanding of what needs to be done. Unfortunately, while common sense and generalisations might be good arguments in a debate, they can be misleading. The history of science is rich in examples of counter-sense learning (the Sun doesn’t go around the Earth, the Earth goes around the Sun), and contemporary politicians still build their campaigns on ideas that are known not to work (choose your own examples).

An unfortunate outcome of the belief that biodiversity data will help save the natural world is a tolerance of the sharing of bad data, in the hope or expectation that the data will be fixed sometime in the future:

Urgent needs for data to study the effects of rapid changes in climate and land use on biodiversity, can make it necessary to make data available first and worry about curation later. However, low quality data may negatively impact research results. Curation often requires (semi-)manual checks which are time-consuming, and the required expertise is scarce.

Addink and Guensch

Another unfortunate outcome is that a biodiversity informatics practitioner might begin to doubt the core notion. “Why” they might ask, “am I doing this, if it has no significant impact on extinctions and habitat loss?”

I’ve posted these remarks in GBIF’s community forum in hopes they will stimulate discussion on why biodiversity informatics exists and what its practioners hope to accomplish. In my own view, biodiversity informatics is a branch of library and information science. Who and what it serves is debatable, but like any LIS program it should be done well, and that focus is lost if practitioners believe that all data is good data in an effort to conserve the natural world.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

sformel · July 17, 2024, 3:56pm

@datafixer Thank you, as always, for your thoughtful questions and careful explanation. Here is how I soothe my soul when I find myself wondering, "Why am I doing this, if it has no significant impact on extinctions and habitat loss?”

The hypothesis that biodiversity data can have significant impact on extinctions and habitat loss has enough supporting evidence that I’m not ready to reject it. It is also enough of an interesting challenge to implement the experiments (e.g. GBIF, OBIS) and probe for signal in the results, that I find it stimulating. It’s a bit selfish, but I enjoy the process as much as I hope to enjoy the result.
The path to high quality, perfectly interoperable and reusable data smacks of Zeno’s paradox. We’ll probably never get there, but we learn a lot along the way, and like the classic joke, we can “get close enough”. I imagine the same could be said for any LIS program. That being said, if the message is “all data is good data” then I think we are fooling ourselves. We need to be candid with ourselves about the limits of what we can currently achieve, while striving to halve the distance to perfection. When I get frustrated, this framework helps me identify purpose in our work.

The title you chose asks if this is dogma, marketing or both? I would say both. And I’m ok with that. Whether it’s dogma or marketing that is driving the hypothesis formation, I’m able to convince myself that it remains worthwhile to aggregate data, produce biodiversity informatics, and pursue higher quality data.

datafixer · July 18, 2024, 3:59am

@sformel, many thanks for contributing to this discussion.

It’s interesting that you see the core notion as an hypothesis, because that opens the question “How do you test for a significant result?” In other words, what variables would you have to control to examine statistically the connection between the sharing of biodiversity data and positive, real-world conservation benefits, and how would you avoid the trap of seeing a false-cause connection (post hoc ergo propter hoc; the benefits were coming for other reasons)?

In my experience in Australia, the variables-needing-to-be-controlled make up a long list and are subject to stochastic change, e.g. because of elections. What’s been clear in many conservation wins is that the science behind the win didn’t play a large part in its success.

In fact, in many cases the science has been abused. Suppose a campaign is started to legally reserve a block of forest to be logged. To focus public attention on the block, a biodiversity survey is done that generates occurrence data for a rare or otherwise glamorous species in the block. The campaign propaganda features these occurrences: “Stop the logging! Save the {X}!”

Whether or not the campaign succeeds, several ecological questions are likely to remain unanswered:

Is the rarity of {X} real, or is it an artifact of inadequate sampling?
Will logging actually disadvantage {X}?
What are the current and near-term threats (other than logging) to the continued existence of {X} in the block of forest?
What management is required to maintain {X} in unlogged forest, and who will do it?

You can argue that the propaganda use of occurrence records is not relevant to the science in this case, and that the important thing is that conservation planners and policy-makers now have additional data to inform their decisions. My counter-argument is that those decisions are likely to be made for reasons having nothing to do with the occurrence data, which gets us back to how you could test your hypothesis.

Debbie · July 18, 2024, 11:07pm

Hm. A specific use case example and some references might help here. I would like to add an additional culture change aspect regarding the need to test / document / validate what we purport (that our data prove fit for uses declared).

Sometimes? It seems to me that we focus on the downstream nature of this question, as in, can these data be used to support policy development and their effectiveness in this application.

I’d like to suggest we also take time to consider our embedded research practices (e. g. some of the upstream practices) as they may or may not give us the data we need to answer or address given challenges. If they do not, are we working to amend our current standards-of-practice to change future outcomes?

Consider: a digitization project is funded. In the grant, it is hypothesized that several research questions can be effectively addressed once these digital data become available. Say two years into the grant, one tries to use these data to address these questions. This analytical test shows these data are insufficient to the task (e. g. missing information, or not enough of a certain type of data) to answer one or more of the proposed questions. What happens next? (I’m not sure).

Then, one thinks, well, in a research paper, the discussion section often brings up what might need to happen next, right? It’s the space where authors say, “Oh, we learned something, and we learned that we need to investigate more, or change our lab method, or ask a different question, or change our Dnase or primer,” etc.). For the purposes of bio/geo collections data:

are collecting methods / protocols / standards changing in response to known gaps or holes when attempts to apply these data have shown where more data are needed?
are these collection digitization collaborations resulting in upstream changes in current practice?

Pedagogically-speaking, we “teach the way we are taught”, that is, we perpetuate practice. We only change it it is found wanting – usually with respect to our own needs (or maybe our collaborator’s needs … see Kading and Kingston 2020, Alba 2021). And that’s the next point. What might be useful for us, might have gaps when it comes to the needs of others. IF we want our data to be useful for other purposes, IF we find our data cannot answer the questions we thought it would, are we changing what we do to improve this going forward? Do we need different people and practices in our loops to fill the expertise / data holes? Perhaps this is another aspect of the so-called “extended specimen”.

When I, and others, have taken the time to have the above conversation, the expected happens (i. e. Human Behavior 101). Some folks are very open to the discussion of what they might be able to change or consider for their existing collecting (specimen and data) practices. Others adamantly state they can’t do anything more. It takes time and lots of community engagement to nurture this conversation. We do have examples to show that it can be very fruitful.

We do need a “growth mindset” which at least means folks are willing to talk about this. As to what’s actually happening, I only have the examples happening in my own sphere to bring to the table. So it’s great to raise it here for discussion Bob. I do think (see) some folks are changing / adapting some of their existing collecting practices. I’m not sure how strategic any of this change is. It is this sort of discussion I (and others) have been known to bring up at some workshops (e. g. Field to Database at iDigBio, and more recently the UIUC-Vouchering-Workshop-2023 see also blog post).

So will the data as they are, contribute to what we want to do? I think yes. Could they be better, sure. Are we working both upstream AND downstream to make it so? I think so. I’m not sure the efforts to address the upstream parts are strategic. There’s an upcoming grant I’m aware of at the moment where we’re looking at modeling some of these types of practices / behaviors – with undergraduates.

Some Related References

Alba C, Levy R, Hufft R (2021) Combining botanical collections and ecological data to better describe plant community diversity. PLoS ONE 16(1): e0244982. Combining botanical collections and ecological data to better describe plant community diversity

ICER Integrating Collections and Ecological Research Working Group. iDigBio. Integrating Collections and Ecological Research - iDigBio

Kading RC, Kingston T (2020) Common ground: The foundation of interdisciplinary research on bat disease emergence. PLoS Biol 18(11): e3000947. Common ground: The foundation of interdisciplinary research on bat disease emergence | PLOS Biology

McElrath T, Paul DL. Blog Post: Towards Long-lived Specimens, Data, and Increased Impact about the UIUC-Vouchering-Workshop-2023

Paul D (formerly at iDigBio, now INHS), Seltmann K (TTD-TCN, AMNH), Michonneau F (FLMNH - iDigBio), Masaki D (USGS - BISON), Soltis P (FLMNH - iDigBio PI), Ellis S (iDigBio), Love K (iDigBio) 2015. Field to Database: Biodiversity Informatics and Data Management Skills for Specimen Based Research Workshop. iDigBio host 2015 March 9 - 12. Field to Database - iDigBio

Other folks who might offer some tidbits
@libby @seltmann @tmcelrath @francois @dshorthouse @qgroom @vijaybarve @tkarim @shellyleegaynor

datafixer · July 19, 2024, 2:03am

@Debbie, great to have your thoughts and your perspective on “upstream”, and I agree that it’s worthwhile to see biodiversity data as a flow.

One flow image is that citizen science, research projects and collections all generate data. The various streams of data are stored and managed in several biodiversity data warehouses. A range of visitors come to the warehouses looking for different sorts of data for different purposes (including conservation), meaning the warehouse outflow streams are very varied in their nature and volume.

“Upstream” in this picture comprises both the data generators and the data warehousers. Because the data generators will usually not know in advance what content the warehouse visitors want, their options for doing better by the visitors are pretty much limited to deciding how and how well to share their data with warehouses. The “how” has become a lot easier with the development of Darwin Core and its offshoots. The “how well” needs more attention at data generator level, as you point out.

“Upstream” at the data warehouses also needs work. I think the era of warehouses seeing themselves as neutral data brokers only marginally concerned with data quality and usability is coming to a close. Sure, GBIF has been working (for some time) on better understanding the content needs of its warehouse visitors by examining how GBIF data is used in research publications, but of course that’s only one segment of the actual and potential visitor population. The next step for warehouses is to work on the quality and usability needs of visitors, by curating and branding the data they make available. Warehouses need to do their own “upstream” work.

I disagree with you (a bit) that use cases and specific examples are helpful in understanding what’s going on in 2024. In a complex situation it’s always easy to find examples of some idea working well, and other examples where the idea is failing. It’s tempting, and foolish, to generalise from examples.

The point of my original post is that I think there’s a foundational myth in biodiversity informatics that isn’t questioned enough. Ignoring the myth, there’s an information management job to be done by biodiversity informatics practitioners. Some of those practitioners work “upstream” with data generators and some work “midstream” with data warehousers. Both lots need to work on quality and usability.

eellwood · July 19, 2024, 3:06pm

This topic has kept me up many a night… why futz with data while the world burns around us? Perhaps a bit more grim of a question than that originally posed, but I think it’s not a very far reach. In short, and in agreement with some points mentioned already, data saving the world is likely both dogma and marketing and that might be ok. In the logging example, if data (however sparse, cherry-picked, or ecologically inadequate) does help to save a species, or even populations, or even individuals, then in my opinion it’s valuable.

On the flip side though, and please don’t blacklist me for this I mostly think that we have more than enough data to support a vast amount of conservation work already. To speak in extremes, we know with full certainty that swapping natural habitat for concrete is bad for just about everything, yet we continue to move rapidly down this path. More data, or even all the data, will provide more evidence for this, but I generally think we have the data we need to support conservation.

Where things get a little more grey perhaps, is when we acknowledge that some amount of, e.g., logging must take place somewhere and to some degree, so finding that sweet spot of destroying just enough habitat but not too much would require specific data that may or may not be in hand. And that’s when I admit my monkey wrench gang tendencies that have me thinking that humans have a large enough footprint already and can’t we just find a way to use the resources we’ve already extracted without destroying more and we don’t need more data to show this… but i digress.

Debbie · July 19, 2024, 5:39pm

Hi Bob,

Ouch

You wrote:

Certainly not wishing to be characterized as foolish. It is my nature to start … by grounding myself with what I understand or perceive about a topic from my own experiences. Of course, one cannot use an “n of 1” to assert the complexity inherent in your question. That was not my purpose. Rather to broaden the scope when you ask about (in some sense) the fitness of these data, to raise another issue that sometimes seems under-discussed to me, that is the upstream current practices. I believe there are opportunities there to improve our (current and future) data, and what it can be used for. I disagree that use cases are not helpful. They are – a starting point – a window into the level/s at which the person joining is thinking about and good for then expanding upon. (I can certainly provide more examples ).

Yes, I’m agreeing with you … 100%. That “… a foundational myth” as you put it, “isn’t questioned enough.” I’m saying that another possible (seemingly to me anyway) “foundational myth” exists … that specimen and data collection as its done aren’t questioned enough either.

The data as they are, are certainly being used (see relevant papers published using GBIF datasets) effectively.

As to @libby’ s points (go Libby!), is understanding that we have choices to make and hard work to do to think of strategies that effect change. To Libby’s example of the Monkey Wrench Gang, I’d add the story Howard Zinn shared about the strategies (actions) that students working with him and others, came up with to end the practice of segregation / materials access in our US libraries. It’s a compelling story, that highlights it’s our behaviors we can change that make a difference. The questions then begin … What do we change? (Upstream and Down)

Debbie · July 19, 2024, 8:10pm

Hi Stephen,

Yes, much like me you formed a use case … framed as a hypothesis. Similarly, in my example, it was exactly this, a “hypothesis” formed in a grant, stating that these (particular) biodiversity data would make a significant impact on the ability to answer some challenging research questions … many of which naturally center around subjects of extinction prediction, habitat loss, etc. Parallel to you, I see our attempts as quite valuable in helping us to both ask questions … and then see … what we learn from our attempts (at digitization, in this case) at whether or not we could “save (at least a part of) the natural world”. And as for marketing, it’s also important. I suspect we have much to learn about how to do our marketing in more strategic ways …

datafixer · July 19, 2024, 8:13pm

@eellwood: Hi, Libby, nice to see you here. I couldn’t agree more that we already have enough data to “save the natural world”. I also think that targeting at-risk areas is the way to get the most important data. In fact, I published a paper promoting this idea 20 years ago (Spare a thought for the losers : Robert Mesibov : Free Download, Borrow, and Streaming : Internet Archive), ran a blog devoted to it and did a lot of fieldwork myself on the principle (some of it published here: v.125:no.4 (2008:Aug.) - The Victorian Naturalist - Biodiversity Heritage Library).

Some of that kind of work does happen, as with Re:wild in Madagascar (Madagascar • Re:wild | rewild.org), but much of the biodiversity data stream isn’t generated with a conservation focus. Further, there have been thousands (?) of university and other research projects funded on the intellectually safe principle that we should be “monitoring” biodiversity in the face of climate change, biological invasions etc, i.e. watching biodiversity’s slow decline without actually doing anything about it.

But, again, this isn’t really the core of my post, which is that regardless of what practitioners think might be done with biodiversity data, it will be done better if the data are better.

datafixer · July 19, 2024, 8:15pm

@Debbie, Strange you took “foolish” that way, my apologies. I used that word in the context of Stephen’s view that “data saves” is an hypothesis. It’s foolish to support (or counter) that hypothesis with examples.

Look: “Drinking bleach can cure COVID”. Example: “My cousin Jack drank bleach, and it cured his COVID”.

Look: “Biodiversity data is important in building effective conservation policies”. Example: “A bird sanctuary was created on this property after visiting birdwatchers built a long, public checklist of bird residents and visitors”.

“I’m saying that another possible (seemingly to me anyway) “foundational myth” exists … that specimen and data collection as its done aren’t questioned enough either.”

I don’t doubt that’s true, and educating data generators is a good way to go, and the same for “not my problem” data warehousers.

sformel · July 24, 2024, 2:26am

I wish I was coming to TDWG in person this year, because I always struggle a bit to express myself through my keyboard (read: very slow writer). But I’m going to give it a shot. I also want to flip the question back to @datafixer. Practically speaking, what do you do when you have a crisis of confidence over the necessity, and/or efficacy of biodiversity informatics?

Anyway, you’re right to call out my hypothesis as loosey goosey. It is, and I should have been more careful with my words. The variables-needing-to-be-controlled do make up a long list and are subject to stochastic change. Just a solid multivariate mess . This doesn’t stress me out too much, maybe because I cut my teeth on microbial community dynamics. But, again, you’re right that doesn’t create permission for a questionable hypothesis.

As I ponder the examples of feedback, possible indirect effects, or false effects, another question has come to mind. Could a problem be that we haven’t had enough data for a long enough time to see an effect? If the data is supposed to elicit a response from humans to change our destructive behaviors, maybe it takes a few generations for that feedback to build up and create consistent effort and measurable evidence. That doesn’t get to the practical heart of the matter, since we need something more tough-actin’ then that.

I do think we should keep collecting and curating data (with the caveat brought up by @Debbie wrt education and upstream feedback) because:

We need to measure what is happening if we intend to use science to support the decision making.
Collecting and curating data is a very practical way to gain experience with the data and try to identify where it is lacking.
As @datafixer pointed out, we’re still working out how to properly sample biodiversity on earth, both in a theoretical and practical sense. We’ve only had the technology to really start scaling it up (e.g. remote-sensing, omics, AI, acoustics) for a few decades. So maybe it needs a little more time in the oven.

Ok, time to wrap it up. Some last thoughts:

I agree we need higher quality data in biodiversity science.
I also agree with @eellwood that we probably have enough data to understand the house is on fire.

datafixer · July 24, 2024, 8:26am

@sformel, , many thanks for your thoughts on this difficult and largely ignored subject.

“Practically speaking, what do you do when you have a crisis of confidence over the necessity, and/or efficacy of biodiversity informatics?”

I think it depends on who’s having the crisis. If you’re a biodiversity informatician you work to ensure that the data are as good as you can “make” them, because that means the data are suited to all sorts of future uses. That’s what LIS people do, it’s in their genes. If you’re a biologist who hopes that accumulating biodiversity data will somehow save the natural world, I feel for you but I suggest you hope for something else.

As a biologist I work with millipedes. They’re way down on the popularity list for the Person In The Street and way down on the popularity (frequency) list for records in GBIF. In the face of habitat loss and numerous local extinctions I do what I can to document millipede biodiversity and to make sure that the documentation is freely available (open access) and that occurrence records are high-quality. It’s what an Australian zoologist recommended 100+ years ago: leave an adequate record of the marvelously interesting forms of animal life we succeeded in exterminating. (https://www.biodiversitylibrary.org/page/10113251)

system · August 23, 2024, 6:26pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analyzing/mining specimen data for novel applications Digital/Extended Specimen	43	2945	April 4, 2021
9. Workforce capacity development and inclusivity Digital/Extended Specimen	25	3272	September 9, 2021
8. Meeting legal/regulatory, ethical and sensitive data obligations Digital/Extended Specimen	55	4644	August 23, 2021
Collections Catalogue - Daily Summaries Collections Catalogue	9	4310	April 30, 2020
Darwin Core Half-Million - UPDATE Data Publishing	11	1115	December 8, 2022

Data will save the natural world: dogma, marketing or both?

Related topics