Do you really need UUIDs in your datasets? - a quick rant

UUIDs (Universally Unique Identifiers) are in fashion at the moment, particularly UUID version 4 (see Note 1, below). They’re the long strings of 32 letters and numbers that look like this:

6449b31b-956c-4dad-a4ee-093d92dacc9c

UUIDs aren’t new. They first appeared in the 1980s, were adopted by Microsoft as “GUIDs”, were standardised in 2005 and have gradually crept into the databases used in biodiversity informatics.

They’re also gross overkill for the job they’re designed to do. In one dataset I audited, the parentEventID entries were UUIDs and the eventID entries had been constructed by prefixing a new UUID with the relevant parentEventID, as in this example:

ider

This is just silly. There are far more sensible ways to give a data item a unique identifier.

The big four questions

The key aspects of user-built unique identifiers can be summarised in four questions:

  • Should the unique identifier be informative to humans, or just an arbitrary string of numbers and/or letters?
  • Should the unique identifier be locally or globally unique?
  • How many data items need to have unique identifiers?
  • Should the unique identifier be persistent, or could it change sometime in the future?

There’s been a lot of discussion (and disappointment) regarding identifier persistence, and I won’t muddy that discussion here.

Informative or not?

In a (fictitious) project called “Palm Plantation Insect Survey 2023”, insects were sampled with numbered pitfall traps and flight intercept traps at 10 sites on eight days in June 2023. Here’s one of the eventID entries, and one of the occurrenceID entries:

PPIS23-S09-FIT4-20230611

PPIS23-S10-PT18-20230611-237

With the background information I’ve provided (the metadata), I’ll bet you can understand those two identifiers, because they’re informative. The strings in the identifiers mean something, and it’s not hard to infer that the occurrence, for example, was number 237 from pitfall trap 18 at site 10, emptied on 2023-06-11 in the PPIS23 project.

Identifiers can also be non-informative. How about “4590289036”? It’s a GBIF occurrence number, also seen in the URL https://www.gbif.org/occurrence/4590289036. The number 4590289036 uniquely identifies a duck sighting earlier this year in France, but the identifier itself tells you nothing about that sighting.

Locally or globally unique?

Suppose you assign the identifier “AH2614” to a data item in a local database, not expecting that the identifier would ever be needed or used outside your database. Then your records are shared with a larger database, and the identifier “AH2614” finds itself in a field including several other “AH2614” entries that refer to other, completely different data items from other databases.

So a locally unique identifier — still unique and perfectly usable in your local database — can fail as a “globally” unique identifier. But what does “globally” mean? In the context of the entire universe of data, a UUID like 6449b31b-956c-4dad-a4ee-093d92dacc9c is truly globally unique (see Note 1, below), but that’s an enormous context. A more realistic context might be “the universe of biodiversity informatics”, which doesn’t need a UUID’s capacity for uniqueness.

In other words, the choice isn’t local/global context, it’s local/larger/larger still/even larger/…/universal context. Note also that the larger-database manager who loads “AH2614” into a field possibly containing other “AH2614” entries is likely to name that field something like originalID, and assign to your data item a new identifier which is unique within the larger database, but not universally unique. No problem there.

How many data items?

If you have 1 million data items then you need at least 1 million unique identifiers. Allowing for growth, you might choose an identifier scheme that can handle 10 million different data items, and the simplest way to do that is with numbers: identifiers 1 through 10000000, perhaps formatted with 8 characters: 00000001 through 10000000.

But there are many other ways to generate 10 million unique identifiers, some complicated and some simple. A particularly simple method is to use serial numbers, but to format the number in base 36 (see Note 2, below), which uses the 10 digits 0-9 and the 26 Latin-alphabet letters A-Z. A 5-character base 36 identifier like “55YFG” is the same as the decimal number 8,675,980.

UPDATED. A 5-character formatting is used by Catalogue of Life for its unique identifiers. “55YFG” is the unique identifier for the foram name Textularia rugosa d’Orbigny, 1852, which you can find on CoL by entering http://www.catalogueoflife.org/data/taxon/55YFG in a browser. The 5-character CoL identifier is actually based on the 29 characters 23456789BCDFGHJKLMNPQRSTVWXYZ and has special rules for assignment to names. See this interesting discussion on GitHub.

Real-world examples

A rich source of unique-identifier types is Wikidata. Look through the “Identifiers” section of the Wikidata item for Rattus rattus, which itself has the identifier “Q106133”. The black rat has 68 listed unique identifiers! Some are plain text (“Rattus_rattus”), most are numbers and three identifiers are or contain a version 4 UUID.

Those three UUIDs are different, of course. They’re each globally unique in the universal context but locally unique in their source databases, and they’ve been linked to each other by the Wikidata volunteers. For those purposes the UUIDs could just as usefully be smaller and simpler.

ORCIDs (Open Researcher and Contributor IDs) are often used in the Darwin Core recordedByID and identifiedByID fields. An ORCID is 16 characters, each a digit in the range 0-9, in four hyphen-separated groups. My own ORCID, for example, is 0000-0003-3466-5038, although as a database entry that should be the URL https://orcid.org/0000-0003-3466-5038.

ORCIDs are a subset of an international scheme for identifying persons (ISNI), and the identifiers allowed for ORCIDs within that context are 0000-0001-5000-0007 through 0000-0003-5000-0001, and between 0009-0000-0000-0000 and 0009-0010-0000-0000. The last digit in an ORCID is a special checksum digit calculated by an algorithm. ORCIDs are assigned randomly to individuals from the pool of unused possibilities. On a complexity gradient from “uncomplicated” (like “55YFG”) to “way too complicated”, ORCIDs are “a little bit complicated”.

ORCIDs as 16-character strings aren’t globally unique in the universe of human-compiled data, because a car-parts dealer in Lagos, Nigeria might assign the part number “0000-0003-3466-5038” to a brake lining he sells. ORCIDs are only unique within the ISNI context, which is partly specified by using a full URL for an ORCID.

Are there UUID alternatives?

Yes, and they’ve been slowly increasing in number as developers and database managers grow weary of UUID overkill, and especially the randomness of version 4 UUIDs, which is an obstacle to efficient sorting and searching. Before I mention two of these alternatives, ask yourself

Do I really and truly need a unique identifier that’s universally unique, or could I use one that has fewer characters, is sortable and is unique enough for uses in the biodiversity informatics context?

OK, you’ve decided you want an identifier that will remain unique in space and time for at least the next few thousand years. In that case, an alternative is the ULID (Universally Unique Lexicographically Sortable Identifier), described here. Although a ULID is just as long in binary code as a UUID, it uses a special base 32 encoding to reduce the number of characters needed to 26, with no hyphens. The first 10 characters encode the time at which the ULID was generated, to the nearest millisecond. The next 16 characters are random or pseudo-random. The time-stamp encoding is lexicographically sortable, meaning that ULIDs sort chronologically. Here’s a ULID example: 01BX5ZZKBKACTAV9WEVGEMMVRZ.

Alternative 2 is NanoID (see here and here). It’s a 21-character string by default but can also be generated with a string length of your choosing, and with an alphabet of your choosing. Nano IDs encode a timestamp. Nano ID example: X2JaSYP7_Q2leGI9b-MyA.

Ask yourself again…

UUIDs, ULIDs and Nano IDs all require code to generate. You’re unlikely to build them yourself. Are you sure you need universally unique identifiers for your data items, or could you “roll your own”, much simpler identifiers that would be unique enough for all practical purposes?

And which database fields will they appear in? You needn’t follow GBIF’s practice: GBIF currently shares about 100,000 individual datasets, but their datasetKey is a version 4 UUID, which can identify 5,316,911,983,139,663,491,615,228,241,121,378,304 unique datasets.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com


Note 1 A version 4 UUID like 6449b31b-956c-4dad-a4ee-093d92dacc9c might look like a simple string of letters and numbers, but it isn’t read that way by a computer program that recognises UUIDs. Instead it’s read as a sequence of 128 bits (0’s or 1’s). The 49th through 52nd bits store the number “4” (0100 in binary) in version 4 UUIDs — it’s the first character in the third hyphenated block — and bits 66 and 67 store a variant number. The other 122 bits are randomly or pseudo-randomly generated.


Note 2 If you’ve forgotten “base” numbering, think of it this way: the base 10 decimal system we use has 10 possible digits (0-9) at each position. The right-most position is ones, the next position left is tens, the next is hundreds, and so on. The number 3197 is made up of 3 thousands, 1 hundred, 9 tens and 7 ones (3x1000 + 1x100 + 9x10 + 7x1).

There are other ways to use characters to store numbers in positional notation. UUIDs and many other digital schemes use hexadecimal, or base 16 notation. The right-most position holds the numbers from 0 to 15 and counts the “ones”, the next position left holds the numbers from 0 to 15 and counts the “sixteens”, the next left holds the numbers 0 to 15 and counts the “256es” (16x16).

How can you count numbers from 0 to 15 using the digits 0 to 9? You can’t, so letters are used instead:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > 0 1 2 3 4 5 6 7 8 9 A B C D E F

For larger numbers, the effect is to reduce the number of characters needed to express the number. 3197 in base 16 is C7D (12x256 + 7x16 + 13x1).

Base 36 also uses positional notation more efficiently than decimal representation. Each position holds the numbers from 0 to 35, with the 26 letters for the numbers from 10 to 35 running from A to Z. 3197 in base 36 is 2GT (2x1296 + 16X36 + 29x1).

6 Likes

You raise some good points. One thing about database control numbers is that some software can drop the least significant digits – so if you’re matching plot data with tree data you can make a mistake if you’re not careful.

Another thing we have to watch out for is taxonomic homonyms.

Hi @datafixer, always enjoy your rants :wink:

Personally I think UUIDs (and variants on that theme) are really only useful in one of two situtations:

  • you want a unique, semantically opaque, but disposable identifier (e.g., for “session variables” to anonymously identify a user while they interact with a site)
  • you have a distributed system where multiple users are independently creating objects that you want to aggregate (hence you want to ensure everyone is using unique identifiers without having to create a centralised tool that ensures this)

Otherwise they are best avoided.

I note that identifiers that people want others to use and reuse (e.g., DOIs, ORCIDs, RORs) avoid UUIDs and aim instead for relatively simple, human readable identifiers.

@rdmpage, a centralised tool was proposed for Natural Science Identifiers that would link somehow to museum accession numbers. I note in that discussion that followed the comment:

An IGSN can be of any length, it is not limited to nine characters (see Syntax Guidelines · IGSN). However, since human operators are involved in many IGSN applications, the advice given by the IGSN Implementation Organization is to keep identifiers short to make it easy to fit IGSN identifiers onto labels or into tables, etc. Nobody wants to copy a 32-digit UUID by hand, and there are other potential sources of transcription errors.

One of the strangest uses of UUIDs I’ve come across is in RU-BIRDS, a bird observation dataset shared with GBIF: https://www.gbif.org/dataset/ba19fc1d-670c-426b-b99d-49f003792ac4. The occurrenceID field uses version 1 UUIDs, but appends “#” and a serial number.

For example, there’s a serial series of 331 occurrenceID entries from “3234c922-c6bc-11ec-80c3-96000110f1ee#100” to “3234c922-c6bc-11ec-80c3-96000110f1ee#430”.

If you know how to decode version 1 UUIDs, you learn that this one was minted 2022-04-28 06:26:54.011421.0 UTC on a computer with the randomised MAC address 96:00:01:10:f1:ee.

I think the best feature of UUID is that there is no need to do a look up when it is generated, it is going to be unique, it is also just a number, and would not need to be changed because a dataset is renamed, a person moved from one university to another etc.

My favorite UUID is actually v5, because it is generated from the content of an entity, and therefore cannot point to anything but that content. For example I use UUID v5 to point to name-strings. This approach saved me trouble when I moved data from one application to another, my IDs were guaranteed to be the same for the same string. The obvious question about it – why not to use name-string as ID then? The link provides the rationale why I chose UUID v5.

1 Like

@dimus, your GNA blog post makes a good case for not using scientific names as identifiers, but here I think you are talking about a false dichotomy. The choice in good-practice databasing is not between scientific names as identifiers vs UUIDs as identifiers, it is between UUIDs as identifiers and other computer-generated identifiers for which “there is no need to do a look up when it is generated, it is going to be unique, it is also just a number, and would not need to be changed because a dataset is renamed, a person moved from one university to another etc.

@dimus, have you considered a non-reversible but content-based hashing? A simple one is the 32-bit CRC32:

Corchoropsis tomentosa var. psilocarpa (Harms & Loes.) C.Y.Wu & Y.Tang
714fc454
Corchoropsis tomentosa var. psilocanpa (Harms & Loes.) C.Y.Wu & Y.Tang
7c6a72b8

My main concern with 32bit numbers would be a probability of collisions:

"
The surprising thing about collisions is that they become probable much
sooner than you might expect. This is related to the concept of the Birthday
Paradox: in a group of just 23 people, there’s a greater than 50% chance
two people share a birthday.

Collisions work similarly. A 32-bit number can have about 4.3 billion
unique values. Here’s a rough estimate of how collision probability increases:

Around 77,000 numbers: 50% chance of at least one collision.
Around 118,000 numbers: 99% chance of at least one collision.
"

What I also like about UUIDs they are standard and databases like Postgres are optimized to work with
UUIDs as identifiers.

I tried 32bit hash for 30 million name-strings and sorted result by collision numbers:

15:10:43 ❯ sort ids8.txt |uniq -c |sort |tail
      3 fab4ec24
      3 fabe4374
      3 faf81f84
      3 fb0ddf42
      3 fb535631
      3 fb81063b
      3 fb99f251
      3 ff2aad33
      3 ffe4d67b
      4 ada8099e

Got altogether 195533 collisions

@dimus, thanks for the test. Your 30M-item set obviously needs more than 32-bit identifiers.

In a different kind of check, I tried Nano IDs with 24 characters (not bits) and the alphabet 0-9a-z. At 30M items generated per second, you would need approx 700 years to reach 1% probability of a collision.

1 Like

I suspect people tried to use 64bits at first for non-colliding ids, and even if it was probably enough to make a database of all atoms in universe (if uniqueness is enforced), Birthday Paradox pushed them to 128bits.

Although UUID is not 128bit proper, because each UUID carries info about its version.