The importance of a good algorithm to deterministically get the primary key of your ...

The importance of a good algorithm to deterministically get the primary key of your records can never be underestimated.

Database admins usually put a strong emphasis on having non-deterministic primary keys for your records - usually in the form of an auto-increment or UUID field.

After spending many, many years designing storage systems (using at least a dozen or so of different frameworks, from Hibernate, to SQLAlchemy, to MyBatis, to Exposed, to Perl DBI, to plain old JDBC and PHP mysql_* functions, and everything in between), I feel like I start to disagree more and more with that approach.

Implementing the entities engine in Platypush using auto-increment ints all over the place has been a huge pain in the butt, as I had to basically do the job of the framework in scoping sessions, caching IDs, wiring existing IDs, figure out if something needed to be queried, inserted or updated, etc., as well as all the possible corner cases. I eventually had to implement my own enqueuer of database requests too in order to avoid inter-process race conditions when doing the insert+select dance to retrieve the primary key.

I used SQLAlchemy, but when it comes to solving the upsert-ignore + select the ID problem in a database-agnostic way you're basically on your own regardless of the framework. All of them suck, literally. You'd better not use a framework at all, because a framework built around concepts like transactions, sessions and lazy query/init is incapable of solving the insert+select in an efficient way, and it will only come in your way.

As I've been recently tasked in Booking of implementing a large scale storage that needs to ingest billions of events per day, I've still decided to give the auto-increment int or auto-generated UUID approach a go.

And I found out that there's no database-agnostic way of doing it even in a modern Kotlin framework like Exposed. You wanna get the ID of the record you just inserted in a reliable way? Sure, just commit your transaction and run another select, and start another transaction just to wire that ID to another record who needs it as a foreign key.

Needless to say, as a data performance geek the idea of committing a transaction to force a session flush, and running another select just to get the ID of the stuff I just inserted, horrifies me.

Multiply it by all the foreign key ID-based relations that a record usually has, and multiply it by the number of records that you have to ingest, and I'm wondering why there aren't more engineers who are as horrified as me by the idea of using auto-generated keys in a DAO or ORM framework. It's a performance nightmare, and it's really the main single reason why so many large-scale data systems are slow nowadays.

So I've decided to throw that out of the window and using non-cryptographic deterministic hashes based on a record's "logical key" - which in my case it just so happens to be a simple URL.

I'm now implementing my primary keys as FNV-1 hashes of my record URLs. As FNV-1 is flexible enough when it comes to hash size, I can generate 16 bytes hash, and that's quite handy because a 16 bytes hash can easily fit into a UUID field, which most of the dbs know how to handle as a primary key.

So the application itself deterministically generates its primary keys as objects are created, and we can use those IDs as foreign keys in other records straight away without going through the insert+select pain.

We just need to run a INSERT IGNORE (if the record with that key is already there do nothing, otherwise insert it), and that's it.

There are some potential downsides I see though:

1. Excess of database writes - we basically have to run an INSERT IGNORE for every record. However, those INSERTs will only target one single record by primary key, and in most of the cases they should go down the IGNORE path.

2. Key partitioning/sharding - but that's the same problem that you've also got with completely randomly generated UUIDs.

3. Chance of collisions: as the key size shrinks, and the volume of ingested data increases, the probability of collisions goes up. This is something I'd like to prove mathematically though. Is the chance of collisions for a 16 bytes hash any higher than a 16 bytes randomly generated UUID? I'm honestly not sure if there's already some research in the topic.

Have any of you data geeks encountered the same issues when designing software that operates with an RDBMS? Which solutions do you prefer for these problems?

Fabio Manganiello on Nostr: The importance of a good algorithm to deterministically get the primary key of your ...