aottr

Designing a Fair Booking System Is Harder Than It Looks

, 896 words, 5 minutes to read
TL;DR: Why naive SQL and job queues both failed for fair, fast hotel booking—and how atomic Postgres selection with row locking fixed it until scale pushes you to rethink again.

Imagine you’re organizing an event and open the registration.

100 users want to get a room at the same time. You have 30 rooms…What could go wrong? … a lot ^^

The Problem

As an urgent request from a local convention, I was asked to write a hotel booking system with a few constraints:

At first glance, this might sound simple.

The (very) Naive Approach

The most straightforward implementation looks like this:

  1. Query for a room that is not occupied yet (user_id = NULL)
  2. Assign it to the user
  3. Done

I hope it does not need to be explained why this is not a great idea. The query to get a room will with a high confidence return the same room id for parallel requests.

The race condition appears when both users get the same room as response and try to claim it.

The (still) Naive Approach

An improvement of this implementation would be some kind of randomization logic:

  1. Query available rooms
  2. Pick one randomly
  3. Assign it to the user

Why does this still cause an issue with concurrency?

Two users can still pick the same room, especially with decreasing number of available quota, what the possibility per room significantly increases.

The race condition still applies and the user gets a failing request.

The “Obvious” Approach: Use a Queue

The thought process was simple:

Let’s just serialize the booking process.

Since I was writing my backend with NestJS, I considered writing a queue system with e.g. BullMQ:

Problem solved? Not really…

Te queue did fix consistency, but it introduced a long latency. Sizing a queue is difficult enough, but maintaining a level of fairness made it feel impossible.

The direct problem here: There are different room tiers, cheaper rooms, more expensive rooms, two beds, three beds, King size, Queen size etc…

While one user A might click on a room and wait for his position in the queue to pass, another user B might take another room tier and get assigned instantly. While being in the queue, the room tier of user A gets sold out before he got a spot, leaving him with no room and a redirect to the room selection. User A might have been fine with the same room tier as user B, but since the feedback, that the original room tier was sold out took multiple seconds, this possibility to switch might be gone now too.

Queues are a wonderful tool to de-couple workloads while being able to maintain a feedback loop about the progress. But queue sizing is not easy and it is not reliable to get immediate response.

The “real” Problem

This is where the interesting part began, and it already sounded over-engineered for the requester…

This booking system did not just need optimization in one specific direction. It needed a balance of three competing goals:

Imagine one of these Radar charts, one goal pulling on two others and the solution is to find a middle-ground.

Rethinking the Approach

Instead of pushing the problem into a queue, I moved the logic closer to the database. I use postgres btw.

The idea: Do selection and assignment atomically inside a transaction

  1. Start a transaction
  2. Select a random available room
  3. Lock it
  4. Assign the User
  5. Commit

Inside the transaction, I select a room with the following statement:

WITH available_rooms AS (
  SELECT r.id FROM "Room" r
  LEFT JOIN "RoomBooking" rb ON r.id = rb."roomId"
  WHERE "roomCategoryId" = ${roomCategoryId}
    AND rb."roomId" IS NULL
  ORDER BY id ASC
)

SELECT id FROM available_rooms
LIMIT 1 
OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM available_rooms))
FOR UPDATE SKIP LOCKED;

This was followed with an update statement on Room and the creation of a RoomBooking for the user inside the same transaction.

Verdict

Benefits

Tradeoffs

But for this project with around 1000–1500 users and a few hundreds of rooms, the balance was far better than a queue.

What I’d improve next

If we decide to keep this project up for the growing convention, I’d consider:

Final Thoughts

The biggest lesson here wasn’t about SQL transactions or queues. Locking-mechanisms are basics in concurrency.

It was this:

Fixing race conditions is easy.

Designing a system that feels fair and snappy is not.

And sometimes, the most “sophisticated” solution is not the best one…