Counting the Countless
A Case Study on Handling Probabilistic Increments in Large-Scale Systems 🎲
Let’s face a hypothetical scenario.
You work at Docker as a software engineer in the registry team. Every time somebody in the world does a docker pull
of a given image, a system that increments a counter for this image. The rationale? Display the value in Docker Hub for the users:
Docker Hub calls an API (/v2/repositories/airspace/android/) that returns the exact pull count, 4,361. Yet, Docker Hub displays a rounded value of 4.4k. Indeed, the value shown is exact for images pulled less than 1,000 times. Above that threshold, the value is rounded.
The current implementation is the following:
Let’s say the DB is a relational database. So, during a Docker pull:
- The registry emits an event (Kafka, AWS Lambda, or whatever) which is then caught by service W (W for Write)
- Service W opens a new transaction, increments the pull count for the specific image, and commits
When a user navigates in Docker Hub on the page of a specific image:
- The service R (R for Read) retrieves data from the DB, which includes the pull count, and returns the response
- Docker Hub displays an approximation of this value (for example and as we have seen: 4.4K instead of 4,361)
At some point, the team realizes that performing a new transaction for every pull is an issue. It works well for the vast majority of images which are pulled a limited number of times. Yet, for popular images such as Ubuntu, which is pulled hundreds of times per second (more than 7 billion in total), the implementation is not optimized and puts a lot of pressure on the database.
As we said initially, as you work at Docker as a software engineer, you are in charge of proposing a new design. The question is: what do you do?
Take a moment to think about one or multiple possible solutions before scrolling down.
Probabilistic Increment
Let’s think about what is required for a second. We have seen that Docker Hub doesn’t display the exact pull count. It displays an approximation. Indeed, for a user, it’s probably not very much important to know if an image has been pulled 1,002,156 times or about 1 million times. Hence, if service R returns an approximation of the result (and if this approximation is accurate enough), that would be perfectly fine for Docker Hub.
Therefore, how about a probabilistic solution?
Before delving into the solution, let’s first define what a probabilistic algorithm means. An algorithm is probabilistic if it incorporates some element of randomness or probability in its operation to produce an approximation with a certain level of confidence. By definition, such an algorithm isn’t deterministic as it may produce different results when run multiple times due to the inherent randomness involved in the algorithm.
Coming back to our problem, producing an approximation, if this approximation is good enough, could be a good fit.
We could think about the following solution. Instead of incrementing the counter during every single pull (hence, performing a transaction open and commit every single time), we could:
- Define a value N.
- Generate an integer from 1 to N. If the value is between 2 and N: we do nothing. Otherwise, if the value is 1, we increment by N.
Assuming an increment
function that would contain the actual logic (open transaction, increment, commit), here is the implementation:
if randomInt(1, N) == 1 {
increment(N)
}
Overall, it’s a single line of code which wraps the existing increment
logic. For every 1/N call to the surrounding function, we call increment
by N. Therefore, we also reduce the calls to the DB by N.
Wrap Up
Applying such an approach should be handled with care. Perhaps, we can decide to use this probabilistic solution only for unpopular images to have an exact value and apply it only for popular images (more than X existing pulls).
Furthermore, choosing N, the frequency of the actual increments, should also be done carefully. The larger N is and the more we reduce the calls to the DB, but the larger the variance will be. Said differently, the larger N is, the less accurate the pull count would be. Perhaps, we could also scale N according to the number of pulls. For example start with 10 from 1,000 pulls, then increase N to 100 from 100k pulls, etc.
Of course, some competing solutions exist; the probabilistic increment isn’t the only available option. For example, we could add a caching layer in which we would increment a counter and periodically retrieve this value to update the primary store. Yet, such an approach would require way more engineering work than the probabilistic solution discussed.
In summary, let’s bear in mind that probabilistic solutions exist. For example, HyperLogLog to approximate the number of distinct elements in a set. At scale, probabilistic solutions can be an efficient way to deal with large traffic and alleviate the load on the backing systems. 🎲
Thanks for reading. Meanwhile, I’m really interested if you have already used a similar solution. Let me know in the comments.