Published 2025-06-02
Part of an ongoing series. (See Part 3 here)
Okay, you’ve built your service, you’ve set in-flight requests to 90% of what you know it can handle (for a given amount of CPU, memory, and network). It’s a great service.
And now it’s time to integrate with another service!
So your service accepts client requests, does useful work, and then replies to the client. And now you want to add a network-call to that procedure.
… Why?
Let’s set aside that line of questioning for a moment and assume you MUST integrate with a non-local service/datastore/cache for $DAYJOB reasons.
Our service is popular, so we’ve deployed it across multiple servers. $DAYJOB SREs have put a load balancer (like HAProxy) in front of our services to spread request equally across them all. Because our service handles its own work and doesn’t need things like TLS termination (almost always a bad idea) the SREs could do this without our involvement.
Now, $DAYJOB is worried the servers are doing repetitive work and they can’t track of user patterns to show value to their investors.
Again, let’s set aside the ickiness of the whole concept of tracking users – it’s gross and, depending how you do it, against the law in some jurisdictions – but let’s say persisting repeat requests is useful, and maybe users are asking to save some kind of operation history. Sure, yeah… the users are asking for it…
If we’re just looking to handle repeat requests, we could cache responses in-memory. This is a time-tested solution and often the first step in making algorithms faster if you find yourself doing repeat work.
As a service, we can do things like store the request parameters as keys to a cache, and set the response as a value. If they keys are the same, we just return the value right out of cache – no additional computation necessary!
But what if the service restarts? We’ll lose all that cached data!
There are many ways to store in-memory objects as a file, but if we’re considering flexibility in our implementation we probably want to stick with a more portable solution than just splatting a cache object on disk.
There’s the option of building a custom key=value
format, or converting to JSON, but one of the fastest options is using
sqlite. As developers, we get
binary integration (it’s a library you include with your application),
portability with the application, and the flexibility of SQL and its associated data
structures.
With sqlite, we can persist our cache data to a table (maybe with a
table design like key, value
). Because we’re talking about
SQL, we can have multiple tables to store any kind of data we want.
Sqlite gives us fast and flexible data storage, but it is only usable in a single instance. There are some options to stream transactions across multiple instances, but sqlite has a single-writer paradigm.
If we want our service instances to do work “only once” across multiple instances, we can’t use a shared file – we need an external service.
As the joke goes, “now you have two problems.”
Sqlite is often compared to postgresql since they’re both free-and-open-source (unlike redis as others might suggest), so let’s have our service work with postgres.
In order to write data to postgres, postgres has to accept the request, (ideally) ensure the operation doesn’t conflict with anything else it’s doing, do the thing, and then respond to the client.
Doesn’t this sound familiar?
In other words, now that we’ve decided we’re going to have multiple instances talking to a database, we are dealing with two services.
Note: We won’t go into transaction isolation and other SQL-specific parts here.
With two services talking to each other, our model looks like this (with some liberties taken around the query request):
(Step 1) Client -- "foo" --> <Service1> --> "Get/Insert: foo" --> [Postgres]
(Step 2) Client -- "1" <-- <Service1> <-- "1" <-- [Postgres]
(Step 3) Client -- "foo" --> <Service2> --> "Get/Insert: foo" --> [Postgres]
(Step 4) Client -- "1" <-- <Service2> <-- "1" <-- [Postgres]
(Step 5) Client -- "bar" --> <Service2> --> "Get/Insert: bar" --> [Postgres]
(Step 6) Client -- "2" <-- <Service2> <-- "2" <-- [Postgres]
Instead of having our service write and serialize to disk, we now “write and serialize” to Postgres, which means that if we try and write the same value more than once, Postgres just returns the original result. Work is saved at the cost of two services and we have services which can operate in parallel.
We’re web scale! 🤮
Postgres will give us feedback when it is under load. Queries can take longer to complete, connections to the database are of limited quantity, and we can run bad queries. For a single service or client talking to the database, this isn’t a significant problem.
Let’s say we have hundreds of service copies talking to our single Postgres server. This replicates our original client-service issue but now our service is also a client. We must manage our use of postgres while still providing our clients the mechanisms they expect, which means we get to talk about retries, the “thundering herd” problem, and “race conditions.”
As an intermediate service, our processor needs to determine what to do when Postgres has a failure during a transaction.
In this example, we leverage transactions to ensure our writes to the database only meet the conditions we provide (namely, that the value doesn’t already exist somewhere). If instance-1 and instance-2 of our service try to write “foo” to our database simultaneously, one instance will succeed, the other will fail. For the succeeding instance, the answer is easy: send the response to the client.
But what about the instance that failed?
Let’s say instance-2 failed – there are at least two options: we can retry at the instance-level, or we can pass the error back to the client.
So we want the service to handle retries – after all, the client already submitted the request. We accepted their request and so we feel responsible for making it happen.
(This seems to be a common feeling among many.)
If we follow this path, client response time is dependent on how quickly we can perform the “get or put” function described above.
In the case of a race condition, we’d see the client response time for a request that “lost” the race take two or three times longer than the one(s) which succeeded. As a more concrete example, if the successful operation takes ~50 milliseconds, a failed operation could take up to 150ms.
Let’s say that either a successful or unsuccessful response will be returned to the client within ~50 milliseconds. The successful operator gets a return value in their reply, while the “losers” of the race condition get an error code within the same time frame.
So, upon encountering the error, we pass back an error code to the client such as HTTP 409 - Conflict or 412 - Precondition failed. One could even “misuse” the status code and send a 429 - Too Many Requests to encourage the client to retry their operation.
I say “misuse” the 429 code because it’s really not the case that the server couldn’t handle more requests, but the client doesn’t know that.
The client can either abandon its request or try it again. If it tries again, with the same content, it gets a successful return from the server (as the information is already present, rather than being newly written to the storage system). This is a rather simple implementation both from the server side and on the client side, however, it does “make the client do more work.”
Every time I bring up “let the client retry”, people seem to get worried, like clients doing anything beyond asking for things exactly once is bad. I refer objectors to the MIT vs New Jersey school reference article and point out that if the client knows what’s happening then they can make the choice for you instead of requiring the developer to handle every single retry / reproduce case.
Whichever solution we chose above, we’ve defined a
backpressure mechanism
for our system – the mechanism
controls the inflow of information and prevents overload of any one
component. Now to test it.
Let’s say we become extraordinarily popular: our service has hundreds of instances, and multiple hundreds of active clients. This is great news! Popularity means we are useful and/or generating revenue – good things for us as operators.
But then Something Happens and our database has a momentary outage. It comes right back online and can resume servicing requests, but our clients never stopped submitting data – we’re about to receive hundreds of concurrent, possibly duplicative, requests and Postgres might not be able to handle it.
If our service is handling retries, for each request to the service, we will:
N
number of requests to Postgres to fulfill the
requestIf our clients are handling retries, then for each request to the service, we will:
In the first case, our service amplifies every client request to the database, where the second does not – each transaction is a fixed multiple (1x or 2x) of each client request. If a service amplifies request volumes, we’re facing a situation where a denial of service (DoS) is achieved from within our service, with “normal” request traffic.
If we have a flood of requests coming from outside our service arrangement (i.e., from our clients directly), then we can either scale-up our service and postgres to handle more traffic, or we can not scale-up and allow the excessive-use responses to push back on clients to handle retries. (And, if any one client is being abusive, we can talk about rejecting that one client instead of disrupting the whole service.)
As shown above, when systems degrade or fail, the way the system handles loss of function directly affects both the system itself and the clients using the system. By making design decisions that limit the load on each component, we build our system for resilience in the face of degradations and failures.