p99 Latency Tuning in Redis-Based Applications

Lately, I have been working on a high-throughput API where most reads are served from Redis. Redis is already fast, but my goal was to understand how low I could push p99 latency, and more importantly, how stable I could make the tail. This post is about end-to-end p99 tuning for a Redis-backed API. Some optimizations target Redis directly. Others target the client, request shape, and network behavior. While I use Redis here, these optimizations targeting the network stack, request shape, and client-side caching apply to almost any distributed system.

Photo by Antoine Gravier on Unsplash

p99

p99 (99th percentile) means that 99% of requests complete faster than this number. The remaining 1% are the slowest requests, also called the tail. At 10,000 requests per second, that 1% translates to 100 slow requests every second.

Tracking p99 helps understand consistency. It shows how often users hit delays and how unpredictable the system feels. Average latency hides this behavior and can look healthy even when a subset of users is experiencing slow responses.

Baseline

The API server is written in NodeJS. Redis and the API run in separate containers.

Please note that all latency numbers below are end-to-end API latency and not Redis command execution times. Redis GET itself is sub-millisecond. The tail comes from scheduling, networking, client behavior, and request shape.

Docker setup

version: '3.8'

services:
  redis-p99:
    image: redis:8-alpine
    container_name: redis-p99
    ports:
      - "9899:6379"
    command: ["redis-server", "--loglevel", "warning"]

  api-server:
    build: .
    container_name: api-p99
    environment:
      - REDIS_URL=redis://redis-p99:9899
      - NODE_ENV=production
    depends_on:
      - redis-p99
    ports:
      - "3000:3000"

API handler

const app = express();
const redis = new Redis();

app.get('/api/data', async (req, res) => {
  const userId = req.query.userId;
  const user = await redis.get(`user:${userId}`);
  const userPosts = await redis.get(`user:${userId}:posts`);
  res.json({ user, userPosts });
});

Benchmark script

function runTest(name) {
  const instance = autocannon({
    url: 'http://localhost:3000/api/data',
    connections: 100,
    duration: 30,
    pipelining: 1,
    title: name
  }, (err, result) => {
    if (err) console.error(err);
    console.log(`>>Results for: ${name}`);
    console.log(`Avg Latency: ${result.latency.average} ms`);
    console.log(`p99 Latency: ${result.latency.p99} ms`);
    console.log(`Total Requests: ${result.requests.total}`);
    console.log(`Requests/Sec: ${result.requests.average}`);
  });

  autocannon.track(instance, { renderProgressBar: true });
}

Baseline results

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 1 ms │ 1 ms │ 3 ms  │ 3 ms │ 1.33 ms │ 2.28 ms │ 107 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘

That 107 ms is not just Redis and includes end-to-end scheduling noise, GC pauses, and network jitter. The goal from here is to minimize max latency and stabilize the tail, which directly improves consistency in user experience.

Network path optimization (if co-location is possible)

Because Redis and the API run on the same host, the first thing I tested was switching from TCP to Unix Domain Sockets (UDS).

UDS avoids TCP/IP overhead and reduces jitter introduced by parts of the network stack. It does not bypass the kernel, but it removes several sources of variance.

Code change

const redis = new Redis({
  path: process.env.REDIS_SOCKET_PATH
});

Redis config

services:
  redis-p99:
    image: redis:8-alpine
    user: "999:1000"
    ...
    command: ["redis-server", "--unixsocket", "/dev/shm/redis.sock", "--unixsocketperm", "700"]
    ...
    volumes:
      - /dev/shm:/dev/shm
      
  api-server:
    user: "1000:1000"
    ...
    environment:
      - REDIS_URL=redis://redis-p99:9899
      - REDIS_SOCKET_PATH=/dev/shm/redis.sock
      - NODE_ENV=production
    ...
    volumes:
      - /dev/shm:/dev/shm  
  
volumes:
  redis_socket:

using /dev/shm is just convenient, but the main win here comes from UDS itself.

The key changes:

Shared volume for the socket file
Compatible user permissions between containers

Results

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 1 ms │ 3 ms  │ 3 ms │ 1.33 ms │ 2.08 ms │ 99 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴───────┘

The Standard Deviation dropped from 2.28 ms to 2.08 ms (more predictablity), and the absolute ceiling (Max) fell below 100ms.

I understand that colocation is not always acceptable as it couples failure domains. In those cases reducing and controlling network variance is key.

Some strategies:

Avoid cross-AZ traffic
Reduce round trips aggressively
Use stable, long-lived connections
Bound retries and timeouts
Pack data to reduce chattiness

Pipelining

Using pipelining helps make 2 changes:

In the earlier code, I had two sequential await calls. Each await redis.get() yields back to the Node.js event loop. If the loop is busy with other tasks, the next Redis call is delayed. With pipelining, we only await once.

Old code:

const user = await redis.get(`user:${userId}`);
const posts = await redis.get(`user:${userId}:posts`);

New code:

const [user, posts] = await redis.pipeline()
  .get(`user:${userId}`)
  .get(`user:${userId}:posts`)
  .exec();

Pipelining collapses multiple commands into a single RTT (Round Trip Time). Meaning only one network interaction is required and gets are batched.

Pipelining doesn't just save RTTs, it also reduces interrupt frequency. Processing one large buffer is more CPU-efficient than processing multiple smaller ones. This lowers the load on the Node.js event loop, preventing the loop from being blocked while the next request is waiting.

RESP3 and Client-Side Caching

The fastest network call is the one you never make. Server-Assisted Client-Side Caching (via RESP3) allows the client to keep a local copy of the data while Redis tracks which keys the client has "subscribed" to. When a key changes, Redis pushes an invalidation message to the client. This allows for near-zero latency reads without the classic stale cache problem.

If the connection between Redis and the API drops, invalidation messages may be lost. This can be mitigated by implementing a max age/TTL for local entries. Also memory use has to be efficient (LRU or TTL-based eviction).

To use RESP3, configure the client:

const redis = new Redis({
  path: process.env.REDIS_SOCKET_PATH,
  protocol: 3
});

Also, RESP3 returns native data types (like Maps), which means the client process spends fewer CPU cycles parsing data.

Enabling tracking

await redis.client('TRACKING', 'on');

Handling invalidations

redis.on('push', (message) => {
  if (message.type === 'invalidate') {
    message.data.forEach(key => localCache.delete(key));
  }
});

Updated request flow

if (localCache.has(userKey) && localCache.has(postsKey)) {
  return res.json({
    user: localCache.get(userKey),
    posts: localCache.get(postsKey),
    source: 'local-memory'
  });
}

const results = await redis.pipeline()
  .get(userKey)
  .get(postsKey)
  .exec();

localCache.set(userKey, results[0][1]);
localCache.set(postsKey, results[1][1]);

Results

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 1 ms │ 3 ms  │ 3 ms │ 1.27 ms │ 1.51 ms │ 41 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴───────┘

The system is now significantly more deterministic. By removing the network from the critical path for repeated reads, the Max latency dropped by 60% (from 107ms to 41ms), and the Standard Deviation reached a much healthier 1.51 ms.

On a high level

Reduce round trips and batch network calls
Reduce interrupt frequency and event loop blocking
Use client-side caching
Control network variance