I/O Schedulers: Why the Kernel Reorders Your Writes

The kernel is reordering your writes. You asked for A then B. The disk received B then A.

I/O schedulers optimize for throughput by batching and reordering requests. For trading audit logs, this means your write queues behind background activity-100µs+ added to what should be a 10µs operation.

This post covers the Linux block layer, why defaults hurt latency, and how to get predictable storage performance.

The Problem {#the-problem}

Default storage settings prioritize throughput:

Default Behavior	Why It Exists	Latency Impact
I/O schedulers	Batching/reordering for HDD seeks	1-10ms queueing
Page cache buffering	Write coalescing	Unpredictable flush timing
Request merging	Fewer I/O operations	Delay while accumulating
EBS burst behavior	Cost optimization	Variable IOPS

For CPU-related storage interactions, see CPU Deep Dive. For memory interactions (page cache), see Memory Deep Dive.

Background: Block Layer Internals {#background}

The Block I/O Path

When you call write(), the path is (block/blk-core.c):

write() syscall
    ↓
VFS layer (file operations)
    ↓
Page cache (unless O_DIRECT)
    ↓
Filesystem (ext4, xfs)
    ↓
Block layer (I/O scheduling)
    ↓
Device driver (nvme, sd)
    ↓
Hardware
```bash

**The block layer's job:** Convert file-level operations to block-level operations, queue them efficiently, and submit to hardware.

### I/O Schedulers

I/O schedulers ([block/mq-deadline.c](https://github.com/torvalds/linux/blob/master/block/mq-deadline.c), etc.) reorder requests to improve throughput. Historical context:

| Scheduler | Era | Design Goal |
|-----------|-----|-------------|
| CFQ | HDD era | Fair bandwidth between processes |
| Deadline | HDD era | Bounded latency with reordering |
| BFQ | Modern | Proportional bandwidth |
| mq-deadline | Multi-queue | Deadline for NVMe |
| none | Modern | Pass-through (no scheduling) |

**Why reordering helped HDDs:** Seeks take 5-10ms. Reordering requests to minimize head movement saves time.

**Why reordering hurts NVMe:** NVMe has no seek penalty. Random I/O is as fast as sequential. Scheduling overhead is pure latency addition.

### Multi-Queue Block Layer

Modern kernels use the multi-queue block layer (blk-mq, [block/blk-mq.c](https://github.com/torvalds/linux/blob/master/block/blk-mq.c)):

```text
Per-CPU software queues
    ↓
Hardware dispatch queues
    ↓
NVMe submission queues
```text

Each CPU has its own queue, reducing lock contention. But schedulers still operate between software and hardware queues.


## Fix 1: I/O Scheduler Selection {#scheduler}

### The Problem

Even on NVMe, some distributions default to `mq-deadline`:

```bash
cat /sys/block/nvme0n1/queue/scheduler
# [mq-deadline] kyber bfq none
```text

Every scheduler adds overhead-even minimal scheduling adds microseconds.

## The Fix

```bash
# Check current
cat /sys/block/nvme0n1/queue/scheduler

# Set to none (bypass scheduling)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
```text

**Persistent via udev:**

```bash
# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
```text

## Why It Works (Kernel Internals)

The `none` scheduler ([block/blk-mq-sched.c](https://github.com/torvalds/linux/blob/master/block/blk-mq-sched.c)) does minimal work:

```c
// With none scheduler:
blk_mq_request_bypass_insert()  // Direct to hardware queue
```text

No reordering, no batching, minimal overhead.

### Expected Improvement

**Eliminates 1-10ms scheduler queueing** on NVMe. For HDDs, `mq-deadline` may still be better.


## Fix 2: Direct I/O {#direct-io}

### The Problem

Standard writes go through the page cache:

```text
write() → page cache → (later) disk
```python

"Later" is unpredictable. The `pdflush` (now background writeback) kernel thread decides when to flush based on:
- dirty_ratio thresholds
- dirty_expire_centisecs age
- Memory pressure

**For audit logs:** You call `write()`, return to trading. 500ms later, background writeback stalls your trading thread while flushing.

### The Kernel Mechanism

O_DIRECT ([fs/direct-io.c](https://github.com/torvalds/linux/blob/master/fs/direct-io.c)) bypasses the page cache:

```python
write() with O_DIRECT → disk immediately
```text

**Requirements:**
- Buffer must be aligned (typically 512 bytes or 4KB)
- Size must be multiple of block size
- No write coalescing benefit

### The Fix

**C:**

```c
#include <fcntl.h>
#include <unistd.h>

int fd = open("/data/audit.log", O_WRONLY | O_CREAT | O_DIRECT, 0644);

// Buffer must be aligned
void* buf;
posix_memalign(&buf, 4096, 4096);  // 4KB aligned

// Write directly to disk
ssize_t written = write(fd, buf, 4096);
```text

**Python:**

```python
import os
import mmap

# Open with O_DIRECT
fd = os.open('/data/audit.log', os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)

# Aligned buffer via mmap
buf = mmap.mmap(-1, 4096)
buf[:len(data)] = data

os.write(fd, buf[:4096])
```sql

## Trade-offs

- **No write coalescing:** Multiple small writes = multiple I/O operations
- **Alignment requirements:** Adds complexity
- **No read-ahead:** Sequential reads won't benefit from prefetching

### Expected Improvement

**Predictable I/O latency** (10-50µs on NVMe) instead of variable background flush timing.


## Fix 3: io_uring {#io-uring}

### The Problem

Traditional syscalls (`read`, `write`) involve context switches:

```text
User space → Kernel → User space
```text

Each transition costs 0.5-2µs. For high-frequency I/O, this adds up.

### The Kernel Mechanism

io_uring ([io_uring/](https://github.com/torvalds/linux/tree/master/io_uring)) uses shared memory rings:

```text
Submission queue (user writes here)
    ↓
Kernel processes asynchronously
    ↓
Completion queue (user reads here)
```text

**No syscall for submission.** The kernel polls the submission ring.

### The Fix

**Using liburing (C):**

```c
#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(32, &ring, 0);

// Prepare write
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, offset);
sqe->user_data = my_context;

// Submit (batch multiple)
io_uring_submit(&ring);

// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);

// Process result
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);
```text

**Python (via liburing bindings):**

```python
import io_uring  # Various bindings available

ring = io_uring.Ring(32)
ring.prep_write(fd, buffer, len, offset)
ring.submit()
cqe = ring.wait()
```sql

### Expected Improvement

**Saves 0.5-2µs per I/O operation** from eliminated syscall overhead. At 100K IOPS, this is 50-200ms/second saved.

**Citation:** io_uring performance documented by [Jens Axboe](https://kernel.dk/io_uring.pdf).


## Fix 4: Dirty Page Tuning {#dirty-pages}

### The Problem

Default dirty page thresholds allow large amounts of buffered data:

```bash
sysctl vm.dirty_ratio
# vm.dirty_ratio = 20 (20% of RAM can be dirty)

sysctl vm.dirty_background_ratio
# vm.dirty_background_ratio = 10
```text

With 64GB RAM, 20% = 12GB of dirty data before forced writeback. When writeback finally happens, it's a storm.

## The Fix

```bash
# Smaller dirty buffers = more frequent, smaller flushes
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2

# Faster writeback age
sudo sysctl -w vm.dirty_expire_centisecs=100  # 1 second
sudo sysctl -w vm.dirty_writeback_centisecs=100
```text

**Make persistent:**

```bash
# /etc/sysctl.d/60-latency.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100
```bash

## Why It Works

Smaller buffers mean:
- Background writeback starts earlier
- Each flush is smaller
- No sudden I/O storms blocking allocations

### Expected Improvement

**Reduces I/O stall variance** from write storms.


## Fix 5: AWS EBS Optimization {#ebs}

### The Problem

EBS volume performance varies by type:

| Type | Baseline IOPS | Max IOPS | Latency |
|------|---------------|----------|---------|
| gp3 | 3,000 | 16,000 | 1-4ms |
| io2 | Provisioned | 64,000 | <1ms |
| io2 Block Express | Provisioned | 256,000 | <1ms |
| Instance store (NVMe) | ~400,000 | ~400,000 | <100µs |

**gp3 burst behavior:** IOPS above baseline use burst credits. When depleted, latency spikes.

### The Fix

**For latency-critical workloads:**

```hcl
# Terraform: io2 for predictable IOPS
resource "aws_ebs_volume" "trading" {
  availability_zone = "us-east-1a"
  size              = 100
  type              = "io2"
  iops              = 16000  # Provisioned, no burst
  throughput        = 500
}
```text

**For lowest latency (ephemeral data):**

```hcl
# Instance types with NVMe instance store
resource "aws_instance" "trading" {
  instance_type = "i3.xlarge"  # Includes NVMe SSD

  # WARNING: Instance store is ephemeral!
  # Use for cache, not persistent data
}
```text

## EBS-Optimized Instances

```hcl
resource "aws_instance" "trading" {
  instance_type = "c6in.xlarge"
  ebs_optimized = true  # Dedicated EBS bandwidth
}
```python

**EBS-optimized** ensures storage traffic doesn't compete with network traffic.

### Verification

```bash
# Monitor IOPS and latency
iostat -x 1

# Check for burst credit depletion (CloudWatch)
# BurstBalance metric shows remaining credits
```bash

## Expected Improvement

**io2 vs gp3:** Eliminates burst-related latency variance. Instance store: 10x faster than EBS.

**Citation:** [AWS EBS documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html).


## Design Philosophy {#design-philosophy}

### The Golden Rule

**Hot path never touches disk.**

Storage I/O is 10-100µs minimum (NVMe), 1-10ms typical (EBS). No amount of tuning makes it competitive with memory (100ns).

**Architecture implications:**

| Operation | Where It Belongs |
|-----------|-----------------|
| Market data processing | Memory only |
| Order decision | Memory only |
| Audit logging | Async queue, separate thread |
| State persistence | Write-ahead log, batched |
| Recovery | Startup, not hot path |

### When Defaults Are Right

Storage optimizations matter for:
- **Audit logs:** Compliance requires writes
- **State persistence:** Crash recovery
- **Market data replay:** Historical analysis

They don't matter for:
- **Hot path:** If you're reading/writing disk here, redesign

### The Tradeoff

| Change | We Give Up | We Get |
|--------|-----------|--------|
| none scheduler | Fairness between processes | Immediate dispatch |
| O_DIRECT | Write coalescing | Predictable timing |
| io_uring | Simpler code | Lower syscall overhead |
| Lower dirty_ratio | Large batch efficiency | No write storms |
| io2 EBS | Cost savings | Predictable IOPS |



---

## Check Your Storage Configuration

Before tuning, verify what your storage stack is actually doing:

```bash
# Check current I/O scheduler
cat /sys/block/nvme0n1/queue/scheduler

# Check dirty page thresholds
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs

# Check IOPS and latency in real-time
iostat -x 1

Topic	Next Post
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
The 5 kernel settings that cost you latency	Linux Defaults That Cost You Latency
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss

Nikhil Padala

I/O Schedulers: Why the Kernel Reorders Your Writes

The Problem {#the-problem}

Background: Block Layer Internals {#background}

The Block I/O Path

Trading Infrastructure: First Principles That Scale

Continue Reading

Self-Custody vs CEX: The Speed Myth That Cost Me Money

My Bot Signs 1,000 Trades a Day. Here's How I Sleep at Night.

42µs: How Deterministic Execution Beats Cloud Signing

Reading Path