Infrastructure
I/O Schedulers: Why the Kernel Reorders Your Writes
Deep dive into I/O schedulers, Direct I/O, io_uring, and AWS EBS optimization. Block layer internals for predictable storage latency.
The kernel is reordering your writes. You asked for A then B. The disk received B then A.
I/O schedulers optimize for throughput by batching and reordering requests. For trading audit logs, this means your write queues behind background activity-100µs+ added to what should be a 10µs operation.
This post covers the Linux block layer, why defaults hurt latency, and how to get predictable storage performance.
The Problem {#the-problem}
Default storage settings prioritize throughput:
| Default Behavior | Why It Exists | Latency Impact |
|---|---|---|
| I/O schedulers | Batching/reordering for HDD seeks | 1-10ms queueing |
| Page cache buffering | Write coalescing | Unpredictable flush timing |
| Request merging | Fewer I/O operations | Delay while accumulating |
| EBS burst behavior | Cost optimization | Variable IOPS |
For CPU-related storage interactions, see CPU Deep Dive. For memory interactions (page cache), see Memory Deep Dive.
Background: Block Layer Internals {#background}
The Block I/O Path
When you call write(), the path is (block/blk-core.c):
write() syscall
↓
VFS layer (file operations)
↓
Page cache (unless O_DIRECT)
↓
Filesystem (ext4, xfs)
↓
Block layer (I/O scheduling)
↓
Device driver (nvme, sd)
↓
Hardware
```bash
**The block layer's job:** Convert file-level operations to block-level operations, queue them efficiently, and submit to hardware.
### I/O Schedulers
I/O schedulers ([block/mq-deadline.c](https://github.com/torvalds/linux/blob/master/block/mq-deadline.c), etc.) reorder requests to improve throughput. Historical context:
| Scheduler | Era | Design Goal |
|-----------|-----|-------------|
| CFQ | HDD era | Fair bandwidth between processes |
| Deadline | HDD era | Bounded latency with reordering |
| BFQ | Modern | Proportional bandwidth |
| mq-deadline | Multi-queue | Deadline for NVMe |
| none | Modern | Pass-through (no scheduling) |
**Why reordering helped HDDs:** Seeks take 5-10ms. Reordering requests to minimize head movement saves time.
**Why reordering hurts NVMe:** NVMe has no seek penalty. Random I/O is as fast as sequential. Scheduling overhead is pure latency addition.
### Multi-Queue Block Layer
Modern kernels use the multi-queue block layer (blk-mq, [block/blk-mq.c](https://github.com/torvalds/linux/blob/master/block/blk-mq.c)):
```text
Per-CPU software queues
↓
Hardware dispatch queues
↓
NVMe submission queues
```text
Each CPU has its own queue, reducing lock contention. But schedulers still operate between software and hardware queues.
## Fix 1: I/O Scheduler Selection {#scheduler}
### The Problem
Even on NVMe, some distributions default to `mq-deadline`:
```bash
cat /sys/block/nvme0n1/queue/scheduler
# [mq-deadline] kyber bfq none
```text
Every scheduler adds overhead-even minimal scheduling adds microseconds.
## The Fix
```bash
# Check current
cat /sys/block/nvme0n1/queue/scheduler
# Set to none (bypass scheduling)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
```text
**Persistent via udev:**
```bash
# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
```text
## Why It Works (Kernel Internals)
The `none` scheduler ([block/blk-mq-sched.c](https://github.com/torvalds/linux/blob/master/block/blk-mq-sched.c)) does minimal work:
```c
// With none scheduler:
blk_mq_request_bypass_insert() // Direct to hardware queue
```text
No reordering, no batching, minimal overhead.
### Expected Improvement
**Eliminates 1-10ms scheduler queueing** on NVMe. For HDDs, `mq-deadline` may still be better.
## Fix 2: Direct I/O {#direct-io}
### The Problem
Standard writes go through the page cache:
```text
write() → page cache → (later) disk
```python
"Later" is unpredictable. The `pdflush` (now background writeback) kernel thread decides when to flush based on:
- dirty_ratio thresholds
- dirty_expire_centisecs age
- Memory pressure
**For audit logs:** You call `write()`, return to trading. 500ms later, background writeback stalls your trading thread while flushing.
### The Kernel Mechanism
O_DIRECT ([fs/direct-io.c](https://github.com/torvalds/linux/blob/master/fs/direct-io.c)) bypasses the page cache:
```python
write() with O_DIRECT → disk immediately
```text
**Requirements:**
- Buffer must be aligned (typically 512 bytes or 4KB)
- Size must be multiple of block size
- No write coalescing benefit
### The Fix
**C:**
```c
#include <fcntl.h>
#include <unistd.h>
int fd = open("/data/audit.log", O_WRONLY | O_CREAT | O_DIRECT, 0644);
// Buffer must be aligned
void* buf;
posix_memalign(&buf, 4096, 4096); // 4KB aligned
// Write directly to disk
ssize_t written = write(fd, buf, 4096);
```text
**Python:**
```python
import os
import mmap
# Open with O_DIRECT
fd = os.open('/data/audit.log', os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
# Aligned buffer via mmap
buf = mmap.mmap(-1, 4096)
buf[:len(data)] = data
os.write(fd, buf[:4096])
```sql
## Trade-offs
- **No write coalescing:** Multiple small writes = multiple I/O operations
- **Alignment requirements:** Adds complexity
- **No read-ahead:** Sequential reads won't benefit from prefetching
### Expected Improvement
**Predictable I/O latency** (10-50µs on NVMe) instead of variable background flush timing.
## Fix 3: io_uring {#io-uring}
### The Problem
Traditional syscalls (`read`, `write`) involve context switches:
```text
User space → Kernel → User space
```text
Each transition costs 0.5-2µs. For high-frequency I/O, this adds up.
### The Kernel Mechanism
io_uring ([io_uring/](https://github.com/torvalds/linux/tree/master/io_uring)) uses shared memory rings:
```text
Submission queue (user writes here)
↓
Kernel processes asynchronously
↓
Completion queue (user reads here)
```text
**No syscall for submission.** The kernel polls the submission ring.
### The Fix
**Using liburing (C):**
```c
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);
// Prepare write
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, offset);
sqe->user_data = my_context;
// Submit (batch multiple)
io_uring_submit(&ring);
// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// Process result
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);
```text
**Python (via liburing bindings):**
```python
import io_uring # Various bindings available
ring = io_uring.Ring(32)
ring.prep_write(fd, buffer, len, offset)
ring.submit()
cqe = ring.wait()
```sql
### Expected Improvement
**Saves 0.5-2µs per I/O operation** from eliminated syscall overhead. At 100K IOPS, this is 50-200ms/second saved.
**Citation:** io_uring performance documented by [Jens Axboe](https://kernel.dk/io_uring.pdf).
## Fix 4: Dirty Page Tuning {#dirty-pages}
### The Problem
Default dirty page thresholds allow large amounts of buffered data:
```bash
sysctl vm.dirty_ratio
# vm.dirty_ratio = 20 (20% of RAM can be dirty)
sysctl vm.dirty_background_ratio
# vm.dirty_background_ratio = 10
```text
With 64GB RAM, 20% = 12GB of dirty data before forced writeback. When writeback finally happens, it's a storm.
## The Fix
```bash
# Smaller dirty buffers = more frequent, smaller flushes
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2
# Faster writeback age
sudo sysctl -w vm.dirty_expire_centisecs=100 # 1 second
sudo sysctl -w vm.dirty_writeback_centisecs=100
```text
**Make persistent:**
```bash
# /etc/sysctl.d/60-latency.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100
```bash
## Why It Works
Smaller buffers mean:
- Background writeback starts earlier
- Each flush is smaller
- No sudden I/O storms blocking allocations
### Expected Improvement
**Reduces I/O stall variance** from write storms.
## Fix 5: AWS EBS Optimization {#ebs}
### The Problem
EBS volume performance varies by type:
| Type | Baseline IOPS | Max IOPS | Latency |
|------|---------------|----------|---------|
| gp3 | 3,000 | 16,000 | 1-4ms |
| io2 | Provisioned | 64,000 | <1ms |
| io2 Block Express | Provisioned | 256,000 | <1ms |
| Instance store (NVMe) | ~400,000 | ~400,000 | <100µs |
**gp3 burst behavior:** IOPS above baseline use burst credits. When depleted, latency spikes.
### The Fix
**For latency-critical workloads:**
```hcl
# Terraform: io2 for predictable IOPS
resource "aws_ebs_volume" "trading" {
availability_zone = "us-east-1a"
size = 100
type = "io2"
iops = 16000 # Provisioned, no burst
throughput = 500
}
```text
**For lowest latency (ephemeral data):**
```hcl
# Instance types with NVMe instance store
resource "aws_instance" "trading" {
instance_type = "i3.xlarge" # Includes NVMe SSD
# WARNING: Instance store is ephemeral!
# Use for cache, not persistent data
}
```text
## EBS-Optimized Instances
```hcl
resource "aws_instance" "trading" {
instance_type = "c6in.xlarge"
ebs_optimized = true # Dedicated EBS bandwidth
}
```python
**EBS-optimized** ensures storage traffic doesn't compete with network traffic.
### Verification
```bash
# Monitor IOPS and latency
iostat -x 1
# Check for burst credit depletion (CloudWatch)
# BurstBalance metric shows remaining credits
```bash
## Expected Improvement
**io2 vs gp3:** Eliminates burst-related latency variance. Instance store: 10x faster than EBS.
**Citation:** [AWS EBS documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html).
## Design Philosophy {#design-philosophy}
### The Golden Rule
**Hot path never touches disk.**
Storage I/O is 10-100µs minimum (NVMe), 1-10ms typical (EBS). No amount of tuning makes it competitive with memory (100ns).
**Architecture implications:**
| Operation | Where It Belongs |
|-----------|-----------------|
| Market data processing | Memory only |
| Order decision | Memory only |
| Audit logging | Async queue, separate thread |
| State persistence | Write-ahead log, batched |
| Recovery | Startup, not hot path |
### When Defaults Are Right
Storage optimizations matter for:
- **Audit logs:** Compliance requires writes
- **State persistence:** Crash recovery
- **Market data replay:** Historical analysis
They don't matter for:
- **Hot path:** If you're reading/writing disk here, redesign
### The Tradeoff
| Change | We Give Up | We Get |
|--------|-----------|--------|
| none scheduler | Fairness between processes | Immediate dispatch |
| O_DIRECT | Write coalescing | Predictable timing |
| io_uring | Simpler code | Lower syscall overhead |
| Lower dirty_ratio | Large batch efficiency | No write storms |
| io2 EBS | Cost savings | Predictable IOPS |
---
## Check Your Storage Configuration
Before tuning, verify what your storage stack is actually doing:
```bash
# Check current I/O scheduler
cat /sys/block/nvme0n1/queue/scheduler
# Check dirty page thresholds
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs
# Check IOPS and latency in real-time
iostat -x 1 Up Next in Linux Infrastructure Deep Dives
Trading Infrastructure: First Principles That Scale
Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.
Continue Reading
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| The 5 kernel settings that cost you latency | Linux Defaults That Cost You Latency |
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.