Incident Response Runbook
This runbook covers common failure scenarios and the recommended response procedures for MuroDB operators.
First Response: Gather Diagnostics
Before taking any corrective action, collect the current state:
SHOW DATABASE STATS;
Note the values of commit_in_doubt_count, failed_checkpoints, wal_file_size_bytes, freelist_sanitize_count, freelist_out_of_range_total, and freelist_duplicates_total.
If the database cannot be opened, inspect the WAL without modifying it:
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json
Scenario: CommitInDoubt Detected
Symptom: commit_in_doubt_count > 0 in SHOW DATABASE STATS, or the application receives a CommitInDoubt error.
What happened: The transaction was durably written to the WAL, but the subsequent write of dirty pages or metadata to the data file failed (e.g., disk full, I/O error). The session is poisoned and will reject further operations.
Response:
- Close the current session / database handle immediately.
- Investigate the root cause — check disk space (
df -h), kernel logs (dmesg), and storage health. - Resolve the underlying issue (free disk space, replace failing disk).
- Reopen the database. WAL recovery will automatically replay the committed transaction.
- Run
SHOW DATABASE STATSto confirmcommit_in_doubt_countis0after recovery.
Do NOT delete or rename the WAL file — it contains the committed data that needs to be replayed.
Scenario: Checkpoint Failures / WAL Growth
Symptom: failed_checkpoints > 0, and wal_file_size_bytes is growing over time.
What happened: After committing, MuroDB truncates the WAL via a checkpoint. If truncation fails, the WAL keeps growing. The database remains correct (WAL replay is idempotent), but recovery time increases.
Response:
- Check disk I/O health and available space.
- If the WAL is very large but the database is otherwise healthy, restart the process. Recovery on startup will replay and then truncate the WAL.
- Monitor
wal_file_size_bytesafter restart to confirm the WAL was truncated.
Scenario: Freelist Corruption Suspected
Symptom: freelist_sanitize_count is consistently non-zero across multiple sessions (not just once after a crash recovery).
What happened: The freelist (which tracks reusable pages) contained invalid entries. MuroDB self-heals by removing invalid entries during allocation, but persistent occurrences may indicate deeper corruption.
Response:
- A single occurrence after crash recovery is normal — no action needed.
- If it recurs across sessions:
- Back up the database file and WAL immediately.
- Open with
--recovery-mode permissiveand check the recovery report for skipped transactions. - If data integrity is confirmed, the self-healing is working correctly. Continue monitoring.
- If data loss is suspected, restore from backup and replay from the last known good state.
Scenario: Database Fails to Open (WAL Corruption)
Symptom: Opening the database fails with a recovery error in strict mode.
Response:
- Inspect first — do not delete any files:
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json - Review the report. If only incomplete (uncommitted) transactions are malformed, they can be safely skipped.
- Open with permissive mode to recover valid data:
murodb mydb.db --recovery-mode permissive - The original WAL is automatically quarantined to
*.wal.quarantine.*for forensic analysis. - Verify recovered data integrity by querying critical tables.
Scenario: Process Crash / Kill During Operation
Symptom: The MuroDB process was killed (SIGKILL, OOM, power loss) mid-operation.
Response:
- Simply reopen the database. WAL recovery handles this automatically.
- Check
SHOW DATABASE STATSafter recovery:commit_in_doubt_countshould be0.freelist_sanitize_countmay be> 0once — this is normal.
- If strict recovery fails, follow the “Database Fails to Open” procedure above.
When to Restart vs. Quarantine WAL
| Situation | Action |
|---|---|
| Session poisoned (CommitInDoubt) | Restart — recovery replays committed data |
| WAL growing (checkpoint failures) | Restart — recovery truncates WAL |
| Strict recovery fails | Inspect WAL, then open with --recovery-mode permissive |
| Repeated freelist sanitization | Back up, then investigate with permissive mode |
| Corrupted WAL with data loss | Restore from backup |
Escalation Criteria
Escalate to the development team if:
commit_in_doubt_count > 0persists after reopening the database (recovery failed to replay).- WAL inspection shows committed transactions that were not recovered.
- Permissive mode skips transactions that should have been valid.
- Freelist sanitization count grows across sessions without any preceding crash.
- The database file size is inconsistent with expected data volume (possible page leak).
When escalating, include:
- Full
SHOW DATABASE STATSoutput. - WAL inspection JSON output (
murodb-wal-inspect --format json). - Kernel logs around the time of failure (
dmesg,journalctl). - The quarantined WAL file(s), if any.