MuroDB
MuroDB is an embedded SQL database written in Rust.
This documentation is designed for people who already know SQL and want to start building quickly.
What You Can Do Quickly
- Create a local database file and start querying right away.
- Keep data durable with WAL-based crash recovery.
- Use encrypted-at-rest mode by default, or explicitly opt into plaintext mode.
- Add full-text search with bigram tokenization and relevance scoring.
If you want hands-on steps first, start here:
Core Capabilities
- At-rest mode:
aes256-gcm-siv(default) or explicitoff(plaintext) - Storage engine: B+tree, WAL, page cache, crash recovery
- Transactions: ACID semantics
- Concurrency model: multiple readers / single writer
- Full-text search:
MATCH(...) AGAINST(...), NATURAL/BOOLEAN modes,fts_snippet()
Architecture Map
| Component | Description |
|---|---|
crypto/ | Page encryption, key derivation, and term-ID blinding |
storage/ | Pager, slotted pages, cache, and freelist management |
btree/ | Index/table structures, search, and mutation operations |
wal/ | Write-ahead log, reader/writer, recovery flow |
tx/ | Transaction lifecycle, dirty-page buffering, commit/rollback |
schema/ | Catalog metadata for tables and indexes |
sql/ | Lexer, parser, planner, executor, and session control |
fts/ | Tokenizer, postings, BM25 scoring, snippets |
concurrency/ | In-process and cross-process locking |
Non-goals
- Network server protocol
- Full access-pattern obfuscation
- Stored procedures / triggers
Installation
Requirements
- Rust toolchain (stable)
Install from source
git clone https://github.com/tokuhirom/murodb.git
cd murodb
cargo install --path .
This installs the murodb binary to ~/.cargo/bin/.
Build only
cargo build --release
The binary will be at target/release/murodb.
First Session (10 Minutes)
This page is a fast path for SQL users who want to feel MuroDB before reading full details.
1. Install
cargo install --path .
2. Create a database and schema
murodb demo.db --create -e "CREATE TABLE notes (
id BIGINT PRIMARY KEY,
title VARCHAR NOT NULL,
body TEXT
)"
Encrypted mode prompts for a password by default.
If you need plaintext mode, opt in explicitly:
murodb demo-plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"
3. Insert and query rows
murodb demo.db -e "INSERT INTO notes (id, title, body) VALUES
(1, 'welcome', 'hello from murodb'),
(2, 'next', 'test full text search')"
murodb demo.db -e "SELECT id, title FROM notes ORDER BY id"
4. Add full-text search
murodb demo.db -e "CREATE FULLTEXT INDEX notes_body_fts ON notes(body)
WITH PARSER ngram
OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000)"
murodb demo.db -e "SELECT id,
MATCH(body) AGAINST('full text' IN NATURAL LANGUAGE MODE) AS score
FROM notes
WHERE MATCH(body) AGAINST('full text' IN NATURAL LANGUAGE MODE) > 0
ORDER BY score DESC"
5. Open interactive REPL
murodb demo.db
Start without -e to enter REPL mode.
Tip:
- Press
Ctrl-Cwhile typing to clear the current buffer. - Press
Ctrl-Cduring statement execution to cancel the running statement and keep the REPL open.
Next
- Quick Start for concise command examples.
- SQL Reference for statement details.
- Recovery for durability and incident handling.
Quick Start
This page is a concise command reference.
For a guided first run, see First Session (10 Minutes).
Create a database
murodb mydb.db --create -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"
You will be prompted for an encryption password.
If you need plaintext mode, opt in explicitly:
murodb mydb_plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"
Insert rows
murodb mydb.db -e "INSERT INTO t (id, name) VALUES (1, 'hello'), (2, 'world')"
Query rows
murodb mydb.db -e "SELECT id, name FROM t ORDER BY id"
Show tables
murodb mydb.db -e "SHOW TABLES"
Run with JSON output
murodb mydb.db --format json -e "SELECT id, name FROM t ORDER BY id"
Interactive REPL
murodb mydb.db
Start without -e to enter the interactive REPL mode.
CLI Options
Basic usage
murodb <database-file> [options]
Options
| Option | Description |
|---|---|
-e <SQL> | Execute SQL and exit |
--create | Create a new database |
--encryption <aes256-gcm-siv|off> | Encryption suite (off = plaintext, explicit opt-in) |
--password <PW> | Password for aes256-gcm-siv mode (prompts if omitted) |
--recovery-mode <strict|permissive> | WAL recovery policy for open |
--format <text|json> | Output format for query results |
--busy-timeout-ms <N> | Lock wait timeout in milliseconds (0 = wait indefinitely) |
--statement-timeout-ms <N> | Per-statement execution timeout in milliseconds (0 = no timeout) |
Examples
# Create a new database
murodb mydb.db --create -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"
# Create plaintext DB (explicit)
murodb mydb_plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"
# Insert data
murodb mydb.db -e "INSERT INTO t (id, name) VALUES (1, 'hello')"
# Query
murodb mydb.db -e "SELECT * FROM t"
# Interactive REPL
murodb mydb.db
# Open with permissive recovery mode
murodb mydb.db --recovery-mode permissive
# JSON output for machine processing
murodb mydb.db --format json -e "SELECT * FROM t"
Query routing behavior
The CLI parses each statement and routes execution automatically:
- Read-only statements (
SELECT,UNION,EXPLAIN SELECT,SHOW ...,DESCRIBE) use the read path. EXPLAIN UPDATE/EXPLAIN DELETEare supported, but route according to the inner statement category.- Write and transaction-control statements (
INSERT,UPDATE,DELETE, DDL,BEGIN/COMMIT/ROLLBACK) use the write path. - While an explicit transaction is active (
BEGIN…COMMIT/ROLLBACK), all statements (includingSELECT) run with execute semantics.
Ctrl-C behavior
- In REPL input mode (while typing a statement),
Ctrl-Cclears the current input buffer. - While a statement is executing,
Ctrl-Crequests cancellation of the in-flight statement. - If cancellation succeeds, the statement returns an error and the REPL stays alive.
- In one-shot mode (
-e),Ctrl-Calso requests cancellation; the command then exits after printing the statement error.
JSON output
When using --format json, results are emitted as a single JSON object per statement.
Result envelope
type- One ofrows,rows_affected,ok, orerrorcolumns- Column names in result order (only forrows)rows- Array of row arrays in column order (only forrows)row_count- Number of rows (only forrows)rows_affected- Number of rows affected (only forrows_affected)message- Error message string (only forerror)
Example:
{"type":"rows","columns":["id","name"],"rows":[[1,"alice"]],"row_count":1}
Types
INTEGER
Numbers are emitted as JSON numbers.
FLOAT
Finite values are emitted as JSON numbers. Non-finite values are emitted as JSON strings.
DATE
YYYY-MM-DD
DATETIME
ISO 8601 YYYY-MM-DDTHH:MM:SS
TIMESTAMP
ISO 8601 YYYY-MM-DDTHH:MM:SS
VARCHAR
JSON strings with standard escaping.
VARBINARY
Base64 string (standard alphabet with = padding), for example q80=.
NULL
null
WAL inspection
WAL inspection is handled by a dedicated command so the query CLI stays simple:
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive
See WAL Inspection for exit codes and JSON schema.
Rekey Command
Use the dedicated command for password rotation:
murodb-rekey mydb.db
The command prompts for current password, new password, and confirmation via TTY.
Security Notes
- Prefer interactive password prompt over
--passwordto reduce secret exposure in process lists/history. - Password rotation uses
murodb-rekey(interactive prompt), not SQL text.
WAL Inspection
WAL inspection is a separate command so the main murodb CLI stays focused on queries.
Basic usage
murodb-wal-inspect <database-file> --wal <WAL-PATH> [options]
Options
| Option | Description |
|---|---|
--wal <PATH> | WAL file path or quarantine file path |
--password <PW> | Password (prompts if omitted) |
--recovery-mode <strict|permissive> | Recovery policy used during inspection |
--format <text|json> | Output format for inspection results |
Examples
# Text output
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive
# JSON output (for automation)
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json
# Inspect a quarantine WAL file
murodb-wal-inspect mydb.db --wal mydb.wal.quarantine.20240101_120000
Exit codes
| Exit Code | Meaning |
|---|---|
0 | No malformed transactions detected |
10 | Malformed transactions detected (inspection succeeded) |
20 | Fatal error (decrypt/IO/strict failure, etc.) |
JSON output
When using --format json, the output includes stable fields:
schema_version- Schema version for the JSON formatmode- Recovery mode usedwal_path- Path to the WAL filegenerated_at- Timestamp of the inspectionstatus-ok,warning, orfatalexit_code- Exit codeskipped[].code- Machine-readable classification of skipped transactionsfatal_error/fatal_error_code- Present on fatal failures
SQL Reference
Data Types
| Type | Storage | Range |
|---|---|---|
| TINYINT | 1 byte | -128 to 127 |
| SMALLINT | 2 bytes | -32,768 to 32,767 |
| INT | 4 bytes | -2,147,483,648 to 2,147,483,647 |
| BIGINT | 8 bytes | -2^63 to 2^63-1 |
| BOOLEAN | 1 byte | Alias for TINYINT |
| DATE | 4 bytes | YYYY-MM-DD |
| DATETIME | 8 bytes | YYYY-MM-DD HH:MM:SS |
| TIMESTAMP | 8 bytes | YYYY-MM-DD HH:MM:SS (timezone-aware input, normalized to UTC) |
| VARCHAR(n) | variable | max n bytes (optional) |
| TEXT | variable | unbounded text |
| JSONB | variable | Canonical JSON text (validated on write) |
| VARBINARY(n) | variable | max n bytes (optional) |
| FLOAT | 4 bytes | Single-precision IEEE 754 |
| DOUBLE | 8 bytes | Double-precision IEEE 754 |
| DECIMAL(p,s) | 16 bytes | Fixed-point exact numeric (precision 1-28, scale 0-p). Alias: NUMERIC(p,s). Default: DECIMAL(10,0) |
| UUID | 16 bytes | 128-bit UUID (RFC 9562), stored as fixed-length binary |
| NULL | 0 bytes | null value |
Temporal semantics:
DATEstores calendar date only.DATETIMEstores date-time as provided (no timezone conversion).TIMESTAMPaccepts timezone offsets in string input (for example+09:00,Z) and stores UTC-normalized value.- Invalid calendar/time values are rejected.
DDL (Data Definition Language)
CREATE TABLE
CREATE TABLE t (
id BIGINT PRIMARY KEY,
body VARCHAR,
blob VARBINARY,
uniq VARCHAR UNIQUE
);
-- With additional features
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR NOT NULL,
email VARCHAR UNIQUE,
age INT DEFAULT 0,
active BOOLEAN DEFAULT 1,
CONSTRAINT chk_age CHECK (age >= 0)
);
-- IF NOT EXISTS
CREATE TABLE IF NOT EXISTS t (id BIGINT PRIMARY KEY);
-- Composite PRIMARY KEY
CREATE TABLE orders (
user_id INT,
order_id INT,
amount INT,
PRIMARY KEY (user_id, order_id)
);
-- Composite UNIQUE constraint
CREATE TABLE t (
id BIGINT PRIMARY KEY,
a INT,
b INT,
UNIQUE (a, b)
);
-- FOREIGN KEY (default: RESTRICT)
CREATE TABLE children (
id BIGINT PRIMARY KEY,
parent_id BIGINT,
FOREIGN KEY (parent_id) REFERENCES parents(id)
);
-- FOREIGN KEY with actions
CREATE TABLE child_actions (
id BIGINT PRIMARY KEY,
parent_id BIGINT,
FOREIGN KEY (parent_id) REFERENCES parents(id)
ON DELETE CASCADE
ON UPDATE SET NULL
);
CREATE INDEX
CREATE UNIQUE INDEX idx_email ON users(email);
-- IF NOT EXISTS
CREATE INDEX IF NOT EXISTS idx_name ON users(name);
-- Composite index (multiple columns)
CREATE INDEX idx_ab ON t(a, b);
CREATE UNIQUE INDEX idx_ab ON t(a, b);
CREATE FULLTEXT INDEX
CREATE FULLTEXT INDEX t_body_fts ON t(body)
WITH PARSER ngram
OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000);
FULLTEXT is usable with any primary-key type. Internally, MuroDB maintains a separate FTS doc_id.
stop_filter supports on/off (quoted or unquoted), 1/0, and true/false.
stop_df_ratio_ppm range is 0..=1000000.
DROP TABLE / DROP INDEX
DROP TABLE t;
DROP TABLE IF EXISTS t;
DROP INDEX idx_email;
DROP INDEX IF EXISTS idx_email;
ALTER TABLE
-- Add a new column (O(1), no row rewrite)
ALTER TABLE t ADD COLUMN email VARCHAR;
ALTER TABLE t ADD age INT DEFAULT 0;
-- Drop a column (full table rewrite)
ALTER TABLE t DROP COLUMN age;
-- Modify column type or constraints (full rewrite if type changes)
ALTER TABLE t MODIFY COLUMN name VARCHAR(255) NOT NULL;
ALTER TABLE t MODIFY name TEXT;
-- Rename and optionally change a column (CHANGE COLUMN)
ALTER TABLE t CHANGE COLUMN name username VARCHAR;
-- Add / drop FOREIGN KEY
ALTER TABLE child ADD FOREIGN KEY (parent_id) REFERENCES parent(id);
ALTER TABLE child DROP FOREIGN KEY (parent_id);
Performance notes:
ADD COLUMNis O(1) — only updates the catalog. Existing rows return the default value (or NULL) for the new column without rewriting data.DROP COLUMN,MODIFY COLUMN(with type change), andCHANGE COLUMN(with type change) perform a full table rewrite.MODIFY COLUMN/CHANGE COLUMNwithout a type change is catalog-only (O(1)).
Behavior details:
ADD COLUMN ... NOT NULLwithoutDEFAULTfails if the table already has rows.ADD COLUMN ... UNIQUEcreates an automatic unique index (auto_unique_<table>_<column>).ADD COLUMN ... UNIQUEwith a non-NULLdefault fails for multi-row existing tables, because all rows would backfill to the same value.MODIFY COLUMN/CHANGE COLUMNthat addsNOT NULLvalidates existing rows and fails ifNULLvalues are present.MODIFY COLUMN/CHANGE COLUMNwith a type change rewrites all rows and coerces values; conversion failures abort the statement.CHANGE COLUMNupdates index metadata to the new column name when indexes reference the old name.MODIFY COLUMN/CHANGE COLUMNreconcile single-columnUNIQUE: addingUNIQUEmay create an index; removingUNIQUEdrops the corresponding auto unique index.ADD FOREIGN KEYvalidates existing rows; if orphan rows exist, it fails.- FK actions support
RESTRICT,CASCADE, andSET NULLfor bothON DELETEandON UPDATE.
Limitations:
- Cannot add a PRIMARY KEY column via ALTER TABLE.
- Cannot drop a PRIMARY KEY column.
- Cannot drop a column that has an index on it (drop the index first).
- Cannot drop a table that is referenced by a foreign key.
DROP FOREIGN KEYis specified by child column list:DROP FOREIGN KEY (col1, col2).
RENAME TABLE
RENAME TABLE old_name TO new_name;
Renames a table. All indexes are automatically updated. No row data is rewritten.
Schema Inspection
SHOW TABLES;
SHOW CREATE TABLE t;
DESCRIBE t;
DESC t;
Operational Inspection
SHOW CHECKPOINT STATS;
SHOW DATABASE STATS;
Both commands return two columns: stat and value.
SHOW DATABASE STATS includes cache observability fields:
pager_cache_hitspager_cache_missespager_cache_hit_rate_pct
It also exposes checkpoint policy/runtime fields:
deferred_checkpointscheckpoint_pending_opscheckpoint_policy_tx_thresholdcheckpoint_policy_wal_bytes_thresholdcheckpoint_policy_interval_ms
WAL observability:
wal_file_size_bytes
Runtime Configuration
Runtime options are documented in Runtime Configuration.
DML (Data Manipulation Language)
ANALYZE TABLE
Refreshes persisted planner statistics.
ANALYZE TABLE t;
Current persisted stats include:
- table row count
- index distinct-key count
INSERT
INSERT INTO t (id, name) VALUES (1, 'Alice');
-- Multi-row insert
INSERT INTO t (id, name) VALUES (1, 'Alice'), (2, 'Bob');
INSERT … ON DUPLICATE KEY UPDATE
If a row with the same PRIMARY KEY already exists, updates the existing row instead of inserting a new one.
INSERT INTO t (id, name) VALUES (1, 'Alice')
ON DUPLICATE KEY UPDATE name = 'Alice Updated';
-- Expressions can reference existing column values
INSERT INTO counters (id, cnt) VALUES (1, 1)
ON DUPLICATE KEY UPDATE cnt = cnt + 1;
Affected rows (MySQL-compatible):
- New row inserted: 1
- Existing row updated: 2
Limitations:
VALUES()function is not supported. Use column references to access the existing row’s values.
REPLACE INTO
Inserts a new row. If a row with the same PRIMARY KEY or UNIQUE index value already exists, deletes the old row first, then inserts the new one.
REPLACE INTO t (id, name) VALUES (1, 'Alice');
-- Multi-row replace
REPLACE INTO t (id, name) VALUES (1, 'Alice'), (2, 'Bob');
Unlike INSERT ... ON DUPLICATE KEY UPDATE, REPLACE deletes and re-inserts the entire row. This means:
- All columns are replaced with the new values (columns not specified get defaults/NULL).
- Conflicts on any UNIQUE index (not just PRIMARY KEY) also trigger deletion of the conflicting row.
SELECT
SELECT * FROM t WHERE id = 42 ORDER BY id DESC LIMIT 10;
-- With OFFSET
SELECT * FROM t LIMIT 10 OFFSET 20;
UPDATE
UPDATE t SET name = 'Alicia' WHERE id = 1;
DELETE
DELETE FROM t WHERE id = 1;
Index Hints (FORCE INDEX / USE INDEX / IGNORE INDEX)
MySQL-compatible index hints allow controlling which indexes the query planner considers.
-- Force the planner to use only the specified index (skips PK seek)
SELECT * FROM t FORCE INDEX (idx_age) WHERE age = 20;
-- Suggest the planner to use the specified index (PK seek still allowed)
SELECT * FROM t USE INDEX (idx_age) WHERE age = 20;
-- Exclude the specified index from consideration
SELECT * FROM t IGNORE INDEX (idx_age) WHERE age = 20;
-- Multiple index names
SELECT * FROM t FORCE INDEX (idx_age, idx_name) WHERE age = 20;
-- Also works with UPDATE and DELETE
UPDATE t FORCE INDEX (idx_age) SET name = 'updated' WHERE age = 20;
DELETE FROM t IGNORE INDEX (idx_age) WHERE age = 20;
Behavior:
- FORCE INDEX: Only the specified indexes are candidates. Primary key seek is skipped. If the specified index cannot be used for the query, falls back to full table scan (matching MySQL behavior).
- USE INDEX: The specified indexes are preferred, but full table scan is also a candidate. Primary key seek is still allowed.
- IGNORE INDEX: The specified indexes are excluded from consideration. All other indexes and primary key seek remain available.
WHERE Clause
Comparison operators
WHERE id = 1
WHERE id != 1
WHERE id < 10
WHERE id > 5
WHERE id <= 10
WHERE id >= 5
Logical operators
WHERE id > 1 AND name = 'Alice'
WHERE id = 1 OR id = 2
WHERE NOT (id = 1)
LIKE / NOT LIKE
WHERE name LIKE 'Ali%'
WHERE name LIKE '_ob'
WHERE name NOT LIKE '%test%'
IN
WHERE id IN (1, 2, 3)
BETWEEN
WHERE id BETWEEN 1 AND 10
IS NULL / IS NOT NULL
WHERE name IS NULL
WHERE name IS NOT NULL
ORDER BY / LIMIT
SELECT * FROM t ORDER BY id ASC;
SELECT * FROM t ORDER BY name DESC, id ASC;
SELECT * FROM t LIMIT 10;
SELECT * FROM t LIMIT 10 OFFSET 5;
Literals
Hex Literal (Binary)
Binary data can be specified using the X'...' syntax (SQL standard / MySQL compatible):
-- Insert binary data
INSERT INTO t (id, data) VALUES (1, X'DEADBEEF');
-- Empty binary literal
INSERT INTO t (id, data) VALUES (2, X'');
-- Case-insensitive (both X and x are accepted)
INSERT INTO t (id, data) VALUES (3, x'cafebabe');
-- Use in WHERE clause
SELECT * FROM t WHERE data = X'DEADBEEF';
The hex string must contain an even number of hex digits (0-9, A-F, a-f).
Odd-length hex strings and invalid characters produce a parse error.
Expressions
Arithmetic operators
SELECT id, price * quantity AS total FROM orders;
SELECT id, (a + b) / 2 AS average FROM t;
-- Supported: +, -, *, /, %
Built-in Functions
String Functions
LENGTH(s)
Returns the byte length of a string.
SELECT LENGTH('hello'); -- 5
SELECT LENGTH('héllo'); -- 6 (é is 2 bytes in UTF-8)
CHAR_LENGTH(s) / CHARACTER_LENGTH(s)
Returns the character count of a string.
SELECT CHAR_LENGTH('hello'); -- 5
SELECT CHAR_LENGTH('héllo'); -- 5
CONCAT(s1, s2, …)
Concatenates two or more strings. Returns NULL if any argument is NULL.
SELECT CONCAT('hello', ' ', 'world'); -- 'hello world'
SELECT CONCAT('a', NULL); -- NULL
SUBSTRING(s, pos [, len]) / SUBSTR(s, pos [, len])
Returns a substring starting at position pos (1-based). Optional len limits the length.
SELECT SUBSTRING('hello world', 7); -- 'world'
SELECT SUBSTRING('hello world', 1, 5); -- 'hello'
SELECT SUBSTRING('hello', -3); -- 'llo'
UPPER(s) / LOWER(s)
Converts a string to upper or lower case.
SELECT UPPER('hello'); -- 'HELLO'
SELECT LOWER('HELLO'); -- 'hello'
TRIM(s) / LTRIM(s) / RTRIM(s)
Removes whitespace from both ends, the left end, or the right end.
SELECT TRIM(' hello '); -- 'hello'
SELECT LTRIM(' hello '); -- 'hello '
SELECT RTRIM(' hello '); -- ' hello'
REPLACE(s, from, to)
Replaces all occurrences of from with to in s.
SELECT REPLACE('hello world', 'world', 'rust'); -- 'hello rust'
REVERSE(s)
Reverses a string.
SELECT REVERSE('hello'); -- 'olleh'
REPEAT(s, n)
Repeats a string n times.
SELECT REPEAT('ab', 3); -- 'ababab'
LEFT(s, n) / RIGHT(s, n)
Returns the leftmost or rightmost n characters.
SELECT LEFT('hello', 3); -- 'hel'
SELECT RIGHT('hello', 3); -- 'llo'
LPAD(s, len, pad) / RPAD(s, len, pad)
Pads a string to length len using pad characters on the left or right.
SELECT LPAD('hi', 5, '*'); -- '***hi'
SELECT RPAD('hi', 5, '*'); -- 'hi***'
INSTR(s, sub)
Returns the position (1-based) of the first occurrence of sub in s. Returns 0 if not found.
SELECT INSTR('hello world', 'world'); -- 7
SELECT INSTR('hello', 'xyz'); -- 0
LOCATE(sub, s [, pos])
Returns the position (1-based) of sub in s, starting the search at pos.
SELECT LOCATE('hello', 'hello hello'); -- 1
SELECT LOCATE('hello', 'hello hello', 2); -- 7
REGEXP
REGEXP / REGEXP_LIKE(s, pattern)
Tests whether a string matches a regular expression. Can be used as an operator or function.
-- Operator syntax
SELECT * FROM t WHERE name REGEXP '[0-9]+';
-- Function syntax
SELECT REGEXP_LIKE(name, '^hello') FROM t;
Numeric Functions
All numeric functions support INTEGER, FLOAT, DOUBLE, and DECIMAL types.
ABS(n)
Returns the absolute value.
SELECT ABS(-42); -- 42
SELECT ABS(-3.14); -- 3.14 (DECIMAL)
CEIL(n) / CEILING(n) / FLOOR(n)
Returns the ceiling or floor. (Identity for integer types.)
SELECT CEIL(3.14); -- 4
SELECT FLOOR(3.14); -- 3
ROUND(n [, decimals])
Rounds a number to decimals decimal places (default 0). Works with DECIMAL for exact rounding.
SELECT ROUND(3.1459, 2); -- 3.15 (DECIMAL)
SELECT ROUND(42); -- 42
MOD(a, b)
Returns the modulo (same as % operator).
SELECT MOD(10, 3); -- 1
POWER(base, exp) / POW(base, exp)
Returns base raised to the power of exp.
SELECT POWER(2, 10); -- 1024
UUID Functions
UUID_V4()
Generates a random UUID (version 4, RFC 9562).
SELECT UUID_V4(); -- e.g. '550e8400-e29b-41d4-a716-446655440000'
UUID_V7()
Generates a time-ordered UUID (version 7, RFC 9562). UUIDs generated later sort after earlier ones, making them suitable for primary keys with time-ordered insertion.
CREATE TABLE events (id UUID PRIMARY KEY, data VARCHAR);
INSERT INTO events VALUES (UUID_V7(), 'event data');
UUID values are displayed as lowercase hyphenated hex strings (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx). String literals in UUID format (with or without hyphens) are automatically parsed when inserted into UUID columns.
-- Both forms are accepted:
INSERT INTO t VALUES ('550e8400-e29b-41d4-a716-446655440000', 'with hyphens');
INSERT INTO t VALUES ('550e8400e29b41d4a716446655440000', 'without hyphens');
-- Cast between UUID and VARCHAR/VARBINARY:
SELECT CAST(id AS VARCHAR) FROM t;
SELECT CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID);
SELECT CAST(id AS VARBINARY) FROM t; -- 16-byte binary
Date/Time Functions
NOW() / CURRENT_TIMESTAMP[()]
Returns the current UTC datetime as a DATETIME value.
SELECT NOW();
SELECT CURRENT_TIMESTAMP();
SELECT CURRENT_TIMESTAMP; -- parentheses are optional
DATE_FORMAT(dt, format)
Formats a date/datetime/timestamp string/value using MySQL-style format specifiers.
SELECT DATE_FORMAT('2026-02-22 13:04:05', '%Y/%m/%d %H:%i:%s');
-- '2026/02/22 13:04:05'
Common specifiers:
%Yyear (4 digits),%yyear (2 digits)%mmonth (01-12),%cmonth (1-12),%Mmonth name,%bmonth abbreviation%dday (01-31),%eday (1-31)%Hhour (00-23),%h/%Ihour (01-12),%iminute,%ssecond%Wweekday name,%aweekday abbreviation%THH:MM:SS,%r12-hour time with AM/PM,%%literal percent
NULL Handling & Conditional
COALESCE(a, b, …)
Returns the first non-NULL argument.
SELECT COALESCE(NULL, NULL, 'fallback'); -- 'fallback'
IFNULL(a, b)
Returns a if not NULL, otherwise b.
SELECT IFNULL(NULL, 'default'); -- 'default'
SELECT IFNULL('value', 'default'); -- 'value'
NULLIF(a, b)
Returns NULL if a = b, otherwise returns a.
SELECT NULLIF(0, 0); -- NULL
SELECT NULLIF(5, 0); -- 5
IF(cond, then, else)
Returns then if cond is truthy, otherwise else.
SELECT IF(1, 'yes', 'no'); -- 'yes'
SELECT IF(0, 'yes', 'no'); -- 'no'
CASE WHEN
-- Searched CASE
SELECT CASE
WHEN val < 10 THEN 'low'
WHEN val < 20 THEN 'mid'
ELSE 'high'
END FROM t;
-- Simple CASE
SELECT CASE status
WHEN 1 THEN 'active'
WHEN 2 THEN 'inactive'
ELSE 'unknown'
END FROM t;
CAST
Converts a value to a different data type.
SELECT CAST('42' AS INT); -- 42
SELECT CAST(42 AS VARCHAR); -- '42'
SELECT CAST(val AS BIGINT) FROM t;
Supported target types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL(p,s), DATE, DATETIME, TIMESTAMP, VARCHAR, TEXT, JSONB, VARBINARY.
JSON Functions
JSONB accepts valid JSON only. Values are canonicalized when stored.
CREATE TABLE docs (id BIGINT PRIMARY KEY, doc JSONB);
INSERT INTO docs VALUES (1, '{"a":{"b":[1,2,3]}}');
JSON_EXTRACT(json, path)
Evaluates path using jsonpath_lib JSONPath semantics and returns matched JSON (canonical text). If multiple values match, returns a JSON array.
SELECT JSON_EXTRACT('{"a":{"b":1}}', '$.a.b'); -- "1"
JSON_SET(json, path, value)
Sets value at path and returns updated JSON. Supported update-path syntax is root-based dot/index form ($.key, $.arr[0], chained combinations).
SELECT JSON_SET('{"a":1}', '$.b', 2); -- {"a":1,"b":2}
JSON_REMOVE(json, path)
Removes value at path and returns updated JSON. Missing path is a no-op.
SELECT JSON_REMOVE('{"a":1,"b":2}', '$.a'); -- {"b":2}
JSON_TYPE(json)
Returns one of: NULL, BOOLEAN, INTEGER, DOUBLE, STRING, ARRAY, OBJECT.
SELECT JSON_TYPE('[1,2,3]'); -- ARRAY
JSON_CONTAINS(json, value_or_path)
- If second argument starts with
$, it is evaluated as JSONPath (viajsonpath_lib); returns1when any match exists. - Otherwise it is treated as a JSON candidate value and checked for containment.
SELECT JSON_CONTAINS('{"a":{"b":1}}', '$.a.b'); -- 1
SELECT JSON_CONTAINS('{"a":1,"b":2}', '{"b":2}'); -- 1
JSON function behavior:
- If any argument is
NULL, returnsNULL. - Invalid JSON input returns an error.
- Invalid/unsupported update-path syntax in
JSON_SET/JSON_REMOVEreturns an error.
Aggregation & GROUP BY
Aggregate Functions
SELECT COUNT(*) FROM t; -- count all rows
SELECT COUNT(col) FROM t; -- count non-NULL values
SELECT COUNT(DISTINCT col) FROM t; -- count distinct non-NULL values
SELECT SUM(amount) FROM orders; -- sum (skips NULLs)
SELECT AVG(amount) FROM orders; -- average (integer for integer inputs, float otherwise)
SELECT MIN(amount) FROM orders; -- minimum (skips NULLs)
SELECT MAX(amount) FROM orders; -- maximum (skips NULLs)
NULL semantics (SQL standard):
COUNT(*)counts all rows including NULLsCOUNT(col)counts non-NULL values onlySUM,AVG,MIN,MAXskip NULLs; return NULL if all values are NULL- On empty tables:
COUNTreturns 0, others return NULL
GROUP BY
SELECT category, COUNT(*) AS cnt FROM orders GROUP BY category;
SELECT category, status, SUM(amount) FROM orders GROUP BY category, status;
NULLs in GROUP BY columns form their own group.
HAVING
Filters groups after aggregation (use WHERE to filter rows before aggregation).
SELECT category, COUNT(*) AS cnt
FROM orders
GROUP BY category
HAVING COUNT(*) > 2;
SELECT DISTINCT
SELECT DISTINCT category FROM orders;
SELECT DISTINCT category, status FROM orders;
Subqueries
Uncorrelated subqueries are supported in WHERE clauses and SELECT lists.
IN / NOT IN (SELECT …)
-- Find users who have placed orders
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders);
-- Find users who have NOT placed orders
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM orders);
EXISTS / NOT EXISTS
-- Check if any orders exist (uncorrelated)
SELECT * FROM users WHERE EXISTS (SELECT id FROM orders);
-- Check if no orders exist
SELECT * FROM users WHERE NOT EXISTS (SELECT id FROM orders WHERE amount > 1000);
Scalar Subqueries
A scalar subquery returns exactly one column and at most one row. If it returns zero rows, the result is NULL. If it returns more than one row, an error is raised.
-- Scalar subquery in SELECT list
SELECT id, (SELECT MAX(amount) FROM orders) AS max_order FROM users;
-- Scalar subquery in WHERE
SELECT * FROM users WHERE id = (SELECT MIN(user_id) FROM orders);
Nested Subqueries
Subqueries can be nested:
SELECT * FROM t1 WHERE id IN (
SELECT id FROM t2 WHERE EXISTS (SELECT id FROM t3)
);
Limitations:
- Only uncorrelated subqueries (no outer row references).
- Subqueries are pre-materialized once per query (not per row).
JOIN
-- INNER JOIN
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.t1_id;
-- LEFT JOIN
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.t1_id;
-- RIGHT JOIN
SELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.t1_id;
-- CROSS JOIN
SELECT * FROM t1 CROSS JOIN t2;
-- Table aliases
SELECT a.id, b.name FROM t1 AS a JOIN t2 AS b ON a.id = b.t1_id;
UNION / UNION ALL
Combines results from multiple SELECT statements.
-- UNION (removes duplicates)
SELECT id, name FROM t1 UNION SELECT id, name FROM t2;
-- UNION ALL (keeps duplicates)
SELECT id, name FROM t1 UNION ALL SELECT id, name FROM t2;
-- With ORDER BY and LIMIT (applies to the whole result)
SELECT id FROM t1 UNION SELECT id FROM t2 ORDER BY id LIMIT 10;
All SELECT statements in a UNION must return the same number of columns.
EXPLAIN
Shows the optimizer’s chosen access path and cardinality/cost estimates for a statement.
EXPLAIN SELECT * FROM t WHERE id = 1;
EXPLAIN UPDATE t SET name = 'Alicia' WHERE id = 1;
EXPLAIN DELETE FROM t WHERE id = 1;
Output Columns
| Column | Description |
|---|---|
| id | Always 1 (single plan row output) |
| select_type | SIMPLE, UPDATE, or DELETE |
| table | Base table name |
| type | Access type: const (PK lookup), ref (index lookup), range (index range seek), ALL (full scan), fulltext (FTS) |
| key | Index used (NULL for full scan) |
| rows | Estimated candidate rows for the chosen access path |
| cost | Heuristic cost of the chosen plan |
| Extra | Additional diagnostics (Using where, Using index, Using fulltext, JOIN loop notes, etc.) |
Access Type Meanings
const: primary-key equality lookup (WHERE pk = ...).ref: secondary index equality lookup.range: index range scan (single/composite range shape).ALL: full table scan.fulltext: FULLTEXT index path.
How rows Is Estimated
- If table/index stats are present (
ANALYZE TABLE), EXPLAIN uses:- table row stats,
- index distinct-key stats,
- numeric min/max bounds,
- numeric histograms (single-column numeric B-tree indexes).
- If stats are missing, EXPLAIN falls back to conservative heuristics (or table row scan fallback where applicable).
How cost Is Estimated
costis a deterministic heuristic score used for plan comparison.- Lower is better.
- It includes access-path cost and, for JOIN planning diagnostics, nested-loop alternative comparison cost.
- Compare
costvalues primarily within the same query shape.
JOIN Diagnostics in Extra
For EXPLAIN SELECT ... JOIN ..., Extra can include join-loop notes such as:
Join loops: j1=right_outer (L=20,R=3,cL=620,cR=603)
j1= first JOIN step.left_outer/right_outer= chosen outer loop side.L/R= estimated left/right input rows at that step.cL/cR= compared heuristic costs for each outer-loop alternative.
Practical Workflow
-- 1) inspect plan
EXPLAIN SELECT * FROM t WHERE a >= 100 AND a <= 110;
-- 2) refresh stats after major data changes
ANALYZE TABLE t;
-- 3) inspect again (rows/cost should better reflect current data)
EXPLAIN SELECT * FROM t WHERE a >= 100 AND a <= 110;
Current Scope and Limits
Limitations:
- Supported targets are
SELECT,UPDATE, andDELETE. - Output is currently a single-row summary (not a full operator tree).
- JOIN/subquery internals are summarized in
Extrarather than emitted as multiple plan rows.
Rekey (Password Rotation)
Password rotation is not available as SQL syntax. Use API or dedicated CLI command instead:
- Rust API:
Database::rekey_with_password("new_password") - CLI:
murodb-rekey <db-file>
Transactions
BEGIN;
INSERT INTO t (id, name) VALUES (1, 'Alice');
INSERT INTO t (id, name) VALUES (2, 'Bob');
COMMIT;
-- Or rollback
BEGIN;
INSERT INTO t (id, name) VALUES (3, 'Charlie');
ROLLBACK;
-- Savepoints (MySQL-compatible)
BEGIN;
INSERT INTO t (id, name) VALUES (10, 'step1');
SAVEPOINT sp1;
INSERT INTO t (id, name) VALUES (11, 'step2');
ROLLBACK TO SAVEPOINT sp1; -- keeps step1, reverts step2
RELEASE SAVEPOINT sp1;
COMMIT;
Savepoint notes:
SAVEPOINT <name>,ROLLBACK TO [SAVEPOINT] <name>,RELEASE SAVEPOINT <name>are valid only inside an active transaction.ROLLBACK TOkeeps the transaction active and discards savepoints created after the target.- Reusing the same savepoint name overwrites the previous one (MySQL behavior).
COMMITand fullROLLBACKclear all savepoints.
Rust API note:
Database::query()accepts read-only SQL only.Database::query()takes&mut selfbecause read execution may refresh pager/catalog state from disk before running.- For concurrent reads in one process, use multiple read-only handles (for example
Database::open_reader()). - Inside an explicit transaction (
BEGIN…COMMIT/ROLLBACK), run statements throughDatabase::execute(), includingSELECT. Database::set_busy_timeout_ms(ms)sets lock wait timeout (0= wait indefinitely).DatabaseReader::set_busy_timeout_ms(ms)does the same for read-only handles.Database::cancel_handle()/DatabaseReader::cancel_handle()returns aQueryCancelHandle.QueryCancelHandle::cancel()returnstruewhen a statement is currently in flight, otherwisefalse.- Cancellation errors are reported as
MuroError::Cancelled. Database::set_statement_timeout_ms(ms)andDatabaseReader::set_statement_timeout_ms(ms)set per-statement execution timeout (0= no timeout).- Timeout errors are reported as
MuroError::StatementTimeout { timeout_ms }. - Cancellation safety for explicit transactions: cancellation checks in write paths are performed before row-application phases, so a cancelled statement does not commit partial row changes.
Hidden _rowid
Tables without an explicit PRIMARY KEY automatically get a hidden _rowid column with auto-generated values.
Runtime Configuration
MuroDB supports session-scoped runtime options for operational tuning. This page explains what each option does, when to use it, and how to observe it.
Scope and Behavior
- Scope: session-only
- Persistence: not persisted in the database file
- Update timing: immediate for subsequent operations in the same session
- Transaction rule: runtime
SETis rejected inside explicit transactions (BEGIN ... COMMIT/ROLLBACK)
You can set runtime options with SQL:
SET checkpoint_tx_threshold = 8;
SET checkpoint_wal_bytes_threshold = 1048576;
SET checkpoint_interval_ms = 1000;
Or with Rust API:
#![allow(unused)]
fn main() {
use murodb::{Database, sql::session::RuntimeConfig};
let mut db = Database::open_plaintext("mydb.db".as_ref())?;
db.set_runtime_config(RuntimeConfig {
checkpoint_tx_threshold: 8,
checkpoint_wal_bytes_threshold: 1_048_576,
checkpoint_interval_ms: 1_000,
})?;
let active = db.runtime_config()?;
}
Available Options
checkpoint_tx_threshold
- SQL name:
checkpoint_tx_threshold - Env default source:
MURODB_CHECKPOINT_TX_THRESHOLD - Default value:
1 - Type/range:
u64(0or greater)
Meaning:
- Trigger checkpoint after this many post-commit/post-rollback operations.
1means checkpoint every commit/rollback.0disables the tx-count trigger.
Use when:
- You want to trade lower commit overhead for larger WAL windows.
checkpoint_wal_bytes_threshold
- SQL name:
checkpoint_wal_bytes_threshold - Env default source:
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD - Default value:
0(disabled) - Type/range:
u64(0or greater)
Meaning:
- Trigger checkpoint when WAL file size reaches this threshold in bytes.
0disables the size trigger.
Use when:
- You want an upper guardrail on WAL growth.
checkpoint_interval_ms
- SQL name:
checkpoint_interval_ms - Env default source:
MURODB_CHECKPOINT_INTERVAL_MS - Default value:
0(disabled) - Type/range:
u64(0or greater)
Meaning:
- Trigger checkpoint when elapsed time since the last successful checkpoint reaches this threshold.
0disables the time trigger.
Use when:
- Workload has bursts and you want a time-based checkpoint cadence.
Trigger Combination Rules
Checkpoint runs when any enabled trigger fires:
- tx-count trigger (
checkpoint_tx_threshold) - WAL-size trigger (
checkpoint_wal_bytes_threshold) - interval trigger (
checkpoint_interval_ms)
Validation and Errors
- Runtime values must be non-negative integers.
- Unknown option names return a deterministic parse error.
- Using runtime
SETinside explicit transactions returns an execution error.
Observability
Use:
SHOW DATABASE STATS;
Relevant fields:
checkpoint_policy_tx_thresholdcheckpoint_policy_wal_bytes_thresholdcheckpoint_policy_interval_msdeferred_checkpointscheckpoint_pending_opsfailed_checkpointswal_file_size_bytes
See also:
Benchmarks
This page tracks MuroDB performance over time for a fixed embedded-DB style workload mix.
Each entry is tied to a concrete version (git commit) so changes can be compared historically.
Benchmark Scope
Current benchmark runner: src/bin/murodb_bench.rs
Workloads:
point_select_pk: point lookup by primary key (SELECT ... WHERE id = ?)point_update_pk: point update by primary key (UPDATE ... WHERE id = ?)insert_autocommit: single-row insert per transaction (auto-commit)range_scan_limit_100: range read (WHERE id >= ? ORDER BY id LIMIT 100)mixed_80r_15u_5i: mixed OLTP-like workload (80% read / 15% update / 5% insert)fts_select_natural: fulltext natural-language search (MATCH(body) AGAINST(... IN NATURAL LANGUAGE MODE))fts_update_point: point update on FTS-indexedTEXTcolumnfts_mixed_70q_30u: FTS-focused mixed workload (70% search / 30% update)
Additional microbenchmark:
murodb_snippet_bench: compares legacy vs currentfts_snippet()implementation on representative text sizes and reports approximate offset-map memory bytes.
Default dataset/config:
- initial rows:
20,000 - fts initial rows:
256 - select ops:
20,000 - update ops:
5,000 - insert ops:
5,000 - scan ops:
2,000 - mixed ops:
10,000 - fts select ops:
5,000 - fts update ops:
2,000 - fts mixed ops:
5,000 - warmup ops:
200 - batch size (initial load):
500
Run command:
cargo run --release --bin murodb_bench
Migration note: as of 2026-02-22, benchmark scope includes additional FTS workloads. Older entries without FTS metrics are not directly comparable on total runtime.
Versioned Results
| Date (UTC) | Commit | Environment | Notes |
|---|---|---|---|
| 2026-02-22 | a78694537f59 | local dev machine | first baseline |
| 2026-02-22 | 829ad18145c2 | local dev machine | after secondary-index root persistence fix |
| 2026-02-22 | 5c422b8b | local dev machine | added snippet microbenchmark and UTF-8 offset-map optimization |
2026-02-22 / a78694537f59
Raw output summary:
| Workload | Ops | Total sec | Ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
|---|---|---|---|---|---|---|
| point_select_pk | 20,000 | 0.144532 | 138,377.80 | 0.0082 | 0.0096 | 0.0108 |
| point_update_pk | 5,000 | 27.098314 | 184.51 | 5.2210 | 6.9286 | 8.9318 |
| insert_autocommit | 5,000 | 8.785356 | 569.13 | 1.5480 | 2.4816 | 5.7406 |
| range_scan_limit_100 | 2,000 | 20.240664 | 98.81 | 9.6326 | 13.5310 | 13.9811 |
| mixed_80r_15u_5i | 10,000 | 10.417702 | 959.90 | 0.0112 | 6.2421 | 6.7669 |
Row counts:
- start:
20,000 - after insert phase:
25,000 - final:
25,519
2026-02-22 / 829ad18145c2
Raw output summary:
| Workload | Ops | Total sec | Ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
|---|---|---|---|---|---|---|
| point_select_pk | 20,000 | 0.123495 | 161,949.51 | 0.0072 | 0.0076 | 0.0087 |
| point_update_pk | 5,000 | 8.146347 | 613.77 | 1.4856 | 2.0578 | 5.4681 |
| insert_autocommit | 5,000 | 8.675668 | 576.32 | 1.5077 | 2.7074 | 5.8033 |
| range_scan_limit_100 | 2,000 | 18.577489 | 107.66 | 9.2207 | 12.1729 | 12.5408 |
| mixed_80r_15u_5i | 10,000 | 3.626337 | 2,757.60 | 0.0098 | 1.7256 | 2.4382 |
| fts_select_natural | 5,000 | 2.797700 | 1,787.18 | 0.5595 | 0.6087 | 0.6366 |
| fts_update_point | 2,000 | 11.462999 | 174.47 | 5.4321 | 7.0625 | 9.7803 |
| fts_mixed_70q_30u | 5,000 | 10.865190 | 460.19 | 0.6296 | 6.3375 | 7.4033 |
Row counts:
- start:
20,000 - after insert phase:
25,000 - final:
25,519
2026-02-22 / 5c422b8b (Snippet Microbenchmark)
Command:
cargo run --release --bin murodb_snippet_bench
Raw output summary:
| Case | Iters | Legacy p50 (us) | New p50 (us) | p50 Speedup | Approx offset-map bytes |
|---|---|---|---|---|---|
| snippet_short_tail_hit | 2,000 | 14.10 | 15.09 | 0.93x | 4,936 |
| snippet_medium_tail_hit | 2,000 | 126.01 | 127.93 | 0.99x | 48,136 |
| snippet_long_tail_hit | 500 | 1245.52 | 1228.43 | 1.01x | 480,136 |
Notes:
- Short/medium cases are near parity.
- Long tail-hit case shows small but measurable p50 reduction.
- Offset-map memory is linear in normalized char count:
(chars + 1) * sizeof(usize).
2026-02-22 / Stop-ngram behavior example
Reference scenario (same as SQL integration tests):
| Setting | Query | Expected behavior |
|---|---|---|
stop_filter=off | MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE) | broader recall (東京* docs can match) |
stop_filter=on, stop_df_ratio_ppm=500000 | same | higher precision (mostly exact-intent doc remains) |
Adding New Entries
When updating this page for a new version:
- Run
cargo run --release --bin murodb_bench. - Record
git rev-parse --short=12 HEAD. - Append one row to the “Versioned Results” table.
- Add a new subsection with the raw metrics table for that commit.
Keep benchmark parameters constant unless intentionally changing the benchmark definition. If benchmark definitions change, include a short migration note in the new entry.
Full-Text Search
MuroDB provides MySQL-compatible full-text search with bigram tokenization.
Creating a fulltext index
CREATE FULLTEXT INDEX t_body_fts ON t(body)
WITH PARSER ngram
OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000);
WITH PARSER / OPTIONS syntax is available so parser variants can be expanded in future releases.
FTS uses an internal doc_id mapping, so it works with non-BIGINT primary keys too.
If a table has no explicit primary key, MuroDB’s hidden _rowid is used.
Supported options:
n: ngram size (2only for now)normalize: normalization mode ('nfkc'only for now)stop_filter:on/off(or1/0,'true'/'false')stop_df_ratio_ppm: document-frequency threshold in ppm (0..=1000000)
stop_filter applies to NATURAL LANGUAGE MODE only. BOOLEAN MODE behavior is unchanged.
Query semantics
NATURAL LANGUAGE MODE
- BM25-based relevance ranking.
SELECT id, MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE) AS score
FROM t
WHERE MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE) > 0
ORDER BY score DESC
LIMIT 20;
With stop-ngram filtering enabled:
CREATE FULLTEXT INDEX t_body_fts ON t(body)
WITH PARSER ngram
OPTIONS (n=2, normalize='nfkc', stop_filter=on, stop_df_ratio_ppm=500000);
This skips very frequent low-information ngrams during scoring.
BOOLEAN MODE
Supports +term (required), -term (excluded), and "phrase" (exact phrase).
SELECT id
FROM t
WHERE MATCH(body) AGAINST('"東京タワー" +夜景 -混雑' IN BOOLEAN MODE) > 0;
| Operator | Meaning | Example |
|---|---|---|
+term | Term must be present | +東京 |
-term | Term must not be present | -混雑 |
"phrase" | Exact phrase match | "東京タワー" |
term | Optional (score contribution) | 夜景 |
Snippet helper
Use fts_snippet() for highlighted excerpts.
SELECT id,
fts_snippet(body, '"東京タワー"', '<mark>', '</mark>', 30) AS snippet
FROM t
WHERE MATCH(body) AGAINST('"東京タワー"' IN BOOLEAN MODE) > 0
LIMIT 10;
Recall/precision tradeoff example
Dataset:
- doc1:
東京タワー - doc2:
東京駅 - doc3:
東京大学 - doc4:
東京ドーム
Query:
MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE)
Observed behavior:
stop_filter=off: broad recall (multiple東京*docs can match)stop_filter=on, stop_df_ratio_ppm=500000: higher precision (mostly東京タワーdoc remains)
Internal design
See FTS Internals for implementation details.
Recovery
MuroDB uses WAL-based crash recovery with two recovery modes.
Recovery modes
strict (default)
The default mode. Fails on any WAL protocol violation.
murodb mydb.db
Detects and rejects:
- Records before BEGIN
- Commit LSN mismatches
- Duplicate terminal records
- PagePut integrity mismatches
permissive
Skips invalid transactions and recovers only valid committed transactions. Useful for salvaging data from corrupted databases.
murodb mydb.db --recovery-mode permissive
When transactions are skipped, the original WAL is quarantined to *.wal.quarantine.*.
WAL Inspection
Analyze WAL consistency without modifying the database.
# Text output
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive
# JSON output (for automation)
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json
Quarantine files can also be inspected directly:
murodb-wal-inspect mydb.db --wal mydb.wal.quarantine.20240101_120000
Exit codes
| Code | Meaning |
|---|---|
0 | No malformed transactions detected |
10 | Malformed transactions detected (inspection succeeded) |
20 | Fatal error (decrypt/IO/strict failure, etc.) |
API
#![allow(unused)]
fn main() {
use murodb::{Database, RecoveryMode};
// strict (default)
let db = Database::open("mydb.db", &master_key)?;
// permissive
let db = Database::open_with_recovery_mode(
"mydb.db", &master_key, RecoveryMode::Permissive
)?;
// permissive with report
let (db, report) = Database::open_with_recovery_mode_and_report(
"mydb.db", &master_key, RecoveryMode::Permissive
)?;
for skip in &report.skipped {
eprintln!("Skipped tx {}: {:?}", skip.txid, skip.reason);
}
}
JSON Schema Versioning Policy
schema_versionincrements only on breaking changes (key removal, type changes)- New keys are added without version bump (consumers should ignore unknown keys)
RecoverySkipCodestring values are frozen (regression-tested)InspectFatalKindstring values are frozen (regression-tested)
Backup & Restore
Overview
MuroDB provides a Database::backup() API for creating consistent, point-in-time backups of a running database. The backup file is a complete MuroDB database that can be opened directly with the same key or password.
Creating a Backup
Rust API
#![allow(unused)]
fn main() {
use murodb::Database;
use murodb::crypto::aead::MasterKey;
let mut db = Database::open(path, &master_key)?;
// Normal operations ...
db.execute("INSERT INTO users VALUES (1, 'alice')")?;
// Create a consistent backup
db.backup("/path/to/backup.db")?;
}
Password-based databases work the same way:
#![allow(unused)]
fn main() {
let mut db = Database::open_with_password(path, "my-password")?;
db.backup("/path/to/backup.db")?;
}
Plaintext databases too:
#![allow(unused)]
fn main() {
let mut db = Database::open_plaintext(path)?;
db.backup("/path/to/backup.db")?;
}
What Happens During Backup
- An exclusive lock is acquired (writers are blocked, but backup is typically fast).
- The WAL is checkpointed so all committed data is flushed to the main file.
- The database file is copied byte-by-byte to the destination.
- The destination file is fsynced for durability.
- The lock is released and normal operations resume.
Restoring from a Backup
The backup file is a standard MuroDB database file. To restore:
- Stop the application (or close the
Databasehandle). - Replace the original database file with the backup file.
- Delete the WAL file (
<dbname>.wal) if present — the backup has no pending WAL. - Reopen the database normally.
#![allow(unused)]
fn main() {
// Open the backup directly
let mut db = Database::open("/path/to/backup.db", &master_key)?;
}
Safety
- Same-file protection: Attempting to backup to the source file itself (including via symlinks or hardlinks) returns an error without modifying the source.
- Encryption preserved: Encrypted databases are copied as-is. No decryption or re-encryption occurs. The backup uses the same key, salt, and encryption suite.
- Crash safety: If the process crashes during backup, the source database is unaffected. The destination file may be incomplete and should be discarded.
Limitations
| Item | Detail |
|---|---|
| Disk space | Requires free space equal to the full database size. |
| Writer blocking | Writers are blocked for the duration of the copy (proportional to DB size). |
| WAL not included | The WAL is checkpointed before copy; the backup file has no WAL dependency. |
| Incremental backup | Not supported. Each backup is a full copy. |
Checkpoint Policy Tuning
MuroDB can defer WAL truncate (checkpoint) using policy thresholds. This helps reduce commit-time overhead on write-heavy workloads.
Configuration Knobs
Checkpoint policy can be configured in two ways:
- Environment variables (process default, applied at session construction)
- SQL runtime options (
SET ..., session scope, non-persistent)
For full option-by-option semantics, see Runtime Configuration.
Environment variable names:
MURODB_CHECKPOINT_TX_THRESHOLDMURODB_CHECKPOINT_WAL_BYTES_THRESHOLDMURODB_CHECKPOINT_INTERVAL_MS
SQL runtime option names:
checkpoint_tx_thresholdcheckpoint_wal_bytes_thresholdcheckpoint_interval_ms
Semantics:
- Checkpoint runs when any enabled trigger fires.
MURODB_CHECKPOINT_TX_THRESHOLD=1(default): checkpoint every commit/rollback.MURODB_CHECKPOINT_TX_THRESHOLD=0: disable tx-count trigger.*_WAL_BYTES_THRESHOLD=0/*_INTERVAL_MS=0: disabled.- Runtime
SETvalues follow the same semantics, but are session-only and not persisted.
Runtime example:
SET checkpoint_tx_threshold = 8;
SET checkpoint_wal_bytes_threshold = 1048576;
SET checkpoint_interval_ms = 1000;
Recommended Starting Profiles
Low-latency / conservative
MURODB_CHECKPOINT_TX_THRESHOLD=8
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=1048576
MURODB_CHECKPOINT_INTERVAL_MS=1000
Use this when you want modest commit-speed gains with tight WAL growth control.
Write-throughput focused
MURODB_CHECKPOINT_TX_THRESHOLD=64
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=8388608
MURODB_CHECKPOINT_INTERVAL_MS=5000
Use this for update/insert-heavy workloads where throughput is prioritized over immediate WAL truncation.
Time-driven only
MURODB_CHECKPOINT_TX_THRESHOLD=0
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=0
MURODB_CHECKPOINT_INTERVAL_MS=1000
Useful when transaction size/shape is bursty and you want predictable checkpoint cadence.
Tuning Procedure
- Start from the conservative profile.
- Run workload benchmark and record throughput/latency.
- Increase
MURODB_CHECKPOINT_TX_THRESHOLDstepwise (for example:8 -> 16 -> 32 -> 64). - Keep safety bounds with either
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLDorMURODB_CHECKPOINT_INTERVAL_MS. - Stop increasing when throughput gain flattens or WAL/recovery cost becomes unacceptable.
What to Monitor
Use:
SHOW DATABASE STATS;
Track at least:
failed_checkpointsdeferred_checkpointscheckpoint_pending_opscheckpoint_policy_tx_thresholdcheckpoint_policy_wal_bytes_thresholdcheckpoint_policy_interval_ms
And from filesystem:
ls -lh mydb.wal
Guardrails
failed_checkpoints > 0means truncate is failing; investigate disk I/O.checkpoint_pending_opsgrowing continuously with WAL size means policy is too loose for current workload.- Larger deferred windows can increase restart recovery time because WAL replay work increases.
- Durability boundary is unchanged: commit durability still depends on WAL
sync.
Alerting & Monitoring
MuroDB exposes internal health metrics via the SHOW DATABASE STATS SQL command. This page describes how to use these metrics for operational alerting.
Querying Stats
SHOW DATABASE STATS;
Returns a key-value table with columns: stat and value.
Related command:
SHOW CHECKPOINT STATS;
This returns the checkpoint-only subset for backward compatibility.
Key Metrics
commit_in_doubt_count
Alert threshold: > 0
Indicates the number of times a commit succeeded in the WAL (durable) but failed to write pages or metadata to the data file. The session is poisoned after this event.
Action: Close the session and reopen the database. WAL recovery will replay the committed transaction. Investigate the root cause (disk full, I/O errors, hardware failure).
IF commit_in_doubt_count > 0 THEN ALERT
severity: CRITICAL
message: "CommitInDoubt detected. Session is poisoned. Reopen database to recover."
failed_checkpoints
Alert threshold: > 0
Indicates checkpoint truncation failures. The WAL file is not being truncated after successful commits, which causes WAL growth.
Action: Monitor WAL file size. If it grows unboundedly, investigate disk I/O. The database remains correct (WAL replay is idempotent), but performance may degrade due to longer recovery times.
IF failed_checkpoints > 0 THEN ALERT
severity: WARNING
message: "Checkpoint failures detected. WAL may be growing. Monitor WAL file size on disk."
deferred_checkpoints / checkpoint_pending_ops
deferred_checkpoints is the number of transactions where checkpoint was intentionally skipped by policy.
checkpoint_pending_ops is the current backlog since the last successful checkpoint.
High deferred_checkpoints is expected with batch policies. Alert only when checkpoint_pending_ops keeps growing together with WAL size.
checkpoint_policy_*
The active checkpoint policy is surfaced as:
checkpoint_policy_tx_thresholdcheckpoint_policy_wal_bytes_thresholdcheckpoint_policy_interval_ms
Policy is configured via environment variables:
MURODB_CHECKPOINT_TX_THRESHOLD(default1,0disables tx-count trigger)MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD(default0, disabled)MURODB_CHECKPOINT_INTERVAL_MS(default0, disabled)
Policy can also be changed at runtime per session:
SET checkpoint_tx_threshold = <u64>SET checkpoint_wal_bytes_threshold = <u64>SET checkpoint_interval_ms = <u64>
See Checkpoint Policy Tuning for recommended starting values and rollout procedure. Option details are documented in Runtime Configuration.
freelist_sanitize_count
Alert threshold: > 0 (informational)
Number of times the freelist was sanitized during page allocation to remove invalid entries. This is a self-healing mechanism, not necessarily an error.
Action: If the count is consistently non-zero across sessions, investigate potential freelist corruption. A one-time occurrence after crash recovery is normal.
freelist_out_of_range_total / freelist_duplicates_total
Alert threshold: > 0 (informational)
Breakdown counters for freelist sanitization events:
freelist_out_of_range_total: removed page IDs outside valid page rangefreelist_duplicates_total: removed duplicate page IDs
Action: non-zero values are expected only when sanitization occurred. If values continue increasing across clean restarts, investigate possible on-disk corruption.
WAL Size Monitoring
SHOW DATABASE STATS exposes WAL size as wal_file_size_bytes.
Use this SQL-native metric first and correlate with failed_checkpoints.
Persistent growth together with failed_checkpoints > 0 indicates checkpoint truncate failures.
Monitoring Patterns
Polling Loop
For long-running applications, periodically query stats:
#![allow(unused)]
fn main() {
use murodb::Database;
fn check_health(db: &mut Database) {
if let Ok(result) = db.execute("SHOW DATABASE STATS") {
// Parse result and check thresholds
// Alert on commit_in_doubt_count > 0
// Alert on failed_checkpoints > 0
// Track wal_file_size_bytes
}
}
}
Post-Recovery Check
After opening a database that required WAL recovery, verify the recovery result:
#![allow(unused)]
fn main() {
use murodb::{Database, RecoveryMode};
let (db, report) = Database::open_with_recovery_mode_and_report(
"mydb.db", &master_key, RecoveryMode::Strict
)?;
if !report.committed_txids.is_empty() {
log::info!("Recovered {} committed transactions", report.committed_txids.len());
}
if !report.skipped.is_empty() {
log::warn!("Skipped {} malformed transactions", report.skipped.len());
}
}
Summary
| Metric | Threshold | Severity | Meaning |
|---|---|---|---|
commit_in_doubt_count | > 0 | Critical | Session poisoned; reopen required |
failed_checkpoints | > 0 | Warning | WAL growing; checkpoint failing |
wal_file_size_bytes | Increasing trend | Warning | WAL growth; correlate with checkpoint failures |
freelist_sanitize_count | > 0 | Info | Freelist self-healed |
freelist_out_of_range_total | > 0 | Info | Invalid freelist entries removed (range) |
freelist_duplicates_total | > 0 | Info | Invalid freelist entries removed (duplicates) |
Incident Response Runbook
This runbook covers common failure scenarios and the recommended response procedures for MuroDB operators.
First Response: Gather Diagnostics
Before taking any corrective action, collect the current state:
SHOW DATABASE STATS;
Note the values of commit_in_doubt_count, failed_checkpoints, wal_file_size_bytes, freelist_sanitize_count, freelist_out_of_range_total, and freelist_duplicates_total.
If the database cannot be opened, inspect the WAL without modifying it:
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json
Scenario: CommitInDoubt Detected
Symptom: commit_in_doubt_count > 0 in SHOW DATABASE STATS, or the application receives a CommitInDoubt error.
What happened: The transaction was durably written to the WAL, but the subsequent write of dirty pages or metadata to the data file failed (e.g., disk full, I/O error). The session is poisoned and will reject further operations.
Response:
- Close the current session / database handle immediately.
- Investigate the root cause — check disk space (
df -h), kernel logs (dmesg), and storage health. - Resolve the underlying issue (free disk space, replace failing disk).
- Reopen the database. WAL recovery will automatically replay the committed transaction.
- Run
SHOW DATABASE STATSto confirmcommit_in_doubt_countis0after recovery.
Do NOT delete or rename the WAL file — it contains the committed data that needs to be replayed.
Scenario: Checkpoint Failures / WAL Growth
Symptom: failed_checkpoints > 0, and wal_file_size_bytes is growing over time.
What happened: After committing, MuroDB truncates the WAL via a checkpoint. If truncation fails, the WAL keeps growing. The database remains correct (WAL replay is idempotent), but recovery time increases.
Response:
- Check disk I/O health and available space.
- If the WAL is very large but the database is otherwise healthy, restart the process. Recovery on startup will replay and then truncate the WAL.
- Monitor
wal_file_size_bytesafter restart to confirm the WAL was truncated.
Scenario: Freelist Corruption Suspected
Symptom: freelist_sanitize_count is consistently non-zero across multiple sessions (not just once after a crash recovery).
What happened: The freelist (which tracks reusable pages) contained invalid entries. MuroDB self-heals by removing invalid entries during allocation, but persistent occurrences may indicate deeper corruption.
Response:
- A single occurrence after crash recovery is normal — no action needed.
- If it recurs across sessions:
- Back up the database file and WAL immediately.
- Open with
--recovery-mode permissiveand check the recovery report for skipped transactions. - If data integrity is confirmed, the self-healing is working correctly. Continue monitoring.
- If data loss is suspected, restore from backup and replay from the last known good state.
Scenario: Database Fails to Open (WAL Corruption)
Symptom: Opening the database fails with a recovery error in strict mode.
Response:
- Inspect first — do not delete any files:
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json - Review the report. If only incomplete (uncommitted) transactions are malformed, they can be safely skipped.
- Open with permissive mode to recover valid data:
murodb mydb.db --recovery-mode permissive - The original WAL is automatically quarantined to
*.wal.quarantine.*for forensic analysis. - Verify recovered data integrity by querying critical tables.
Scenario: Process Crash / Kill During Operation
Symptom: The MuroDB process was killed (SIGKILL, OOM, power loss) mid-operation.
Response:
- Simply reopen the database. WAL recovery handles this automatically.
- Check
SHOW DATABASE STATSafter recovery:commit_in_doubt_countshould be0.freelist_sanitize_countmay be> 0once — this is normal.
- If strict recovery fails, follow the “Database Fails to Open” procedure above.
When to Restart vs. Quarantine WAL
| Situation | Action |
|---|---|
| Session poisoned (CommitInDoubt) | Restart — recovery replays committed data |
| WAL growing (checkpoint failures) | Restart — recovery truncates WAL |
| Strict recovery fails | Inspect WAL, then open with --recovery-mode permissive |
| Repeated freelist sanitization | Back up, then investigate with permissive mode |
| Corrupted WAL with data loss | Restore from backup |
Escalation Criteria
Escalate to the development team if:
commit_in_doubt_count > 0persists after reopening the database (recovery failed to replay).- WAL inspection shows committed transactions that were not recovered.
- Permissive mode skips transactions that should have been valid.
- Freelist sanitization count grows across sessions without any preceding crash.
- The database file size is inconsistent with expected data volume (possible page leak).
When escalating, include:
- Full
SHOW DATABASE STATSoutput. - WAL inspection JSON output (
murodb-wal-inspect --format json). - Kernel logs around the time of failure (
dmesg,journalctl). - The quarantined WAL file(s), if any.
Limits Reference
This page documents the known limits of MuroDB. These limits arise from the fixed page size, serialization formats, and design decisions of the storage engine.
Page & Row Limits
| Limit | Value | Notes |
|---|---|---|
| Page size | 4,096 bytes | Fixed; all data pages, B-tree nodes, and catalog entries use this size |
| Page header | 14 bytes | page_id (8) + cell_count (2) + free_start (2) + free_end (2) |
| Max inline row size | ~4,073 bytes | Rows within this limit are stored inline in a single page |
| Max row size (with overflow) | ~4 GB | Limited by u32 total_value_len; values exceeding inline limit use overflow pages |
| Max cell payload | ~4,073 bytes | 4,096 − 14 (header) − 5 (node header cell) − 4 (cell pointer + length prefix) |
| Overflow chunk size | 4,077 bytes | Per overflow page: 4,096 − 19 bytes header |
Rows with values that exceed the inline page capacity automatically use overflow pages. The value is stored in a chain of overflow pages, with the leaf cell containing only the key and a pointer to the first overflow page. Keys must still fit inline (max ~4,071 bytes).
Column Limits
| Limit | Value | Notes |
|---|---|---|
| Max column count | 65,535 (u16) | Serialization limit; practical limit is lower due to page capacity |
| Column name max length | 65,535 bytes (u16) | Limited in practice by catalog page capacity |
| Table name max length | 65,535 bytes (u16) | Limited in practice by catalog page capacity |
Data Type Ranges
| Type | Min | Max | Storage |
|---|---|---|---|
| TINYINT | −128 | 127 | 1 byte |
| SMALLINT | −32,768 | 32,767 | 2 bytes |
| INT | −2,147,483,648 | 2,147,483,647 | 4 bytes |
| BIGINT | −2^63 | 2^63 − 1 | 8 bytes |
| FLOAT | ±1.2×10^−38 | ±3.4×10^38 | 4 bytes (finite values only; NaN/Infinity rejected) |
| DOUBLE | ±2.2×10^−308 | ±1.7×10^308 | 8 bytes (finite values only; NaN/Infinity rejected) |
String & Binary Limits
| Limit | Value | Notes |
|---|---|---|
| VARCHAR(n) max n | 4,294,967,295 (u32) | Values exceeding ~4,073 bytes use overflow pages |
| VARBINARY(n) max n | 4,294,967,295 (u32) | Values exceeding ~4,073 bytes use overflow pages |
| TEXT max size | ~4 GB | Limited by u32 value length; large values use overflow pages |
| VARCHAR(n) length check | Character-based | VARCHAR(100) allows up to 100 characters (MySQL-compatible) |
Note: MuroDB checks
VARCHAR(n)against character count, consistent with MySQL. Multi-byte UTF-8 characters (e.g., Japanese characters at 3 bytes each, emoji at 4 bytes) each count as one character.VARBINARY(n)still uses byte-based length checking.
Internal Limits
| Limit | Value | Notes |
|---|---|---|
| B-tree max depth | 64 | Exceeded depth indicates corruption |
| WAL max frame length | 5,120 bytes | |
| FTS inline segment limit | 3,000 bytes | Larger posting lists use overflow pages |
| FTS max payload | 65,536 bytes | |
| LRU cache default size | 256 pages | Configurable |
NULL Behavior
- Primary key columns cannot be NULL
NULL = NULLevaluates to UNKNOWN (not TRUE); useIS NULLinstead- Aggregate functions (SUM, AVG, MIN, MAX) skip NULL values
COUNT(*)counts all rows;COUNT(column)counts non-NULL values only- Multiple NULL values are allowed in UNIQUE indexes (SQL standard)
- NULL values in ORDER BY are grouped together
Security Considerations
This page summarizes the current security model, known risks, and operational guidance.
Threat Model (Current)
- MuroDB is an embedded database library/CLI, not a network server.
- Encrypted mode (
aes256-gcm-siv) targets at-rest confidentiality and tamper detection for DB/WAL pages. - Plaintext mode (
--encryption off) is explicit opt-in and provides no cryptographic protection.
Known Risks
| Risk | Impact | Status |
|---|---|---|
| Malformed page/cell metadata can currently trigger panic paths instead of clean corruption errors | Process abort (availability) when opening/querying corrupted files, especially relevant in plaintext mode | Tracked: #182 |
| Plaintext mode has no confidentiality/integrity guarantees | Data can be read/modified offline without cryptographic checks | By design |
| No built-in user authentication/authorization layer | Access control depends on host process + filesystem permissions | By design |
Operational Guidance
- Prefer encrypted mode for production data.
- Avoid passing secrets via CLI args (
--password) when possible; use interactive prompt. - Treat database files as trusted inputs only until #182 is addressed.
- Apply OS-level controls: file permissions, disk encryption, process isolation, and secrets management.
Architecture
MuroDB is an embedded SQL database with optional at-rest encryption. The runtime stack is organized in layers:
sql/ (lexer → parser → planner → executor)
↓
schema/ (catalog: table/index definitions)
↓
tx/ (transaction: dirty page buffer, commit/rollback)
↓
btree/ (B-tree: insert/split, delete, search, scan)
↓
wal/ (WAL: encrypted records, crash recovery)
↓
storage/ (pager: encrypted page I/O, LRU cache, freelist)
↓
crypto/ (AES-256-GCM-SIV, Argon2 KDF, HMAC-SHA256)
Additional modules:
fts/- Full-text search (bigram tokenizer, postings B-tree, BM25, BOOLEAN/NATURAL mode)concurrency/- parking_lot::RwLock (thread) + fs4 file lock (process)
How To Read This Section
If your goal is “reconstruct internals after a long break”, read in this order:
- Reading Guide
- Files, WAL, and Locking
- Storage
- Catalog Format
- B-tree
- Query Planning & Execution
- Cryptography
- WAL & Crash Resilience
Module Map
| Module | Files | Role |
|---|---|---|
storage/ | page.rs, pager.rs, freelist.rs | 4096B encrypted page I/O |
crypto/ | aead.rs, kdf.rs, hmac_util.rs | Encryption primitives |
btree/ | node.rs, ops.rs, cursor.rs, key_encoding.rs | B-tree operations |
wal/ | record.rs, writer.rs, reader.rs, recovery.rs | WAL + crash recovery |
tx/ | transaction.rs, lock_manager.rs | Transactions |
schema/ | catalog.rs, column.rs, index.rs | System catalog |
sql/ | lexer.rs, parser.rs, ast.rs, planner.rs, executor.rs, eval.rs | SQL processing |
fts/ | tokenizer.rs, postings.rs, index.rs, query.rs, scoring.rs, snippet.rs | Full-text search |
concurrency/ | mod.rs | Concurrency control |
Concurrency Model
- Thread-level:
parking_lot::RwLock- multiple readers, single writer - Process-level:
fs4file lock - prevents concurrent access from multiple processes - API routing:
Database::queryacquires a shared lock for read-only statements.Database::executeacquires an exclusive lock for general SQL execution.- CLI routes read-only statements to
queryunless an explicit transaction is active.
- Handle model:
Database::querytakes&mut selfbecause read execution may refresh pager/catalog state from disk before running.- For concurrent reads in one process, open additional read-only handles (
Database::open_reader) and runqueryon each handle.
For on-disk file contracts (main DB file / .wal / .lock), see Files, WAL, and Locking.
For catalog key/value binary layouts, see Catalog Format.
Reading Guide
This page defines how to use the internals docs as a long-term memory aid.
Current Focus Areas
If you return after weeks/months, start from these questions:
- How is data physically laid out on disk?
- How is a B+tree node stored inside a page?
- How does a query become a concrete access path?
- What is the exact
.walformat and recovery state machine? - What does
.locklock, and at what granularity? - Which cryptographic primitives are used, and why these choices?
Target Documentation Shape
The internals section now follows this order:
- Architecture: module map and end-to-end data flow.
- Files, WAL, and Locking: main file /
.wal/.lockcontract. - Storage: file header, generic page layout, and freelist format.
- Catalog Format: system catalog key/value encoding and compatibility.
- B-tree: in-page node format and mutation/scan algorithms.
- Query Planning & Execution: plan selection and execution mapping.
- Cryptography: encryption/KDF details and rationale.
- WAL & Crash Resilience: transaction protocol and recovery validation.
- Durability Matrix: crash-at-each-step outcomes.
Suggested Rebuild Path (for implementers)
If you are implementing an embedded RDBMS with an LLM agent, use this order:
- Implement fixed-size pages and a pager (
storage/page.rs,storage/pager/mod.rs). - Implement B+tree on top of slotted pages (
btree/node.rs,btree/ops/mod.rs). - Add SQL parser/planner/executor (
sql/parser/*,sql/planner.rs,sql/executor/*). - Add WAL append + replay (
wal/writer.rs,wal/reader.rs,wal/recovery.rs). - Add lock manager for thread/process safety (
concurrency/mod.rs). - Add encryption suite and key derivation (
crypto/*).
Each chapter in this internals section maps directly to those steps.
Files, WAL, and Locking
This chapter defines the on-disk files and lock behavior.
At a Glance
For database path <db_path>, MuroDB uses three files:
<db_path> (main state) + <db_path>.wal (durability log) + <db_path>.lock (OS advisory lock target)
High-level flow:
- Append commit-intent records to
.walandwal.sync(). - Commit is durable at this point.
- Main DB file is checkpointed/flushed later.
.lockcoordinates readers/writers across processes during API calls.
The sections below expand each file in this order.
File Set
If you open database path <db_path>, MuroDB uses:
<db_path>: main database file (header + pages)<db_path>.wal: write-ahead log<db_path>.lock: lock file for cross-process coordination
Example:
- if
<db_path> = mydata, files aremydata,mydata.wal,mydata.lock - if
<db_path> = mydb.db, files aremydb.db,mydb.db.wal,mydb.db.lock
Main DB File Layout
The main file starts with a 76-byte plaintext header (src/storage/pager/mod.rs):
| Offset | Size | Field |
|---|---|---|
| 0 | 8 | Magic "MURODB01" |
| 8 | 4 | Format version (u32, current 4) |
| 12 | 16 | KDF salt |
| 28 | 8 | Catalog root page id |
| 36 | 8 | Page count |
| 44 | 8 | Epoch |
| 52 | 8 | Freelist root page id |
| 60 | 8 | Next transaction id |
| 68 | 4 | Encryption suite id |
| 72 | 4 | CRC32 over bytes 0..72 |
Immediately after the header, pages are stored sequentially. For page-level internals, see Storage.
.wal File Role
.wal stores commit-intent records before data-file flush.
Binary framing details are in WAL & Crash Resilience.
Durability boundary:
- commit is considered durable after
wal.sync()succeeds. - data-file flush may happen after that; failures become
CommitInDoubt.
.lock File Semantics
<db_path>.lock is created by LockManager::new (src/concurrency/mod.rs).
- It is not a structured metadata file.
- Its payload is not interpreted by MuroDB.
- It exists as a stable file descriptor target for advisory file locks (
fs4).
Lock Granularity
Locking has two layers:
- In-process:
parking_lot::RwLock<()> - Cross-process:
fs4shared/exclusive lock on.lock
API behavior:
Database::query(...)acquires shared read lock.Database::execute(...)acquires exclusive write lock.Database::query(...)is a&mut selfAPI because read execution may refresh pager/catalog metadata from disk before running.- For multiple concurrent readers within one process, use separate read-only handles (for example
Database::open_reader()).
Important granularity note:
- Locks are acquired per API call, not globally for session lifetime.
- During explicit transactions (
BEGIN ... COMMIT), each statement still enters throughexecute(...)and takes the write lock for that call.
Visibility Refresh
When no explicit transaction is active, session execution calls pager.refresh_from_disk_if_changed() and reloads catalog metadata when header fields changed.
This is how a process observes committed changes from other processes.
Why this split (main file + .wal + .lock)?
- main DB file: stable state and efficient reads.
.wal: sequential append for crash-safe commit protocol..lock: avoids embedding lock bytes into data format and delegates arbitration to OS advisory locks.
Why .lock is separated from the main DB file:
- Keep data format clean: lock state is operational state, not database state.
- Avoid extra data-file churn: lock acquire/release does not force DB header/page writes.
- Better crash behavior: lock lifetime is tied to OS file-lock semantics; stale lock bytes do not need cleanup from the DB payload.
- Portability and tooling: advisory locking APIs (
flock/fcntl-style viafs4) naturally target a lock file descriptor. - Lower format coupling: lock strategy can evolve without changing on-disk table/page format.
Storage
Data File Structure (At a Glance)
The main .db file is:
[plaintext file header (76B)] [page 0 on disk] [page 1 on disk] [page 2 on disk] ...
- File header is always plaintext and fixed-size.
- Each page is a logical 4096-byte page (
PAGE_SIZE) stored at:offset = 76 + page_id * page_size_on_disk
page_size_on_diskis:- plaintext mode:
4096 - encrypted mode:
12 (nonce) + 4096 (ciphertext) + 16 (tag) = 4124
- plaintext mode:
This chapter first explains the file header, then the page layout, then special page formats (freelist).
Data File Header
Main DB file header (src/storage/pager/mod.rs) is 76 bytes:
0..8 Magic "MURODB01"
8..12 Format version (u32 LE)
12..28 Salt (16B, Argon2 input)
28..36 Catalog root page ID (u64 LE)
36..44 Page count (u64 LE)
44..52 Epoch (u64 LE)
52..60 Freelist page ID (u64 LE, 0 = none)
60..68 Next TxId (u64 LE)
68..72 Encryption suite ID (u32 LE)
72..76 CRC32 over bytes 0..72
freelist_page_idpersists the freelist root across restarts.- CRC32 protects header integrity before any page decryption.
- This header exists once per file; everything after this is page data.
catalog_rootpoints to the system catalog B-tree root (format in Catalog Format).
See Files, WAL, and Locking for .db / .wal / .lock lifecycle.
Generic Page Layout
- Page size: 4096 bytes (
PAGE_SIZE) - Page header: 14 bytes (
src/storage/page.rs) - Cell pointer: 2 bytes per cell
- Cell payload:
[len:u16][payload bytes] - Cache: LRU page cache (default 256 pages)
Slotted-page structure:
[header(14)] [cell pointer array] [free space] [cell bodies (from tail)]
This slotted layout is generic; B+tree node format is layered on top of it.
See B-tree for node/header cell conventions.
Encryption
Encrypted mode stores each page as:
nonce(12) || ciphertext || tag(16)
- Algorithm: AES-256-GCM-SIV (nonce-misuse resistant AEAD)
- KDF: Argon2 derives the master key from the user’s passphrase + random salt
- AAD binding:
(page_id, epoch)prevents page-swap/misbinding attacks
See Cryptography for rationale and full details.
Freelist
Freed pages are tracked in a freelist for reuse.
Freelist In-Memory Model
Implementation (src/storage/freelist.rs) uses Vec<PageId>:
allocate()pops from tail (LIFO reuse)free(page_id)pushes if not already present- duplicate
freeis treated as double-free and rejected undo_last_free()exists for speculative commit-time calculations
Freelist On-Disk Format
Freelist is stored in normal data pages, linked as a chain.
Per freelist page data area (after the generic 14-byte page header):
[magic "FLMP":4][next_page_id:u64][count:u64][entries:u64...]
Facts:
ENTRIES_PER_FREELIST_PAGE = 507for 4096-byte pages with 14-byte page headernext_page_id = 0marks chain end- DB header field
freelist_page_idpoints to chain head
Commit-Time Freelist Handling
During Transaction::commit (src/tx/transaction.rs):
- Build a speculative freelist snapshot (without permanent mutation).
- Determine how many freelist pages are needed.
- Reuse existing freelist head page when possible, allocate more page IDs if needed.
- Serialize freelist pages and emit them as WAL
PagePut. - Emit
MetaUpdatewith newfreelist_page_id. - After
wal.sync()succeeds, apply freed pages to in-memory freelist.
This ordering avoids freelist state leaks when commit fails before WAL durability.
Open-Time Freelist Loading and Sanitize
At open/refresh (Pager::reload_freelist_from_disk):
- Read freelist chain from
freelist_page_id. - For multi-page chain, detect cycles and out-of-range next pointers.
- Deserialize entries.
- Run
sanitize(page_count)to remove:- out-of-range entries (
pid >= page_count) - duplicate entries
- out-of-range entries (
Sanitization results are exposed as diagnostics and warning counters.
Catalog Format
This page documents how schema metadata is persisted in the system catalog.
Where Catalog Metadata Lives
- The system catalog itself is a B-tree (
src/schema/catalog.rs). - The root page id of that B-tree is stored in the main DB header field
catalog_root. - Catalog keys are UTF-8 bytes; values are binary-serialized structs.
Key namespace:
table:<table_name>-> serializedTableDefindex:<index_name>-> serializedIndexDef
TableDef Value Format
TableDef::serialize / deserialize in src/schema/catalog.rs.
Layout (length-prefixed, little-endian integers):
name_len: u16name: [u8; name_len](UTF-8)column_count: u16- Repeated
column_counttimes:col_blob_len: u16col_blob: [u8; col_blob_len](ColumnDefpayload)
pk_tag: u8- Primary-key names by
pk_tag:0: no PK names1: single-column PK:pk_len: u16pk_name: [u8; pk_len]
2: composite PK:pk_count: u16- repeated
pk_counttimes:pk_len: u16+pk_name
data_btree_root: u64next_rowid: i64(optional tail; defaults to0if absent)row_format_version: u8(optional tail; defaults to0if absent)stats_row_count: u64(optional tail; defaults to0if absent)
Unknown pk_tag causes decode failure.
ColumnDef Embedded Blob Format
ColumnDef::serialize / deserialize in src/schema/column.rs.
Layout:
name_len: u16name: [u8; name_len](UTF-8)type_byte: u8flags: u8- Optional
size: u32only when type isVARCHAR/VARBINARY default_tag: u8+ optional default payloadcheck_len: u16+ optional check expression bytes
type_byte mapping:
1BIGINT2VARCHAR3VARBINARY4TINYINT5SMALLINT6INT7TEXT8FLOAT9DOUBLE10DATE11DATETIME12TIMESTAMP
Flag bits:
0x01primary key0x02unique0x04nullable0x08hidden0x10auto_increment
default_tag mapping:
0no default1NULL2integer (i64)3string (u16length + bytes)4float (f64)
Unknown type_byte or default_tag causes decode failure.
IndexDef Value Format
IndexDef::serialize / deserialize in src/schema/index.rs.
Layout:
name_len: u16+nametable_len: u16+table_namefirst_col_len: u16+first_column_name(legacy position)index_type: u8(1BTree,2Fulltext)is_unique: u8(0/1)btree_root: u64extra_col_count: u16- Repeated extra columns:
col_len: u16+col_name stats_distinct_keys: u64(optional tail; default0)- Numeric-bounds extension (optional):
stats_num_bounds_known: u8stats_num_min: i64stats_num_max: i64
- FULLTEXT stop-filter extension (optional):
fts_stop_filter: u8fts_stop_df_ratio_ppm: u32
- Histogram extension (optional):
hist_bin_count: u16- repeated
hist_bin_counttimes:u32
Unknown index_type causes decode failure.
Compatibility Policy in Code
Current decode strategy is append-only/tolerant:
- New fields are generally appended at the tail.
- Older records are accepted by defaulting missing tail fields.
- Some old layouts are explicitly recognized (for example
IndexDefstats tails). - Truncated/corrupt payloads fail decode (
None) or ignore incomplete optional tails (histogram extension inIndexDef).
Executable Spec (Tests)
Primary roundtrip tests:
src/schema/catalog.rs(test_table_def_roundtrip)src/schema/column.rs(test_column_roundtrip_all_types)src/schema/index.rs(test_composite_index_roundtrip)
Backward-compat/malformed behavior:
src/schema/index.rs(test_deserialize_old_layout_keeps_fts_settings)src/schema/index.rs(test_deserialize_truncated_index_returns_none)
Foreign Keys
This page describes how foreign-key metadata and enforcement are implemented in murodb.
Source Layout
- Metadata types and serialization:
src/schema/catalog.rs - DDL validation (
CREATE TABLE):src/sql/executor/ddl.rs - ALTER integration (
ADD/DROP FOREIGN KEY, column guards):src/sql/executor/alter.rs - Runtime enforcement core:
src/sql/executor/foreign_key.rs - DML integration:
INSERT/REPLACE/ON DUPLICATE KEY UPDATE:src/sql/executor/insert.rsUPDATE/DELETE:src/sql/executor/mutation.rs
- Introspection output:
src/sql/executor/show.rs
Metadata Model
Each table stores outgoing FKs in TableDef.foreign_keys:
- child columns:
columns - parent table:
ref_table - parent columns:
ref_columns - actions:
on_delete,on_update(RESTRICT,CASCADE,SET NULL)
Metadata is persisted in TableDef::serialize / TableDef::deserialize as an optional tail.
The current format uses FK_LAYOUT_V2_TAG and stores both actions explicitly.
DDL and ALTER Rules
CREATE TABLE and ALTER TABLE ... ADD FOREIGN KEY validate:
- referenced table exists
- child and parent column counts match
- referenced parent columns exist
- child/parent types are compatible
- existing rows in child table satisfy the new constraint
ALTER TABLE also protects FK dependencies:
- cannot drop/modify/change a column used by local FK definitions
- cannot drop parent-side columns referenced by incoming FKs (including self-reference)
DROP FOREIGN KEY (cols)is rejected when ambiguous
Runtime Enforcement
Child-side check (outgoing FK)
enforce_child_foreign_keys(...) validates that each non-NULL child key has a matching parent row.
Any NULL in FK columns skips the check (SQL nullable FK behavior).
Parent-side check (incoming FK)
enforce_parent_restrict_on_delete(...)enforce_parent_restrict_on_update(...)
These locate incoming references by scanning catalog tables and selecting FKs where ref_table matches the parent table.
For RESTRICT, they fail the statement.
For CASCADE / SET NULL, they mutate child rows through helper paths in foreign_key.rs.
Ordering and Atomicity Safeguards
The implementation uses ordering rules to avoid partial side effects on failed statements:
- delete path validates the full delete set first, then applies pending cascade/set-null actions
- update path validates incoming
RESTRICTbefore applying pending child updates - mutation paths run local uniqueness/outgoing-FK checks before triggering parent-side cascades
REPLACEpre-validates conflict-delete effects needed for self-referential FK safety
Recursion and Cycles
Cascade logic tracks visited (table, pk) entries to break recursive loops in cyclic FK graphs.
This is used for both delete and update cascades.
Introspection
SHOW CREATE TABLErenders FK clauses includingON DELETEandON UPDATE.DESCRIBEemits an FK row with action details inExtra:ON DELETE <action> ON UPDATE <action>
Current Cost Characteristics
FK checks currently rely on scans in several places:
- parent existence checks scan parent rows
- incoming-reference discovery scans table metadata (
list_tables) - child-reference matching scans child table rows
This is correct but not index-accelerated yet; heavy FK workloads may pay O(table size) costs.
B-tree
Overview
MuroDB uses one B+tree implementation (src/btree/*) for:
- table primary data (clustered by primary key)
- secondary B-tree indexes
- some internal metadata trees
Core handle is BTree { root_page_id } in src/btree/ops/mod.rs.
Is B+tree Stored Inside Pages?
Yes. A B+tree node is encoded directly inside one slotted page (src/storage/page.rs + src/btree/node.rs).
Page structure:
- Slotted-page header/pointer area (generic storage layer)
- Cell payloads (B+tree node header + entries)
So, “page” is storage primitive, and B+tree node format is layered on top of it.
Node Layout on Page
src/btree/node.rs defines:
- Cell
0is always a node header. - Leaf header payload:
[node_type=1] - Internal header payload:
[node_type=2][right_child: u64]
Leaf entry cell:
[key_len: u16][key bytes][value bytes]
Internal entry cell:
[left_child: u64][key_len: u16][key bytes]
Internal nodes store N separator keys and N+1 child pointers:
left_childin each entry (N pointers)right_childin header (last pointer)
Key/Value Semantics by Tree Type
Primary data tree
- key: encoded primary key bytes
- value: serialized row bytes
Unique secondary index
- key: encoded indexed column(s)
- value: primary key bytes
Non-unique secondary index
- key:
index_key || primary_key(appended PK disambiguates duplicates) - value: primary key bytes
That encoding is implemented in src/sql/executor/indexing.rs.
Key Encoding Rules
Order-preserving encoding is in src/btree/key_encoding.rs.
- signed integers: sign-bit flip + big-endian
- float/double: order-preserving bit transform
- composite keys: null marker + per-column encoding
- variable-length components: byte-stuffed terminator scheme
Because encoded bytes preserve logical order, tree comparison is plain lexicographic byte compare.
Search and Scan Behavior
Point lookup
BTree::search walks internal nodes with separator comparison (find_child) until leaf, then linear-searches leaf cells.
Full scan
There are no leaf sibling links.BTree::scan performs recursive in-order traversal from root:
- visit each internal left subtree in key order
- then rightmost subtree
Range scan (>= start_key)
BTree::scan_from prunes early subtrees then falls back to in-order traversal for remaining branches.
Insert Path
BTree::insert behavior:
- Descend to target leaf.
- Rebuild leaf page with new/updated cell in sorted position.
- If overflow, split node and return median separator upward.
- Parent inserts new separator; parent may split recursively.
- If root splits, allocate new internal root.
What Happens If It Does Not Fit in One Page?
Two different cases:
- Tree growth case (many entries):
- normal behavior is page split (leaf/internal), with separator propagation up to root.
- this is fully supported.
- Single-cell-too-large case (one key/value entry itself is huge):
- value-only overflow pages store large values that exceed page capacity.
- keys remain inline (max ~4,071 bytes); values spill to overflow page chains.
Overflow Pages
When a leaf cell (key + value) exceeds ~4,073 bytes, the value is stored in an overflow page chain.
Overflow cell format
Normal leaf cell: [key_len: u16][key][value]
Overflow leaf cell: [key_len|0x8000: u16][key][total_value_len: u32][first_overflow_page: u64]
The high bit of key_len (0x8000) signals an overflow cell. All value data is stored in the overflow chain; the leaf cell contains only the key and metadata pointer.
Overflow page layout
Overflow pages use a simple linked-list format (not slotted pages):
[page_id: u64] bytes 0..8 (standard page header)
[0xFF marker: u8] byte 8 (distinguishes from B-tree nodes)
[next_page: u64] bytes 9..17 (next page in chain, u64::MAX = end)
[chunk_len: u16] bytes 17..19 (length of data in this page)
[chunk data] bytes 19.. (up to 4,077 bytes per page)
Operations
- Insert: if
needs_overflow(key, value), write value to overflow chain, store chain head pointer in leaf cell. - Search/Scan: on overflow cell, call
read_overflow_chainto reconstruct the full value. - Delete: free all overflow pages before removing the leaf cell.
- Update: free old overflow chain (if any), then create new cell (inline or overflow).
- Split/Merge: work with raw cell bytes to preserve overflow pointers without touching overflow data.
- collect_all_pages: includes overflow page IDs for each overflow cell.
Implementation
src/storage/overflow.rs:write_overflow_chain,read_overflow_chain,free_overflow_chain,collect_overflow_pagessrc/btree/node.rs:needs_overflow,encode_overflow_leaf_cell,is_overflow_cell,decode_overflow_metadatasrc/btree/ops/mod.rs: overflow-aware insert/search/scan/delete/split/merge
Delete/Rebalance Path
BTree::delete removes target entry and handles underflow:
- if leaf underfull, attempt merge/rebalance with sibling
- if root internal ends with zero entries, collapse root to its only child
Current rebalance path focuses on practical leaf merges; behavior is intentionally conservative.
Practical Mental Model
If you are rebuilding this design:
- Implement slotted page first.
- Reserve cell
0as node metadata. - Keep key bytes order-preserving so comparison is simple.
- Start with split-only insert and basic delete, then add rebalance.
- Keep scan correctness independent of leaf links (recursive in-order works immediately).
Query Planning & Execution
This chapter explains how SQL predicates are converted into access paths and then executed.
Pipeline
sql/parser produces AST (Statement / Select), then:
plan_select(...)insrc/sql/planner.rschooses aPlan.- Executor modules (
src/sql/executor/select_query.rs,src/sql/executor/mutation.rs) dispatch byPlan. - B+tree/index scans are performed via
BTree::search,scan,scan_from.
Plan Types
Plan currently has these variants:
PkSeek: full primary-key equality (single or composite).IndexSeek: equality lookup on a B-tree secondary index.IndexRangeSeek: bounded/ranged lookup on index prefix + next column range.FtsScan: full-text path usingMATCH ... AGAINST.FullScan: fallback table scan.
Candidate Extraction from WHERE
Planner heuristics extract:
- equalities (
col = expr) - numeric ranges (
<,<=,>,>=,BETWEEN) - full-text predicates (
MATCH(...) AGAINST(...))
Selection order:
- If FTS predicate is present, choose
FtsScan. - If all PK columns are equality-constrained, choose
PkSeek. - Otherwise evaluate index candidates and pick minimum cost.
- If none matches, use
FullScan.
Cost Model (Deterministic Heuristic)
plan_cost_hint_with_stats uses a stable heuristic (smaller is better):
PkSeek:100 + est_rowsIndexSeek:1500 - 300*key_parts + 3*est_rowsIndexRangeSeek:1400 - 250*prefix_parts - 250*bound_terms + 3*est_rowsFtsScan:2000 + 2*est_rowsFullScan:3000 + 5*est_rows
Tie-break uses a stable string key, so identical inputs keep deterministic plans.
Row Estimation Inputs
Estimator uses:
- table row count (
TableDef.stats_row_count) - index distinct count and optional numeric histogram (
IndexDefstats) - fallback defaults when stats are missing
ANALYZE TABLE persists these stats and improves plan quality.
Plan-to-Executor Mapping
Main dispatch happens in src/sql/executor/select_query.rs:
PkSeek: encode PK bytes and do one data B-tree lookup.IndexSeek: encode index key, fetch matching PKs from index B-tree, then fetch rows from data B-tree.IndexRangeSeek: range-scan index keys, then fetch rows by PK.FtsScan: evaluate FTS postings and scoring, then materialize matching rows.FullScan: iterate data B-tree and filter with WHERE.
For UPDATE / DELETE, planner is reused, then matching PKs are collected before mutation to avoid in-place scan mutation hazards.
JOIN Strategy
Join execution is currently nested loop (src/sql/executor/select_join.rs).
For INNER / CROSS, loop order is chosen from estimated cardinality:
- smaller side tends to be outer loop (
choose_nested_loop_order).
EXPLAIN includes join-loop notes in Extra.
EXPLAIN Mapping
src/sql/executor/select_meta.rs maps plan to EXPLAIN fields:
- access
type:const,ref,range,fulltext,ALL key:PRIMARYor chosen index namerows: estimated rowscost: heuristic planner costExtra: e.g.Using where,Using index,Using fulltext
This is a planner/debug aid, not a precise runtime profiler.
DDL Execution: ALTER TABLE
ALTER TABLE is executed in src/sql/executor/alter.rs and is mostly planner-independent.
Operation Dispatch
exec_alter_table(...) dispatches by AST operation:
ADD COLUMNDROP COLUMNMODIFY COLUMNCHANGE COLUMN(rename + optional type/constraint change)
Fast Path vs Rewrite Path
Implementation uses two paths:
- metadata-only (catalog update only): no row rewrite
- full rewrite (scan + rebuild data B-tree): required when row bytes must change
Rules in current code:
ADD COLUMNis metadata-only.DROP COLUMNalways rewrites all rows.MODIFY/CHANGErewrites only when column type changes.MODIFY/CHANGEwithout type change is metadata-only.
Safety Checks and Validation
Before applying metadata/rewrite:
- adding
PRIMARY KEYviaADD COLUMNis rejected - dropping a PK column is rejected
- dropping a column referenced by any index is rejected
- adding
NOT NULLchecks existing rows forNULLand fails if found ADD COLUMN ... NOT NULLwithoutDEFAULTfails on non-empty tables
Rewrite Algorithm (when triggered)
Rewrite path is:
- Scan old data B-tree and decode each row.
- Transform row shape/value (drop column or type coercion).
- Collect all old data-tree page IDs and free them.
- Create a new data B-tree root and reinsert transformed rows.
- Update
TableDef.data_btree_rootand persist catalog metadata.
Rewritten rows are stored in row format v1.
Unique Index Reconciliation
After MODIFY / CHANGE, reconcile_unique_index(...) adjusts single-column unique index state:
- add
UNIQUE: validate duplicates first, then create auto unique index - remove
UNIQUE: drop corresponding unique index and free its pages
For CHANGE COLUMN, index metadata that references the old column name is renamed to the new name.
Cryptography
MuroDB encrypts all data at rest. This chapter explains the full encryption pipeline — from a user’s password to bytes on disk — and the rationale behind each design choice.
Overview
User's passphrase
│
↓ Argon2id (memory-hard KDF) + salt
256-bit MasterKey (zeroized on drop)
│
├──→ Page encryption (AES-256-GCM-SIV, AAD = page_id || epoch)
└──→ WAL encryption (AES-256-GCM-SIV, AAD = lsn || 0)
FTS term blinding (HMAC-SHA256, DB-scoped term key)
The encryption system has four layers:
- Key derivation (KDF) — Derive a cryptographic key from the user’s passphrase
- Page encryption — Encrypt and authenticate each data page
- WAL encryption — Encrypt write-ahead log records
- FTS term blinding — Hide full-text search tokens on disk (separate from page/WAL encryption; see below)
Encryption Suites
Defined in src/crypto/suite.rs:
| Suite ID | Name | Purpose |
|---|---|---|
| 1 | aes256-gcm-siv | Default. Production use |
| 0 | plaintext | Testing/development. Explicit opt-in |
The suite ID is stored in plaintext in the DB header (bytes 68..72), allowing MuroDB to determine the encryption mode before the user provides a passphrase.
Key Derivation (KDF)
src/crypto/kdf.rs
Flow
passphrase (arbitrary-length bytes) + salt (16-byte random)
│
↓ Argon2id
256-bit MasterKey
- Argon2id: A memory-hard KDF. If a database file is stolen, the attacker must spend significant memory and CPU per password guess, making offline brute-force expensive.
- Salt: Generated randomly at database creation, stored in plaintext in the DB header (bytes 12..28). The salt must be readable before the passphrase is provided.
- MasterKey: Implements
ZeroizeOnDrop— key material is securely erased from memory when dropped.
Why Argon2id?
Argon2id is the NIST-recommended password hashing function. Its memory-hard design provides strong resistance against GPU/ASIC-accelerated brute-force attacks, outperforming bcrypt and PBKDF2 in this regard.
Page Encryption
src/crypto/aead.rs
On-Disk Format
Each 4096-byte page is individually encrypted and stored as:
nonce (12B) || ciphertext (4096B) || auth tag (16B)
──────────────────────────────────────────────────
total: 4124B
- Nonce: 12 bytes, randomly generated for each encryption operation
- Authentication tag: 16 bytes, detects tampering of both ciphertext and AAD
Why AES-256-GCM-SIV?
MuroDB uses AES-256-GCM-SIV rather than standard AES-GCM.
With standard AES-GCM, reusing a nonce completely breaks authentication. AES-GCM-SIV, on the other hand, degrades gracefully under nonce reuse — it only leaks whether two plaintexts are identical, without compromising authentication. For a storage engine where nonce management bugs can have catastrophic consequences, this nonce-misuse resistance provides a significantly safer failure envelope.
AAD (Additional Authenticated Data)
#![allow(unused)]
fn main() {
fn build_aad(page_id: PageId, epoch: u64) -> [u8; 16] {
aad[0..8] = page_id.to_le_bytes() // 8 bytes
aad[8..16] = epoch.to_le_bytes() // 8 bytes
}
}
AAD is data that is not encrypted but is included in the authentication tag computation. If the AAD provided during decryption does not match what was used during encryption, authentication fails and decryption is rejected.
By including page_id and epoch in the AAD, MuroDB detects the following attacks:
| Attack | Detected by |
|---|---|
| Page swapping (inserting page 5’s data at page 10’s location) | page_id mismatch |
| Downgrade (injecting a page from an older backup) | epoch mismatch |
Even if the ciphertext itself is untouched, decryption is rejected when the context (which page, which generation) does not match.
WAL Encryption
src/wal/writer.rs
WAL (Write-Ahead Log) frames are encrypted using the same PageCipher.
Frame Format
[frame_len: u32] [encrypted payload]
The payload before encryption is:
record_bytes || CRC32(record_bytes)
AAD Differences from Page Encryption
For WAL frames, the AEAD parameters differ:
page_idparameter → LSN (Log Sequence Number)epochparameter → always 0
Since LSN increases monotonically within a WAL session, this prevents swapping ciphertext between different WAL frames.
Note: After a checkpoint, the WAL is truncated and LSN resets to 0. This means a new WAL session may reuse the same LSN values as a previous session. This is not a vulnerability because each encryption uses a fresh random nonce, and AES-GCM-SIV’s nonce-misuse resistance provides an additional safety margin.
FTS Term Blinding
src/crypto/hmac_util.rs
Full-text search (FTS) stores bigram tokens in a B-tree index. Storing tokens as plaintext would directly expose search terms on disk. MuroDB blinds tokens with HMAC-SHA256:
term_id = HMAC-SHA256(term_key, "tokyo") → 32-byte hash
- Deterministic: The same token always produces the same
term_id, enabling search - One-way: Recovering the original token from a
term_idis computationally infeasible
Only hash values are stored on disk — no plaintext search terms ever reach the storage layer.
Important caveats:
- In encrypted mode,
term_keyis derived fromMasterKey+ DB salt, so term IDs are database-scoped and secret-dependent. - In plaintext mode,
term_keyis derived from a public label + DB salt. This still hides raw tokens in casual inspection, but does not provide strong secrecy against offline dictionary guessing. - Legacy databases may still contain historical FTS key metadata. Open-time migration/backfill logic keeps them readable.
Key Rotation (Epoch)
When the user changes their passphrase, MuroDB re-encrypts all pages with the new key. The epoch — a u64 counter stored in the DB header — coordinates this process.
Rotation Flow
1. Derive a new MasterKey from the new passphrase
2. Increment epoch (e.g., 2 → 3)
3. Re-encrypt every page with the new key and new epoch
4. Update salt and epoch in the DB header
Why Full Re-encryption Instead of KEK?
Large-scale databases often use a KEK (Key Encryption Key) architecture: a master key encrypts per-page Data Encryption Keys (DEKs), and rotation only re-wraps the DEKs without touching the data. This avoids the cost of re-encrypting all data.
MuroDB is an embedded database. Typical data sizes range from a few MB to a few hundred MB. Full re-encryption completes in seconds, so the complexity of KEK — an extra DEK management layer, more complex crash recovery, increased attack surface — is not justified.
Full re-encryption also provides a security advantage that KEK lacks: all data is actually protected by the new key. If the old key is compromised, no data encrypted under it remains on disk.
Epoch-Based Attack Detection
Because epoch is included in the AAD, an attacker who extracts a page from an epoch=2 backup and inserts it into an epoch=3 database will trigger an authentication tag verification failure. The page is rejected without revealing any data.
In-Memory Key Protection
MasterKeyimplementsZeroizeOnDrop: memory is zeroed on drop- Key material is process-local (no external KMS/HSM dependency)
Non-Goals
The following are explicitly out of scope for MuroDB’s encryption design:
- Traffic encryption: MuroDB is an embedded database with no network protocol
- Access-pattern hiding: No ORAM or similar oblivious access mechanisms
- HSM/KMS integration: Key material is process-local
WAL & Crash Resilience
MuroDB uses a write-ahead log (.wal) for crash recovery.
All commits are WAL-first: durable intent is recorded before data-file flush.
.wal Binary Layout
WAL constants are in src/wal/mod.rs:
- magic:
"MUROWAL1"(8 bytes) - version:
u32(current1) - header size: 12 bytes
File layout:
- Header:
[magic:8][version:4] - Repeating frames:
[frame_len: u32][encrypted_payload: frame_len bytes]
frame_len is bounded by MAX_WAL_FRAME_LEN (PAGE_SIZE + 1024).
Encrypted payload format before encryption (src/wal/writer.rs):
record_bytes = WalRecord::serialize(...)payload = record_bytes || crc32(record_bytes)
Encryption uses PageCipher; frame nonce context is (lsn, 0).
WAL Record Types
WalRecord (src/wal/record.rs) variants:
| Record | Payload |
|---|---|
Begin | txid |
PagePut | txid, page_id, full page image bytes |
MetaUpdate | txid, catalog_root, page_count, freelist_page_id, epoch |
Commit | txid, lsn |
Abort | txid |
Record tags on wire:
1=Begin,2=PagePut,3=Commit,4=Abort,5=MetaUpdate
Write Path
Read-Only Query Path (Database::query)
Database::query(sql):
- Acquire shared lock
- Parse/validate read-only statement
- Execute directly on pager/catalog (no implicit WAL transaction)
query is a &mut self API because the session may refresh pager/catalog state from disk before read execution.
For concurrent readers in one process, open additional read-only handles (Database::open_reader) and query from each handle.
If an explicit transaction is active, read statements are executed in the transaction context (execute_in_tx) so uncommitted writes remain visible to that session.
Auto-Commit Mode (no explicit BEGIN)
Session::execute_auto_commit(stmt):
- Create implicit transaction + dirty-page buffer.
- Execute statement against transactional page store.
tx.commit(...)writes:Begin- all dirty
PagePut - freelist
PagePutpages (if needed) MetaUpdateCommit
wal.sync()(fsync) establishes durability boundary.- Flush pages + metadata to main DB file.
Explicit Transaction (BEGIN … COMMIT)
Explicit transaction (BEGIN ... COMMIT) follows the same commit primitive.
ROLLBACK discards dirty state without WAL append (rollback_no_wal in session path).
Commit Point
Durability commit point is WAL fsync:
- before
wal.sync(): commit may be lost on crash - after
wal.sync(): commit must be recoverable even if DB flush fails
If post-sync DB flush fails, transaction returns CommitInDoubt, session is poisoned, and next open recovers from WAL.
Recovery (Database::open)
Database::open(path, master_key)
1. If WAL file exists, run recovery::recover()
→ Scan WAL and validate per-tx state machine
(Begin -> PagePut/MetaUpdate* -> Commit/Abort)
→ Collect latest page images from committed transactions
→ Replay to data file
2. Truncate WAL file (empty it)
→ fsync WAL file
→ best-effort fsync parent directory
3. Build Session with Pager + Catalog + WalWriter
Validation is implemented in src/wal/recovery.rs with explicit skip/error codes.
Recovery Modes
- strict (default): Fails on any WAL protocol violation
- permissive: Skips invalid transactions, recovers only valid committed ones
See Recovery for user-facing documentation.
In permissive mode, if invalid transactions were skipped, WAL can be quarantined (*.quarantine.<ts>.<pid>) before reopening a clean WAL stream.
Inspect-WAL JSON Contract
murodb-wal-inspect --format json returns machine-readable diagnostics with a stable schema contract:
schema_version=1for the current contractstatus:ok/warning/fatalexit_code: mirrors CLI exit code semantics (0,10,20)skipped[].code: stable machine-readable skip classification- On fatal failures,
fatal_errorandfatal_error_codeare included
Secondary Index Consistency
All index updates happen within the same transaction as the data update:
INSERT
- Insert row into data B-tree
- Insert entry into each secondary index (column_value → PK)
- Check UNIQUE constraint before insertion
DELETE
- Scan for rows to delete (collect PK + all column values)
- Delete entries from each secondary index
- Delete row from data B-tree
UPDATE
- Scan for rows to update (collect PK + old column values)
- Compute new values
- Check UNIQUE constraints (for changed values)
- Update secondary indexes (delete old entry + insert new entry)
- Write new row data to data B-tree
Remaining Constraints
fsync granularity
Pager::write_page_to_disk() does not call sync_all() individually. Only flush_meta() calls sync_all(). WAL sync() guarantees data durability, so this is safe in normal operation.
allocate_page counter
Pager::allocate_page() increments in-memory page_count, which is not persisted until flush_meta() after WAL commit.
WAL file size
After successful commits and explicit ROLLBACK, the Session auto-checkpoints the WAL according to policy. Checkpoint is best-effort and does not affect commit success.
Default policy is per-transaction (MURODB_CHECKPOINT_TX_THRESHOLD=1), and can be tuned with:
MURODB_CHECKPOINT_TX_THRESHOLDMURODB_CHECKPOINT_WAL_BYTES_THRESHOLDMURODB_CHECKPOINT_INTERVAL_MS
The same knobs are available as session-scoped SQL runtime options:
SET checkpoint_tx_threshold = <u64>SET checkpoint_wal_bytes_threshold = <u64>SET checkpoint_interval_ms = <u64>
When checkpoint truncate fails, MuroDB emits a warning with wal_path and wal_size_bytes so operators can detect and triage WAL growth.
TLA+ Correspondence
See Formal Verification for the TLA+ model and its mapping to implementation.
| TLA+ Intent | Implementation | Regression Test |
|---|---|---|
| Only valid state transitions are recovered | State transition validation in recovery.rs | test_recovery_rejects_pageput_before_begin |
| Commit/Abort is terminal | Reject duplicate terminal / post-terminal records | test_recovery_rejects_duplicate_terminal_record_for_tx |
| Commit has consistent terminal info | Validate Commit.lsn == actual LSN | test_recovery_rejects_commit_lsn_mismatch |
| Commit requires metadata | Reject Commit without MetaUpdate | test_recovery_rejects_commit_without_meta_update |
| PagePut matches target page | Validate PagePut.page_id vs page header | test_recovery_rejects_pageput_page_id_mismatch |
| Tail corruption tolerated, mid-log rejected | Reader tolerates tail only | test_tail_truncation_tolerated, test_mid_log_corruption_is_error |
| Oversized frames handled safely | Frame length limit in Reader/Writer | test_oversized_tail_frame_tolerated |
| Freelist recovered from committed MetaUpdate | freelist_page_id in WAL MetaUpdate | test_freelist_wal_recovery |
Durability Matrix
This page describes the exact guarantees MuroDB provides at each stage of the commit pipeline and the post-crash outcome when a failure occurs at each step.
Commit Pipeline
The commit flow proceeds through these ordered steps:
1. WAL: Begin record
2. WAL: PagePut records (one per dirty page)
3. WAL: PagePut for freelist page
4. WAL: MetaUpdate record (catalog_root, page_count, freelist_page_id)
5. WAL: Commit record
6. WAL: fsync ← COMMIT POINT (durability boundary)
7. Data file: write dirty pages
8. Data file: flush_meta (fsync)
9. WAL: checkpoint_truncate (fsync + directory fsync)
The commit point is step 6 (WAL fsync). Once wal.sync() returns successfully, the transaction is durable. Steps 7-9 are performance optimizations that apply the committed data to the main database file; if they fail, WAL recovery replays the committed transaction on next open.
Fsync Points
| Step | Fsync Target | What It Protects |
|---|---|---|
| WAL sync (step 6) | WAL file | All WAL records for the transaction reach stable storage. This is the commit point. |
| flush_meta (step 8) | Data file | Page data and metadata (catalog_root, page_count, freelist_page_id) are persisted to the main DB file. |
| checkpoint_truncate (step 9) | WAL file + directory | WAL is truncated to header-only. Directory fsync hardens the metadata change. |
Crash-at-Each-Step Outcome Matrix
| Crash Point | WAL State | Committed? | Post-Recovery Outcome |
|---|---|---|---|
| After Begin (step 1) | Begin only | No | Transaction discarded. Prior committed data intact. |
| After PagePut (step 2) | Begin + PagePut(s) | No | Transaction discarded. No pages applied. |
| After freelist PagePut (step 3) | Begin + PagePut(s) + freelist | No | Transaction discarded. Freelist unchanged. |
| After MetaUpdate (step 4) | Begin + PagePut(s) + MetaUpdate | No | Transaction discarded. Metadata unchanged. |
| After Commit record (step 5) | Complete WAL sequence | No | WAL not fsynced; records may not have reached disk. OS may or may not have flushed buffers. If records survived: recovery replays. If not: transaction lost (no durability guarantee without fsync). |
| After WAL sync (step 6) | Complete + fsynced | Yes | Recovery replays committed pages and metadata to data file. |
| After page writes (step 7) | Complete + fsynced | Yes | Some or all pages written. Recovery replays any missing pages idempotently. |
| After flush_meta (step 8) | Complete + fsynced | Yes | Data file fully consistent. WAL replay is idempotent (re-applying same pages is safe). |
| After checkpoint_truncate (step 9) | Truncated | Yes | WAL is empty. Data file is self-consistent. Normal operation resumes. |
Post-WAL-Sync Failures (CommitInDoubt)
When steps 7 or 8 fail after the WAL has been synced, the commit is durable in the WAL but the in-process session cannot confirm it succeeded on the data file. MuroDB handles this as follows:
Transaction::commit()returnsErr(CommitInDoubt).- The session is poisoned - all subsequent operations return
SessionPoisoned. - On reopen, WAL recovery replays the committed transaction, converging to the correct state.
This design ensures that a durable commit is never lost, even if the process crashes or encounters I/O errors after the commit point.
Checkpoint Truncate Failure
If checkpoint_truncate() fails (step 9), the WAL retains committed records. On next open, recovery replays them idempotently. The data file already has the correct state (from steps 7-8), so replay simply overwrites pages with identical content. WAL growth is the only concern; monitor WAL file size on disk and failed_checkpoints via SHOW DATABASE STATS.
Idempotent Recovery
WAL recovery is designed to be idempotent:
- Running
recover()multiple times on the same WAL produces identical database state. page_countonly increases, never decreases, during recovery.- Page data is overwritten with the WAL image regardless of current content.
- Metadata (catalog_root, freelist_page_id) is set to the last committed values.
This property is critical for crash-during-recovery scenarios: if the process crashes during recovery, the next recovery attempt produces the same result.
Torn WAL Tail
A crash during WAL writes can leave a partially-written frame at the end of the WAL file. MuroDB’s WAL reader handles this gracefully:
- Truncated frame: A frame header claiming more bytes than remain in the file is treated as end-of-log.
- Garbage bytes: Bytes that fail decryption or CRC validation at the tail are ignored.
- Zero-filled tail: Zero bytes (from filesystem pre-allocation) produce a zero frame length, treated as end-of-log.
- Mid-log corruption: Corruption followed by valid frames is a hard error (prevents silent data loss).
The WAL reader uses a two-layer tail detection heuristic:
- Structural check: Is the next frame structurally plausible (non-zero length, fits in file)?
- Content probe: Even if structurally plausible, can any following frame be successfully decrypted and CRC-validated?
If both checks indicate no valid data follows the corrupt frame, it is treated as tail garbage and ignored.
FTS Internals
Tokenization
- Normalization: NFKC unicode normalization
- Tokenizer: Bigram (n=2) - each text is split into overlapping 2-character sequences
- Example: “東京タワー” → [“東京”, “京タ”, “タワ”, “ワー”]
Term ID Blinding
Term IDs are computed using HMAC-SHA256:
- No plaintext tokens are stored on disk
- Term ID = HMAC-SHA256(master_key, normalized_token)
- This provides privacy: the disk contents do not reveal what terms are indexed
Postings Storage
Postings lists are stored in B-tree with compression:
- Delta encoding: Document IDs are stored as deltas from the previous ID
- Varint compression: Deltas are encoded as variable-length integers
- Postings are stored in the same B-tree infrastructure as regular data
Scoring
- Algorithm: BM25 (Okapi BM25)
- Used in NATURAL LANGUAGE MODE for relevance ranking
Phrase Matching
Phrase queries (e.g., "東京タワー") verify consecutive bigram positions:
- Tokenize the phrase into bigrams
- Find postings for each bigram
- Verify that positions are consecutive across all bigrams
Snippet Generation
fts_snippet() uses a local scan approach with a UTF-8 offset map:
- Find matching positions in the document
- Build a char<->byte offset map for normalized text
- Convert match byte offsets to char windows via binary search
- Slice and apply highlight tags (open/close) around matched regions
- Truncate to the specified maximum length
Memory note:
- Offset map size is bounded by
(normalized_chars + 1) * sizeof(usize)bytes per call.
Format Migration
Current Policy (as of 2026-02-22)
MuroDB supports database format v4 only.
- Opening v4 works.
- Opening v1/v2/v3 is rejected.
- Opening future versions (>v4) is also rejected.
This project currently has no production users on pre-v4 formats, so compatibility-migration code is intentionally removed to keep core storage logic simple and safer.
v4 Header Layout
Magic (8B) + Version (4B) + Salt (16B) + CatalogRoot (8B)
+ PageCount (8B) + Epoch (8B) + FreelistPageId (8B)
+ NextTxId (8B) + EncryptionSuiteId (4B) + CRC32 (4B)
- Header size: 76 bytes
- CRC32 covers bytes
0..72
Rejection Behavior
Opening an unsupported version returns:
WAL error: unsupported database format version N
WAL Format Version History
| Version | Changes |
|---|---|
| v1 | Initial: Begin, PagePut, MetaUpdate(catalog_root, page_count), Commit, Abort |
| v2 | MetaUpdate adds freelist_page_id field. Legacy v1 MetaUpdate (25 bytes) decoded with freelist_page_id=0 (backward compatible) |
| v3 | MetaUpdate adds epoch field. Legacy v1/v2 MetaUpdate records decode with epoch=0 (backward compatible) |
Backup
Overview
MuroDB provides a Database::backup() API for creating consistent, point-in-time snapshots of a running database. The backup file is a fully valid MuroDB database file that can be opened directly with the same key or password.
Design
Consistency Model
The backup API uses a write-lock + WAL checkpoint + byte-copy approach:
- Write lock: Acquires an exclusive lock to prevent concurrent writes during the backup window.
- WAL checkpoint: Calls
checkpoint_truncate()to flush all committed WAL records into the main data file and truncate the WAL. This ensures the data file is self-consistent. - Byte-copy: Copies the plaintext header (76 bytes) and all encrypted pages as raw bytes from the source file to the destination. No decryption or re-encryption occurs.
- fsync: The destination file is fsynced before the lock is released, ensuring durability.
What Gets Copied
The backup copies exactly HEADER_SIZE + page_count * page_size_on_disk bytes:
- Header (76 bytes): Magic, format version, salt, catalog root, page count, epoch, freelist page ID, next TxId, encryption suite ID, CRC32.
- Pages: All pages in their on-disk (encrypted) form, including data pages, B-tree nodes, freelist pages, and FTS posting pages.
The WAL file is not included in the backup because the checkpoint step ensures all committed data is already in the main file.
Concurrency
- During backup, writers are blocked (exclusive lock held).
- Backup is typically fast (sequential I/O), so the write-stall window is proportional to database size.
- After the backup completes, normal read/write operations resume.
Usage
Rust API
#![allow(unused)]
fn main() {
use murodb::Database;
let mut db = Database::open(path, &master_key)?;
// ... normal operations ...
// Create a backup
db.backup("backup.db")?;
}
Restore
Restoring from a backup is a file-level operation:
- Stop the application (or close the
Databasehandle). - Replace the database file with the backup file.
- Delete the WAL file (
.walsuffix) if present — the backup has no pending WAL. - Reopen the database.
#![allow(unused)]
fn main() {
// The backup file is a valid MuroDB database
let mut restored = Database::open("backup.db", &master_key)?;
}
Constraints
| Constraint | Detail |
|---|---|
| Disk space | Requires free space equal to the full database size. |
| Write stall | Writers are blocked for the duration of the copy. |
| WAL | WAL is checkpointed and truncated before copy; not included in backup. |
| Encryption | Backup preserves the same encryption suite, key, and salt. |
| Atomicity | If the backup process crashes mid-copy, the destination file may be partial/corrupt. The source database is unaffected. |
Formal Verification
MuroDB uses TLA+ to formally verify crash/recovery protocol invariants.
Files
| File | Description |
|---|---|
specs/tla/CrashResilience.tla | System model |
specs/tla/CrashResilience.cfg | TLC configuration (small finite state space) |
specs/tla/CrashResilience.large.cfg | Larger state space for deeper checking |
specs/tla/run_tlc.sh | Helper script to run TLC |
What is modeled
- Transaction lifecycle:
BeginTx,WritePage,SetMeta,DurableCommit - Partial flush before crash:
FlushSomeCommitted - Crash and recovery:
Crash,Recover - WAL replay semantics at transaction granularity:
- Committed writes are recovered
- Uncommitted writes are ignored
- Metadata (
catalogRoot,pageCount) is recovered from committed records
Checked invariants
| Invariant | Description |
|---|---|
TypeInv | Basic type safety of all state variables |
RecoveredSound | After recovery, DB state equals replayed committed WAL state |
NoUncommittedInfluence | Uncommitted transactions do not influence recovered state |
CommitRequiresMeta | Committed transactions always have metadata update |
UniqueCommittedOrder | Each transaction appears at most once in commit order |
FreelistPreserved | After recovery, freelist ID equals the last committed freelist ID |
Running TLC
Prerequisites
- Java runtime
tla2tools.jar(TLC model checker)
Using Make
make tlc-tools # Download tla2tools.jar
make tlc # Run with small config (fast)
make tlc-large # Run with large config (deeper)
Manual
export TLA2TOOLS_JAR=/path/to/tla2tools.jar
./specs/tla/run_tlc.sh
Or with the tlc2 command:
tlc2 -config specs/tla/CrashResilience.cfg specs/tla/CrashResilience.tla
Scope and Limitations
- This is an abstract model, not byte-level WAL frame parsing
- It does not model OS/filesystem durability anomalies directly
- It validates protocol-level invariants and crash/recovery semantics
Correspondence with Implementation
See the WAL & Crash Resilience page for a detailed mapping between TLA+ invariants and their implementation in code.
Roadmap
Implemented
- Basic CRUD (INSERT, SELECT, UPDATE, DELETE)
- CREATE TABLE (PRIMARY KEY, UNIQUE, NOT NULL)
- CREATE INDEX / CREATE UNIQUE INDEX (single column)
- CREATE FULLTEXT INDEX (bigram, BM25, NATURAL/BOOLEAN mode, snippet)
- MySQL-compatible integer types (TINYINT, SMALLINT, INT, BIGINT)
- VARCHAR(n), VARBINARY(n), TEXT with size validation
- UUID type (16-byte native, UUID_V4/UUID_V7 generation)
- Hex literal (
X'...') for VARBINARY data - WHERE with comparison operators (=, !=, <, >, <=, >=)
- AND, OR logical operators
- ORDER BY (ASC/DESC, multi-column), LIMIT
- JOIN (INNER, LEFT, CROSS) with table aliases
- BEGIN / COMMIT / ROLLBACK
- SHOW TABLES
- Multi-row INSERT
- Hidden _rowid auto-generation for tables without explicit PK
- AES-256-GCM-SIV encryption, Argon2 KDF
- WAL-based crash recovery
- CLI with REPL
- DROP TABLE / DROP TABLE IF EXISTS
- DROP INDEX
- IF NOT EXISTS for CREATE TABLE / CREATE INDEX
- SHOW CREATE TABLE
- DESCRIBE / DESC table
- LIKE / NOT LIKE (% and _ wildcards)
- IN (value list)
- BETWEEN … AND …
- IS NULL / IS NOT NULL
- NOT operator (general)
- OFFSET (SELECT … LIMIT n OFFSET m)
- DEFAULT column values
- AUTO_INCREMENT
- Arithmetic operators in expressions (+, -, *, /, %)
- BOOLEAN type (alias for TINYINT)
- CHECK constraint
Phase 2 — Built-in Functions ✓
MySQL-compatible scalar functions.
- String: LENGTH, CHAR_LENGTH, CONCAT, SUBSTRING/SUBSTR, UPPER, LOWER
- String: TRIM, LTRIM, RTRIM, REPLACE, REVERSE, REPEAT
- String: LEFT, RIGHT, LPAD, RPAD, INSTR/LOCATE
- String: REGEXP / REGEXP_LIKE
- Numeric: ABS, CEIL/CEILING, FLOOR, ROUND, MOD, POWER/POW
- NULL handling: COALESCE, IFNULL, NULLIF, IF
- Type conversion: CAST(expr AS type)
- CASE WHEN … THEN … ELSE … END
Phase 3 — Aggregation & Grouping ✓
- COUNT, SUM, AVG, MIN, MAX
- COUNT(DISTINCT …)
- GROUP BY (single and multiple columns)
- HAVING
- SELECT DISTINCT
Phase 4 — Schema Evolution ✓
- ALTER TABLE ADD COLUMN
- ALTER TABLE DROP COLUMN
- ALTER TABLE MODIFY COLUMN / CHANGE COLUMN
- RENAME TABLE
- Composite PRIMARY KEY
- Composite UNIQUE / composite INDEX
Phase 5 — Advanced Query ✓
- Subqueries (WHERE col IN (SELECT …), scalar subquery)
- UNION / UNION ALL
- EXISTS / NOT EXISTS
- INSERT … ON DUPLICATE KEY UPDATE
- REPLACE INTO
- EXPLAIN (query plan display)
- RIGHT JOIN
- Shared-lock read path (
Database::query) with CLI auto routing
Phase 6 — Types & Storage
- FLOAT / DOUBLE
- DATE, DATETIME, TIMESTAMP
- Scope: fully align parser/executor/CAST/default/literal behavior and edge-case validation.
- Done when:
- Temporal literals and string casts behave consistently across INSERT/UPDATE/WHERE.
- Arithmetic and comparison semantics are defined/documented for mixed temporal expressions.
- Timezone handling policy is explicit (especially TIMESTAMP input/output normalization).
- Invalid dates/times reject with deterministic errors.
- Date/time functions: NOW, CURRENT_TIMESTAMP, DATE_FORMAT, etc.
- UUID type with UUID_V4() and UUID_V7() generation functions
- DECIMAL(p,s) / NUMERIC(p,s) fixed-point exact numeric type
- 96-bit mantissa via
rust_decimal, precision 1-28, 16-byte storage - Full arithmetic, comparison, CAST, aggregation (SUM/AVG/MIN/MAX), ORDER BY, GROUP BY, INDEX support
- MySQL-compatible: NUMERIC alias, default DECIMAL(10,0), DECIMAL+INT→DECIMAL, DECIMAL+FLOAT→FLOAT
- 96-bit mantissa via
- BLOB (skipped for now)
- Decision (2026-02-22): defer and move focus to Phase 7 performance work.
- Why skipped now:
- Current product priorities are query/index performance and planner improvements, not large-object type expansion.
BLOBadds non-trivial storage/operational surface area (limits, indexing semantics, comparison behavior) with low near-term user impact.- Existing
VARBINARY(n)/TEXTcoverage is sufficient for current workloads.
- Revisit when:
- There is a concrete workload requiring large binary payloads that cannot be handled acceptably by current types.
- The performance roadmap items in Phase 7 are complete or no longer the bottleneck.
- Overflow pages (posting list > 4096B)
- Scope: support values/postings that exceed single-page capacity.
- Progress:
- Implemented FTS segment overflow chains (
__segovf__) with typed page format (OFG1). - Read/write/delete + vacuum path now reclaims overflow pages without orphaning.
- Covered by unit/integration tests (
cargo testgreen as of 2026-02-22). - Added WAL recovery integration tests for overflow chains (torn WAL tail and post-sync partial-write replay paths).
- Benchmarked on 2026-02-22 (
murodb_bench, commit829ad18145c2) with no severe small-record regression signal. - Implemented B-tree value overflow pages (2026-02-23): large row values (>~4073 bytes) now spill to overflow page chains transparently. Format version bumped to 5 (backward-compatible with v4).
- Implemented FTS segment overflow chains (
- Done when:
- Overflow chain format is versioned and crash-safe.
- WAL/recovery covers partial-write and torn-tail scenarios for overflow chains.
- Vacuum/reclaim path correctly frees overflow pages without orphaning.
- Benchmarks show no severe regressions for small records.
Phase 7 — Performance & Internals
- Auto-checkpoint (threshold-based WAL)
- Composite index range scan
- Progress:
- Added planner/executor support for composite-index range seek on the last key part (e.g.
(a,b)witha = ?andbrange). - EXPLAIN now reports
type=rangefor this access path. - EXPLAIN now reports estimated cardinality via
rows.
- Added planner/executor support for composite-index range seek on the last key part (e.g.
- Done when:
- Multi-column prefix ranges (
(a,b)with predicates ona, optional range onb) use index scan. - EXPLAIN shows index-range choice and estimated cardinality.
- Fallback path remains correct for unsupported predicate shapes.
- Multi-column prefix ranges (
- Progress:
- Query optimizer improvements (cost-based)
- Progress:
- Added deterministic heuristic cost hints for
PkSeek/IndexSeek/IndexRangeSeek/FullScan. - Planner now compares index candidates by cost instead of choosing the first matching index.
- EXPLAIN now reports a
costcolumn for the chosen plan. - Added persisted stats via
ANALYZE TABLE(table_rows,index_distinct_keys) in catalog metadata. - EXPLAIN row estimation now prefers persisted
table_rowswhen available. - Planner cost model now incorporates persisted
table_rows/index_distinct_keyswhen available, with conservative fallback selectivity when stats are missing. - EXPLAIN
rows/costnow uses the same planner estimation logic (with table-row fallback), so estimates reflect planner tradeoffs. - JOIN loop-order choice for
INNER/CROSSnow uses planner-side estimated row counts (stats-aware with runtime fallback) and keeps row shape (left + right) stable. ANALYZE TABLEnow persists numeric min/max bounds and equal-width histogram bins for single-column numeric B-tree indexes; range row estimation uses these stats when available.- EXPLAIN for JOIN now reports nested-loop outer-side choice with estimated left/right row counts in
Extra.
- Added deterministic heuristic cost hints for
- Done when:
- Planner compares at least full-scan vs single-index vs join-order alternatives.
- Basic column stats/histograms are persisted and refreshable.
- Plan choice is deterministic under identical stats.
- Progress:
- FTS stop-ngram filtering
- Progress:
- Added FULLTEXT options
stop_filterandstop_df_ratio_ppm(ppm threshold). - NATURAL LANGUAGE MODE now supports skipping high-DF ngrams when enabled.
- Default remains OFF for exact-behavior compatibility.
- Recall/precision tradeoff example documented in Full-Text Search guide.
- Added FULLTEXT options
- Done when:
- Frequent low-information ngrams are skipped using configurable thresholds.
- Recall/precision tradeoff is documented with benchmark examples.
- Toggle exists for exact behavior compatibility.
- Progress:
- fts_snippet acceleration (pos-to-offset map)
- Progress:
- Replaced snippet byte/char conversion loops with a UTF-8 position-to-offset map plus binary search.
- Snippet assembly now slices by byte ranges instead of repeatedly collecting char vectors.
- Added dedicated benchmark runner (
murodb_snippet_bench) with legacy-vs-new comparison and offset-map memory estimate. - On 2026-02-22 (local, release build), long-text tail-hit case showed small p50 improvement (legacy
1245.52us-> new1228.43us).
- Done when:
- Snippet generation avoids repeated UTF-8 rescans for long docs.
- Latency improvement is measured and documented on representative datasets.
- Memory overhead remains bounded and observable.
- Progress:
Phase 8 — Security (Future)
- Key rotation (epoch-based re-encryption)
- Implemented API-based rekey (
Database::rekey_with_password) for full page re-encryption. - New random salt generated on each rotation; epoch incremented.
- Crash-safe via
.rekeymarker file with automatic recovery on next open. - Rejects inside transactions and on plaintext databases.
- Implemented API-based rekey (
Phase 9 — Practical Embedded DB (Next)
Real-world deployment features to make MuroDB easier to embed and operate.
- Encryption OFF mode
- Motivation: some embedded deployments prefer CPU savings and rely on disk/host-level protection.
- Done when:
- DB format can be created/opened in explicit plaintext mode.
- File header clearly records mode to avoid accidental mis-open.
- CLI/API require explicit opt-in (no silent downgrade from encrypted DB).
- Pluggable encryption suite
- Motivation: allow policy-driven algorithm choice without forking storage engine.
- Done when:
- Algorithm + KDF are selected by explicit config at DB creation.
- Supported suites are versioned, discoverable, and recorded in metadata.
- Wrong-suite open errors are deterministic and actionable.
- Rekey / algorithm migration
- Rekey implemented via API (
Database::rekey_with_password) and dedicated CLI (murodb-rekey). - Crash-recoverable via
.rekeymarker file. - Algorithm migration (cipher suite change) deferred to future work.
- Rekey implemented via API (
- Backup API + consistent snapshot
- Decision (2026-02-22):
- Prioritize early in Phase 9 so embedded apps can take consistent backups without full writer quiesce windows.
- Why now:
- File-copy backup while writes are active is error-prone operationally.
- A first-class API can provide deterministic snapshot semantics and simpler restore contracts.
- Done when:
- Online consistent backup without long writer stalls.
- Restore path validated by integration tests.
- Snapshot metadata includes format/security parameters.
- Decision (2026-02-22):
- Operational limits and safeguards
- Done when:
- Configurable caps for DB file size, WAL size, statement timeout, and memory budget.
- Error surfaces are clear and machine-parseable for host applications.
- Default limits are documented with recommended profiles (edge device / server / CI).
- Done when: