Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MuroDB

MuroDB Logo

MuroDB is an embedded SQL database written in Rust.

This documentation is designed for people who already know SQL and want to start building quickly.

What You Can Do Quickly

  • Create a local database file and start querying right away.
  • Keep data durable with WAL-based crash recovery.
  • Use encrypted-at-rest mode by default, or explicitly opt into plaintext mode.
  • Add full-text search with bigram tokenization and relevance scoring.

If you want hands-on steps first, start here:

  1. First Session (10 minutes)
  2. Quick Start
  3. SQL Reference

Core Capabilities

  • At-rest mode: aes256-gcm-siv (default) or explicit off (plaintext)
  • Storage engine: B+tree, WAL, page cache, crash recovery
  • Transactions: ACID semantics
  • Concurrency model: multiple readers / single writer
  • Full-text search: MATCH(...) AGAINST(...), NATURAL/BOOLEAN modes, fts_snippet()

Architecture Map

ComponentDescription
crypto/Page encryption, key derivation, and term-ID blinding
storage/Pager, slotted pages, cache, and freelist management
btree/Index/table structures, search, and mutation operations
wal/Write-ahead log, reader/writer, recovery flow
tx/Transaction lifecycle, dirty-page buffering, commit/rollback
schema/Catalog metadata for tables and indexes
sql/Lexer, parser, planner, executor, and session control
fts/Tokenizer, postings, BM25 scoring, snippets
concurrency/In-process and cross-process locking

Non-goals

  • Network server protocol
  • Full access-pattern obfuscation
  • Stored procedures / triggers

Installation

Requirements

  • Rust toolchain (stable)

Install from source

git clone https://github.com/tokuhirom/murodb.git
cd murodb
cargo install --path .

This installs the murodb binary to ~/.cargo/bin/.

Build only

cargo build --release

The binary will be at target/release/murodb.

First Session (10 Minutes)

This page is a fast path for SQL users who want to feel MuroDB before reading full details.

1. Install

cargo install --path .

2. Create a database and schema

murodb demo.db --create -e "CREATE TABLE notes (
  id BIGINT PRIMARY KEY,
  title VARCHAR NOT NULL,
  body TEXT
)"

Encrypted mode prompts for a password by default.

If you need plaintext mode, opt in explicitly:

murodb demo-plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"

3. Insert and query rows

murodb demo.db -e "INSERT INTO notes (id, title, body) VALUES
  (1, 'welcome', 'hello from murodb'),
  (2, 'next', 'test full text search')"

murodb demo.db -e "SELECT id, title FROM notes ORDER BY id"
murodb demo.db -e "CREATE FULLTEXT INDEX notes_body_fts ON notes(body)
  WITH PARSER ngram
  OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000)"

murodb demo.db -e "SELECT id,
  MATCH(body) AGAINST('full text' IN NATURAL LANGUAGE MODE) AS score
FROM notes
WHERE MATCH(body) AGAINST('full text' IN NATURAL LANGUAGE MODE) > 0
ORDER BY score DESC"

5. Open interactive REPL

murodb demo.db

Start without -e to enter REPL mode.

Tip:

  • Press Ctrl-C while typing to clear the current buffer.
  • Press Ctrl-C during statement execution to cancel the running statement and keep the REPL open.

Next

Quick Start

This page is a concise command reference.
For a guided first run, see First Session (10 Minutes).

Create a database

murodb mydb.db --create -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"

You will be prompted for an encryption password.

If you need plaintext mode, opt in explicitly:

murodb mydb_plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"

Insert rows

murodb mydb.db -e "INSERT INTO t (id, name) VALUES (1, 'hello'), (2, 'world')"

Query rows

murodb mydb.db -e "SELECT id, name FROM t ORDER BY id"

Show tables

murodb mydb.db -e "SHOW TABLES"

Run with JSON output

murodb mydb.db --format json -e "SELECT id, name FROM t ORDER BY id"

Interactive REPL

murodb mydb.db

Start without -e to enter the interactive REPL mode.

CLI Options

Basic usage

murodb <database-file> [options]

Options

OptionDescription
-e <SQL>Execute SQL and exit
--createCreate a new database
--encryption <aes256-gcm-siv|off>Encryption suite (off = plaintext, explicit opt-in)
--password <PW>Password for aes256-gcm-siv mode (prompts if omitted)
--recovery-mode <strict|permissive>WAL recovery policy for open
--format <text|json>Output format for query results
--busy-timeout-ms <N>Lock wait timeout in milliseconds (0 = wait indefinitely)
--statement-timeout-ms <N>Per-statement execution timeout in milliseconds (0 = no timeout)

Examples

# Create a new database
murodb mydb.db --create -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"

# Create plaintext DB (explicit)
murodb mydb_plain.db --create --encryption off -e "CREATE TABLE t (id BIGINT PRIMARY KEY, name VARCHAR)"

# Insert data
murodb mydb.db -e "INSERT INTO t (id, name) VALUES (1, 'hello')"

# Query
murodb mydb.db -e "SELECT * FROM t"

# Interactive REPL
murodb mydb.db

# Open with permissive recovery mode
murodb mydb.db --recovery-mode permissive

# JSON output for machine processing
murodb mydb.db --format json -e "SELECT * FROM t"

Query routing behavior

The CLI parses each statement and routes execution automatically:

  • Read-only statements (SELECT, UNION, EXPLAIN SELECT, SHOW ..., DESCRIBE) use the read path.
  • EXPLAIN UPDATE / EXPLAIN DELETE are supported, but route according to the inner statement category.
  • Write and transaction-control statements (INSERT, UPDATE, DELETE, DDL, BEGIN/COMMIT/ROLLBACK) use the write path.
  • While an explicit transaction is active (BEGINCOMMIT/ROLLBACK), all statements (including SELECT) run with execute semantics.

Ctrl-C behavior

  • In REPL input mode (while typing a statement), Ctrl-C clears the current input buffer.
  • While a statement is executing, Ctrl-C requests cancellation of the in-flight statement.
  • If cancellation succeeds, the statement returns an error and the REPL stays alive.
  • In one-shot mode (-e), Ctrl-C also requests cancellation; the command then exits after printing the statement error.

JSON output

When using --format json, results are emitted as a single JSON object per statement.

Result envelope

  • type - One of rows, rows_affected, ok, or error
  • columns - Column names in result order (only for rows)
  • rows - Array of row arrays in column order (only for rows)
  • row_count - Number of rows (only for rows)
  • rows_affected - Number of rows affected (only for rows_affected)
  • message - Error message string (only for error)

Example:

{"type":"rows","columns":["id","name"],"rows":[[1,"alice"]],"row_count":1}

Types

INTEGER

Numbers are emitted as JSON numbers.

FLOAT

Finite values are emitted as JSON numbers. Non-finite values are emitted as JSON strings.

DATE

YYYY-MM-DD

DATETIME

ISO 8601 YYYY-MM-DDTHH:MM:SS

TIMESTAMP

ISO 8601 YYYY-MM-DDTHH:MM:SS

VARCHAR

JSON strings with standard escaping.

VARBINARY

Base64 string (standard alphabet with = padding), for example q80=.

NULL

null

WAL inspection

WAL inspection is handled by a dedicated command so the query CLI stays simple:

murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive

See WAL Inspection for exit codes and JSON schema.

Rekey Command

Use the dedicated command for password rotation:

murodb-rekey mydb.db

The command prompts for current password, new password, and confirmation via TTY.

Security Notes

  • Prefer interactive password prompt over --password to reduce secret exposure in process lists/history.
  • Password rotation uses murodb-rekey (interactive prompt), not SQL text.

WAL Inspection

WAL inspection is a separate command so the main murodb CLI stays focused on queries.

Basic usage

murodb-wal-inspect <database-file> --wal <WAL-PATH> [options]

Options

OptionDescription
--wal <PATH>WAL file path or quarantine file path
--password <PW>Password (prompts if omitted)
--recovery-mode <strict|permissive>Recovery policy used during inspection
--format <text|json>Output format for inspection results

Examples

# Text output
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive

# JSON output (for automation)
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json

# Inspect a quarantine WAL file
murodb-wal-inspect mydb.db --wal mydb.wal.quarantine.20240101_120000

Exit codes

Exit CodeMeaning
0No malformed transactions detected
10Malformed transactions detected (inspection succeeded)
20Fatal error (decrypt/IO/strict failure, etc.)

JSON output

When using --format json, the output includes stable fields:

  • schema_version - Schema version for the JSON format
  • mode - Recovery mode used
  • wal_path - Path to the WAL file
  • generated_at - Timestamp of the inspection
  • status - ok, warning, or fatal
  • exit_code - Exit code
  • skipped[].code - Machine-readable classification of skipped transactions
  • fatal_error / fatal_error_code - Present on fatal failures

SQL Reference

Data Types

TypeStorageRange
TINYINT1 byte-128 to 127
SMALLINT2 bytes-32,768 to 32,767
INT4 bytes-2,147,483,648 to 2,147,483,647
BIGINT8 bytes-2^63 to 2^63-1
BOOLEAN1 byteAlias for TINYINT
DATE4 bytesYYYY-MM-DD
DATETIME8 bytesYYYY-MM-DD HH:MM:SS
TIMESTAMP8 bytesYYYY-MM-DD HH:MM:SS (timezone-aware input, normalized to UTC)
VARCHAR(n)variablemax n bytes (optional)
TEXTvariableunbounded text
JSONBvariableCanonical JSON text (validated on write)
VARBINARY(n)variablemax n bytes (optional)
FLOAT4 bytesSingle-precision IEEE 754
DOUBLE8 bytesDouble-precision IEEE 754
DECIMAL(p,s)16 bytesFixed-point exact numeric (precision 1-28, scale 0-p). Alias: NUMERIC(p,s). Default: DECIMAL(10,0)
UUID16 bytes128-bit UUID (RFC 9562), stored as fixed-length binary
NULL0 bytesnull value

Temporal semantics:

  • DATE stores calendar date only.
  • DATETIME stores date-time as provided (no timezone conversion).
  • TIMESTAMP accepts timezone offsets in string input (for example +09:00, Z) and stores UTC-normalized value.
  • Invalid calendar/time values are rejected.

DDL (Data Definition Language)

CREATE TABLE

CREATE TABLE t (
  id BIGINT PRIMARY KEY,
  body VARCHAR,
  blob VARBINARY,
  uniq VARCHAR UNIQUE
);

-- With additional features
CREATE TABLE users (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR NOT NULL,
  email VARCHAR UNIQUE,
  age INT DEFAULT 0,
  active BOOLEAN DEFAULT 1,
  CONSTRAINT chk_age CHECK (age >= 0)
);

-- IF NOT EXISTS
CREATE TABLE IF NOT EXISTS t (id BIGINT PRIMARY KEY);

-- Composite PRIMARY KEY
CREATE TABLE orders (
  user_id INT,
  order_id INT,
  amount INT,
  PRIMARY KEY (user_id, order_id)
);

-- Composite UNIQUE constraint
CREATE TABLE t (
  id BIGINT PRIMARY KEY,
  a INT,
  b INT,
  UNIQUE (a, b)
);

-- FOREIGN KEY (default: RESTRICT)
CREATE TABLE children (
  id BIGINT PRIMARY KEY,
  parent_id BIGINT,
  FOREIGN KEY (parent_id) REFERENCES parents(id)
);

-- FOREIGN KEY with actions
CREATE TABLE child_actions (
  id BIGINT PRIMARY KEY,
  parent_id BIGINT,
  FOREIGN KEY (parent_id) REFERENCES parents(id)
    ON DELETE CASCADE
    ON UPDATE SET NULL
);

CREATE INDEX

CREATE UNIQUE INDEX idx_email ON users(email);

-- IF NOT EXISTS
CREATE INDEX IF NOT EXISTS idx_name ON users(name);

-- Composite index (multiple columns)
CREATE INDEX idx_ab ON t(a, b);
CREATE UNIQUE INDEX idx_ab ON t(a, b);

CREATE FULLTEXT INDEX

CREATE FULLTEXT INDEX t_body_fts ON t(body)
  WITH PARSER ngram
  OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000);

FULLTEXT is usable with any primary-key type. Internally, MuroDB maintains a separate FTS doc_id. stop_filter supports on/off (quoted or unquoted), 1/0, and true/false. stop_df_ratio_ppm range is 0..=1000000.

DROP TABLE / DROP INDEX

DROP TABLE t;
DROP TABLE IF EXISTS t;
DROP INDEX idx_email;
DROP INDEX IF EXISTS idx_email;

ALTER TABLE

-- Add a new column (O(1), no row rewrite)
ALTER TABLE t ADD COLUMN email VARCHAR;
ALTER TABLE t ADD age INT DEFAULT 0;

-- Drop a column (full table rewrite)
ALTER TABLE t DROP COLUMN age;

-- Modify column type or constraints (full rewrite if type changes)
ALTER TABLE t MODIFY COLUMN name VARCHAR(255) NOT NULL;
ALTER TABLE t MODIFY name TEXT;

-- Rename and optionally change a column (CHANGE COLUMN)
ALTER TABLE t CHANGE COLUMN name username VARCHAR;

-- Add / drop FOREIGN KEY
ALTER TABLE child ADD FOREIGN KEY (parent_id) REFERENCES parent(id);
ALTER TABLE child DROP FOREIGN KEY (parent_id);

Performance notes:

  • ADD COLUMN is O(1) — only updates the catalog. Existing rows return the default value (or NULL) for the new column without rewriting data.
  • DROP COLUMN, MODIFY COLUMN (with type change), and CHANGE COLUMN (with type change) perform a full table rewrite.
  • MODIFY COLUMN / CHANGE COLUMN without a type change is catalog-only (O(1)).

Behavior details:

  • ADD COLUMN ... NOT NULL without DEFAULT fails if the table already has rows.
  • ADD COLUMN ... UNIQUE creates an automatic unique index (auto_unique_<table>_<column>).
  • ADD COLUMN ... UNIQUE with a non-NULL default fails for multi-row existing tables, because all rows would backfill to the same value.
  • MODIFY COLUMN / CHANGE COLUMN that adds NOT NULL validates existing rows and fails if NULL values are present.
  • MODIFY COLUMN / CHANGE COLUMN with a type change rewrites all rows and coerces values; conversion failures abort the statement.
  • CHANGE COLUMN updates index metadata to the new column name when indexes reference the old name.
  • MODIFY COLUMN / CHANGE COLUMN reconcile single-column UNIQUE: adding UNIQUE may create an index; removing UNIQUE drops the corresponding auto unique index.
  • ADD FOREIGN KEY validates existing rows; if orphan rows exist, it fails.
  • FK actions support RESTRICT, CASCADE, and SET NULL for both ON DELETE and ON UPDATE.

Limitations:

  • Cannot add a PRIMARY KEY column via ALTER TABLE.
  • Cannot drop a PRIMARY KEY column.
  • Cannot drop a column that has an index on it (drop the index first).
  • Cannot drop a table that is referenced by a foreign key.
  • DROP FOREIGN KEY is specified by child column list: DROP FOREIGN KEY (col1, col2).

RENAME TABLE

RENAME TABLE old_name TO new_name;

Renames a table. All indexes are automatically updated. No row data is rewritten.

Schema Inspection

SHOW TABLES;
SHOW CREATE TABLE t;
DESCRIBE t;
DESC t;

Operational Inspection

SHOW CHECKPOINT STATS;
SHOW DATABASE STATS;

Both commands return two columns: stat and value.

SHOW DATABASE STATS includes cache observability fields:

  • pager_cache_hits
  • pager_cache_misses
  • pager_cache_hit_rate_pct

It also exposes checkpoint policy/runtime fields:

  • deferred_checkpoints
  • checkpoint_pending_ops
  • checkpoint_policy_tx_threshold
  • checkpoint_policy_wal_bytes_threshold
  • checkpoint_policy_interval_ms

WAL observability:

  • wal_file_size_bytes

Runtime Configuration

Runtime options are documented in Runtime Configuration.

DML (Data Manipulation Language)

ANALYZE TABLE

Refreshes persisted planner statistics.

ANALYZE TABLE t;

Current persisted stats include:

  • table row count
  • index distinct-key count

INSERT

INSERT INTO t (id, name) VALUES (1, 'Alice');

-- Multi-row insert
INSERT INTO t (id, name) VALUES (1, 'Alice'), (2, 'Bob');

INSERT … ON DUPLICATE KEY UPDATE

If a row with the same PRIMARY KEY already exists, updates the existing row instead of inserting a new one.

INSERT INTO t (id, name) VALUES (1, 'Alice')
  ON DUPLICATE KEY UPDATE name = 'Alice Updated';

-- Expressions can reference existing column values
INSERT INTO counters (id, cnt) VALUES (1, 1)
  ON DUPLICATE KEY UPDATE cnt = cnt + 1;

Affected rows (MySQL-compatible):

  • New row inserted: 1
  • Existing row updated: 2

Limitations:

  • VALUES() function is not supported. Use column references to access the existing row’s values.

REPLACE INTO

Inserts a new row. If a row with the same PRIMARY KEY or UNIQUE index value already exists, deletes the old row first, then inserts the new one.

REPLACE INTO t (id, name) VALUES (1, 'Alice');

-- Multi-row replace
REPLACE INTO t (id, name) VALUES (1, 'Alice'), (2, 'Bob');

Unlike INSERT ... ON DUPLICATE KEY UPDATE, REPLACE deletes and re-inserts the entire row. This means:

  • All columns are replaced with the new values (columns not specified get defaults/NULL).
  • Conflicts on any UNIQUE index (not just PRIMARY KEY) also trigger deletion of the conflicting row.

SELECT

SELECT * FROM t WHERE id = 42 ORDER BY id DESC LIMIT 10;

-- With OFFSET
SELECT * FROM t LIMIT 10 OFFSET 20;

UPDATE

UPDATE t SET name = 'Alicia' WHERE id = 1;

DELETE

DELETE FROM t WHERE id = 1;

Index Hints (FORCE INDEX / USE INDEX / IGNORE INDEX)

MySQL-compatible index hints allow controlling which indexes the query planner considers.

-- Force the planner to use only the specified index (skips PK seek)
SELECT * FROM t FORCE INDEX (idx_age) WHERE age = 20;

-- Suggest the planner to use the specified index (PK seek still allowed)
SELECT * FROM t USE INDEX (idx_age) WHERE age = 20;

-- Exclude the specified index from consideration
SELECT * FROM t IGNORE INDEX (idx_age) WHERE age = 20;

-- Multiple index names
SELECT * FROM t FORCE INDEX (idx_age, idx_name) WHERE age = 20;

-- Also works with UPDATE and DELETE
UPDATE t FORCE INDEX (idx_age) SET name = 'updated' WHERE age = 20;
DELETE FROM t IGNORE INDEX (idx_age) WHERE age = 20;

Behavior:

  • FORCE INDEX: Only the specified indexes are candidates. Primary key seek is skipped. If the specified index cannot be used for the query, falls back to full table scan (matching MySQL behavior).
  • USE INDEX: The specified indexes are preferred, but full table scan is also a candidate. Primary key seek is still allowed.
  • IGNORE INDEX: The specified indexes are excluded from consideration. All other indexes and primary key seek remain available.

WHERE Clause

Comparison operators

WHERE id = 1
WHERE id != 1
WHERE id < 10
WHERE id > 5
WHERE id <= 10
WHERE id >= 5

Logical operators

WHERE id > 1 AND name = 'Alice'
WHERE id = 1 OR id = 2
WHERE NOT (id = 1)

LIKE / NOT LIKE

WHERE name LIKE 'Ali%'
WHERE name LIKE '_ob'
WHERE name NOT LIKE '%test%'

IN

WHERE id IN (1, 2, 3)

BETWEEN

WHERE id BETWEEN 1 AND 10

IS NULL / IS NOT NULL

WHERE name IS NULL
WHERE name IS NOT NULL

ORDER BY / LIMIT

SELECT * FROM t ORDER BY id ASC;
SELECT * FROM t ORDER BY name DESC, id ASC;
SELECT * FROM t LIMIT 10;
SELECT * FROM t LIMIT 10 OFFSET 5;

Literals

Hex Literal (Binary)

Binary data can be specified using the X'...' syntax (SQL standard / MySQL compatible):

-- Insert binary data
INSERT INTO t (id, data) VALUES (1, X'DEADBEEF');

-- Empty binary literal
INSERT INTO t (id, data) VALUES (2, X'');

-- Case-insensitive (both X and x are accepted)
INSERT INTO t (id, data) VALUES (3, x'cafebabe');

-- Use in WHERE clause
SELECT * FROM t WHERE data = X'DEADBEEF';

The hex string must contain an even number of hex digits (0-9, A-F, a-f). Odd-length hex strings and invalid characters produce a parse error.

Expressions

Arithmetic operators

SELECT id, price * quantity AS total FROM orders;
SELECT id, (a + b) / 2 AS average FROM t;
-- Supported: +, -, *, /, %

Built-in Functions

String Functions

LENGTH(s)

Returns the byte length of a string.

SELECT LENGTH('hello');       -- 5
SELECT LENGTH('héllo');       -- 6 (é is 2 bytes in UTF-8)

CHAR_LENGTH(s) / CHARACTER_LENGTH(s)

Returns the character count of a string.

SELECT CHAR_LENGTH('hello');  -- 5
SELECT CHAR_LENGTH('héllo');  -- 5

CONCAT(s1, s2, …)

Concatenates two or more strings. Returns NULL if any argument is NULL.

SELECT CONCAT('hello', ' ', 'world');  -- 'hello world'
SELECT CONCAT('a', NULL);              -- NULL

SUBSTRING(s, pos [, len]) / SUBSTR(s, pos [, len])

Returns a substring starting at position pos (1-based). Optional len limits the length.

SELECT SUBSTRING('hello world', 7);     -- 'world'
SELECT SUBSTRING('hello world', 1, 5);  -- 'hello'
SELECT SUBSTRING('hello', -3);          -- 'llo'

UPPER(s) / LOWER(s)

Converts a string to upper or lower case.

SELECT UPPER('hello');  -- 'HELLO'
SELECT LOWER('HELLO');  -- 'hello'

TRIM(s) / LTRIM(s) / RTRIM(s)

Removes whitespace from both ends, the left end, or the right end.

SELECT TRIM('  hello  ');   -- 'hello'
SELECT LTRIM('  hello  ');  -- 'hello  '
SELECT RTRIM('  hello  ');  -- '  hello'

REPLACE(s, from, to)

Replaces all occurrences of from with to in s.

SELECT REPLACE('hello world', 'world', 'rust');  -- 'hello rust'

REVERSE(s)

Reverses a string.

SELECT REVERSE('hello');  -- 'olleh'

REPEAT(s, n)

Repeats a string n times.

SELECT REPEAT('ab', 3);  -- 'ababab'

LEFT(s, n) / RIGHT(s, n)

Returns the leftmost or rightmost n characters.

SELECT LEFT('hello', 3);   -- 'hel'
SELECT RIGHT('hello', 3);  -- 'llo'

LPAD(s, len, pad) / RPAD(s, len, pad)

Pads a string to length len using pad characters on the left or right.

SELECT LPAD('hi', 5, '*');  -- '***hi'
SELECT RPAD('hi', 5, '*');  -- 'hi***'

INSTR(s, sub)

Returns the position (1-based) of the first occurrence of sub in s. Returns 0 if not found.

SELECT INSTR('hello world', 'world');  -- 7
SELECT INSTR('hello', 'xyz');          -- 0

LOCATE(sub, s [, pos])

Returns the position (1-based) of sub in s, starting the search at pos.

SELECT LOCATE('hello', 'hello hello');     -- 1
SELECT LOCATE('hello', 'hello hello', 2);  -- 7

REGEXP

REGEXP / REGEXP_LIKE(s, pattern)

Tests whether a string matches a regular expression. Can be used as an operator or function.

-- Operator syntax
SELECT * FROM t WHERE name REGEXP '[0-9]+';

-- Function syntax
SELECT REGEXP_LIKE(name, '^hello') FROM t;

Numeric Functions

All numeric functions support INTEGER, FLOAT, DOUBLE, and DECIMAL types.

ABS(n)

Returns the absolute value.

SELECT ABS(-42);    -- 42
SELECT ABS(-3.14);  -- 3.14 (DECIMAL)

CEIL(n) / CEILING(n) / FLOOR(n)

Returns the ceiling or floor. (Identity for integer types.)

SELECT CEIL(3.14);   -- 4
SELECT FLOOR(3.14);  -- 3

ROUND(n [, decimals])

Rounds a number to decimals decimal places (default 0). Works with DECIMAL for exact rounding.

SELECT ROUND(3.1459, 2);  -- 3.15 (DECIMAL)
SELECT ROUND(42);          -- 42

MOD(a, b)

Returns the modulo (same as % operator).

SELECT MOD(10, 3);  -- 1

POWER(base, exp) / POW(base, exp)

Returns base raised to the power of exp.

SELECT POWER(2, 10);  -- 1024

UUID Functions

UUID_V4()

Generates a random UUID (version 4, RFC 9562).

SELECT UUID_V4();  -- e.g. '550e8400-e29b-41d4-a716-446655440000'

UUID_V7()

Generates a time-ordered UUID (version 7, RFC 9562). UUIDs generated later sort after earlier ones, making them suitable for primary keys with time-ordered insertion.

CREATE TABLE events (id UUID PRIMARY KEY, data VARCHAR);
INSERT INTO events VALUES (UUID_V7(), 'event data');

UUID values are displayed as lowercase hyphenated hex strings (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx). String literals in UUID format (with or without hyphens) are automatically parsed when inserted into UUID columns.

-- Both forms are accepted:
INSERT INTO t VALUES ('550e8400-e29b-41d4-a716-446655440000', 'with hyphens');
INSERT INTO t VALUES ('550e8400e29b41d4a716446655440000', 'without hyphens');

-- Cast between UUID and VARCHAR/VARBINARY:
SELECT CAST(id AS VARCHAR) FROM t;
SELECT CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID);
SELECT CAST(id AS VARBINARY) FROM t;  -- 16-byte binary

Date/Time Functions

NOW() / CURRENT_TIMESTAMP[()]

Returns the current UTC datetime as a DATETIME value.

SELECT NOW();
SELECT CURRENT_TIMESTAMP();
SELECT CURRENT_TIMESTAMP; -- parentheses are optional

DATE_FORMAT(dt, format)

Formats a date/datetime/timestamp string/value using MySQL-style format specifiers.

SELECT DATE_FORMAT('2026-02-22 13:04:05', '%Y/%m/%d %H:%i:%s');
-- '2026/02/22 13:04:05'

Common specifiers:

  • %Y year (4 digits), %y year (2 digits)
  • %m month (01-12), %c month (1-12), %M month name, %b month abbreviation
  • %d day (01-31), %e day (1-31)
  • %H hour (00-23), %h/%I hour (01-12), %i minute, %s second
  • %W weekday name, %a weekday abbreviation
  • %T HH:MM:SS, %r 12-hour time with AM/PM, %% literal percent

NULL Handling & Conditional

COALESCE(a, b, …)

Returns the first non-NULL argument.

SELECT COALESCE(NULL, NULL, 'fallback');  -- 'fallback'

IFNULL(a, b)

Returns a if not NULL, otherwise b.

SELECT IFNULL(NULL, 'default');  -- 'default'
SELECT IFNULL('value', 'default');  -- 'value'

NULLIF(a, b)

Returns NULL if a = b, otherwise returns a.

SELECT NULLIF(0, 0);  -- NULL
SELECT NULLIF(5, 0);  -- 5

IF(cond, then, else)

Returns then if cond is truthy, otherwise else.

SELECT IF(1, 'yes', 'no');  -- 'yes'
SELECT IF(0, 'yes', 'no');  -- 'no'

CASE WHEN

-- Searched CASE
SELECT CASE
  WHEN val < 10 THEN 'low'
  WHEN val < 20 THEN 'mid'
  ELSE 'high'
END FROM t;

-- Simple CASE
SELECT CASE status
  WHEN 1 THEN 'active'
  WHEN 2 THEN 'inactive'
  ELSE 'unknown'
END FROM t;

CAST

Converts a value to a different data type.

SELECT CAST('42' AS INT);      -- 42
SELECT CAST(42 AS VARCHAR);    -- '42'
SELECT CAST(val AS BIGINT) FROM t;

Supported target types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL(p,s), DATE, DATETIME, TIMESTAMP, VARCHAR, TEXT, JSONB, VARBINARY.

JSON Functions

JSONB accepts valid JSON only. Values are canonicalized when stored.

CREATE TABLE docs (id BIGINT PRIMARY KEY, doc JSONB);
INSERT INTO docs VALUES (1, '{"a":{"b":[1,2,3]}}');

JSON_EXTRACT(json, path)

Evaluates path using jsonpath_lib JSONPath semantics and returns matched JSON (canonical text). If multiple values match, returns a JSON array.

SELECT JSON_EXTRACT('{"a":{"b":1}}', '$.a.b'); -- "1"

JSON_SET(json, path, value)

Sets value at path and returns updated JSON. Supported update-path syntax is root-based dot/index form ($.key, $.arr[0], chained combinations).

SELECT JSON_SET('{"a":1}', '$.b', 2); -- {"a":1,"b":2}

JSON_REMOVE(json, path)

Removes value at path and returns updated JSON. Missing path is a no-op.

SELECT JSON_REMOVE('{"a":1,"b":2}', '$.a'); -- {"b":2}

JSON_TYPE(json)

Returns one of: NULL, BOOLEAN, INTEGER, DOUBLE, STRING, ARRAY, OBJECT.

SELECT JSON_TYPE('[1,2,3]'); -- ARRAY

JSON_CONTAINS(json, value_or_path)

  • If second argument starts with $, it is evaluated as JSONPath (via jsonpath_lib); returns 1 when any match exists.
  • Otherwise it is treated as a JSON candidate value and checked for containment.
SELECT JSON_CONTAINS('{"a":{"b":1}}', '$.a.b'); -- 1
SELECT JSON_CONTAINS('{"a":1,"b":2}', '{"b":2}'); -- 1

JSON function behavior:

  • If any argument is NULL, returns NULL.
  • Invalid JSON input returns an error.
  • Invalid/unsupported update-path syntax in JSON_SET/JSON_REMOVE returns an error.

Aggregation & GROUP BY

Aggregate Functions

SELECT COUNT(*) FROM t;              -- count all rows
SELECT COUNT(col) FROM t;            -- count non-NULL values
SELECT COUNT(DISTINCT col) FROM t;   -- count distinct non-NULL values
SELECT SUM(amount) FROM orders;      -- sum (skips NULLs)
SELECT AVG(amount) FROM orders;      -- average (integer for integer inputs, float otherwise)
SELECT MIN(amount) FROM orders;      -- minimum (skips NULLs)
SELECT MAX(amount) FROM orders;      -- maximum (skips NULLs)

NULL semantics (SQL standard):

  • COUNT(*) counts all rows including NULLs
  • COUNT(col) counts non-NULL values only
  • SUM, AVG, MIN, MAX skip NULLs; return NULL if all values are NULL
  • On empty tables: COUNT returns 0, others return NULL

GROUP BY

SELECT category, COUNT(*) AS cnt FROM orders GROUP BY category;
SELECT category, status, SUM(amount) FROM orders GROUP BY category, status;

NULLs in GROUP BY columns form their own group.

HAVING

Filters groups after aggregation (use WHERE to filter rows before aggregation).

SELECT category, COUNT(*) AS cnt
FROM orders
GROUP BY category
HAVING COUNT(*) > 2;

SELECT DISTINCT

SELECT DISTINCT category FROM orders;
SELECT DISTINCT category, status FROM orders;

Subqueries

Uncorrelated subqueries are supported in WHERE clauses and SELECT lists.

IN / NOT IN (SELECT …)

-- Find users who have placed orders
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders);

-- Find users who have NOT placed orders
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM orders);

EXISTS / NOT EXISTS

-- Check if any orders exist (uncorrelated)
SELECT * FROM users WHERE EXISTS (SELECT id FROM orders);

-- Check if no orders exist
SELECT * FROM users WHERE NOT EXISTS (SELECT id FROM orders WHERE amount > 1000);

Scalar Subqueries

A scalar subquery returns exactly one column and at most one row. If it returns zero rows, the result is NULL. If it returns more than one row, an error is raised.

-- Scalar subquery in SELECT list
SELECT id, (SELECT MAX(amount) FROM orders) AS max_order FROM users;

-- Scalar subquery in WHERE
SELECT * FROM users WHERE id = (SELECT MIN(user_id) FROM orders);

Nested Subqueries

Subqueries can be nested:

SELECT * FROM t1 WHERE id IN (
  SELECT id FROM t2 WHERE EXISTS (SELECT id FROM t3)
);

Limitations:

  • Only uncorrelated subqueries (no outer row references).
  • Subqueries are pre-materialized once per query (not per row).

JOIN

-- INNER JOIN
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.t1_id;

-- LEFT JOIN
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.t1_id;

-- RIGHT JOIN
SELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.t1_id;

-- CROSS JOIN
SELECT * FROM t1 CROSS JOIN t2;

-- Table aliases
SELECT a.id, b.name FROM t1 AS a JOIN t2 AS b ON a.id = b.t1_id;

UNION / UNION ALL

Combines results from multiple SELECT statements.

-- UNION (removes duplicates)
SELECT id, name FROM t1 UNION SELECT id, name FROM t2;

-- UNION ALL (keeps duplicates)
SELECT id, name FROM t1 UNION ALL SELECT id, name FROM t2;

-- With ORDER BY and LIMIT (applies to the whole result)
SELECT id FROM t1 UNION SELECT id FROM t2 ORDER BY id LIMIT 10;

All SELECT statements in a UNION must return the same number of columns.

EXPLAIN

Shows the optimizer’s chosen access path and cardinality/cost estimates for a statement.

EXPLAIN SELECT * FROM t WHERE id = 1;
EXPLAIN UPDATE t SET name = 'Alicia' WHERE id = 1;
EXPLAIN DELETE FROM t WHERE id = 1;

Output Columns

ColumnDescription
idAlways 1 (single plan row output)
select_typeSIMPLE, UPDATE, or DELETE
tableBase table name
typeAccess type: const (PK lookup), ref (index lookup), range (index range seek), ALL (full scan), fulltext (FTS)
keyIndex used (NULL for full scan)
rowsEstimated candidate rows for the chosen access path
costHeuristic cost of the chosen plan
ExtraAdditional diagnostics (Using where, Using index, Using fulltext, JOIN loop notes, etc.)

Access Type Meanings

  • const: primary-key equality lookup (WHERE pk = ...).
  • ref: secondary index equality lookup.
  • range: index range scan (single/composite range shape).
  • ALL: full table scan.
  • fulltext: FULLTEXT index path.

How rows Is Estimated

  • If table/index stats are present (ANALYZE TABLE), EXPLAIN uses:
    • table row stats,
    • index distinct-key stats,
    • numeric min/max bounds,
    • numeric histograms (single-column numeric B-tree indexes).
  • If stats are missing, EXPLAIN falls back to conservative heuristics (or table row scan fallback where applicable).

How cost Is Estimated

  • cost is a deterministic heuristic score used for plan comparison.
  • Lower is better.
  • It includes access-path cost and, for JOIN planning diagnostics, nested-loop alternative comparison cost.
  • Compare cost values primarily within the same query shape.

JOIN Diagnostics in Extra

For EXPLAIN SELECT ... JOIN ..., Extra can include join-loop notes such as:

Join loops: j1=right_outer (L=20,R=3,cL=620,cR=603)
  • j1 = first JOIN step.
  • left_outer / right_outer = chosen outer loop side.
  • L / R = estimated left/right input rows at that step.
  • cL / cR = compared heuristic costs for each outer-loop alternative.

Practical Workflow

-- 1) inspect plan
EXPLAIN SELECT * FROM t WHERE a >= 100 AND a <= 110;

-- 2) refresh stats after major data changes
ANALYZE TABLE t;

-- 3) inspect again (rows/cost should better reflect current data)
EXPLAIN SELECT * FROM t WHERE a >= 100 AND a <= 110;

Current Scope and Limits

Limitations:

  • Supported targets are SELECT, UPDATE, and DELETE.
  • Output is currently a single-row summary (not a full operator tree).
  • JOIN/subquery internals are summarized in Extra rather than emitted as multiple plan rows.

Rekey (Password Rotation)

Password rotation is not available as SQL syntax. Use API or dedicated CLI command instead:

  • Rust API: Database::rekey_with_password("new_password")
  • CLI: murodb-rekey <db-file>

Transactions

BEGIN;
INSERT INTO t (id, name) VALUES (1, 'Alice');
INSERT INTO t (id, name) VALUES (2, 'Bob');
COMMIT;

-- Or rollback
BEGIN;
INSERT INTO t (id, name) VALUES (3, 'Charlie');
ROLLBACK;

-- Savepoints (MySQL-compatible)
BEGIN;
INSERT INTO t (id, name) VALUES (10, 'step1');
SAVEPOINT sp1;
INSERT INTO t (id, name) VALUES (11, 'step2');
ROLLBACK TO SAVEPOINT sp1; -- keeps step1, reverts step2
RELEASE SAVEPOINT sp1;
COMMIT;

Savepoint notes:

  • SAVEPOINT <name>, ROLLBACK TO [SAVEPOINT] <name>, RELEASE SAVEPOINT <name> are valid only inside an active transaction.
  • ROLLBACK TO keeps the transaction active and discards savepoints created after the target.
  • Reusing the same savepoint name overwrites the previous one (MySQL behavior).
  • COMMIT and full ROLLBACK clear all savepoints.

Rust API note:

  • Database::query() accepts read-only SQL only.
  • Database::query() takes &mut self because read execution may refresh pager/catalog state from disk before running.
  • For concurrent reads in one process, use multiple read-only handles (for example Database::open_reader()).
  • Inside an explicit transaction (BEGINCOMMIT/ROLLBACK), run statements through Database::execute(), including SELECT.
  • Database::set_busy_timeout_ms(ms) sets lock wait timeout (0 = wait indefinitely).
  • DatabaseReader::set_busy_timeout_ms(ms) does the same for read-only handles.
  • Database::cancel_handle() / DatabaseReader::cancel_handle() returns a QueryCancelHandle.
  • QueryCancelHandle::cancel() returns true when a statement is currently in flight, otherwise false.
  • Cancellation errors are reported as MuroError::Cancelled.
  • Database::set_statement_timeout_ms(ms) and DatabaseReader::set_statement_timeout_ms(ms) set per-statement execution timeout (0 = no timeout).
  • Timeout errors are reported as MuroError::StatementTimeout { timeout_ms }.
  • Cancellation safety for explicit transactions: cancellation checks in write paths are performed before row-application phases, so a cancelled statement does not commit partial row changes.

Hidden _rowid

Tables without an explicit PRIMARY KEY automatically get a hidden _rowid column with auto-generated values.

Runtime Configuration

MuroDB supports session-scoped runtime options for operational tuning. This page explains what each option does, when to use it, and how to observe it.

Scope and Behavior

  • Scope: session-only
  • Persistence: not persisted in the database file
  • Update timing: immediate for subsequent operations in the same session
  • Transaction rule: runtime SET is rejected inside explicit transactions (BEGIN ... COMMIT/ROLLBACK)

You can set runtime options with SQL:

SET checkpoint_tx_threshold = 8;
SET checkpoint_wal_bytes_threshold = 1048576;
SET checkpoint_interval_ms = 1000;

Or with Rust API:

#![allow(unused)]
fn main() {
use murodb::{Database, sql::session::RuntimeConfig};

let mut db = Database::open_plaintext("mydb.db".as_ref())?;
db.set_runtime_config(RuntimeConfig {
    checkpoint_tx_threshold: 8,
    checkpoint_wal_bytes_threshold: 1_048_576,
    checkpoint_interval_ms: 1_000,
})?;
let active = db.runtime_config()?;
}

Available Options

checkpoint_tx_threshold

  • SQL name: checkpoint_tx_threshold
  • Env default source: MURODB_CHECKPOINT_TX_THRESHOLD
  • Default value: 1
  • Type/range: u64 (0 or greater)

Meaning:

  • Trigger checkpoint after this many post-commit/post-rollback operations.
  • 1 means checkpoint every commit/rollback.
  • 0 disables the tx-count trigger.

Use when:

  • You want to trade lower commit overhead for larger WAL windows.

checkpoint_wal_bytes_threshold

  • SQL name: checkpoint_wal_bytes_threshold
  • Env default source: MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD
  • Default value: 0 (disabled)
  • Type/range: u64 (0 or greater)

Meaning:

  • Trigger checkpoint when WAL file size reaches this threshold in bytes.
  • 0 disables the size trigger.

Use when:

  • You want an upper guardrail on WAL growth.

checkpoint_interval_ms

  • SQL name: checkpoint_interval_ms
  • Env default source: MURODB_CHECKPOINT_INTERVAL_MS
  • Default value: 0 (disabled)
  • Type/range: u64 (0 or greater)

Meaning:

  • Trigger checkpoint when elapsed time since the last successful checkpoint reaches this threshold.
  • 0 disables the time trigger.

Use when:

  • Workload has bursts and you want a time-based checkpoint cadence.

Trigger Combination Rules

Checkpoint runs when any enabled trigger fires:

  • tx-count trigger (checkpoint_tx_threshold)
  • WAL-size trigger (checkpoint_wal_bytes_threshold)
  • interval trigger (checkpoint_interval_ms)

Validation and Errors

  • Runtime values must be non-negative integers.
  • Unknown option names return a deterministic parse error.
  • Using runtime SET inside explicit transactions returns an execution error.

Observability

Use:

SHOW DATABASE STATS;

Relevant fields:

  • checkpoint_policy_tx_threshold
  • checkpoint_policy_wal_bytes_threshold
  • checkpoint_policy_interval_ms
  • deferred_checkpoints
  • checkpoint_pending_ops
  • failed_checkpoints
  • wal_file_size_bytes

See also:

Benchmarks

This page tracks MuroDB performance over time for a fixed embedded-DB style workload mix. Each entry is tied to a concrete version (git commit) so changes can be compared historically.

Benchmark Scope

Current benchmark runner: src/bin/murodb_bench.rs

Workloads:

  • point_select_pk: point lookup by primary key (SELECT ... WHERE id = ?)
  • point_update_pk: point update by primary key (UPDATE ... WHERE id = ?)
  • insert_autocommit: single-row insert per transaction (auto-commit)
  • range_scan_limit_100: range read (WHERE id >= ? ORDER BY id LIMIT 100)
  • mixed_80r_15u_5i: mixed OLTP-like workload (80% read / 15% update / 5% insert)
  • fts_select_natural: fulltext natural-language search (MATCH(body) AGAINST(... IN NATURAL LANGUAGE MODE))
  • fts_update_point: point update on FTS-indexed TEXT column
  • fts_mixed_70q_30u: FTS-focused mixed workload (70% search / 30% update)

Additional microbenchmark:

  • murodb_snippet_bench: compares legacy vs current fts_snippet() implementation on representative text sizes and reports approximate offset-map memory bytes.

Default dataset/config:

  • initial rows: 20,000
  • fts initial rows: 256
  • select ops: 20,000
  • update ops: 5,000
  • insert ops: 5,000
  • scan ops: 2,000
  • mixed ops: 10,000
  • fts select ops: 5,000
  • fts update ops: 2,000
  • fts mixed ops: 5,000
  • warmup ops: 200
  • batch size (initial load): 500

Run command:

cargo run --release --bin murodb_bench

Migration note: as of 2026-02-22, benchmark scope includes additional FTS workloads. Older entries without FTS metrics are not directly comparable on total runtime.

Versioned Results

Date (UTC)CommitEnvironmentNotes
2026-02-22a78694537f59local dev machinefirst baseline
2026-02-22829ad18145c2local dev machineafter secondary-index root persistence fix
2026-02-225c422b8blocal dev machineadded snippet microbenchmark and UTF-8 offset-map optimization

2026-02-22 / a78694537f59

Raw output summary:

WorkloadOpsTotal secOps/secp50 (ms)p95 (ms)p99 (ms)
point_select_pk20,0000.144532138,377.800.00820.00960.0108
point_update_pk5,00027.098314184.515.22106.92868.9318
insert_autocommit5,0008.785356569.131.54802.48165.7406
range_scan_limit_1002,00020.24066498.819.632613.531013.9811
mixed_80r_15u_5i10,00010.417702959.900.01126.24216.7669

Row counts:

  • start: 20,000
  • after insert phase: 25,000
  • final: 25,519

2026-02-22 / 829ad18145c2

Raw output summary:

WorkloadOpsTotal secOps/secp50 (ms)p95 (ms)p99 (ms)
point_select_pk20,0000.123495161,949.510.00720.00760.0087
point_update_pk5,0008.146347613.771.48562.05785.4681
insert_autocommit5,0008.675668576.321.50772.70745.8033
range_scan_limit_1002,00018.577489107.669.220712.172912.5408
mixed_80r_15u_5i10,0003.6263372,757.600.00981.72562.4382
fts_select_natural5,0002.7977001,787.180.55950.60870.6366
fts_update_point2,00011.462999174.475.43217.06259.7803
fts_mixed_70q_30u5,00010.865190460.190.62966.33757.4033

Row counts:

  • start: 20,000
  • after insert phase: 25,000
  • final: 25,519

2026-02-22 / 5c422b8b (Snippet Microbenchmark)

Command:

cargo run --release --bin murodb_snippet_bench

Raw output summary:

CaseItersLegacy p50 (us)New p50 (us)p50 SpeedupApprox offset-map bytes
snippet_short_tail_hit2,00014.1015.090.93x4,936
snippet_medium_tail_hit2,000126.01127.930.99x48,136
snippet_long_tail_hit5001245.521228.431.01x480,136

Notes:

  • Short/medium cases are near parity.
  • Long tail-hit case shows small but measurable p50 reduction.
  • Offset-map memory is linear in normalized char count: (chars + 1) * sizeof(usize).

2026-02-22 / Stop-ngram behavior example

Reference scenario (same as SQL integration tests):

SettingQueryExpected behavior
stop_filter=offMATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE)broader recall (東京* docs can match)
stop_filter=on, stop_df_ratio_ppm=500000samehigher precision (mostly exact-intent doc remains)

Adding New Entries

When updating this page for a new version:

  1. Run cargo run --release --bin murodb_bench.
  2. Record git rev-parse --short=12 HEAD.
  3. Append one row to the “Versioned Results” table.
  4. Add a new subsection with the raw metrics table for that commit.

Keep benchmark parameters constant unless intentionally changing the benchmark definition. If benchmark definitions change, include a short migration note in the new entry.

Full-Text Search

MuroDB provides MySQL-compatible full-text search with bigram tokenization.

Creating a fulltext index

CREATE FULLTEXT INDEX t_body_fts ON t(body)
  WITH PARSER ngram
  OPTIONS (n=2, normalize='nfkc', stop_filter=off, stop_df_ratio_ppm=200000);

WITH PARSER / OPTIONS syntax is available so parser variants can be expanded in future releases.

FTS uses an internal doc_id mapping, so it works with non-BIGINT primary keys too. If a table has no explicit primary key, MuroDB’s hidden _rowid is used.

Supported options:

  • n: ngram size (2 only for now)
  • normalize: normalization mode ('nfkc' only for now)
  • stop_filter: on/off (or 1/0, 'true'/'false')
  • stop_df_ratio_ppm: document-frequency threshold in ppm (0..=1000000)

stop_filter applies to NATURAL LANGUAGE MODE only. BOOLEAN MODE behavior is unchanged.

Query semantics

NATURAL LANGUAGE MODE

  • BM25-based relevance ranking.
SELECT id, MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE) AS score
FROM t
WHERE MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE) > 0
ORDER BY score DESC
LIMIT 20;

With stop-ngram filtering enabled:

CREATE FULLTEXT INDEX t_body_fts ON t(body)
  WITH PARSER ngram
  OPTIONS (n=2, normalize='nfkc', stop_filter=on, stop_df_ratio_ppm=500000);

This skips very frequent low-information ngrams during scoring.

BOOLEAN MODE

Supports +term (required), -term (excluded), and "phrase" (exact phrase).

SELECT id
FROM t
WHERE MATCH(body) AGAINST('"東京タワー" +夜景 -混雑' IN BOOLEAN MODE) > 0;
OperatorMeaningExample
+termTerm must be present+東京
-termTerm must not be present-混雑
"phrase"Exact phrase match"東京タワー"
termOptional (score contribution)夜景

Snippet helper

Use fts_snippet() for highlighted excerpts.

SELECT id,
  fts_snippet(body, '"東京タワー"', '<mark>', '</mark>', 30) AS snippet
FROM t
WHERE MATCH(body) AGAINST('"東京タワー"' IN BOOLEAN MODE) > 0
LIMIT 10;

Recall/precision tradeoff example

Dataset:

  • doc1: 東京タワー
  • doc2: 東京駅
  • doc3: 東京大学
  • doc4: 東京ドーム

Query:

MATCH(body) AGAINST('東京タワー' IN NATURAL LANGUAGE MODE)

Observed behavior:

  • stop_filter=off: broad recall (multiple 東京* docs can match)
  • stop_filter=on, stop_df_ratio_ppm=500000: higher precision (mostly 東京タワー doc remains)

Internal design

See FTS Internals for implementation details.

Recovery

MuroDB uses WAL-based crash recovery with two recovery modes.

Recovery modes

strict (default)

The default mode. Fails on any WAL protocol violation.

murodb mydb.db

Detects and rejects:

  • Records before BEGIN
  • Commit LSN mismatches
  • Duplicate terminal records
  • PagePut integrity mismatches

permissive

Skips invalid transactions and recovers only valid committed transactions. Useful for salvaging data from corrupted databases.

murodb mydb.db --recovery-mode permissive

When transactions are skipped, the original WAL is quarantined to *.wal.quarantine.*.

WAL Inspection

Analyze WAL consistency without modifying the database.

# Text output
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive

# JSON output (for automation)
murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json

Quarantine files can also be inspected directly:

murodb-wal-inspect mydb.db --wal mydb.wal.quarantine.20240101_120000

Exit codes

CodeMeaning
0No malformed transactions detected
10Malformed transactions detected (inspection succeeded)
20Fatal error (decrypt/IO/strict failure, etc.)

API

#![allow(unused)]
fn main() {
use murodb::{Database, RecoveryMode};

// strict (default)
let db = Database::open("mydb.db", &master_key)?;

// permissive
let db = Database::open_with_recovery_mode(
    "mydb.db", &master_key, RecoveryMode::Permissive
)?;

// permissive with report
let (db, report) = Database::open_with_recovery_mode_and_report(
    "mydb.db", &master_key, RecoveryMode::Permissive
)?;
for skip in &report.skipped {
    eprintln!("Skipped tx {}: {:?}", skip.txid, skip.reason);
}
}

JSON Schema Versioning Policy

  • schema_version increments only on breaking changes (key removal, type changes)
  • New keys are added without version bump (consumers should ignore unknown keys)
  • RecoverySkipCode string values are frozen (regression-tested)
  • InspectFatalKind string values are frozen (regression-tested)

Backup & Restore

Overview

MuroDB provides a Database::backup() API for creating consistent, point-in-time backups of a running database. The backup file is a complete MuroDB database that can be opened directly with the same key or password.

Creating a Backup

Rust API

#![allow(unused)]
fn main() {
use murodb::Database;
use murodb::crypto::aead::MasterKey;

let mut db = Database::open(path, &master_key)?;

// Normal operations ...
db.execute("INSERT INTO users VALUES (1, 'alice')")?;

// Create a consistent backup
db.backup("/path/to/backup.db")?;
}

Password-based databases work the same way:

#![allow(unused)]
fn main() {
let mut db = Database::open_with_password(path, "my-password")?;
db.backup("/path/to/backup.db")?;
}

Plaintext databases too:

#![allow(unused)]
fn main() {
let mut db = Database::open_plaintext(path)?;
db.backup("/path/to/backup.db")?;
}

What Happens During Backup

  1. An exclusive lock is acquired (writers are blocked, but backup is typically fast).
  2. The WAL is checkpointed so all committed data is flushed to the main file.
  3. The database file is copied byte-by-byte to the destination.
  4. The destination file is fsynced for durability.
  5. The lock is released and normal operations resume.

Restoring from a Backup

The backup file is a standard MuroDB database file. To restore:

  1. Stop the application (or close the Database handle).
  2. Replace the original database file with the backup file.
  3. Delete the WAL file (<dbname>.wal) if present — the backup has no pending WAL.
  4. Reopen the database normally.
#![allow(unused)]
fn main() {
// Open the backup directly
let mut db = Database::open("/path/to/backup.db", &master_key)?;
}

Safety

  • Same-file protection: Attempting to backup to the source file itself (including via symlinks or hardlinks) returns an error without modifying the source.
  • Encryption preserved: Encrypted databases are copied as-is. No decryption or re-encryption occurs. The backup uses the same key, salt, and encryption suite.
  • Crash safety: If the process crashes during backup, the source database is unaffected. The destination file may be incomplete and should be discarded.

Limitations

ItemDetail
Disk spaceRequires free space equal to the full database size.
Writer blockingWriters are blocked for the duration of the copy (proportional to DB size).
WAL not includedThe WAL is checkpointed before copy; the backup file has no WAL dependency.
Incremental backupNot supported. Each backup is a full copy.

Checkpoint Policy Tuning

MuroDB can defer WAL truncate (checkpoint) using policy thresholds. This helps reduce commit-time overhead on write-heavy workloads.

Configuration Knobs

Checkpoint policy can be configured in two ways:

  • Environment variables (process default, applied at session construction)
  • SQL runtime options (SET ..., session scope, non-persistent)

For full option-by-option semantics, see Runtime Configuration.

Environment variable names:

  • MURODB_CHECKPOINT_TX_THRESHOLD
  • MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD
  • MURODB_CHECKPOINT_INTERVAL_MS

SQL runtime option names:

  • checkpoint_tx_threshold
  • checkpoint_wal_bytes_threshold
  • checkpoint_interval_ms

Semantics:

  • Checkpoint runs when any enabled trigger fires.
  • MURODB_CHECKPOINT_TX_THRESHOLD=1 (default): checkpoint every commit/rollback.
  • MURODB_CHECKPOINT_TX_THRESHOLD=0: disable tx-count trigger.
  • *_WAL_BYTES_THRESHOLD=0 / *_INTERVAL_MS=0: disabled.
  • Runtime SET values follow the same semantics, but are session-only and not persisted.

Runtime example:

SET checkpoint_tx_threshold = 8;
SET checkpoint_wal_bytes_threshold = 1048576;
SET checkpoint_interval_ms = 1000;

Low-latency / conservative

MURODB_CHECKPOINT_TX_THRESHOLD=8
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=1048576
MURODB_CHECKPOINT_INTERVAL_MS=1000

Use this when you want modest commit-speed gains with tight WAL growth control.

Write-throughput focused

MURODB_CHECKPOINT_TX_THRESHOLD=64
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=8388608
MURODB_CHECKPOINT_INTERVAL_MS=5000

Use this for update/insert-heavy workloads where throughput is prioritized over immediate WAL truncation.

Time-driven only

MURODB_CHECKPOINT_TX_THRESHOLD=0
MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD=0
MURODB_CHECKPOINT_INTERVAL_MS=1000

Useful when transaction size/shape is bursty and you want predictable checkpoint cadence.

Tuning Procedure

  1. Start from the conservative profile.
  2. Run workload benchmark and record throughput/latency.
  3. Increase MURODB_CHECKPOINT_TX_THRESHOLD stepwise (for example: 8 -> 16 -> 32 -> 64).
  4. Keep safety bounds with either MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD or MURODB_CHECKPOINT_INTERVAL_MS.
  5. Stop increasing when throughput gain flattens or WAL/recovery cost becomes unacceptable.

What to Monitor

Use:

SHOW DATABASE STATS;

Track at least:

  • failed_checkpoints
  • deferred_checkpoints
  • checkpoint_pending_ops
  • checkpoint_policy_tx_threshold
  • checkpoint_policy_wal_bytes_threshold
  • checkpoint_policy_interval_ms

And from filesystem:

ls -lh mydb.wal

Guardrails

  • failed_checkpoints > 0 means truncate is failing; investigate disk I/O.
  • checkpoint_pending_ops growing continuously with WAL size means policy is too loose for current workload.
  • Larger deferred windows can increase restart recovery time because WAL replay work increases.
  • Durability boundary is unchanged: commit durability still depends on WAL sync.

Alerting & Monitoring

MuroDB exposes internal health metrics via the SHOW DATABASE STATS SQL command. This page describes how to use these metrics for operational alerting.

Querying Stats

SHOW DATABASE STATS;

Returns a key-value table with columns: stat and value.

Related command:

SHOW CHECKPOINT STATS;

This returns the checkpoint-only subset for backward compatibility.

Key Metrics

commit_in_doubt_count

Alert threshold: > 0

Indicates the number of times a commit succeeded in the WAL (durable) but failed to write pages or metadata to the data file. The session is poisoned after this event.

Action: Close the session and reopen the database. WAL recovery will replay the committed transaction. Investigate the root cause (disk full, I/O errors, hardware failure).

IF commit_in_doubt_count > 0 THEN ALERT
  severity: CRITICAL
  message: "CommitInDoubt detected. Session is poisoned. Reopen database to recover."

failed_checkpoints

Alert threshold: > 0

Indicates checkpoint truncation failures. The WAL file is not being truncated after successful commits, which causes WAL growth.

Action: Monitor WAL file size. If it grows unboundedly, investigate disk I/O. The database remains correct (WAL replay is idempotent), but performance may degrade due to longer recovery times.

IF failed_checkpoints > 0 THEN ALERT
  severity: WARNING
  message: "Checkpoint failures detected. WAL may be growing. Monitor WAL file size on disk."

deferred_checkpoints / checkpoint_pending_ops

deferred_checkpoints is the number of transactions where checkpoint was intentionally skipped by policy. checkpoint_pending_ops is the current backlog since the last successful checkpoint.

High deferred_checkpoints is expected with batch policies. Alert only when checkpoint_pending_ops keeps growing together with WAL size.

checkpoint_policy_*

The active checkpoint policy is surfaced as:

  • checkpoint_policy_tx_threshold
  • checkpoint_policy_wal_bytes_threshold
  • checkpoint_policy_interval_ms

Policy is configured via environment variables:

  • MURODB_CHECKPOINT_TX_THRESHOLD (default 1, 0 disables tx-count trigger)
  • MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD (default 0, disabled)
  • MURODB_CHECKPOINT_INTERVAL_MS (default 0, disabled)

Policy can also be changed at runtime per session:

  • SET checkpoint_tx_threshold = <u64>
  • SET checkpoint_wal_bytes_threshold = <u64>
  • SET checkpoint_interval_ms = <u64>

See Checkpoint Policy Tuning for recommended starting values and rollout procedure. Option details are documented in Runtime Configuration.

freelist_sanitize_count

Alert threshold: > 0 (informational)

Number of times the freelist was sanitized during page allocation to remove invalid entries. This is a self-healing mechanism, not necessarily an error.

Action: If the count is consistently non-zero across sessions, investigate potential freelist corruption. A one-time occurrence after crash recovery is normal.

freelist_out_of_range_total / freelist_duplicates_total

Alert threshold: > 0 (informational)

Breakdown counters for freelist sanitization events:

  • freelist_out_of_range_total: removed page IDs outside valid page range
  • freelist_duplicates_total: removed duplicate page IDs

Action: non-zero values are expected only when sanitization occurred. If values continue increasing across clean restarts, investigate possible on-disk corruption.

WAL Size Monitoring

SHOW DATABASE STATS exposes WAL size as wal_file_size_bytes. Use this SQL-native metric first and correlate with failed_checkpoints. Persistent growth together with failed_checkpoints > 0 indicates checkpoint truncate failures.

Monitoring Patterns

Polling Loop

For long-running applications, periodically query stats:

#![allow(unused)]
fn main() {
use murodb::Database;

fn check_health(db: &mut Database) {
    if let Ok(result) = db.execute("SHOW DATABASE STATS") {
        // Parse result and check thresholds
        // Alert on commit_in_doubt_count > 0
        // Alert on failed_checkpoints > 0
        // Track wal_file_size_bytes
    }
}
}

Post-Recovery Check

After opening a database that required WAL recovery, verify the recovery result:

#![allow(unused)]
fn main() {
use murodb::{Database, RecoveryMode};

let (db, report) = Database::open_with_recovery_mode_and_report(
    "mydb.db", &master_key, RecoveryMode::Strict
)?;

if !report.committed_txids.is_empty() {
    log::info!("Recovered {} committed transactions", report.committed_txids.len());
}
if !report.skipped.is_empty() {
    log::warn!("Skipped {} malformed transactions", report.skipped.len());
}
}

Summary

MetricThresholdSeverityMeaning
commit_in_doubt_count> 0CriticalSession poisoned; reopen required
failed_checkpoints> 0WarningWAL growing; checkpoint failing
wal_file_size_bytesIncreasing trendWarningWAL growth; correlate with checkpoint failures
freelist_sanitize_count> 0InfoFreelist self-healed
freelist_out_of_range_total> 0InfoInvalid freelist entries removed (range)
freelist_duplicates_total> 0InfoInvalid freelist entries removed (duplicates)

Incident Response Runbook

This runbook covers common failure scenarios and the recommended response procedures for MuroDB operators.

First Response: Gather Diagnostics

Before taking any corrective action, collect the current state:

SHOW DATABASE STATS;

Note the values of commit_in_doubt_count, failed_checkpoints, wal_file_size_bytes, freelist_sanitize_count, freelist_out_of_range_total, and freelist_duplicates_total.

If the database cannot be opened, inspect the WAL without modifying it:

murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json

Scenario: CommitInDoubt Detected

Symptom: commit_in_doubt_count > 0 in SHOW DATABASE STATS, or the application receives a CommitInDoubt error.

What happened: The transaction was durably written to the WAL, but the subsequent write of dirty pages or metadata to the data file failed (e.g., disk full, I/O error). The session is poisoned and will reject further operations.

Response:

  1. Close the current session / database handle immediately.
  2. Investigate the root cause — check disk space (df -h), kernel logs (dmesg), and storage health.
  3. Resolve the underlying issue (free disk space, replace failing disk).
  4. Reopen the database. WAL recovery will automatically replay the committed transaction.
  5. Run SHOW DATABASE STATS to confirm commit_in_doubt_count is 0 after recovery.

Do NOT delete or rename the WAL file — it contains the committed data that needs to be replayed.

Scenario: Checkpoint Failures / WAL Growth

Symptom: failed_checkpoints > 0, and wal_file_size_bytes is growing over time.

What happened: After committing, MuroDB truncates the WAL via a checkpoint. If truncation fails, the WAL keeps growing. The database remains correct (WAL replay is idempotent), but recovery time increases.

Response:

  1. Check disk I/O health and available space.
  2. If the WAL is very large but the database is otherwise healthy, restart the process. Recovery on startup will replay and then truncate the WAL.
  3. Monitor wal_file_size_bytes after restart to confirm the WAL was truncated.

Scenario: Freelist Corruption Suspected

Symptom: freelist_sanitize_count is consistently non-zero across multiple sessions (not just once after a crash recovery).

What happened: The freelist (which tracks reusable pages) contained invalid entries. MuroDB self-heals by removing invalid entries during allocation, but persistent occurrences may indicate deeper corruption.

Response:

  1. A single occurrence after crash recovery is normal — no action needed.
  2. If it recurs across sessions:
    • Back up the database file and WAL immediately.
    • Open with --recovery-mode permissive and check the recovery report for skipped transactions.
    • If data integrity is confirmed, the self-healing is working correctly. Continue monitoring.
    • If data loss is suspected, restore from backup and replay from the last known good state.

Scenario: Database Fails to Open (WAL Corruption)

Symptom: Opening the database fails with a recovery error in strict mode.

Response:

  1. Inspect first — do not delete any files:
    murodb-wal-inspect mydb.db --wal mydb.wal --recovery-mode permissive --format json
    
  2. Review the report. If only incomplete (uncommitted) transactions are malformed, they can be safely skipped.
  3. Open with permissive mode to recover valid data:
    murodb mydb.db --recovery-mode permissive
    
  4. The original WAL is automatically quarantined to *.wal.quarantine.* for forensic analysis.
  5. Verify recovered data integrity by querying critical tables.

Scenario: Process Crash / Kill During Operation

Symptom: The MuroDB process was killed (SIGKILL, OOM, power loss) mid-operation.

Response:

  1. Simply reopen the database. WAL recovery handles this automatically.
  2. Check SHOW DATABASE STATS after recovery:
    • commit_in_doubt_count should be 0.
    • freelist_sanitize_count may be > 0 once — this is normal.
  3. If strict recovery fails, follow the “Database Fails to Open” procedure above.

When to Restart vs. Quarantine WAL

SituationAction
Session poisoned (CommitInDoubt)Restart — recovery replays committed data
WAL growing (checkpoint failures)Restart — recovery truncates WAL
Strict recovery failsInspect WAL, then open with --recovery-mode permissive
Repeated freelist sanitizationBack up, then investigate with permissive mode
Corrupted WAL with data lossRestore from backup

Escalation Criteria

Escalate to the development team if:

  • commit_in_doubt_count > 0 persists after reopening the database (recovery failed to replay).
  • WAL inspection shows committed transactions that were not recovered.
  • Permissive mode skips transactions that should have been valid.
  • Freelist sanitization count grows across sessions without any preceding crash.
  • The database file size is inconsistent with expected data volume (possible page leak).

When escalating, include:

  1. Full SHOW DATABASE STATS output.
  2. WAL inspection JSON output (murodb-wal-inspect --format json).
  3. Kernel logs around the time of failure (dmesg, journalctl).
  4. The quarantined WAL file(s), if any.

Limits Reference

This page documents the known limits of MuroDB. These limits arise from the fixed page size, serialization formats, and design decisions of the storage engine.

Page & Row Limits

LimitValueNotes
Page size4,096 bytesFixed; all data pages, B-tree nodes, and catalog entries use this size
Page header14 bytespage_id (8) + cell_count (2) + free_start (2) + free_end (2)
Max inline row size~4,073 bytesRows within this limit are stored inline in a single page
Max row size (with overflow)~4 GBLimited by u32 total_value_len; values exceeding inline limit use overflow pages
Max cell payload~4,073 bytes4,096 − 14 (header) − 5 (node header cell) − 4 (cell pointer + length prefix)
Overflow chunk size4,077 bytesPer overflow page: 4,096 − 19 bytes header

Rows with values that exceed the inline page capacity automatically use overflow pages. The value is stored in a chain of overflow pages, with the leaf cell containing only the key and a pointer to the first overflow page. Keys must still fit inline (max ~4,071 bytes).

Column Limits

LimitValueNotes
Max column count65,535 (u16)Serialization limit; practical limit is lower due to page capacity
Column name max length65,535 bytes (u16)Limited in practice by catalog page capacity
Table name max length65,535 bytes (u16)Limited in practice by catalog page capacity

Data Type Ranges

TypeMinMaxStorage
TINYINT−1281271 byte
SMALLINT−32,76832,7672 bytes
INT−2,147,483,6482,147,483,6474 bytes
BIGINT−2^632^63 − 18 bytes
FLOAT±1.2×10^−38±3.4×10^384 bytes (finite values only; NaN/Infinity rejected)
DOUBLE±2.2×10^−308±1.7×10^3088 bytes (finite values only; NaN/Infinity rejected)

String & Binary Limits

LimitValueNotes
VARCHAR(n) max n4,294,967,295 (u32)Values exceeding ~4,073 bytes use overflow pages
VARBINARY(n) max n4,294,967,295 (u32)Values exceeding ~4,073 bytes use overflow pages
TEXT max size~4 GBLimited by u32 value length; large values use overflow pages
VARCHAR(n) length checkCharacter-basedVARCHAR(100) allows up to 100 characters (MySQL-compatible)

Note: MuroDB checks VARCHAR(n) against character count, consistent with MySQL. Multi-byte UTF-8 characters (e.g., Japanese characters at 3 bytes each, emoji at 4 bytes) each count as one character. VARBINARY(n) still uses byte-based length checking.

Internal Limits

LimitValueNotes
B-tree max depth64Exceeded depth indicates corruption
WAL max frame length5,120 bytes
FTS inline segment limit3,000 bytesLarger posting lists use overflow pages
FTS max payload65,536 bytes
LRU cache default size256 pagesConfigurable

NULL Behavior

  • Primary key columns cannot be NULL
  • NULL = NULL evaluates to UNKNOWN (not TRUE); use IS NULL instead
  • Aggregate functions (SUM, AVG, MIN, MAX) skip NULL values
  • COUNT(*) counts all rows; COUNT(column) counts non-NULL values only
  • Multiple NULL values are allowed in UNIQUE indexes (SQL standard)
  • NULL values in ORDER BY are grouped together

Security Considerations

This page summarizes the current security model, known risks, and operational guidance.

Threat Model (Current)

  • MuroDB is an embedded database library/CLI, not a network server.
  • Encrypted mode (aes256-gcm-siv) targets at-rest confidentiality and tamper detection for DB/WAL pages.
  • Plaintext mode (--encryption off) is explicit opt-in and provides no cryptographic protection.

Known Risks

RiskImpactStatus
Malformed page/cell metadata can currently trigger panic paths instead of clean corruption errorsProcess abort (availability) when opening/querying corrupted files, especially relevant in plaintext modeTracked: #182
Plaintext mode has no confidentiality/integrity guaranteesData can be read/modified offline without cryptographic checksBy design
No built-in user authentication/authorization layerAccess control depends on host process + filesystem permissionsBy design

Operational Guidance

  • Prefer encrypted mode for production data.
  • Avoid passing secrets via CLI args (--password) when possible; use interactive prompt.
  • Treat database files as trusted inputs only until #182 is addressed.
  • Apply OS-level controls: file permissions, disk encryption, process isolation, and secrets management.

Architecture

MuroDB is an embedded SQL database with optional at-rest encryption. The runtime stack is organized in layers:

sql/ (lexer → parser → planner → executor)
  ↓
schema/ (catalog: table/index definitions)
  ↓
tx/ (transaction: dirty page buffer, commit/rollback)
  ↓
btree/ (B-tree: insert/split, delete, search, scan)
  ↓
wal/ (WAL: encrypted records, crash recovery)
  ↓
storage/ (pager: encrypted page I/O, LRU cache, freelist)
  ↓
crypto/ (AES-256-GCM-SIV, Argon2 KDF, HMAC-SHA256)

Additional modules:

  • fts/ - Full-text search (bigram tokenizer, postings B-tree, BM25, BOOLEAN/NATURAL mode)
  • concurrency/ - parking_lot::RwLock (thread) + fs4 file lock (process)

How To Read This Section

If your goal is “reconstruct internals after a long break”, read in this order:

  1. Reading Guide
  2. Files, WAL, and Locking
  3. Storage
  4. Catalog Format
  5. B-tree
  6. Query Planning & Execution
  7. Cryptography
  8. WAL & Crash Resilience

Module Map

ModuleFilesRole
storage/page.rs, pager.rs, freelist.rs4096B encrypted page I/O
crypto/aead.rs, kdf.rs, hmac_util.rsEncryption primitives
btree/node.rs, ops.rs, cursor.rs, key_encoding.rsB-tree operations
wal/record.rs, writer.rs, reader.rs, recovery.rsWAL + crash recovery
tx/transaction.rs, lock_manager.rsTransactions
schema/catalog.rs, column.rs, index.rsSystem catalog
sql/lexer.rs, parser.rs, ast.rs, planner.rs, executor.rs, eval.rsSQL processing
fts/tokenizer.rs, postings.rs, index.rs, query.rs, scoring.rs, snippet.rsFull-text search
concurrency/mod.rsConcurrency control

Concurrency Model

  • Thread-level: parking_lot::RwLock - multiple readers, single writer
  • Process-level: fs4 file lock - prevents concurrent access from multiple processes
  • API routing:
    • Database::query acquires a shared lock for read-only statements.
    • Database::execute acquires an exclusive lock for general SQL execution.
    • CLI routes read-only statements to query unless an explicit transaction is active.
  • Handle model:
    • Database::query takes &mut self because read execution may refresh pager/catalog state from disk before running.
    • For concurrent reads in one process, open additional read-only handles (Database::open_reader) and run query on each handle.

For on-disk file contracts (main DB file / .wal / .lock), see Files, WAL, and Locking. For catalog key/value binary layouts, see Catalog Format.

Reading Guide

This page defines how to use the internals docs as a long-term memory aid.

Current Focus Areas

If you return after weeks/months, start from these questions:

  1. How is data physically laid out on disk?
  2. How is a B+tree node stored inside a page?
  3. How does a query become a concrete access path?
  4. What is the exact .wal format and recovery state machine?
  5. What does .lock lock, and at what granularity?
  6. Which cryptographic primitives are used, and why these choices?

Target Documentation Shape

The internals section now follows this order:

  1. Architecture: module map and end-to-end data flow.
  2. Files, WAL, and Locking: main file / .wal / .lock contract.
  3. Storage: file header, generic page layout, and freelist format.
  4. Catalog Format: system catalog key/value encoding and compatibility.
  5. B-tree: in-page node format and mutation/scan algorithms.
  6. Query Planning & Execution: plan selection and execution mapping.
  7. Cryptography: encryption/KDF details and rationale.
  8. WAL & Crash Resilience: transaction protocol and recovery validation.
  9. Durability Matrix: crash-at-each-step outcomes.

Suggested Rebuild Path (for implementers)

If you are implementing an embedded RDBMS with an LLM agent, use this order:

  1. Implement fixed-size pages and a pager (storage/page.rs, storage/pager/mod.rs).
  2. Implement B+tree on top of slotted pages (btree/node.rs, btree/ops/mod.rs).
  3. Add SQL parser/planner/executor (sql/parser/*, sql/planner.rs, sql/executor/*).
  4. Add WAL append + replay (wal/writer.rs, wal/reader.rs, wal/recovery.rs).
  5. Add lock manager for thread/process safety (concurrency/mod.rs).
  6. Add encryption suite and key derivation (crypto/*).

Each chapter in this internals section maps directly to those steps.

Files, WAL, and Locking

This chapter defines the on-disk files and lock behavior.

At a Glance

For database path <db_path>, MuroDB uses three files:

<db_path> (main state) + <db_path>.wal (durability log) + <db_path>.lock (OS advisory lock target)

High-level flow:

  1. Append commit-intent records to .wal and wal.sync().
  2. Commit is durable at this point.
  3. Main DB file is checkpointed/flushed later.
  4. .lock coordinates readers/writers across processes during API calls.

The sections below expand each file in this order.

File Set

If you open database path <db_path>, MuroDB uses:

  • <db_path>: main database file (header + pages)
  • <db_path>.wal: write-ahead log
  • <db_path>.lock: lock file for cross-process coordination

Example:

  • if <db_path> = mydata, files are mydata, mydata.wal, mydata.lock
  • if <db_path> = mydb.db, files are mydb.db, mydb.db.wal, mydb.db.lock

Main DB File Layout

The main file starts with a 76-byte plaintext header (src/storage/pager/mod.rs):

OffsetSizeField
08Magic "MURODB01"
84Format version (u32, current 4)
1216KDF salt
288Catalog root page id
368Page count
448Epoch
528Freelist root page id
608Next transaction id
684Encryption suite id
724CRC32 over bytes 0..72

Immediately after the header, pages are stored sequentially. For page-level internals, see Storage.

.wal File Role

.wal stores commit-intent records before data-file flush. Binary framing details are in WAL & Crash Resilience.

Durability boundary:

  • commit is considered durable after wal.sync() succeeds.
  • data-file flush may happen after that; failures become CommitInDoubt.

.lock File Semantics

<db_path>.lock is created by LockManager::new (src/concurrency/mod.rs).

  • It is not a structured metadata file.
  • Its payload is not interpreted by MuroDB.
  • It exists as a stable file descriptor target for advisory file locks (fs4).

Lock Granularity

Locking has two layers:

  1. In-process: parking_lot::RwLock<()>
  2. Cross-process: fs4 shared/exclusive lock on .lock

API behavior:

  • Database::query(...) acquires shared read lock.
  • Database::execute(...) acquires exclusive write lock.
  • Database::query(...) is a &mut self API because read execution may refresh pager/catalog metadata from disk before running.
  • For multiple concurrent readers within one process, use separate read-only handles (for example Database::open_reader()).

Important granularity note:

  • Locks are acquired per API call, not globally for session lifetime.
  • During explicit transactions (BEGIN ... COMMIT), each statement still enters through execute(...) and takes the write lock for that call.

Visibility Refresh

When no explicit transaction is active, session execution calls pager.refresh_from_disk_if_changed() and reloads catalog metadata when header fields changed. This is how a process observes committed changes from other processes.

Why this split (main file + .wal + .lock)?

  • main DB file: stable state and efficient reads.
  • .wal: sequential append for crash-safe commit protocol.
  • .lock: avoids embedding lock bytes into data format and delegates arbitration to OS advisory locks.

Why .lock is separated from the main DB file:

  • Keep data format clean: lock state is operational state, not database state.
  • Avoid extra data-file churn: lock acquire/release does not force DB header/page writes.
  • Better crash behavior: lock lifetime is tied to OS file-lock semantics; stale lock bytes do not need cleanup from the DB payload.
  • Portability and tooling: advisory locking APIs (flock/fcntl-style via fs4) naturally target a lock file descriptor.
  • Lower format coupling: lock strategy can evolve without changing on-disk table/page format.

Storage

Data File Structure (At a Glance)

The main .db file is:

[plaintext file header (76B)] [page 0 on disk] [page 1 on disk] [page 2 on disk] ...

  • File header is always plaintext and fixed-size.
  • Each page is a logical 4096-byte page (PAGE_SIZE) stored at:
    • offset = 76 + page_id * page_size_on_disk
  • page_size_on_disk is:
    • plaintext mode: 4096
    • encrypted mode: 12 (nonce) + 4096 (ciphertext) + 16 (tag) = 4124

This chapter first explains the file header, then the page layout, then special page formats (freelist).

Data File Header

Main DB file header (src/storage/pager/mod.rs) is 76 bytes:

0..8    Magic "MURODB01"
8..12   Format version (u32 LE)
12..28  Salt (16B, Argon2 input)
28..36  Catalog root page ID (u64 LE)
36..44  Page count (u64 LE)
44..52  Epoch (u64 LE)
52..60  Freelist page ID (u64 LE, 0 = none)
60..68  Next TxId (u64 LE)
68..72  Encryption suite ID (u32 LE)
72..76  CRC32 over bytes 0..72
  • freelist_page_id persists the freelist root across restarts.
  • CRC32 protects header integrity before any page decryption.
  • This header exists once per file; everything after this is page data.
  • catalog_root points to the system catalog B-tree root (format in Catalog Format).

See Files, WAL, and Locking for .db / .wal / .lock lifecycle.

Generic Page Layout

  • Page size: 4096 bytes (PAGE_SIZE)
  • Page header: 14 bytes (src/storage/page.rs)
  • Cell pointer: 2 bytes per cell
  • Cell payload: [len:u16][payload bytes]
  • Cache: LRU page cache (default 256 pages)

Slotted-page structure:

[header(14)] [cell pointer array] [free space] [cell bodies (from tail)]

This slotted layout is generic; B+tree node format is layered on top of it.
See B-tree for node/header cell conventions.

Encryption

Encrypted mode stores each page as:

nonce(12) || ciphertext || tag(16)

  • Algorithm: AES-256-GCM-SIV (nonce-misuse resistant AEAD)
  • KDF: Argon2 derives the master key from the user’s passphrase + random salt
  • AAD binding: (page_id, epoch) prevents page-swap/misbinding attacks

See Cryptography for rationale and full details.

Freelist

Freed pages are tracked in a freelist for reuse.

Freelist In-Memory Model

Implementation (src/storage/freelist.rs) uses Vec<PageId>:

  • allocate() pops from tail (LIFO reuse)
  • free(page_id) pushes if not already present
  • duplicate free is treated as double-free and rejected
  • undo_last_free() exists for speculative commit-time calculations

Freelist On-Disk Format

Freelist is stored in normal data pages, linked as a chain.

Per freelist page data area (after the generic 14-byte page header):

[magic "FLMP":4][next_page_id:u64][count:u64][entries:u64...]

Facts:

  • ENTRIES_PER_FREELIST_PAGE = 507 for 4096-byte pages with 14-byte page header
  • next_page_id = 0 marks chain end
  • DB header field freelist_page_id points to chain head

Commit-Time Freelist Handling

During Transaction::commit (src/tx/transaction.rs):

  1. Build a speculative freelist snapshot (without permanent mutation).
  2. Determine how many freelist pages are needed.
  3. Reuse existing freelist head page when possible, allocate more page IDs if needed.
  4. Serialize freelist pages and emit them as WAL PagePut.
  5. Emit MetaUpdate with new freelist_page_id.
  6. After wal.sync() succeeds, apply freed pages to in-memory freelist.

This ordering avoids freelist state leaks when commit fails before WAL durability.

Open-Time Freelist Loading and Sanitize

At open/refresh (Pager::reload_freelist_from_disk):

  1. Read freelist chain from freelist_page_id.
  2. For multi-page chain, detect cycles and out-of-range next pointers.
  3. Deserialize entries.
  4. Run sanitize(page_count) to remove:
    • out-of-range entries (pid >= page_count)
    • duplicate entries

Sanitization results are exposed as diagnostics and warning counters.

Catalog Format

This page documents how schema metadata is persisted in the system catalog.

Where Catalog Metadata Lives

  • The system catalog itself is a B-tree (src/schema/catalog.rs).
  • The root page id of that B-tree is stored in the main DB header field catalog_root.
  • Catalog keys are UTF-8 bytes; values are binary-serialized structs.

Key namespace:

  • table:<table_name> -> serialized TableDef
  • index:<index_name> -> serialized IndexDef

TableDef Value Format

TableDef::serialize / deserialize in src/schema/catalog.rs.

Layout (length-prefixed, little-endian integers):

  1. name_len: u16
  2. name: [u8; name_len] (UTF-8)
  3. column_count: u16
  4. Repeated column_count times:
    • col_blob_len: u16
    • col_blob: [u8; col_blob_len] (ColumnDef payload)
  5. pk_tag: u8
  6. Primary-key names by pk_tag:
    • 0: no PK names
    • 1: single-column PK:
      • pk_len: u16
      • pk_name: [u8; pk_len]
    • 2: composite PK:
      • pk_count: u16
      • repeated pk_count times: pk_len: u16 + pk_name
  7. data_btree_root: u64
  8. next_rowid: i64 (optional tail; defaults to 0 if absent)
  9. row_format_version: u8 (optional tail; defaults to 0 if absent)
  10. stats_row_count: u64 (optional tail; defaults to 0 if absent)

Unknown pk_tag causes decode failure.

ColumnDef Embedded Blob Format

ColumnDef::serialize / deserialize in src/schema/column.rs.

Layout:

  1. name_len: u16
  2. name: [u8; name_len] (UTF-8)
  3. type_byte: u8
  4. flags: u8
  5. Optional size: u32 only when type is VARCHAR/VARBINARY
  6. default_tag: u8 + optional default payload
  7. check_len: u16 + optional check expression bytes

type_byte mapping:

  • 1 BIGINT
  • 2 VARCHAR
  • 3 VARBINARY
  • 4 TINYINT
  • 5 SMALLINT
  • 6 INT
  • 7 TEXT
  • 8 FLOAT
  • 9 DOUBLE
  • 10 DATE
  • 11 DATETIME
  • 12 TIMESTAMP

Flag bits:

  • 0x01 primary key
  • 0x02 unique
  • 0x04 nullable
  • 0x08 hidden
  • 0x10 auto_increment

default_tag mapping:

  • 0 no default
  • 1 NULL
  • 2 integer (i64)
  • 3 string (u16 length + bytes)
  • 4 float (f64)

Unknown type_byte or default_tag causes decode failure.

IndexDef Value Format

IndexDef::serialize / deserialize in src/schema/index.rs.

Layout:

  1. name_len: u16 + name
  2. table_len: u16 + table_name
  3. first_col_len: u16 + first_column_name (legacy position)
  4. index_type: u8 (1 BTree, 2 Fulltext)
  5. is_unique: u8 (0/1)
  6. btree_root: u64
  7. extra_col_count: u16
  8. Repeated extra columns: col_len: u16 + col_name
  9. stats_distinct_keys: u64 (optional tail; default 0)
  10. Numeric-bounds extension (optional):
  • stats_num_bounds_known: u8
  • stats_num_min: i64
  • stats_num_max: i64
  1. FULLTEXT stop-filter extension (optional):
  • fts_stop_filter: u8
  • fts_stop_df_ratio_ppm: u32
  1. Histogram extension (optional):
  • hist_bin_count: u16
  • repeated hist_bin_count times: u32

Unknown index_type causes decode failure.

Compatibility Policy in Code

Current decode strategy is append-only/tolerant:

  • New fields are generally appended at the tail.
  • Older records are accepted by defaulting missing tail fields.
  • Some old layouts are explicitly recognized (for example IndexDef stats tails).
  • Truncated/corrupt payloads fail decode (None) or ignore incomplete optional tails (histogram extension in IndexDef).

Executable Spec (Tests)

Primary roundtrip tests:

  • src/schema/catalog.rs (test_table_def_roundtrip)
  • src/schema/column.rs (test_column_roundtrip_all_types)
  • src/schema/index.rs (test_composite_index_roundtrip)

Backward-compat/malformed behavior:

  • src/schema/index.rs (test_deserialize_old_layout_keeps_fts_settings)
  • src/schema/index.rs (test_deserialize_truncated_index_returns_none)

Foreign Keys

This page describes how foreign-key metadata and enforcement are implemented in murodb.

Source Layout

  • Metadata types and serialization: src/schema/catalog.rs
  • DDL validation (CREATE TABLE): src/sql/executor/ddl.rs
  • ALTER integration (ADD/DROP FOREIGN KEY, column guards): src/sql/executor/alter.rs
  • Runtime enforcement core: src/sql/executor/foreign_key.rs
  • DML integration:
    • INSERT / REPLACE / ON DUPLICATE KEY UPDATE: src/sql/executor/insert.rs
    • UPDATE / DELETE: src/sql/executor/mutation.rs
  • Introspection output: src/sql/executor/show.rs

Metadata Model

Each table stores outgoing FKs in TableDef.foreign_keys:

  • child columns: columns
  • parent table: ref_table
  • parent columns: ref_columns
  • actions: on_delete, on_update (RESTRICT, CASCADE, SET NULL)

Metadata is persisted in TableDef::serialize / TableDef::deserialize as an optional tail. The current format uses FK_LAYOUT_V2_TAG and stores both actions explicitly.

DDL and ALTER Rules

CREATE TABLE and ALTER TABLE ... ADD FOREIGN KEY validate:

  • referenced table exists
  • child and parent column counts match
  • referenced parent columns exist
  • child/parent types are compatible
  • existing rows in child table satisfy the new constraint

ALTER TABLE also protects FK dependencies:

  • cannot drop/modify/change a column used by local FK definitions
  • cannot drop parent-side columns referenced by incoming FKs (including self-reference)
  • DROP FOREIGN KEY (cols) is rejected when ambiguous

Runtime Enforcement

Child-side check (outgoing FK)

enforce_child_foreign_keys(...) validates that each non-NULL child key has a matching parent row. Any NULL in FK columns skips the check (SQL nullable FK behavior).

Parent-side check (incoming FK)

  • enforce_parent_restrict_on_delete(...)
  • enforce_parent_restrict_on_update(...)

These locate incoming references by scanning catalog tables and selecting FKs where ref_table matches the parent table.

For RESTRICT, they fail the statement. For CASCADE / SET NULL, they mutate child rows through helper paths in foreign_key.rs.

Ordering and Atomicity Safeguards

The implementation uses ordering rules to avoid partial side effects on failed statements:

  • delete path validates the full delete set first, then applies pending cascade/set-null actions
  • update path validates incoming RESTRICT before applying pending child updates
  • mutation paths run local uniqueness/outgoing-FK checks before triggering parent-side cascades
  • REPLACE pre-validates conflict-delete effects needed for self-referential FK safety

Recursion and Cycles

Cascade logic tracks visited (table, pk) entries to break recursive loops in cyclic FK graphs. This is used for both delete and update cascades.

Introspection

  • SHOW CREATE TABLE renders FK clauses including ON DELETE and ON UPDATE.
  • DESCRIBE emits an FK row with action details in Extra:
    • ON DELETE <action> ON UPDATE <action>

Current Cost Characteristics

FK checks currently rely on scans in several places:

  • parent existence checks scan parent rows
  • incoming-reference discovery scans table metadata (list_tables)
  • child-reference matching scans child table rows

This is correct but not index-accelerated yet; heavy FK workloads may pay O(table size) costs.

B-tree

Overview

MuroDB uses one B+tree implementation (src/btree/*) for:

  • table primary data (clustered by primary key)
  • secondary B-tree indexes
  • some internal metadata trees

Core handle is BTree { root_page_id } in src/btree/ops/mod.rs.

Is B+tree Stored Inside Pages?

Yes. A B+tree node is encoded directly inside one slotted page (src/storage/page.rs + src/btree/node.rs).

Page structure:

  1. Slotted-page header/pointer area (generic storage layer)
  2. Cell payloads (B+tree node header + entries)

So, “page” is storage primitive, and B+tree node format is layered on top of it.

Node Layout on Page

src/btree/node.rs defines:

  • Cell 0 is always a node header.
  • Leaf header payload: [node_type=1]
  • Internal header payload: [node_type=2][right_child: u64]

Leaf entry cell:

  • [key_len: u16][key bytes][value bytes]

Internal entry cell:

  • [left_child: u64][key_len: u16][key bytes]

Internal nodes store N separator keys and N+1 child pointers:

  • left_child in each entry (N pointers)
  • right_child in header (last pointer)

Key/Value Semantics by Tree Type

Primary data tree

  • key: encoded primary key bytes
  • value: serialized row bytes

Unique secondary index

  • key: encoded indexed column(s)
  • value: primary key bytes

Non-unique secondary index

  • key: index_key || primary_key (appended PK disambiguates duplicates)
  • value: primary key bytes

That encoding is implemented in src/sql/executor/indexing.rs.

Key Encoding Rules

Order-preserving encoding is in src/btree/key_encoding.rs.

  • signed integers: sign-bit flip + big-endian
  • float/double: order-preserving bit transform
  • composite keys: null marker + per-column encoding
  • variable-length components: byte-stuffed terminator scheme

Because encoded bytes preserve logical order, tree comparison is plain lexicographic byte compare.

Search and Scan Behavior

Point lookup

BTree::search walks internal nodes with separator comparison (find_child) until leaf, then linear-searches leaf cells.

Full scan

There are no leaf sibling links.
BTree::scan performs recursive in-order traversal from root:

  • visit each internal left subtree in key order
  • then rightmost subtree

Range scan (>= start_key)

BTree::scan_from prunes early subtrees then falls back to in-order traversal for remaining branches.

Insert Path

BTree::insert behavior:

  1. Descend to target leaf.
  2. Rebuild leaf page with new/updated cell in sorted position.
  3. If overflow, split node and return median separator upward.
  4. Parent inserts new separator; parent may split recursively.
  5. If root splits, allocate new internal root.

What Happens If It Does Not Fit in One Page?

Two different cases:

  1. Tree growth case (many entries):
    • normal behavior is page split (leaf/internal), with separator propagation up to root.
    • this is fully supported.
  2. Single-cell-too-large case (one key/value entry itself is huge):
    • value-only overflow pages store large values that exceed page capacity.
    • keys remain inline (max ~4,071 bytes); values spill to overflow page chains.

Overflow Pages

When a leaf cell (key + value) exceeds ~4,073 bytes, the value is stored in an overflow page chain.

Overflow cell format

Normal leaf cell: [key_len: u16][key][value]

Overflow leaf cell: [key_len|0x8000: u16][key][total_value_len: u32][first_overflow_page: u64]

The high bit of key_len (0x8000) signals an overflow cell. All value data is stored in the overflow chain; the leaf cell contains only the key and metadata pointer.

Overflow page layout

Overflow pages use a simple linked-list format (not slotted pages):

[page_id: u64]       bytes 0..8   (standard page header)
[0xFF marker: u8]    byte 8       (distinguishes from B-tree nodes)
[next_page: u64]     bytes 9..17  (next page in chain, u64::MAX = end)
[chunk_len: u16]     bytes 17..19 (length of data in this page)
[chunk data]         bytes 19..   (up to 4,077 bytes per page)

Operations

  • Insert: if needs_overflow(key, value), write value to overflow chain, store chain head pointer in leaf cell.
  • Search/Scan: on overflow cell, call read_overflow_chain to reconstruct the full value.
  • Delete: free all overflow pages before removing the leaf cell.
  • Update: free old overflow chain (if any), then create new cell (inline or overflow).
  • Split/Merge: work with raw cell bytes to preserve overflow pointers without touching overflow data.
  • collect_all_pages: includes overflow page IDs for each overflow cell.

Implementation

  • src/storage/overflow.rs: write_overflow_chain, read_overflow_chain, free_overflow_chain, collect_overflow_pages
  • src/btree/node.rs: needs_overflow, encode_overflow_leaf_cell, is_overflow_cell, decode_overflow_metadata
  • src/btree/ops/mod.rs: overflow-aware insert/search/scan/delete/split/merge

Delete/Rebalance Path

BTree::delete removes target entry and handles underflow:

  • if leaf underfull, attempt merge/rebalance with sibling
  • if root internal ends with zero entries, collapse root to its only child

Current rebalance path focuses on practical leaf merges; behavior is intentionally conservative.

Practical Mental Model

If you are rebuilding this design:

  1. Implement slotted page first.
  2. Reserve cell 0 as node metadata.
  3. Keep key bytes order-preserving so comparison is simple.
  4. Start with split-only insert and basic delete, then add rebalance.
  5. Keep scan correctness independent of leaf links (recursive in-order works immediately).

Query Planning & Execution

This chapter explains how SQL predicates are converted into access paths and then executed.

Pipeline

sql/parser produces AST (Statement / Select), then:

  1. plan_select(...) in src/sql/planner.rs chooses a Plan.
  2. Executor modules (src/sql/executor/select_query.rs, src/sql/executor/mutation.rs) dispatch by Plan.
  3. B+tree/index scans are performed via BTree::search, scan, scan_from.

Plan Types

Plan currently has these variants:

  • PkSeek: full primary-key equality (single or composite).
  • IndexSeek: equality lookup on a B-tree secondary index.
  • IndexRangeSeek: bounded/ranged lookup on index prefix + next column range.
  • FtsScan: full-text path using MATCH ... AGAINST.
  • FullScan: fallback table scan.

Candidate Extraction from WHERE

Planner heuristics extract:

  • equalities (col = expr)
  • numeric ranges (<, <=, >, >=, BETWEEN)
  • full-text predicates (MATCH(...) AGAINST(...))

Selection order:

  1. If FTS predicate is present, choose FtsScan.
  2. If all PK columns are equality-constrained, choose PkSeek.
  3. Otherwise evaluate index candidates and pick minimum cost.
  4. If none matches, use FullScan.

Cost Model (Deterministic Heuristic)

plan_cost_hint_with_stats uses a stable heuristic (smaller is better):

  • PkSeek: 100 + est_rows
  • IndexSeek: 1500 - 300*key_parts + 3*est_rows
  • IndexRangeSeek: 1400 - 250*prefix_parts - 250*bound_terms + 3*est_rows
  • FtsScan: 2000 + 2*est_rows
  • FullScan: 3000 + 5*est_rows

Tie-break uses a stable string key, so identical inputs keep deterministic plans.

Row Estimation Inputs

Estimator uses:

  • table row count (TableDef.stats_row_count)
  • index distinct count and optional numeric histogram (IndexDef stats)
  • fallback defaults when stats are missing

ANALYZE TABLE persists these stats and improves plan quality.

Plan-to-Executor Mapping

Main dispatch happens in src/sql/executor/select_query.rs:

  • PkSeek: encode PK bytes and do one data B-tree lookup.
  • IndexSeek: encode index key, fetch matching PKs from index B-tree, then fetch rows from data B-tree.
  • IndexRangeSeek: range-scan index keys, then fetch rows by PK.
  • FtsScan: evaluate FTS postings and scoring, then materialize matching rows.
  • FullScan: iterate data B-tree and filter with WHERE.

For UPDATE / DELETE, planner is reused, then matching PKs are collected before mutation to avoid in-place scan mutation hazards.

JOIN Strategy

Join execution is currently nested loop (src/sql/executor/select_join.rs). For INNER / CROSS, loop order is chosen from estimated cardinality:

  • smaller side tends to be outer loop (choose_nested_loop_order).

EXPLAIN includes join-loop notes in Extra.

EXPLAIN Mapping

src/sql/executor/select_meta.rs maps plan to EXPLAIN fields:

  • access type: const, ref, range, fulltext, ALL
  • key: PRIMARY or chosen index name
  • rows: estimated rows
  • cost: heuristic planner cost
  • Extra: e.g. Using where, Using index, Using fulltext

This is a planner/debug aid, not a precise runtime profiler.

DDL Execution: ALTER TABLE

ALTER TABLE is executed in src/sql/executor/alter.rs and is mostly planner-independent.

Operation Dispatch

exec_alter_table(...) dispatches by AST operation:

  • ADD COLUMN
  • DROP COLUMN
  • MODIFY COLUMN
  • CHANGE COLUMN (rename + optional type/constraint change)

Fast Path vs Rewrite Path

Implementation uses two paths:

  • metadata-only (catalog update only): no row rewrite
  • full rewrite (scan + rebuild data B-tree): required when row bytes must change

Rules in current code:

  • ADD COLUMN is metadata-only.
  • DROP COLUMN always rewrites all rows.
  • MODIFY / CHANGE rewrites only when column type changes.
  • MODIFY / CHANGE without type change is metadata-only.

Safety Checks and Validation

Before applying metadata/rewrite:

  • adding PRIMARY KEY via ADD COLUMN is rejected
  • dropping a PK column is rejected
  • dropping a column referenced by any index is rejected
  • adding NOT NULL checks existing rows for NULL and fails if found
  • ADD COLUMN ... NOT NULL without DEFAULT fails on non-empty tables

Rewrite Algorithm (when triggered)

Rewrite path is:

  1. Scan old data B-tree and decode each row.
  2. Transform row shape/value (drop column or type coercion).
  3. Collect all old data-tree page IDs and free them.
  4. Create a new data B-tree root and reinsert transformed rows.
  5. Update TableDef.data_btree_root and persist catalog metadata.

Rewritten rows are stored in row format v1.

Unique Index Reconciliation

After MODIFY / CHANGE, reconcile_unique_index(...) adjusts single-column unique index state:

  • add UNIQUE: validate duplicates first, then create auto unique index
  • remove UNIQUE: drop corresponding unique index and free its pages

For CHANGE COLUMN, index metadata that references the old column name is renamed to the new name.

Cryptography

MuroDB encrypts all data at rest. This chapter explains the full encryption pipeline — from a user’s password to bytes on disk — and the rationale behind each design choice.

Overview

User's passphrase
    │
    ↓  Argon2id (memory-hard KDF) + salt
256-bit MasterKey (zeroized on drop)
    │
    ├──→ Page encryption  (AES-256-GCM-SIV, AAD = page_id || epoch)
    └──→ WAL encryption   (AES-256-GCM-SIV, AAD = lsn || 0)

FTS term blinding (HMAC-SHA256, DB-scoped term key)

The encryption system has four layers:

  1. Key derivation (KDF) — Derive a cryptographic key from the user’s passphrase
  2. Page encryption — Encrypt and authenticate each data page
  3. WAL encryption — Encrypt write-ahead log records
  4. FTS term blinding — Hide full-text search tokens on disk (separate from page/WAL encryption; see below)

Encryption Suites

Defined in src/crypto/suite.rs:

Suite IDNamePurpose
1aes256-gcm-sivDefault. Production use
0plaintextTesting/development. Explicit opt-in

The suite ID is stored in plaintext in the DB header (bytes 68..72), allowing MuroDB to determine the encryption mode before the user provides a passphrase.

Key Derivation (KDF)

src/crypto/kdf.rs

Flow

passphrase (arbitrary-length bytes) + salt (16-byte random)
    │
    ↓  Argon2id
256-bit MasterKey
  • Argon2id: A memory-hard KDF. If a database file is stolen, the attacker must spend significant memory and CPU per password guess, making offline brute-force expensive.
  • Salt: Generated randomly at database creation, stored in plaintext in the DB header (bytes 12..28). The salt must be readable before the passphrase is provided.
  • MasterKey: Implements ZeroizeOnDrop — key material is securely erased from memory when dropped.

Why Argon2id?

Argon2id is the NIST-recommended password hashing function. Its memory-hard design provides strong resistance against GPU/ASIC-accelerated brute-force attacks, outperforming bcrypt and PBKDF2 in this regard.

Page Encryption

src/crypto/aead.rs

On-Disk Format

Each 4096-byte page is individually encrypted and stored as:

nonce (12B) || ciphertext (4096B) || auth tag (16B)
──────────────────────────────────────────────────
                  total: 4124B
  • Nonce: 12 bytes, randomly generated for each encryption operation
  • Authentication tag: 16 bytes, detects tampering of both ciphertext and AAD

Why AES-256-GCM-SIV?

MuroDB uses AES-256-GCM-SIV rather than standard AES-GCM.

With standard AES-GCM, reusing a nonce completely breaks authentication. AES-GCM-SIV, on the other hand, degrades gracefully under nonce reuse — it only leaks whether two plaintexts are identical, without compromising authentication. For a storage engine where nonce management bugs can have catastrophic consequences, this nonce-misuse resistance provides a significantly safer failure envelope.

AAD (Additional Authenticated Data)

#![allow(unused)]
fn main() {
fn build_aad(page_id: PageId, epoch: u64) -> [u8; 16] {
    aad[0..8]  = page_id.to_le_bytes()   // 8 bytes
    aad[8..16] = epoch.to_le_bytes()      // 8 bytes
}
}

AAD is data that is not encrypted but is included in the authentication tag computation. If the AAD provided during decryption does not match what was used during encryption, authentication fails and decryption is rejected.

By including page_id and epoch in the AAD, MuroDB detects the following attacks:

AttackDetected by
Page swapping (inserting page 5’s data at page 10’s location)page_id mismatch
Downgrade (injecting a page from an older backup)epoch mismatch

Even if the ciphertext itself is untouched, decryption is rejected when the context (which page, which generation) does not match.

WAL Encryption

src/wal/writer.rs

WAL (Write-Ahead Log) frames are encrypted using the same PageCipher.

Frame Format

[frame_len: u32] [encrypted payload]

The payload before encryption is:

record_bytes || CRC32(record_bytes)

AAD Differences from Page Encryption

For WAL frames, the AEAD parameters differ:

  • page_id parameter → LSN (Log Sequence Number)
  • epoch parameter → always 0

Since LSN increases monotonically within a WAL session, this prevents swapping ciphertext between different WAL frames.

Note: After a checkpoint, the WAL is truncated and LSN resets to 0. This means a new WAL session may reuse the same LSN values as a previous session. This is not a vulnerability because each encryption uses a fresh random nonce, and AES-GCM-SIV’s nonce-misuse resistance provides an additional safety margin.

FTS Term Blinding

src/crypto/hmac_util.rs

Full-text search (FTS) stores bigram tokens in a B-tree index. Storing tokens as plaintext would directly expose search terms on disk. MuroDB blinds tokens with HMAC-SHA256:

term_id = HMAC-SHA256(term_key, "tokyo")  →  32-byte hash
  • Deterministic: The same token always produces the same term_id, enabling search
  • One-way: Recovering the original token from a term_id is computationally infeasible

Only hash values are stored on disk — no plaintext search terms ever reach the storage layer.

Important caveats:

  • In encrypted mode, term_key is derived from MasterKey + DB salt, so term IDs are database-scoped and secret-dependent.
  • In plaintext mode, term_key is derived from a public label + DB salt. This still hides raw tokens in casual inspection, but does not provide strong secrecy against offline dictionary guessing.
  • Legacy databases may still contain historical FTS key metadata. Open-time migration/backfill logic keeps them readable.

Key Rotation (Epoch)

When the user changes their passphrase, MuroDB re-encrypts all pages with the new key. The epoch — a u64 counter stored in the DB header — coordinates this process.

Rotation Flow

1. Derive a new MasterKey from the new passphrase
2. Increment epoch (e.g., 2 → 3)
3. Re-encrypt every page with the new key and new epoch
4. Update salt and epoch in the DB header

Why Full Re-encryption Instead of KEK?

Large-scale databases often use a KEK (Key Encryption Key) architecture: a master key encrypts per-page Data Encryption Keys (DEKs), and rotation only re-wraps the DEKs without touching the data. This avoids the cost of re-encrypting all data.

MuroDB is an embedded database. Typical data sizes range from a few MB to a few hundred MB. Full re-encryption completes in seconds, so the complexity of KEK — an extra DEK management layer, more complex crash recovery, increased attack surface — is not justified.

Full re-encryption also provides a security advantage that KEK lacks: all data is actually protected by the new key. If the old key is compromised, no data encrypted under it remains on disk.

Epoch-Based Attack Detection

Because epoch is included in the AAD, an attacker who extracts a page from an epoch=2 backup and inserts it into an epoch=3 database will trigger an authentication tag verification failure. The page is rejected without revealing any data.

In-Memory Key Protection

  • MasterKey implements ZeroizeOnDrop: memory is zeroed on drop
  • Key material is process-local (no external KMS/HSM dependency)

Non-Goals

The following are explicitly out of scope for MuroDB’s encryption design:

  • Traffic encryption: MuroDB is an embedded database with no network protocol
  • Access-pattern hiding: No ORAM or similar oblivious access mechanisms
  • HSM/KMS integration: Key material is process-local

WAL & Crash Resilience

MuroDB uses a write-ahead log (.wal) for crash recovery. All commits are WAL-first: durable intent is recorded before data-file flush.

.wal Binary Layout

WAL constants are in src/wal/mod.rs:

  • magic: "MUROWAL1" (8 bytes)
  • version: u32 (current 1)
  • header size: 12 bytes

File layout:

  1. Header: [magic:8][version:4]
  2. Repeating frames:
    • [frame_len: u32]
    • [encrypted_payload: frame_len bytes]

frame_len is bounded by MAX_WAL_FRAME_LEN (PAGE_SIZE + 1024).

Encrypted payload format before encryption (src/wal/writer.rs):

  • record_bytes = WalRecord::serialize(...)
  • payload = record_bytes || crc32(record_bytes)

Encryption uses PageCipher; frame nonce context is (lsn, 0).

WAL Record Types

WalRecord (src/wal/record.rs) variants:

RecordPayload
Begintxid
PagePuttxid, page_id, full page image bytes
MetaUpdatetxid, catalog_root, page_count, freelist_page_id, epoch
Committxid, lsn
Aborttxid

Record tags on wire:

  • 1=Begin, 2=PagePut, 3=Commit, 4=Abort, 5=MetaUpdate

Write Path

Read-Only Query Path (Database::query)

Database::query(sql):

  1. Acquire shared lock
  2. Parse/validate read-only statement
  3. Execute directly on pager/catalog (no implicit WAL transaction)

query is a &mut self API because the session may refresh pager/catalog state from disk before read execution. For concurrent readers in one process, open additional read-only handles (Database::open_reader) and query from each handle.

If an explicit transaction is active, read statements are executed in the transaction context (execute_in_tx) so uncommitted writes remain visible to that session.

Auto-Commit Mode (no explicit BEGIN)

Session::execute_auto_commit(stmt):

  1. Create implicit transaction + dirty-page buffer.
  2. Execute statement against transactional page store.
  3. tx.commit(...) writes:
    • Begin
    • all dirty PagePut
    • freelist PagePut pages (if needed)
    • MetaUpdate
    • Commit
  4. wal.sync() (fsync) establishes durability boundary.
  5. Flush pages + metadata to main DB file.

Explicit Transaction (BEGIN … COMMIT)

Explicit transaction (BEGIN ... COMMIT) follows the same commit primitive. ROLLBACK discards dirty state without WAL append (rollback_no_wal in session path).

Commit Point

Durability commit point is WAL fsync:

  • before wal.sync(): commit may be lost on crash
  • after wal.sync(): commit must be recoverable even if DB flush fails

If post-sync DB flush fails, transaction returns CommitInDoubt, session is poisoned, and next open recovers from WAL.

Recovery (Database::open)

Database::open(path, master_key)
  1. If WAL file exists, run recovery::recover()
     → Scan WAL and validate per-tx state machine
       (Begin -> PagePut/MetaUpdate* -> Commit/Abort)
     → Collect latest page images from committed transactions
     → Replay to data file
  2. Truncate WAL file (empty it)
     → fsync WAL file
     → best-effort fsync parent directory
  3. Build Session with Pager + Catalog + WalWriter

Validation is implemented in src/wal/recovery.rs with explicit skip/error codes.

Recovery Modes

  • strict (default): Fails on any WAL protocol violation
  • permissive: Skips invalid transactions, recovers only valid committed ones

See Recovery for user-facing documentation.

In permissive mode, if invalid transactions were skipped, WAL can be quarantined (*.quarantine.<ts>.<pid>) before reopening a clean WAL stream.

Inspect-WAL JSON Contract

murodb-wal-inspect --format json returns machine-readable diagnostics with a stable schema contract:

  • schema_version=1 for the current contract
  • status: ok / warning / fatal
  • exit_code: mirrors CLI exit code semantics (0, 10, 20)
  • skipped[].code: stable machine-readable skip classification
  • On fatal failures, fatal_error and fatal_error_code are included

Secondary Index Consistency

All index updates happen within the same transaction as the data update:

INSERT

  1. Insert row into data B-tree
  2. Insert entry into each secondary index (column_value → PK)
  3. Check UNIQUE constraint before insertion

DELETE

  1. Scan for rows to delete (collect PK + all column values)
  2. Delete entries from each secondary index
  3. Delete row from data B-tree

UPDATE

  1. Scan for rows to update (collect PK + old column values)
  2. Compute new values
  3. Check UNIQUE constraints (for changed values)
  4. Update secondary indexes (delete old entry + insert new entry)
  5. Write new row data to data B-tree

Remaining Constraints

fsync granularity

Pager::write_page_to_disk() does not call sync_all() individually. Only flush_meta() calls sync_all(). WAL sync() guarantees data durability, so this is safe in normal operation.

allocate_page counter

Pager::allocate_page() increments in-memory page_count, which is not persisted until flush_meta() after WAL commit.

WAL file size

After successful commits and explicit ROLLBACK, the Session auto-checkpoints the WAL according to policy. Checkpoint is best-effort and does not affect commit success.

Default policy is per-transaction (MURODB_CHECKPOINT_TX_THRESHOLD=1), and can be tuned with:

  • MURODB_CHECKPOINT_TX_THRESHOLD
  • MURODB_CHECKPOINT_WAL_BYTES_THRESHOLD
  • MURODB_CHECKPOINT_INTERVAL_MS

The same knobs are available as session-scoped SQL runtime options:

  • SET checkpoint_tx_threshold = <u64>
  • SET checkpoint_wal_bytes_threshold = <u64>
  • SET checkpoint_interval_ms = <u64>

When checkpoint truncate fails, MuroDB emits a warning with wal_path and wal_size_bytes so operators can detect and triage WAL growth.

TLA+ Correspondence

See Formal Verification for the TLA+ model and its mapping to implementation.

TLA+ IntentImplementationRegression Test
Only valid state transitions are recoveredState transition validation in recovery.rstest_recovery_rejects_pageput_before_begin
Commit/Abort is terminalReject duplicate terminal / post-terminal recordstest_recovery_rejects_duplicate_terminal_record_for_tx
Commit has consistent terminal infoValidate Commit.lsn == actual LSNtest_recovery_rejects_commit_lsn_mismatch
Commit requires metadataReject Commit without MetaUpdatetest_recovery_rejects_commit_without_meta_update
PagePut matches target pageValidate PagePut.page_id vs page headertest_recovery_rejects_pageput_page_id_mismatch
Tail corruption tolerated, mid-log rejectedReader tolerates tail onlytest_tail_truncation_tolerated, test_mid_log_corruption_is_error
Oversized frames handled safelyFrame length limit in Reader/Writertest_oversized_tail_frame_tolerated
Freelist recovered from committed MetaUpdatefreelist_page_id in WAL MetaUpdatetest_freelist_wal_recovery

Durability Matrix

This page describes the exact guarantees MuroDB provides at each stage of the commit pipeline and the post-crash outcome when a failure occurs at each step.

Commit Pipeline

The commit flow proceeds through these ordered steps:

1. WAL: Begin record
2. WAL: PagePut records (one per dirty page)
3. WAL: PagePut for freelist page
4. WAL: MetaUpdate record (catalog_root, page_count, freelist_page_id)
5. WAL: Commit record
6. WAL: fsync            ← COMMIT POINT (durability boundary)
7. Data file: write dirty pages
8. Data file: flush_meta (fsync)
9. WAL: checkpoint_truncate (fsync + directory fsync)

The commit point is step 6 (WAL fsync). Once wal.sync() returns successfully, the transaction is durable. Steps 7-9 are performance optimizations that apply the committed data to the main database file; if they fail, WAL recovery replays the committed transaction on next open.

Fsync Points

StepFsync TargetWhat It Protects
WAL sync (step 6)WAL fileAll WAL records for the transaction reach stable storage. This is the commit point.
flush_meta (step 8)Data filePage data and metadata (catalog_root, page_count, freelist_page_id) are persisted to the main DB file.
checkpoint_truncate (step 9)WAL file + directoryWAL is truncated to header-only. Directory fsync hardens the metadata change.

Crash-at-Each-Step Outcome Matrix

Crash PointWAL StateCommitted?Post-Recovery Outcome
After Begin (step 1)Begin onlyNoTransaction discarded. Prior committed data intact.
After PagePut (step 2)Begin + PagePut(s)NoTransaction discarded. No pages applied.
After freelist PagePut (step 3)Begin + PagePut(s) + freelistNoTransaction discarded. Freelist unchanged.
After MetaUpdate (step 4)Begin + PagePut(s) + MetaUpdateNoTransaction discarded. Metadata unchanged.
After Commit record (step 5)Complete WAL sequenceNoWAL not fsynced; records may not have reached disk. OS may or may not have flushed buffers. If records survived: recovery replays. If not: transaction lost (no durability guarantee without fsync).
After WAL sync (step 6)Complete + fsyncedYesRecovery replays committed pages and metadata to data file.
After page writes (step 7)Complete + fsyncedYesSome or all pages written. Recovery replays any missing pages idempotently.
After flush_meta (step 8)Complete + fsyncedYesData file fully consistent. WAL replay is idempotent (re-applying same pages is safe).
After checkpoint_truncate (step 9)TruncatedYesWAL is empty. Data file is self-consistent. Normal operation resumes.

Post-WAL-Sync Failures (CommitInDoubt)

When steps 7 or 8 fail after the WAL has been synced, the commit is durable in the WAL but the in-process session cannot confirm it succeeded on the data file. MuroDB handles this as follows:

  1. Transaction::commit() returns Err(CommitInDoubt).
  2. The session is poisoned - all subsequent operations return SessionPoisoned.
  3. On reopen, WAL recovery replays the committed transaction, converging to the correct state.

This design ensures that a durable commit is never lost, even if the process crashes or encounters I/O errors after the commit point.

Checkpoint Truncate Failure

If checkpoint_truncate() fails (step 9), the WAL retains committed records. On next open, recovery replays them idempotently. The data file already has the correct state (from steps 7-8), so replay simply overwrites pages with identical content. WAL growth is the only concern; monitor WAL file size on disk and failed_checkpoints via SHOW DATABASE STATS.

Idempotent Recovery

WAL recovery is designed to be idempotent:

  • Running recover() multiple times on the same WAL produces identical database state.
  • page_count only increases, never decreases, during recovery.
  • Page data is overwritten with the WAL image regardless of current content.
  • Metadata (catalog_root, freelist_page_id) is set to the last committed values.

This property is critical for crash-during-recovery scenarios: if the process crashes during recovery, the next recovery attempt produces the same result.

Torn WAL Tail

A crash during WAL writes can leave a partially-written frame at the end of the WAL file. MuroDB’s WAL reader handles this gracefully:

  • Truncated frame: A frame header claiming more bytes than remain in the file is treated as end-of-log.
  • Garbage bytes: Bytes that fail decryption or CRC validation at the tail are ignored.
  • Zero-filled tail: Zero bytes (from filesystem pre-allocation) produce a zero frame length, treated as end-of-log.
  • Mid-log corruption: Corruption followed by valid frames is a hard error (prevents silent data loss).

The WAL reader uses a two-layer tail detection heuristic:

  1. Structural check: Is the next frame structurally plausible (non-zero length, fits in file)?
  2. Content probe: Even if structurally plausible, can any following frame be successfully decrypted and CRC-validated?

If both checks indicate no valid data follows the corrupt frame, it is treated as tail garbage and ignored.

FTS Internals

Tokenization

  • Normalization: NFKC unicode normalization
  • Tokenizer: Bigram (n=2) - each text is split into overlapping 2-character sequences
  • Example: “東京タワー” → [“東京”, “京タ”, “タワ”, “ワー”]

Term ID Blinding

Term IDs are computed using HMAC-SHA256:

  • No plaintext tokens are stored on disk
  • Term ID = HMAC-SHA256(master_key, normalized_token)
  • This provides privacy: the disk contents do not reveal what terms are indexed

Postings Storage

Postings lists are stored in B-tree with compression:

  • Delta encoding: Document IDs are stored as deltas from the previous ID
  • Varint compression: Deltas are encoded as variable-length integers
  • Postings are stored in the same B-tree infrastructure as regular data

Scoring

  • Algorithm: BM25 (Okapi BM25)
  • Used in NATURAL LANGUAGE MODE for relevance ranking

Phrase Matching

Phrase queries (e.g., "東京タワー") verify consecutive bigram positions:

  1. Tokenize the phrase into bigrams
  2. Find postings for each bigram
  3. Verify that positions are consecutive across all bigrams

Snippet Generation

fts_snippet() uses a local scan approach with a UTF-8 offset map:

  1. Find matching positions in the document
  2. Build a char<->byte offset map for normalized text
  3. Convert match byte offsets to char windows via binary search
  4. Slice and apply highlight tags (open/close) around matched regions
  5. Truncate to the specified maximum length

Memory note:

  • Offset map size is bounded by (normalized_chars + 1) * sizeof(usize) bytes per call.

Format Migration

Current Policy (as of 2026-02-22)

MuroDB supports database format v4 only.

  • Opening v4 works.
  • Opening v1/v2/v3 is rejected.
  • Opening future versions (>v4) is also rejected.

This project currently has no production users on pre-v4 formats, so compatibility-migration code is intentionally removed to keep core storage logic simple and safer.

v4 Header Layout

Magic (8B) + Version (4B) + Salt (16B) + CatalogRoot (8B)
+ PageCount (8B) + Epoch (8B) + FreelistPageId (8B)
+ NextTxId (8B) + EncryptionSuiteId (4B) + CRC32 (4B)
  • Header size: 76 bytes
  • CRC32 covers bytes 0..72

Rejection Behavior

Opening an unsupported version returns:

WAL error: unsupported database format version N

WAL Format Version History

VersionChanges
v1Initial: Begin, PagePut, MetaUpdate(catalog_root, page_count), Commit, Abort
v2MetaUpdate adds freelist_page_id field. Legacy v1 MetaUpdate (25 bytes) decoded with freelist_page_id=0 (backward compatible)
v3MetaUpdate adds epoch field. Legacy v1/v2 MetaUpdate records decode with epoch=0 (backward compatible)

Backup

Overview

MuroDB provides a Database::backup() API for creating consistent, point-in-time snapshots of a running database. The backup file is a fully valid MuroDB database file that can be opened directly with the same key or password.

Design

Consistency Model

The backup API uses a write-lock + WAL checkpoint + byte-copy approach:

  1. Write lock: Acquires an exclusive lock to prevent concurrent writes during the backup window.
  2. WAL checkpoint: Calls checkpoint_truncate() to flush all committed WAL records into the main data file and truncate the WAL. This ensures the data file is self-consistent.
  3. Byte-copy: Copies the plaintext header (76 bytes) and all encrypted pages as raw bytes from the source file to the destination. No decryption or re-encryption occurs.
  4. fsync: The destination file is fsynced before the lock is released, ensuring durability.

What Gets Copied

The backup copies exactly HEADER_SIZE + page_count * page_size_on_disk bytes:

  • Header (76 bytes): Magic, format version, salt, catalog root, page count, epoch, freelist page ID, next TxId, encryption suite ID, CRC32.
  • Pages: All pages in their on-disk (encrypted) form, including data pages, B-tree nodes, freelist pages, and FTS posting pages.

The WAL file is not included in the backup because the checkpoint step ensures all committed data is already in the main file.

Concurrency

  • During backup, writers are blocked (exclusive lock held).
  • Backup is typically fast (sequential I/O), so the write-stall window is proportional to database size.
  • After the backup completes, normal read/write operations resume.

Usage

Rust API

#![allow(unused)]
fn main() {
use murodb::Database;

let mut db = Database::open(path, &master_key)?;

// ... normal operations ...

// Create a backup
db.backup("backup.db")?;
}

Restore

Restoring from a backup is a file-level operation:

  1. Stop the application (or close the Database handle).
  2. Replace the database file with the backup file.
  3. Delete the WAL file (.wal suffix) if present — the backup has no pending WAL.
  4. Reopen the database.
#![allow(unused)]
fn main() {
// The backup file is a valid MuroDB database
let mut restored = Database::open("backup.db", &master_key)?;
}

Constraints

ConstraintDetail
Disk spaceRequires free space equal to the full database size.
Write stallWriters are blocked for the duration of the copy.
WALWAL is checkpointed and truncated before copy; not included in backup.
EncryptionBackup preserves the same encryption suite, key, and salt.
AtomicityIf the backup process crashes mid-copy, the destination file may be partial/corrupt. The source database is unaffected.

Formal Verification

MuroDB uses TLA+ to formally verify crash/recovery protocol invariants.

Files

FileDescription
specs/tla/CrashResilience.tlaSystem model
specs/tla/CrashResilience.cfgTLC configuration (small finite state space)
specs/tla/CrashResilience.large.cfgLarger state space for deeper checking
specs/tla/run_tlc.shHelper script to run TLC

What is modeled

  • Transaction lifecycle: BeginTx, WritePage, SetMeta, DurableCommit
  • Partial flush before crash: FlushSomeCommitted
  • Crash and recovery: Crash, Recover
  • WAL replay semantics at transaction granularity:
    • Committed writes are recovered
    • Uncommitted writes are ignored
    • Metadata (catalogRoot, pageCount) is recovered from committed records

Checked invariants

InvariantDescription
TypeInvBasic type safety of all state variables
RecoveredSoundAfter recovery, DB state equals replayed committed WAL state
NoUncommittedInfluenceUncommitted transactions do not influence recovered state
CommitRequiresMetaCommitted transactions always have metadata update
UniqueCommittedOrderEach transaction appears at most once in commit order
FreelistPreservedAfter recovery, freelist ID equals the last committed freelist ID

Running TLC

Prerequisites

  • Java runtime
  • tla2tools.jar (TLC model checker)

Using Make

make tlc-tools   # Download tla2tools.jar
make tlc         # Run with small config (fast)
make tlc-large   # Run with large config (deeper)

Manual

export TLA2TOOLS_JAR=/path/to/tla2tools.jar
./specs/tla/run_tlc.sh

Or with the tlc2 command:

tlc2 -config specs/tla/CrashResilience.cfg specs/tla/CrashResilience.tla

Scope and Limitations

  • This is an abstract model, not byte-level WAL frame parsing
  • It does not model OS/filesystem durability anomalies directly
  • It validates protocol-level invariants and crash/recovery semantics

Correspondence with Implementation

See the WAL & Crash Resilience page for a detailed mapping between TLA+ invariants and their implementation in code.

Roadmap

Implemented

  • Basic CRUD (INSERT, SELECT, UPDATE, DELETE)
  • CREATE TABLE (PRIMARY KEY, UNIQUE, NOT NULL)
  • CREATE INDEX / CREATE UNIQUE INDEX (single column)
  • CREATE FULLTEXT INDEX (bigram, BM25, NATURAL/BOOLEAN mode, snippet)
  • MySQL-compatible integer types (TINYINT, SMALLINT, INT, BIGINT)
  • VARCHAR(n), VARBINARY(n), TEXT with size validation
  • UUID type (16-byte native, UUID_V4/UUID_V7 generation)
  • Hex literal (X'...') for VARBINARY data
  • WHERE with comparison operators (=, !=, <, >, <=, >=)
  • AND, OR logical operators
  • ORDER BY (ASC/DESC, multi-column), LIMIT
  • JOIN (INNER, LEFT, CROSS) with table aliases
  • BEGIN / COMMIT / ROLLBACK
  • SHOW TABLES
  • Multi-row INSERT
  • Hidden _rowid auto-generation for tables without explicit PK
  • AES-256-GCM-SIV encryption, Argon2 KDF
  • WAL-based crash recovery
  • CLI with REPL
  • DROP TABLE / DROP TABLE IF EXISTS
  • DROP INDEX
  • IF NOT EXISTS for CREATE TABLE / CREATE INDEX
  • SHOW CREATE TABLE
  • DESCRIBE / DESC table
  • LIKE / NOT LIKE (% and _ wildcards)
  • IN (value list)
  • BETWEEN … AND …
  • IS NULL / IS NOT NULL
  • NOT operator (general)
  • OFFSET (SELECT … LIMIT n OFFSET m)
  • DEFAULT column values
  • AUTO_INCREMENT
  • Arithmetic operators in expressions (+, -, *, /, %)
  • BOOLEAN type (alias for TINYINT)
  • CHECK constraint

Phase 2 — Built-in Functions ✓

MySQL-compatible scalar functions.

  • String: LENGTH, CHAR_LENGTH, CONCAT, SUBSTRING/SUBSTR, UPPER, LOWER
  • String: TRIM, LTRIM, RTRIM, REPLACE, REVERSE, REPEAT
  • String: LEFT, RIGHT, LPAD, RPAD, INSTR/LOCATE
  • String: REGEXP / REGEXP_LIKE
  • Numeric: ABS, CEIL/CEILING, FLOOR, ROUND, MOD, POWER/POW
  • NULL handling: COALESCE, IFNULL, NULLIF, IF
  • Type conversion: CAST(expr AS type)
  • CASE WHEN … THEN … ELSE … END

Phase 3 — Aggregation & Grouping ✓

  • COUNT, SUM, AVG, MIN, MAX
  • COUNT(DISTINCT …)
  • GROUP BY (single and multiple columns)
  • HAVING
  • SELECT DISTINCT

Phase 4 — Schema Evolution ✓

  • ALTER TABLE ADD COLUMN
  • ALTER TABLE DROP COLUMN
  • ALTER TABLE MODIFY COLUMN / CHANGE COLUMN
  • RENAME TABLE
  • Composite PRIMARY KEY
  • Composite UNIQUE / composite INDEX

Phase 5 — Advanced Query ✓

  • Subqueries (WHERE col IN (SELECT …), scalar subquery)
  • UNION / UNION ALL
  • EXISTS / NOT EXISTS
  • INSERT … ON DUPLICATE KEY UPDATE
  • REPLACE INTO
  • EXPLAIN (query plan display)
  • RIGHT JOIN
  • Shared-lock read path (Database::query) with CLI auto routing

Phase 6 — Types & Storage

  • FLOAT / DOUBLE
  • DATE, DATETIME, TIMESTAMP
    • Scope: fully align parser/executor/CAST/default/literal behavior and edge-case validation.
    • Done when:
      • Temporal literals and string casts behave consistently across INSERT/UPDATE/WHERE.
      • Arithmetic and comparison semantics are defined/documented for mixed temporal expressions.
      • Timezone handling policy is explicit (especially TIMESTAMP input/output normalization).
      • Invalid dates/times reject with deterministic errors.
  • Date/time functions: NOW, CURRENT_TIMESTAMP, DATE_FORMAT, etc.
  • UUID type with UUID_V4() and UUID_V7() generation functions
  • DECIMAL(p,s) / NUMERIC(p,s) fixed-point exact numeric type
    • 96-bit mantissa via rust_decimal, precision 1-28, 16-byte storage
    • Full arithmetic, comparison, CAST, aggregation (SUM/AVG/MIN/MAX), ORDER BY, GROUP BY, INDEX support
    • MySQL-compatible: NUMERIC alias, default DECIMAL(10,0), DECIMAL+INT→DECIMAL, DECIMAL+FLOAT→FLOAT
  • BLOB (skipped for now)
    • Decision (2026-02-22): defer and move focus to Phase 7 performance work.
    • Why skipped now:
      • Current product priorities are query/index performance and planner improvements, not large-object type expansion.
      • BLOB adds non-trivial storage/operational surface area (limits, indexing semantics, comparison behavior) with low near-term user impact.
      • Existing VARBINARY(n)/TEXT coverage is sufficient for current workloads.
    • Revisit when:
      • There is a concrete workload requiring large binary payloads that cannot be handled acceptably by current types.
      • The performance roadmap items in Phase 7 are complete or no longer the bottleneck.
  • Overflow pages (posting list > 4096B)
    • Scope: support values/postings that exceed single-page capacity.
    • Progress:
      • Implemented FTS segment overflow chains (__segovf__) with typed page format (OFG1).
      • Read/write/delete + vacuum path now reclaims overflow pages without orphaning.
      • Covered by unit/integration tests (cargo test green as of 2026-02-22).
      • Added WAL recovery integration tests for overflow chains (torn WAL tail and post-sync partial-write replay paths).
      • Benchmarked on 2026-02-22 (murodb_bench, commit 829ad18145c2) with no severe small-record regression signal.
      • Implemented B-tree value overflow pages (2026-02-23): large row values (>~4073 bytes) now spill to overflow page chains transparently. Format version bumped to 5 (backward-compatible with v4).
    • Done when:
      • Overflow chain format is versioned and crash-safe.
      • WAL/recovery covers partial-write and torn-tail scenarios for overflow chains.
      • Vacuum/reclaim path correctly frees overflow pages without orphaning.
      • Benchmarks show no severe regressions for small records.

Phase 7 — Performance & Internals

  • Auto-checkpoint (threshold-based WAL)
  • Composite index range scan
    • Progress:
      • Added planner/executor support for composite-index range seek on the last key part (e.g. (a,b) with a = ? and b range).
      • EXPLAIN now reports type=range for this access path.
      • EXPLAIN now reports estimated cardinality via rows.
    • Done when:
      • Multi-column prefix ranges ((a,b) with predicates on a, optional range on b) use index scan.
      • EXPLAIN shows index-range choice and estimated cardinality.
      • Fallback path remains correct for unsupported predicate shapes.
  • Query optimizer improvements (cost-based)
    • Progress:
      • Added deterministic heuristic cost hints for PkSeek / IndexSeek / IndexRangeSeek / FullScan.
      • Planner now compares index candidates by cost instead of choosing the first matching index.
      • EXPLAIN now reports a cost column for the chosen plan.
      • Added persisted stats via ANALYZE TABLE (table_rows, index_distinct_keys) in catalog metadata.
      • EXPLAIN row estimation now prefers persisted table_rows when available.
      • Planner cost model now incorporates persisted table_rows/index_distinct_keys when available, with conservative fallback selectivity when stats are missing.
      • EXPLAIN rows/cost now uses the same planner estimation logic (with table-row fallback), so estimates reflect planner tradeoffs.
      • JOIN loop-order choice for INNER/CROSS now uses planner-side estimated row counts (stats-aware with runtime fallback) and keeps row shape (left + right) stable.
      • ANALYZE TABLE now persists numeric min/max bounds and equal-width histogram bins for single-column numeric B-tree indexes; range row estimation uses these stats when available.
      • EXPLAIN for JOIN now reports nested-loop outer-side choice with estimated left/right row counts in Extra.
    • Done when:
      • Planner compares at least full-scan vs single-index vs join-order alternatives.
      • Basic column stats/histograms are persisted and refreshable.
      • Plan choice is deterministic under identical stats.
  • FTS stop-ngram filtering
    • Progress:
      • Added FULLTEXT options stop_filter and stop_df_ratio_ppm (ppm threshold).
      • NATURAL LANGUAGE MODE now supports skipping high-DF ngrams when enabled.
      • Default remains OFF for exact-behavior compatibility.
      • Recall/precision tradeoff example documented in Full-Text Search guide.
    • Done when:
      • Frequent low-information ngrams are skipped using configurable thresholds.
      • Recall/precision tradeoff is documented with benchmark examples.
      • Toggle exists for exact behavior compatibility.
  • fts_snippet acceleration (pos-to-offset map)
    • Progress:
      • Replaced snippet byte/char conversion loops with a UTF-8 position-to-offset map plus binary search.
      • Snippet assembly now slices by byte ranges instead of repeatedly collecting char vectors.
      • Added dedicated benchmark runner (murodb_snippet_bench) with legacy-vs-new comparison and offset-map memory estimate.
      • On 2026-02-22 (local, release build), long-text tail-hit case showed small p50 improvement (legacy 1245.52us -> new 1228.43us).
    • Done when:
      • Snippet generation avoids repeated UTF-8 rescans for long docs.
      • Latency improvement is measured and documented on representative datasets.
      • Memory overhead remains bounded and observable.

Phase 8 — Security (Future)

  • Key rotation (epoch-based re-encryption)
    • Implemented API-based rekey (Database::rekey_with_password) for full page re-encryption.
    • New random salt generated on each rotation; epoch incremented.
    • Crash-safe via .rekey marker file with automatic recovery on next open.
    • Rejects inside transactions and on plaintext databases.

Phase 9 — Practical Embedded DB (Next)

Real-world deployment features to make MuroDB easier to embed and operate.

  • Encryption OFF mode
    • Motivation: some embedded deployments prefer CPU savings and rely on disk/host-level protection.
    • Done when:
      • DB format can be created/opened in explicit plaintext mode.
      • File header clearly records mode to avoid accidental mis-open.
      • CLI/API require explicit opt-in (no silent downgrade from encrypted DB).
  • Pluggable encryption suite
    • Motivation: allow policy-driven algorithm choice without forking storage engine.
    • Done when:
      • Algorithm + KDF are selected by explicit config at DB creation.
      • Supported suites are versioned, discoverable, and recorded in metadata.
      • Wrong-suite open errors are deterministic and actionable.
  • Rekey / algorithm migration
    • Rekey implemented via API (Database::rekey_with_password) and dedicated CLI (murodb-rekey).
    • Crash-recoverable via .rekey marker file.
    • Algorithm migration (cipher suite change) deferred to future work.
  • Backup API + consistent snapshot
    • Decision (2026-02-22):
      • Prioritize early in Phase 9 so embedded apps can take consistent backups without full writer quiesce windows.
    • Why now:
      • File-copy backup while writes are active is error-prone operationally.
      • A first-class API can provide deterministic snapshot semantics and simpler restore contracts.
    • Done when:
      • Online consistent backup without long writer stalls.
      • Restore path validated by integration tests.
      • Snapshot metadata includes format/security parameters.
  • Operational limits and safeguards
    • Done when:
      • Configurable caps for DB file size, WAL size, statement timeout, and memory budget.
      • Error surfaces are clear and machine-parseable for host applications.
      • Default limits are documented with recommended profiles (edge device / server / CI).