Roadmap

Implemented

Basic CRUD (INSERT, SELECT, UPDATE, DELETE)
CREATE TABLE (PRIMARY KEY, UNIQUE, NOT NULL)
CREATE INDEX / CREATE UNIQUE INDEX (single column)
CREATE FULLTEXT INDEX (bigram, BM25, NATURAL/BOOLEAN mode, snippet)
MySQL-compatible integer types (TINYINT, SMALLINT, INT, BIGINT)
VARCHAR(n), VARBINARY(n), TEXT with size validation
UUID type (16-byte native, UUID_V4/UUID_V7 generation)
Hex literal (X'...') for VARBINARY data
WHERE with comparison operators (=, !=, <, >, <=, >=)
AND, OR logical operators
ORDER BY (ASC/DESC, multi-column), LIMIT
JOIN (INNER, LEFT, CROSS) with table aliases
BEGIN / COMMIT / ROLLBACK
SHOW TABLES
Multi-row INSERT
Hidden _rowid auto-generation for tables without explicit PK
AES-256-GCM-SIV encryption, Argon2 KDF
WAL-based crash recovery
CLI with REPL
DROP TABLE / DROP TABLE IF EXISTS
DROP INDEX
IF NOT EXISTS for CREATE TABLE / CREATE INDEX
SHOW CREATE TABLE
DESCRIBE / DESC table
LIKE / NOT LIKE (% and _ wildcards)
IN (value list)
BETWEEN … AND …
IS NULL / IS NOT NULL
NOT operator (general)
OFFSET (SELECT … LIMIT n OFFSET m)
DEFAULT column values
AUTO_INCREMENT
Arithmetic operators in expressions (+, -, *, /, %)
BOOLEAN type (alias for TINYINT)
CHECK constraint

Phase 2 — Built-in Functions ✓

MySQL-compatible scalar functions.

String: LENGTH, CHAR_LENGTH, CONCAT, SUBSTRING/SUBSTR, UPPER, LOWER
String: TRIM, LTRIM, RTRIM, REPLACE, REVERSE, REPEAT
String: LEFT, RIGHT, LPAD, RPAD, INSTR/LOCATE
String: REGEXP / REGEXP_LIKE
Numeric: ABS, CEIL/CEILING, FLOOR, ROUND, MOD, POWER/POW
NULL handling: COALESCE, IFNULL, NULLIF, IF
Type conversion: CAST(expr AS type)
CASE WHEN … THEN … ELSE … END

Phase 3 — Aggregation & Grouping ✓

COUNT, SUM, AVG, MIN, MAX
COUNT(DISTINCT …)
GROUP BY (single and multiple columns)
HAVING
SELECT DISTINCT

Phase 4 — Schema Evolution ✓

ALTER TABLE ADD COLUMN
ALTER TABLE DROP COLUMN
ALTER TABLE MODIFY COLUMN / CHANGE COLUMN
RENAME TABLE
Composite PRIMARY KEY
Composite UNIQUE / composite INDEX

Phase 5 — Advanced Query ✓

Subqueries (WHERE col IN (SELECT …), scalar subquery)
UNION / UNION ALL
EXISTS / NOT EXISTS
INSERT … ON DUPLICATE KEY UPDATE
REPLACE INTO
EXPLAIN (query plan display)
RIGHT JOIN
Shared-lock read path (Database::query) with CLI auto routing

Phase 6 — Types & Storage

FLOAT / DOUBLE
DATE, DATETIME, TIMESTAMP
- Scope: fully align parser/executor/CAST/default/literal behavior and edge-case validation.
- Done when:
  - Temporal literals and string casts behave consistently across INSERT/UPDATE/WHERE.
  - Arithmetic and comparison semantics are defined/documented for mixed temporal expressions.
  - Timezone handling policy is explicit (especially TIMESTAMP input/output normalization).
  - Invalid dates/times reject with deterministic errors.
Date/time functions: NOW, CURRENT_TIMESTAMP, DATE_FORMAT, etc.
UUID type with UUID_V4() and UUID_V7() generation functions
DECIMAL(p,s) / NUMERIC(p,s) fixed-point exact numeric type
- 96-bit mantissa via rust_decimal, precision 1-28, 16-byte storage
- Full arithmetic, comparison, CAST, aggregation (SUM/AVG/MIN/MAX), ORDER BY, GROUP BY, INDEX support
- MySQL-compatible: NUMERIC alias, default DECIMAL(10,0), DECIMAL+INT→DECIMAL, DECIMAL+FLOAT→FLOAT
BLOB (skipped for now)
- Decision (2026-02-22): defer and move focus to Phase 7 performance work.
- Why skipped now:
  - Current product priorities are query/index performance and planner improvements, not large-object type expansion.
  - BLOB adds non-trivial storage/operational surface area (limits, indexing semantics, comparison behavior) with low near-term user impact.
  - Existing VARBINARY(n)/TEXT coverage is sufficient for current workloads.
- Revisit when:
  - There is a concrete workload requiring large binary payloads that cannot be handled acceptably by current types.
  - The performance roadmap items in Phase 7 are complete or no longer the bottleneck.
Overflow pages (posting list > 4096B)
- Scope: support values/postings that exceed single-page capacity.
- Progress:
  - Implemented FTS segment overflow chains (__segovf__) with typed page format (OFG1).
  - Read/write/delete + vacuum path now reclaims overflow pages without orphaning.
  - Covered by unit/integration tests (cargo test green as of 2026-02-22).
  - Added WAL recovery integration tests for overflow chains (torn WAL tail and post-sync partial-write replay paths).
  - Benchmarked on 2026-02-22 (murodb_bench, commit 829ad18145c2) with no severe small-record regression signal.
  - Implemented B-tree value overflow pages (2026-02-23): large row values (>~4073 bytes) now spill to overflow page chains transparently. Format version bumped to 5 (backward-compatible with v4).
- Done when:
  - Overflow chain format is versioned and crash-safe.
  - WAL/recovery covers partial-write and torn-tail scenarios for overflow chains.
  - Vacuum/reclaim path correctly frees overflow pages without orphaning.
  - Benchmarks show no severe regressions for small records.

Phase 7 — Performance & Internals

Auto-checkpoint (threshold-based WAL)
Composite index range scan
- Progress:
  - Added planner/executor support for composite-index range seek on the last key part (e.g. (a,b) with a = ? and b range).
  - EXPLAIN now reports type=range for this access path.
  - EXPLAIN now reports estimated cardinality via rows.
- Done when:
  - Multi-column prefix ranges ((a,b) with predicates on a, optional range on b) use index scan.
  - EXPLAIN shows index-range choice and estimated cardinality.
  - Fallback path remains correct for unsupported predicate shapes.
Query optimizer improvements (cost-based)
- Progress:
  - Added deterministic heuristic cost hints for PkSeek / IndexSeek / IndexRangeSeek / FullScan.
  - Planner now compares index candidates by cost instead of choosing the first matching index.
  - EXPLAIN now reports a cost column for the chosen plan.
  - Added persisted stats via ANALYZE TABLE (table_rows, index_distinct_keys) in catalog metadata.
  - EXPLAIN row estimation now prefers persisted table_rows when available.
  - Planner cost model now incorporates persisted table_rows/index_distinct_keys when available, with conservative fallback selectivity when stats are missing.
  - EXPLAIN rows/cost now uses the same planner estimation logic (with table-row fallback), so estimates reflect planner tradeoffs.
  - JOIN loop-order choice for INNER/CROSS now uses planner-side estimated row counts (stats-aware with runtime fallback) and keeps row shape (left + right) stable.
  - ANALYZE TABLE now persists numeric min/max bounds and equal-width histogram bins for single-column numeric B-tree indexes; range row estimation uses these stats when available.
  - EXPLAIN for JOIN now reports nested-loop outer-side choice with estimated left/right row counts in Extra.
- Done when:
  - Planner compares at least full-scan vs single-index vs join-order alternatives.
  - Basic column stats/histograms are persisted and refreshable.
  - Plan choice is deterministic under identical stats.
FTS stop-ngram filtering
- Progress:
  - Added FULLTEXT options stop_filter and stop_df_ratio_ppm (ppm threshold).
  - NATURAL LANGUAGE MODE now supports skipping high-DF ngrams when enabled.
  - Default remains OFF for exact-behavior compatibility.
  - Recall/precision tradeoff example documented in Full-Text Search guide.
- Done when:
  - Frequent low-information ngrams are skipped using configurable thresholds.
  - Recall/precision tradeoff is documented with benchmark examples.
  - Toggle exists for exact behavior compatibility.
fts_snippet acceleration (pos-to-offset map)
- Progress:
  - Replaced snippet byte/char conversion loops with a UTF-8 position-to-offset map plus binary search.
  - Snippet assembly now slices by byte ranges instead of repeatedly collecting char vectors.
  - Added dedicated benchmark runner (murodb_snippet_bench) with legacy-vs-new comparison and offset-map memory estimate.
  - On 2026-02-22 (local, release build), long-text tail-hit case showed small p50 improvement (legacy 1245.52us -> new 1228.43us).
- Done when:
  - Snippet generation avoids repeated UTF-8 rescans for long docs.
  - Latency improvement is measured and documented on representative datasets.
  - Memory overhead remains bounded and observable.

Phase 8 — Security (Future)

Key rotation (epoch-based re-encryption)
- Implemented API-based rekey (Database::rekey_with_password) for full page re-encryption.
- New random salt generated on each rotation; epoch incremented.
- Crash-safe via .rekey marker file with automatic recovery on next open.
- Rejects inside transactions and on plaintext databases.

Phase 9 — Practical Embedded DB (Next)

Real-world deployment features to make MuroDB easier to embed and operate.

Encryption OFF mode
- Motivation: some embedded deployments prefer CPU savings and rely on disk/host-level protection.
- Done when:
  - DB format can be created/opened in explicit plaintext mode.
  - File header clearly records mode to avoid accidental mis-open.
  - CLI/API require explicit opt-in (no silent downgrade from encrypted DB).
Pluggable encryption suite
- Motivation: allow policy-driven algorithm choice without forking storage engine.
- Done when:
  - Algorithm + KDF are selected by explicit config at DB creation.
  - Supported suites are versioned, discoverable, and recorded in metadata.
  - Wrong-suite open errors are deterministic and actionable.
Rekey / algorithm migration
- Rekey implemented via API (Database::rekey_with_password) and dedicated CLI (murodb-rekey).
- Crash-recoverable via .rekey marker file.
- Algorithm migration (cipher suite change) deferred to future work.
Backup API + consistent snapshot
- Decision (2026-02-22):
  - Prioritize early in Phase 9 so embedded apps can take consistent backups without full writer quiesce windows.
- Why now:
  - File-copy backup while writes are active is error-prone operationally.
  - A first-class API can provide deterministic snapshot semantics and simpler restore contracts.
- Done when:
  - Online consistent backup without long writer stalls.
  - Restore path validated by integration tests.
  - Snapshot metadata includes format/security parameters.
Operational limits and safeguards
- Done when:
  - Configurable caps for DB file size, WAL size, statement timeout, and memory budget.
  - Error surfaces are clear and machine-parseable for host applications.
  - Default limits are documented with recommended profiles (edge device / server / CI).

Keyboard shortcuts

MuroDB Documentation