Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Roadmap

Implemented

  • Basic CRUD (INSERT, SELECT, UPDATE, DELETE)
  • CREATE TABLE (PRIMARY KEY, UNIQUE, NOT NULL)
  • CREATE INDEX / CREATE UNIQUE INDEX (single column)
  • CREATE FULLTEXT INDEX (bigram, BM25, NATURAL/BOOLEAN mode, snippet)
  • MySQL-compatible integer types (TINYINT, SMALLINT, INT, BIGINT)
  • VARCHAR(n), VARBINARY(n), TEXT with size validation
  • UUID type (16-byte native, UUID_V4/UUID_V7 generation)
  • Hex literal (X'...') for VARBINARY data
  • WHERE with comparison operators (=, !=, <, >, <=, >=)
  • AND, OR logical operators
  • ORDER BY (ASC/DESC, multi-column), LIMIT
  • JOIN (INNER, LEFT, CROSS) with table aliases
  • BEGIN / COMMIT / ROLLBACK
  • SHOW TABLES
  • Multi-row INSERT
  • Hidden _rowid auto-generation for tables without explicit PK
  • AES-256-GCM-SIV encryption, Argon2 KDF
  • WAL-based crash recovery
  • CLI with REPL
  • DROP TABLE / DROP TABLE IF EXISTS
  • DROP INDEX
  • IF NOT EXISTS for CREATE TABLE / CREATE INDEX
  • SHOW CREATE TABLE
  • DESCRIBE / DESC table
  • LIKE / NOT LIKE (% and _ wildcards)
  • IN (value list)
  • BETWEEN … AND …
  • IS NULL / IS NOT NULL
  • NOT operator (general)
  • OFFSET (SELECT … LIMIT n OFFSET m)
  • DEFAULT column values
  • AUTO_INCREMENT
  • Arithmetic operators in expressions (+, -, *, /, %)
  • BOOLEAN type (alias for TINYINT)
  • CHECK constraint

Phase 2 — Built-in Functions ✓

MySQL-compatible scalar functions.

  • String: LENGTH, CHAR_LENGTH, CONCAT, SUBSTRING/SUBSTR, UPPER, LOWER
  • String: TRIM, LTRIM, RTRIM, REPLACE, REVERSE, REPEAT
  • String: LEFT, RIGHT, LPAD, RPAD, INSTR/LOCATE
  • String: REGEXP / REGEXP_LIKE
  • Numeric: ABS, CEIL/CEILING, FLOOR, ROUND, MOD, POWER/POW
  • NULL handling: COALESCE, IFNULL, NULLIF, IF
  • Type conversion: CAST(expr AS type)
  • CASE WHEN … THEN … ELSE … END

Phase 3 — Aggregation & Grouping ✓

  • COUNT, SUM, AVG, MIN, MAX
  • COUNT(DISTINCT …)
  • GROUP BY (single and multiple columns)
  • HAVING
  • SELECT DISTINCT

Phase 4 — Schema Evolution ✓

  • ALTER TABLE ADD COLUMN
  • ALTER TABLE DROP COLUMN
  • ALTER TABLE MODIFY COLUMN / CHANGE COLUMN
  • RENAME TABLE
  • Composite PRIMARY KEY
  • Composite UNIQUE / composite INDEX

Phase 5 — Advanced Query ✓

  • Subqueries (WHERE col IN (SELECT …), scalar subquery)
  • UNION / UNION ALL
  • EXISTS / NOT EXISTS
  • INSERT … ON DUPLICATE KEY UPDATE
  • REPLACE INTO
  • EXPLAIN (query plan display)
  • RIGHT JOIN
  • Shared-lock read path (Database::query) with CLI auto routing

Phase 6 — Types & Storage

  • FLOAT / DOUBLE
  • DATE, DATETIME, TIMESTAMP
    • Scope: fully align parser/executor/CAST/default/literal behavior and edge-case validation.
    • Done when:
      • Temporal literals and string casts behave consistently across INSERT/UPDATE/WHERE.
      • Arithmetic and comparison semantics are defined/documented for mixed temporal expressions.
      • Timezone handling policy is explicit (especially TIMESTAMP input/output normalization).
      • Invalid dates/times reject with deterministic errors.
  • Date/time functions: NOW, CURRENT_TIMESTAMP, DATE_FORMAT, etc.
  • UUID type with UUID_V4() and UUID_V7() generation functions
  • DECIMAL(p,s) / NUMERIC(p,s) fixed-point exact numeric type
    • 96-bit mantissa via rust_decimal, precision 1-28, 16-byte storage
    • Full arithmetic, comparison, CAST, aggregation (SUM/AVG/MIN/MAX), ORDER BY, GROUP BY, INDEX support
    • MySQL-compatible: NUMERIC alias, default DECIMAL(10,0), DECIMAL+INT→DECIMAL, DECIMAL+FLOAT→FLOAT
  • BLOB (skipped for now)
    • Decision (2026-02-22): defer and move focus to Phase 7 performance work.
    • Why skipped now:
      • Current product priorities are query/index performance and planner improvements, not large-object type expansion.
      • BLOB adds non-trivial storage/operational surface area (limits, indexing semantics, comparison behavior) with low near-term user impact.
      • Existing VARBINARY(n)/TEXT coverage is sufficient for current workloads.
    • Revisit when:
      • There is a concrete workload requiring large binary payloads that cannot be handled acceptably by current types.
      • The performance roadmap items in Phase 7 are complete or no longer the bottleneck.
  • Overflow pages (posting list > 4096B)
    • Scope: support values/postings that exceed single-page capacity.
    • Progress:
      • Implemented FTS segment overflow chains (__segovf__) with typed page format (OFG1).
      • Read/write/delete + vacuum path now reclaims overflow pages without orphaning.
      • Covered by unit/integration tests (cargo test green as of 2026-02-22).
      • Added WAL recovery integration tests for overflow chains (torn WAL tail and post-sync partial-write replay paths).
      • Benchmarked on 2026-02-22 (murodb_bench, commit 829ad18145c2) with no severe small-record regression signal.
      • Implemented B-tree value overflow pages (2026-02-23): large row values (>~4073 bytes) now spill to overflow page chains transparently. Format version bumped to 5 (backward-compatible with v4).
    • Done when:
      • Overflow chain format is versioned and crash-safe.
      • WAL/recovery covers partial-write and torn-tail scenarios for overflow chains.
      • Vacuum/reclaim path correctly frees overflow pages without orphaning.
      • Benchmarks show no severe regressions for small records.

Phase 7 — Performance & Internals

  • Auto-checkpoint (threshold-based WAL)
  • Composite index range scan
    • Progress:
      • Added planner/executor support for composite-index range seek on the last key part (e.g. (a,b) with a = ? and b range).
      • EXPLAIN now reports type=range for this access path.
      • EXPLAIN now reports estimated cardinality via rows.
    • Done when:
      • Multi-column prefix ranges ((a,b) with predicates on a, optional range on b) use index scan.
      • EXPLAIN shows index-range choice and estimated cardinality.
      • Fallback path remains correct for unsupported predicate shapes.
  • Query optimizer improvements (cost-based)
    • Progress:
      • Added deterministic heuristic cost hints for PkSeek / IndexSeek / IndexRangeSeek / FullScan.
      • Planner now compares index candidates by cost instead of choosing the first matching index.
      • EXPLAIN now reports a cost column for the chosen plan.
      • Added persisted stats via ANALYZE TABLE (table_rows, index_distinct_keys) in catalog metadata.
      • EXPLAIN row estimation now prefers persisted table_rows when available.
      • Planner cost model now incorporates persisted table_rows/index_distinct_keys when available, with conservative fallback selectivity when stats are missing.
      • EXPLAIN rows/cost now uses the same planner estimation logic (with table-row fallback), so estimates reflect planner tradeoffs.
      • JOIN loop-order choice for INNER/CROSS now uses planner-side estimated row counts (stats-aware with runtime fallback) and keeps row shape (left + right) stable.
      • ANALYZE TABLE now persists numeric min/max bounds and equal-width histogram bins for single-column numeric B-tree indexes; range row estimation uses these stats when available.
      • EXPLAIN for JOIN now reports nested-loop outer-side choice with estimated left/right row counts in Extra.
    • Done when:
      • Planner compares at least full-scan vs single-index vs join-order alternatives.
      • Basic column stats/histograms are persisted and refreshable.
      • Plan choice is deterministic under identical stats.
  • FTS stop-ngram filtering
    • Progress:
      • Added FULLTEXT options stop_filter and stop_df_ratio_ppm (ppm threshold).
      • NATURAL LANGUAGE MODE now supports skipping high-DF ngrams when enabled.
      • Default remains OFF for exact-behavior compatibility.
      • Recall/precision tradeoff example documented in Full-Text Search guide.
    • Done when:
      • Frequent low-information ngrams are skipped using configurable thresholds.
      • Recall/precision tradeoff is documented with benchmark examples.
      • Toggle exists for exact behavior compatibility.
  • fts_snippet acceleration (pos-to-offset map)
    • Progress:
      • Replaced snippet byte/char conversion loops with a UTF-8 position-to-offset map plus binary search.
      • Snippet assembly now slices by byte ranges instead of repeatedly collecting char vectors.
      • Added dedicated benchmark runner (murodb_snippet_bench) with legacy-vs-new comparison and offset-map memory estimate.
      • On 2026-02-22 (local, release build), long-text tail-hit case showed small p50 improvement (legacy 1245.52us -> new 1228.43us).
    • Done when:
      • Snippet generation avoids repeated UTF-8 rescans for long docs.
      • Latency improvement is measured and documented on representative datasets.
      • Memory overhead remains bounded and observable.

Phase 8 — Security (Future)

  • Key rotation (epoch-based re-encryption)
    • Implemented API-based rekey (Database::rekey_with_password) for full page re-encryption.
    • New random salt generated on each rotation; epoch incremented.
    • Crash-safe via .rekey marker file with automatic recovery on next open.
    • Rejects inside transactions and on plaintext databases.

Phase 9 — Practical Embedded DB (Next)

Real-world deployment features to make MuroDB easier to embed and operate.

  • Encryption OFF mode
    • Motivation: some embedded deployments prefer CPU savings and rely on disk/host-level protection.
    • Done when:
      • DB format can be created/opened in explicit plaintext mode.
      • File header clearly records mode to avoid accidental mis-open.
      • CLI/API require explicit opt-in (no silent downgrade from encrypted DB).
  • Pluggable encryption suite
    • Motivation: allow policy-driven algorithm choice without forking storage engine.
    • Done when:
      • Algorithm + KDF are selected by explicit config at DB creation.
      • Supported suites are versioned, discoverable, and recorded in metadata.
      • Wrong-suite open errors are deterministic and actionable.
  • Rekey / algorithm migration
    • Rekey implemented via API (Database::rekey_with_password) and dedicated CLI (murodb-rekey).
    • Crash-recoverable via .rekey marker file.
    • Algorithm migration (cipher suite change) deferred to future work.
  • Backup API + consistent snapshot
    • Decision (2026-02-22):
      • Prioritize early in Phase 9 so embedded apps can take consistent backups without full writer quiesce windows.
    • Why now:
      • File-copy backup while writes are active is error-prone operationally.
      • A first-class API can provide deterministic snapshot semantics and simpler restore contracts.
    • Done when:
      • Online consistent backup without long writer stalls.
      • Restore path validated by integration tests.
      • Snapshot metadata includes format/security parameters.
  • Operational limits and safeguards
    • Done when:
      • Configurable caps for DB file size, WAL size, statement timeout, and memory budget.
      • Error surfaces are clear and machine-parseable for host applications.
      • Default limits are documented with recommended profiles (edge device / server / CI).