FTS Internals

Tokenization

Normalization: NFKC unicode normalization
Tokenizer: Bigram (n=2) - each text is split into overlapping 2-character sequences
Example: “東京タワー” → [“東京”, “京タ”, “タワ”, “ワー”]

Term ID Blinding

Term IDs are computed using HMAC-SHA256:

No plaintext tokens are stored on disk
Term ID = HMAC-SHA256(master_key, normalized_token)
This provides privacy: the disk contents do not reveal what terms are indexed

Postings Storage

Postings lists are stored in B-tree with compression:

Delta encoding: Document IDs are stored as deltas from the previous ID
Varint compression: Deltas are encoded as variable-length integers
Postings are stored in the same B-tree infrastructure as regular data

Scoring

Algorithm: BM25 (Okapi BM25)
Used in NATURAL LANGUAGE MODE for relevance ranking

Phrase Matching

Phrase queries (e.g., "東京タワー") verify consecutive bigram positions:

Tokenize the phrase into bigrams
Find postings for each bigram
Verify that positions are consecutive across all bigrams

Snippet Generation

fts_snippet() uses a local scan approach with a UTF-8 offset map:

Find matching positions in the document
Build a char<->byte offset map for normalized text
Convert match byte offsets to char windows via binary search
Slice and apply highlight tags (open/close) around matched regions
Truncate to the specified maximum length

Memory note:

Offset map size is bounded by (normalized_chars + 1) * sizeof(usize) bytes per call.