Ngram Index
Ngram indexes accelerate pattern matching queries using the LIKE operator with wildcards (%), enabling fast substring searches without full table scans.
What Problem Does It Solve?
Pattern matching with LIKE queries faces significant performance challenges on large datasets:
Example: Searching for '%error log%' in 10M log entries. Without ngram index, it scans all 10M rows. With ngram index, it pre-filters to ~1000 relevant blocks instantly.
Ngram vs Full-Text Index: When to Use Which?
Choose Ngram Index when:
- You have existing
LIKE '%pattern%'queries to optimize - Need exact substring matching (case-insensitive)
- Working with structured data like logs, codes, or IDs
- Want to improve performance without changing query syntax
Choose Full-Text Index when:
- Building search functionality for documents or content
- Need fuzzy search, relevance scoring, or complex queries
- Working with natural language text
- Want advanced search capabilities beyond simple pattern matching
How Ngram Index Works
Ngram indexes break text into overlapping character substrings (n-grams) for fast pattern lookup:
Example with gram_size = 3:
Input: "The quick brown"
N-grams: "The", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own"
Query Processing:
SELECT * FROM t WHERE content LIKE '%quick br%'
- Pattern
'quick br'is tokenized into n-grams: "qui", "uic", "ick", "ck ", "k b", " br" - Index filters data blocks containing these n-grams
- Full
LIKEfilter applied only to pre-filtered blocks
Quick Setup
-- Create table with text content
CREATE TABLE logs(id INT, message STRING);
-- Create ngram index with 3-character segments
CREATE NGRAM INDEX logs_message_idx ON logs(message) gram_size = 3;
-- Insert data (automatically indexed)
INSERT INTO logs VALUES (1, 'Application error occurred');
-- Search using LIKE - automatically optimized
SELECT * FROM logs WHERE message LIKE '%error%';
Complete Example
This example demonstrates creating an ngram index for log analysis and verifying its performance benefits:
-- Create table for application logs
CREATE TABLE t_articles (
id INT,
content STRING
);
-- Create ngram index with 3-character segments
CREATE NGRAM INDEX ngram_idx_content
ON t_articles(content)
gram_size = 3;
-- Verify index creation
SHOW INDEXES;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ name │ type │ original │ definition │ created_on │ updated_on │
├───────────────────┼────────┼──────────┼──────────────────────────────────┼────────────────────────────┼─────────────────────┤
│ ngram_idx_content │ NGRAM │ │ t_articles(content)gram_size='3' │ 2025-05-13 01:02:58.598409 │ NULL │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
-- Insert test data: 995 irrelevant rows + 5 target rows
INSERT INTO t_articles
SELECT number, CONCAT('Random text number ', number)
FROM numbers(995);
INSERT INTO t_articles VALUES
(1001, 'The silence was deep and complete'),
(1002, 'They walked in silence through the woods'),
(1003, 'Silence fell over the room'),
(1004, 'A moment of silence was observed'),
(1005, 'In silence, they understood each other');
-- Search with pattern matching
SELECT id, content FROM t_articles WHERE content LIKE '%silence%';
-- Verify index usage
EXPLAIN SELECT id, content FROM t_articles WHERE content LIKE '%silence%';
Performance Results:
-[ EXPLAIN ]-----------------------------------
TableScan
├── table: default.default.t_articles
├── output columns: [id (#0), content (#1)]
├── read rows: 5
├── read size: < 1 KiB
├── partitions total: 2
├── partitions scanned: 1
├── pruning stats: [segments: <range pruning: 2 to 2>, blocks: <range pruning: 2 to 2, bloom pruning: 2 to 1>]
├── push downs: [filters: [is_true(like(t_articles.content (#1), '%silence%'))], limit: NONE]
└── estimated rows: 15.62
Key Performance Indicator: bloom pruning: 2 to 1 shows the ngram index successfully filtered out 50% of data blocks before scanning.
Best Practices
Essential Commands
For complete command reference, see Ngram Index.
When to Use Ngram Indexes
Ideal for:
- Log analysis and monitoring systems
- Code search and pattern matching
- Product catalog searches
- Any application with frequent
LIKE '%pattern%'queries
Not recommended for:
- Short pattern searches (less than
gram_sizecharacters) - Exact string matching (use equality comparison instead)
- Complex text search requirements (use Full-Text Index instead)
Ngram indexes are essential for applications requiring fast pattern matching with LIKE queries on large text datasets.