PostgreSQL Fuzzy Search
When working with human-generated text data (such as form fields, reviews, blogs, classified ads, or social media), several challenges arise that can affect your ability to analyze the data effectively:
- Common misspellings (e.g., “definitely” → “definitly”)
- Regional spelling variations (e.g., American vs. British English: “color”/“colour”, “analyze”/“analyse”)
- Stylistic variations for emphasis (e.g., “heyyy”, “whaaaaat!”, “noooo!“)
These variations make it difficult to perform accurate analysis, such as grouping similar topics together. Fortunately, PostgreSQL provides several powerful tools to handle these challenges. While this is a complex topic, we’ll cover the basics and some common use cases that should address most scenarios.
Simple Pattern Matching (A Little Fuzzy)
LIKE Operator
The LIKE operator is useful when you have a good understanding of your data patterns but need some flexibility in matching. It provides two wildcard characters:
%
: Matches zero or more characters of any value_
: Matches exactly one character of any value
PostgreSQL offers three variations:
LIKE
: Case-sensitive pattern matching with wildcards
SELECT 'Hello world' LIKE 'He__o %'; -- TRUE
ILIKE
: Case-insensitive pattern matching with wildcards
SELECT 'Hello world' ILIKE 'h_llo _%'; -- TRUE
NOT LIKE
/NOT ILIKE
: Inverse of LIKE or ILIKE operations
SELECT 'Hello world' NOT ILIKE '%_llo world%'; -- FALSE
Regular Expressions
Regular expressions provide more sophisticated pattern matching capabilities. PostgreSQL supports two types of regex operations:
[NOT] SIMILAR TO
: Uses SQL-standard expression syntax, combining elements of LIKE syntax with POSIX regular expressions
SELECT 'Hello world' SIMILAR TO 'H(e|a)l+o %'; --TRUE
~
/!~
/~*
/!~*
: Uses full POSIX regular expression syntax~
: Case-sensitive matching~*
: Case-insensitive matching!
: Negates the match
SELECT 'Hello world' ~ '^H(e|a)l{1,2}o [a-zA-Z]{5}$'; -- TRUE
Improving Performance
You can optimize pattern matching queries using special operator classes for BTREE indexes: text_pattern_ops
and varchar_pattern_ops
. These indexes are effective only for left-anchored patterns (no leading wildcard), such as:
WHERE text_field LIKE 'hell_ %'
For more details on pattern matching capabilities, see the PostgreSQL Pattern Matching Documentation.
Text Search Vectors (Fuzzy-ish)
Full-text search using tsvector is one of the most efficient options for searching text data. It works by:
- Removing stop words (common words like “it”, “the”, “as”, “by”)
- Eliminating duplicates
- Reducing words to their root form (stemming)
For example, words like “quick” and “quickly” are reduced to “quick”, while “product”, “production”, and “products” become “product”. This provides some fuzzy matching capability since exact word matches aren’t required.
Note: To use tsvector effectively, you need to specify the text language. While you can use the ‘simple’ configuration, it loses many of the efficiency benefits.
First, use to_tsvector(config, text)
to create a tsvector, which contains each word component and its position in the original text:
SELECT to_tsvector('english', 'the quick brown fox ran quickly to the other foxes');
to_tsvector
-------------------------------------
'brown':3 'fox':4,10 'quick':2,6 'ran':5
(1 row)
Next, you’ll need a query generator. PostgreSQL provides several options:
to_tsquery(config, text)
: Creates a basic query from tokens with boolean operators
SELECT to_tsquery('english', 'hello'); --> 'hello'
-- OR
SELECT to_tsquery('english', 'hello & worlds'); --> 'hello' & 'world'
plainto_tsquery(config, text)
: Handles generic search terms, connecting words with AND operations
SELECT plainto_tsquery('english', 'hello world'); --> 'hello' & 'world'
websearch_to_tsquery(config, text)
: Provides Google-style search syntax
SELECT websearch_to_tsquery('simple', '"hello there" -world'); --> 'hello' <-> 'there' & !'world'
Example Usage
SELECT message FROM mock_data
WHERE
to_tsvector('english', message)
@@
websearch_to_tsquery('english', 'product killer -content')
LIMIT 5;
message
---------------------------------
productize killer architectures
productize killer synergies
(2 rows)
Improving performance
To achieve optimal query performance, create a generated column with the tsvector data and add a GIN index to that column:
ALTER TABLE mock_data
ADD COLUMN ts_message_col tsvector
GENERATED ALWAYS AS (to_tsvector('english', message))
STORED;
CREATE INDEX idx_tsvector_message ON mock_data USING GIN(ts_message_col);
SELECT message FROM mock_data
WHERE
ts_message_col @@ websearch_to_tsquery('english', 'product or content')
LIMIT 5;
message
-----------------------------------
productize extensible initiatives
target value-added content
productize visionary content
monetize proactive content
synthesize cross-media content
(5 rows)
Trigrams (Fuzzier)
pg_trgm module required:
CREATE extension pg_trgm;
A trigram consists of three consecutive characters from a string. For example, here are the trigrams for the string “Hello world”:
SELECT show_trgm('Hello world');
show_trgm
---------------------------------------------------------------
{" h"," w"," he"," wo",ell,hel,"ld ",llo,"lo ",orl,rld,wor}
PostgreSQL uses trigrams to generate similarity scores between strings. The pg_trgm module provides three functions:
similarity(string, string)
: Calculates similarity between two complete strings
SELECT similarity('hello', 'Helo world'); --> 0.30769232
-- OR
SELECT 1 - ('hello' <-> 'Helo world'); --> 0.307692289352417
word_similarity(string, string)
: Returns the highest similarity between the first string and any substring of the second string
SELECT word_similarity('hello', 'Helo world'); --> 0.5714286
-- OR
SELECT 1 - ('hello' <<-> 'Helo world'); --> 0.5714285969734192
strict_word_similarity(string, string)
: Returns the highest similarity between the first string and any complete word in the second string
SELECT strict_word_similarity('hello', 'Helo world'); --> 0.5714286
-- OR
SELECT 1 - ('hello' <<<-> 'Helo world'); --> 0.5714285969734192
Boolean Results
For boolean comparisons:
SELECT ('hello' % 'Helo world'); --> similarity TRUE
SELECT ('hello' <% 'Helo world'); --> word_similarity FALSE
SELECT ('hello' <<% 'Helo world'); --> strict_word_similarity TRUE
The results depend on these GUC parameters:
pg_trgm.similarity_threshold
(default: 0.3)pg_trgm.word_similarity_threshold
(default: 0.6)pg_trgm.strict_word_similarity_threshold
(default: 0.5)
Improving Performance
The pg_trgm module provides both GiST and GIN index operator classes for indexing text columns. While I haven’t extensively tested this, GiST indexes reportedly provide better performance:
CREATE INDEX trgm_idx_text_column ON test_table USING GIST (text_column gist_trgm_ops);
-- OR
CREATE INDEX trgm_idx_text_column ON test_table USING GIN (text_column gin_trgm_ops);
Levenshtein Distance (Fuzzier)
fuzzystrmatch module required:
CREATE extension fuzzystrmatch;
Levenshtein distance measures the similarity between two strings by counting the minimum number of single-character edits required to transform one string into another.
SELECT
first_name,
levenshtein(first_name, 'Bobby') AS difference FROM mock_data
WHERE levenshtein(first_name, 'Bobby') < 3
ORDER BY 2
LIMIT 5;
first_name | difference
------------+------------
Bobby | 0
Bobbi | 1
Bobbi | 1
Bibby | 1
Toby | 2
(5 rows)
Performance Considerations
One limitation of the Levenshtein method is that it cannot be directly indexed since the index would need to know the input string. However, you can optimize performance by combining it with other fuzzy matching methods described below.
Phonetic Similarity (Very Fuzzy)
fuzzystrmatch module required:
CREATE extension fuzzystrmatch;
These methods are particularly interesting because they compare words based on their pronunciation rather than character-by-character similarity. The fuzzystrmatch module provides three functions:
soundex(string) -> text
: Converts a string to its Soundex code
SELECT soundex('Anne'), soundex('Ann'), difference('Anne', 'Ann');
soundex | soundex | difference
---------+---------+------------
A500 | A500 | 4
(1 row)
metaphone(string, max_output_length) -> text
: Similar to Soundex, but uses a different algorithm to create a representative code
SELECT metaphone('brendan', 10), metaphone('brandon', 10);
metaphone | metaphone
-----------+-----------
BRNTN | BRNTN
(1 row)
dmetaphone(string) -> text
/dmetaphone_alt(string) -> text
: Computes two “sounds like” strings for a given input — a “primary” and an “alternate” code. While these are usually identical, they can differ for non-English names depending on pronunciation
SELECT dmetaphone_alt('brendan'), dmetaphone('Brandon');
dmetaphone_alt | dmetaphone
----------------+------------
PRNT | PRNT
(1 row)
Performance Optimization
Each of these methods can be indexed using a standard function-based index:
CREATE INDEX idx_sdx_first_name ON mock_data (soundex(first_name));
--OR
CREATE INDEX idx_mtf_first_name ON mock_data (metaphone(first_name, 10));
--OR
CREATE INDEX idx_dmtf_first_name ON mock_data (dmetaphone(first_name));
As mentioned earlier, you can improve Levenshtein performance by combining it with phonetic methods. After indexing a column with one of the phonetic functions, use that to reduce the dataset before applying Levenshtein filtering:
SELECT
first_name,
levenshtein(first_name, 'Bobby') AS difference FROM mock_data
WHERE
soundex(first_name) = soundex('bobby')
AND
levenshtein(first_name, 'Bobby') < 3
ORDER BY 2
LIMIT 5;
first_name | difference
------------+------------
Bobby | 0
Bobbi | 1
Bobbi | 1
Bibby | 1
Bobbie | 2
(5 rows)