PostgreSQL Fuzzy Search

February 7, 2021 • By Brendan Scullion

When working with human-generated text data (such as form fields, reviews, blogs, classified ads, or social media), several challenges arise that can affect your ability to analyze the data effectively:

Common misspellings (e.g., “definitely” → “definitly”)
Regional spelling variations (e.g., American vs. British English: “color”/“colour”, “analyze”/“analyse”)
Stylistic variations for emphasis (e.g., “heyyy”, “whaaaaat!”, “noooo!“)

These variations make it difficult to perform accurate analysis, such as grouping similar topics together. Fortunately, PostgreSQL provides several powerful tools to handle these challenges. While this is a complex topic, we’ll cover the basics and some common use cases that should address most scenarios.

Simple Pattern Matching (A Little Fuzzy)

LIKE Operator

The LIKE operator is useful when you have a good understanding of your data patterns but need some flexibility in matching. It provides two wildcard characters:

%: Matches zero or more characters of any value
_: Matches exactly one character of any value

PostgreSQL offers three variations:

LIKE: Case-sensitive pattern matching with wildcards

SELECT 'Hello world' LIKE 'He__o %';  -- TRUE

ILIKE: Case-insensitive pattern matching with wildcards

SELECT 'Hello world' ILIKE 'h_llo _%'; -- TRUE

NOT LIKE/NOT ILIKE: Inverse of LIKE or ILIKE operations

SELECT 'Hello world' NOT ILIKE '%_llo world%'; -- FALSE

Regular Expressions

Regular expressions provide more sophisticated pattern matching capabilities. PostgreSQL supports two types of regex operations:

[NOT] SIMILAR TO: Uses SQL-standard expression syntax, combining elements of LIKE syntax with POSIX regular expressions

SELECT 'Hello world' SIMILAR TO 'H(e|a)l+o %'; --TRUE

~/!~/~*/!~*: Uses full POSIX regular expression syntax
- ~: Case-sensitive matching
- ~*: Case-insensitive matching
- !: Negates the match

SELECT 'Hello world' ~ '^H(e|a)l{1,2}o [a-zA-Z]{5}$'; -- TRUE

Improving Performance

You can optimize pattern matching queries using special operator classes for BTREE indexes: text_pattern_ops and varchar_pattern_ops. These indexes are effective only for left-anchored patterns (no leading wildcard), such as:

WHERE text_field LIKE 'hell_ %'

For more details on pattern matching capabilities, see the PostgreSQL Pattern Matching Documentation.

Text Search Vectors (Fuzzy-ish)

Full-text search using tsvector is one of the most efficient options for searching text data. It works by:

Removing stop words (common words like “it”, “the”, “as”, “by”)
Eliminating duplicates
Reducing words to their root form (stemming)

For example, words like “quick” and “quickly” are reduced to “quick”, while “product”, “production”, and “products” become “product”. This provides some fuzzy matching capability since exact word matches aren’t required.

Note: To use tsvector effectively, you need to specify the text language. While you can use the ‘simple’ configuration, it loses many of the efficiency benefits.

First, use to_tsvector(config, text) to create a tsvector, which contains each word component and its position in the original text:

SELECT to_tsvector('english', 'the quick brown fox ran quickly to the other foxes');
               to_tsvector
-------------------------------------
 'brown':3 'fox':4,10 'quick':2,6 'ran':5
(1 row)

Next, you’ll need a query generator. PostgreSQL provides several options:

to_tsquery(config, text): Creates a basic query from tokens with boolean operators

SELECT to_tsquery('english', 'hello');  --> 'hello'
-- OR
SELECT to_tsquery('english', 'hello & worlds');  --> 'hello' & 'world'

plainto_tsquery(config, text): Handles generic search terms, connecting words with AND operations

SELECT plainto_tsquery('english', 'hello world'); --> 'hello' & 'world'

websearch_to_tsquery(config, text): Provides Google-style search syntax

SELECT websearch_to_tsquery('simple', '"hello there" -world');  -->  'hello' <-> 'there' & !'world'

Example Usage

SELECT message FROM mock_data
WHERE
    to_tsvector('english', message)
    @@
    websearch_to_tsquery('english', 'product killer -content')
LIMIT 5;
             message
---------------------------------
 productize killer architectures
 productize killer synergies
(2 rows)

Improving performance

To achieve optimal query performance, create a generated column with the tsvector data and add a GIN index to that column:

ALTER TABLE mock_data
    ADD COLUMN ts_message_col tsvector
    GENERATED ALWAYS AS (to_tsvector('english', message))
    STORED;

CREATE INDEX idx_tsvector_message ON mock_data USING GIN(ts_message_col);

SELECT message FROM mock_data
WHERE
    ts_message_col @@ websearch_to_tsquery('english', 'product or content')
LIMIT 5;
              message
-----------------------------------
 productize extensible initiatives
 target value-added content
 productize visionary content
 monetize proactive content
 synthesize cross-media content
(5 rows)

Trigrams (Fuzzier)

pg_trgm module required: CREATE extension pg_trgm;

A trigram consists of three consecutive characters from a string. For example, here are the trigrams for the string “Hello world”:

SELECT show_trgm('Hello world');
                           show_trgm
---------------------------------------------------------------
 {"  h","  w"," he"," wo",ell,hel,"ld ",llo,"lo ",orl,rld,wor}

PostgreSQL uses trigrams to generate similarity scores between strings. The pg_trgm module provides three functions:

similarity(string, string): Calculates similarity between two complete strings

SELECT similarity('hello', 'Helo world'); --> 0.30769232
-- OR
SELECT 1 - ('hello' <-> 'Helo world');  --> 0.307692289352417

word_similarity(string, string): Returns the highest similarity between the first string and any substring of the second string

SELECT word_similarity('hello', 'Helo world'); --> 0.5714286
-- OR
SELECT 1 - ('hello' <<-> 'Helo world'); --> 0.5714285969734192

strict_word_similarity(string, string): Returns the highest similarity between the first string and any complete word in the second string

SELECT strict_word_similarity('hello', 'Helo world'); --> 0.5714286
-- OR
SELECT 1 - ('hello' <<<-> 'Helo world'); --> 0.5714285969734192

Boolean Results

For boolean comparisons:

SELECT ('hello' % 'Helo world'); --> similarity TRUE
SELECT ('hello' <% 'Helo world'); --> word_similarity FALSE
SELECT ('hello' <<% 'Helo world'); --> strict_word_similarity TRUE

The results depend on these GUC parameters:

pg_trgm.similarity_threshold (default: 0.3)
pg_trgm.word_similarity_threshold (default: 0.6)
pg_trgm.strict_word_similarity_threshold (default: 0.5)

Improving Performance

The pg_trgm module provides both GiST and GIN index operator classes for indexing text columns. While I haven’t extensively tested this, GiST indexes reportedly provide better performance:

CREATE INDEX trgm_idx_text_column ON test_table USING GIST (text_column gist_trgm_ops);
-- OR
CREATE INDEX trgm_idx_text_column ON test_table USING GIN (text_column gin_trgm_ops);

Levenshtein Distance (Fuzzier)

fuzzystrmatch module required: CREATE extension fuzzystrmatch;

Levenshtein distance measures the similarity between two strings by counting the minimum number of single-character edits required to transform one string into another.

SELECT
    first_name,
    levenshtein(first_name, 'Bobby') AS difference FROM mock_data
WHERE levenshtein(first_name, 'Bobby') < 3
ORDER BY 2
LIMIT 5;
 first_name | difference
------------+------------
 Bobby      |          0
 Bobbi      |          1
 Bobbi      |          1
 Bibby      |          1
 Toby       |          2
(5 rows)

Performance Considerations

One limitation of the Levenshtein method is that it cannot be directly indexed since the index would need to know the input string. However, you can optimize performance by combining it with other fuzzy matching methods described below.

Phonetic Similarity (Very Fuzzy)

fuzzystrmatch module required: CREATE extension fuzzystrmatch;

These methods are particularly interesting because they compare words based on their pronunciation rather than character-by-character similarity. The fuzzystrmatch module provides three functions:

soundex(string) -> text: Converts a string to its Soundex code

SELECT soundex('Anne'), soundex('Ann'), difference('Anne', 'Ann');
 soundex | soundex | difference
---------+---------+------------
 A500    | A500    |          4
(1 row)

metaphone(string, max_output_length) -> text: Similar to Soundex, but uses a different algorithm to create a representative code

SELECT metaphone('brendan', 10), metaphone('brandon', 10);
 metaphone | metaphone
-----------+-----------
 BRNTN     | BRNTN
(1 row)

dmetaphone(string) -> text/dmetaphone_alt(string) -> text: Computes two “sounds like” strings for a given input — a “primary” and an “alternate” code. While these are usually identical, they can differ for non-English names depending on pronunciation

SELECT dmetaphone_alt('brendan'), dmetaphone('Brandon');
 dmetaphone_alt | dmetaphone
----------------+------------
 PRNT           | PRNT
(1 row)

Performance Optimization

Each of these methods can be indexed using a standard function-based index:

CREATE INDEX idx_sdx_first_name ON mock_data (soundex(first_name));
--OR
CREATE INDEX idx_mtf_first_name ON mock_data (metaphone(first_name, 10));
--OR
CREATE INDEX idx_dmtf_first_name ON mock_data (dmetaphone(first_name));

As mentioned earlier, you can improve Levenshtein performance by combining it with phonetic methods. After indexing a column with one of the phonetic functions, use that to reduce the dataset before applying Levenshtein filtering:

SELECT
    first_name,
    levenshtein(first_name, 'Bobby') AS difference FROM mock_data
WHERE
    soundex(first_name) = soundex('bobby')
AND
    levenshtein(first_name, 'Bobby') < 3
ORDER BY 2
LIMIT 5;
 first_name | difference
------------+------------
 Bobby      |          0
 Bobbi      |          1
 Bobbi      |          1
 Bibby      |          1
 Bobbie     |          2
(5 rows)

See: PostgreSQL Docs: fuzzystrmatch