(Advanced) How Fuzzy Matching Works

how fuzzy matching works

Fuzzy matching allows MatchLogic to find records that are similar but not identical — handling typos, abbreviations, name variations, and data entry inconsistencies that would cause exact matching to fail. This article explains the algorithms behind fuzzy matching and how the configuration parameters control their behavior.

Edit Distance (Levenshtein)

The foundation of most fuzzy string matching is edit distance, also called Levenshtein distance after its inventor Vladimir Levenshtein. Edit distance is defined as the minimum number of single-character operations required to transform one string into another. The three permitted operations are:

  • Insertion — adding a character (e.g., "color" → "colour" requires inserting a "u")
  • Deletion — removing a character (e.g., "colour" → "color" requires deleting the "u")
  • Substitution — replacing one character with another (e.g., "Jon" → "John" requires substituting "n" for "hn"... actually an insertion here, but substitution handles cases like "cat" → "bat")

Some examples of edit distances between common name variations:

  • "Smith" vs "Smyth" — edit distance 1 (substitute 'i' with 'y')
  • "Jon" vs "John" — edit distance 1 (insert 'h')
  • "Catherine" vs "Kathryn" — edit distance 4
  • "Robert" vs "Roberto" — edit distance 1 (insert 'o')
  • "Williams" vs "Williamson" — edit distance 2 (insert 's', insert 'o', insert 'n' — actually 3)

Converting Edit Distance to a Similarity Score

Raw edit distance is an absolute count, which makes it difficult to compare across fields with different value lengths. A distance of 2 is significant for a 4-character string but negligible for a 20-character string. MatchLogic normalizes edit distance into a 0–100 similarity score using the following formula:

similarity = 1 - (edit_distance / max_length(string_a, string_b))
score = similarity × 100

For example, "Smith" (5 characters) vs "Smyth" (5 characters) with edit distance 1:

similarity = 1 - (1 / 5) = 0.80 → score = 80

And "Jon" (3 characters) vs "Jonathan" (8 characters) with edit distance 5:

similarity = 1 - (5 / 8) = 0.375 → score = 37.5

The Level Parameter

When configuring a fuzzy match criteria in Match Definitions, you set a Level (1–5) that maps to a minimum similarity threshold. Pairs that score below the threshold for that criteria receive a field-level score of 0 for that criteria. The mapping is:

  • Level 1 — Very Low: accepts matches with similarity ≥ 20%. Suitable when values may be heavily abbreviated or truncated. High false-positive risk.
  • Level 2 — Low: accepts matches with similarity ≥ 40%. Useful for free-text fields with significant variation.
  • Level 3 — Medium: accepts matches with similarity ≥ 60%. A balanced default for most name and address fields.
  • Level 4 — High: accepts matches with similarity ≥ 80%. Appropriate when values are expected to be nearly identical with only minor typos.
  • Level 5 — Very High: accepts matches with similarity ≥ 95%. Only small, single-character differences are tolerated. Approaches exact matching.

Choosing the right level is a trade-off: lower levels catch more genuine duplicates (higher recall) but also introduce more false positives (lower precision). It is common to run an initial match at Level 3 and adjust up or down based on the quality of results.

FastLevel

The FastLevel option uses a simplified, computationally cheaper similarity algorithm instead of full Levenshtein calculation. It applies the same Level 1–5 thresholds but uses heuristics rather than the full dynamic programming approach. FastLevel is most appropriate when:

  • The dataset is very large and matching performance is a priority
  • The field values are relatively short and the expected variation is minor
  • A slight reduction in matching accuracy is acceptable in exchange for faster processing

For most projects with standard dataset sizes, the full Levenshtein algorithm provides meaningfully better accuracy and is recommended over FastLevel.

Jaro-Winkler Similarity

Jaro-Winkler is an alternative string similarity algorithm that gives additional weight to matching characters at the beginning of strings (the prefix). This reflects the linguistic reality that the start of a word is often more stable than the end — typos and truncations are more common at the end of names and words.

The Jaro-Winkler score is calculated in two steps:

  1. Compute the base Jaro similarity, which counts matching characters (within a certain distance) and transpositions.
  2. Apply a prefix bonus: if the first 1–4 characters are identical, the score is boosted proportionally.

For example, "JOHNSON" vs "JONSON" scores higher under Jaro-Winkler than under pure Levenshtein because the shared prefix "JO" is rewarded. Jaro-Winkler is particularly well-suited to personal names and is used by MatchLogic's similarity engine (backed by the SimMetrics.Net library) as an alternative to Levenshtein where prefix agreement is an important signal.

Practical Guidance

Fuzzy matching is most effective when combined with good data preparation. Running the Data Cleansing module to standardize casing (UpperCase or ProperCase), remove extra whitespace, and expand abbreviations before matching will significantly improve fuzzy match quality regardless of the level or algorithm used. A pair that differs only in casing or leading/trailing spaces will have its similarity artificially reduced if the values are not normalized first.