(Advanced) Understanding Match Scoring

understanding match scoring

Understanding how MatchLogic calculates match scores is essential for tuning your match configuration and interpreting results correctly. This article walks through the scoring formula, explains how multiple definitions interact, and explains how scores map to confidence bands and the threshold setting.

Field-Level Scores

Every criteria in a match definition produces a field-level score between 0 and 100 when two records are compared. The meaning of this score depends on the match type:

  • Exact match: returns either 100 (values are identical after normalization) or 0 (values differ)
  • Fuzzy match: returns a value between 0 and 100 based on string similarity, with 0 returned if similarity falls below the configured Level threshold
  • Phonetic match: returns 100 if phonetic codes match, 0 if they do not
  • Numeric Range match: returns a graduated score between 0 and 100 based on how close the values are relative to the configured tolerance

A field-level score of 0 on one criteria does not prevent a pair from matching overall — it simply means that criteria contributed nothing to the weighted average. The overall score can still be high if the remaining criteria score well and have significant weights.

The Weighted Average Formula

The overall match score for a pair under a given definition is calculated as a weighted average of all field-level scores:

overall_score = (Σ field_score_i × weight_i) / (Σ weight_i)

The denominator is the sum of all weights in the definition, not just the weights of criteria that produced non-zero scores. This means that a criteria with a high weight that scores 0 will significantly pull down the overall score, even if other criteria score perfectly. This is intentional: if a key field like last name is configured with a high weight, a pair where last names are completely dissimilar should receive a low overall score regardless of how well other fields match.

Example: Three-Criteria Definition

Consider a match definition with three criteria:

Criteria Weight Field Score Contribution
Last Name (Fuzzy) 50 80 50 × 80 = 4,000
First Name (Fuzzy) 30 100 30 × 100 = 3,000
Date of Birth (Exact) 20 0 20 × 0 = 0
overall_score = (4,000 + 3,000 + 0) / (50 + 30 + 20) = 7,000 / 100 = 70

The pair scores 70 overall. The date of birth mismatch costs 14 points (it would have been 84 if DOB matched exactly). The analyst must decide whether 70 is sufficient to be considered a match given the project's threshold setting.

Multiple Definitions: OR Logic

A project can have multiple match definitions, each representing a different matching strategy. When multiple definitions are active, MatchLogic evaluates all of them for each pair and uses the highest scoring definition as the pair's overall score. This is OR logic: if any definition produces a score above the threshold, the pair is considered a match.

For example, you might have:

  • Definition A: Name + Date of Birth + Address (high-confidence identity matching)
  • Definition B: Name + Phone Number (useful when address is not available)
  • Definition C: Email Address alone (for digital identity matching)

A pair that has no address or phone data but identical email addresses will score 0 under Definitions A and B but potentially 100 under Definition C. The pair's reported score is 100, and the winning definition is C. The Match Results table shows the score breakdown per definition, so analysts can see which definition drove each match.

Confidence Bands

To make scores more intuitive, MatchLogic maps numeric scores to named confidence bands displayed as color-coded badges in the results table and summary report:

Band Score Range Interpretation
Excellent 95–100 Near-certain match. Manual review rarely needed.
High 80–94 Strong match. Occasional review recommended for high-value data.
Good 60–79 Likely match. Review is advisable before merging.
Moderate 40–59 Possible match. Careful review required; false positives likely.
Low 20–39 Weak match. Probably not duplicates unless other evidence supports it.
Poor 0–19 Very unlikely to be a true match.

The Match Quality Report (Summary tab in Match Results) shows the distribution of pairs across confidence bands, giving a quick overview of overall match quality for the project.

The Threshold Setting

The threshold is a minimum score below which pairs are excluded from results entirely. It is configured in the Match Definitions module and applied globally. Pairs that score below the threshold are never surfaced to analysts and are not included in groups, counts, or export data.

Setting the threshold correctly is one of the most important tuning decisions in a matching project:

  • A high threshold (e.g., 85) produces fewer, higher-confidence results but may miss genuine duplicates where data quality is lower. This is appropriate when false positives are costly — for example, when matched records will be automatically merged without review.
  • A low threshold (e.g., 40) captures more potential duplicates but requires more manual review to filter out false positives. This is appropriate for exploratory analysis or when completeness is critical and all possible matches must be identified.

A common approach is to start with a moderate threshold (60–70), review a sample of results, and then adjust up if too many false positives appear or down if genuine duplicates are being missed. The confidence band distribution in the Match Quality Report is a useful guide for this tuning process.