Outlier Detection

Outlier Detection

The Outlier Detection panel identifies values that are significantly different from the rest of the data in a field. These anomalies may be data entry errors that need correction or legitimate edge cases that require special handling during matching.

Outlier Detection panel showing flagged values

How Outliers Are Detected

MatchLogic uses Z-score analysis to identify outliers. The Z-score measures how many standard deviations a value is from the mean. Values with Z-scores beyond a threshold (typically 2 or 3 standard deviations) are flagged as outliers.

For text fields, outlier detection is based on characteristics like:

  • Value length -- Names that are unusually long or short compared to the average
  • Character composition -- Values with unexpected character types (numbers in name fields)
  • Pattern deviation -- Values that do not match the dominant patterns for the field

For numeric fields, outlier detection works on the actual numeric values:

  • Extreme values -- Numbers far above or below the typical range
  • Zero or negative values -- When the field typically contains positive numbers

Common Types of Outliers

Here are examples of outliers you might encounter during data profiling:

  • Unusually long names -- A name field where most values are 5-20 characters but one entry is 200 characters, possibly containing an entire address or notes pasted into the wrong field.
  • Extreme numeric values -- An age field with values of 0 or 999, indicating placeholder or error values.
  • Rare patterns -- A phone number field where 99% of values follow standard formats but a few contain letters or special characters.
  • Encoding issues -- Values with non-printable characters, Unicode artifacts, or HTML entities that slipped in during import.

Reviewing Outliers

For each flagged outlier, the panel shows:

  • The actual value (or a truncated preview for long values)
  • The field it belongs to
  • How far it deviates from the norm

Click on an outlier to view the full record, which helps you determine whether it is a genuine error or a valid edge case.

What to Do with Outliers

Before proceeding to matching, decide how to handle each category of outlier:

  1. Data entry errors -- Correct these using data cleansing rules. For example, use #replace-and-remove-operations to fix known bad values.
  2. Placeholder values -- Values like "N/A", "Unknown", or "999" should be removed or converted to nulls so they do not create false matches.
  3. Legitimate edge cases -- Some outliers are valid (very long legal company names, unusually high transaction amounts). Leave these as-is but be aware they may score differently during matching.
  4. Wrong-field data -- Data entered in the wrong column should be corrected at the source if possible, or excluded from matching criteria.

Important

Outliers can cause unexpected matching behavior. A placeholder value like "Unknown" in a name field could match across hundreds of unrelated records, creating a massive false-positive group. Always review outliers before running matches.

Tip

Use the #detailed-analysis-view to sort fields by outlier count and focus your review on the fields with the most anomalies first.