Outlier Detection
Outlier Detection
The Outlier Detection panel identifies values that are significantly different from the rest of the data in a field. These anomalies may be data entry errors that need correction or legitimate edge cases that require special handling during matching.
How Outliers Are Detected
MatchLogic uses Z-score analysis to identify outliers. The Z-score measures how many standard deviations a value is from the mean. Values with Z-scores beyond a threshold (typically 2 or 3 standard deviations) are flagged as outliers.
For text fields, outlier detection is based on characteristics like:
- Value length -- Names that are unusually long or short compared to the average
- Character composition -- Values with unexpected character types (numbers in name fields)
- Pattern deviation -- Values that do not match the dominant patterns for the field
For numeric fields, outlier detection works on the actual numeric values:
- Extreme values -- Numbers far above or below the typical range
- Zero or negative values -- When the field typically contains positive numbers
Common Types of Outliers
Here are examples of outliers you might encounter during data profiling:
- Unusually long names -- A name field where most values are 5-20 characters but one entry is 200 characters, possibly containing an entire address or notes pasted into the wrong field.
- Extreme numeric values -- An age field with values of 0 or 999, indicating placeholder or error values.
- Rare patterns -- A phone number field where 99% of values follow standard formats but a few contain letters or special characters.
- Encoding issues -- Values with non-printable characters, Unicode artifacts, or HTML entities that slipped in during import.
Reviewing Outliers
For each flagged outlier, the panel shows:
- The actual value (or a truncated preview for long values)
- The field it belongs to
- How far it deviates from the norm
Click on an outlier to view the full record, which helps you determine whether it is a genuine error or a valid edge case.
What to Do with Outliers
Before proceeding to matching, decide how to handle each category of outlier:
- Data entry errors -- Correct these using data cleansing rules. For example, use #replace-and-remove-operations to fix known bad values.
- Placeholder values -- Values like "N/A", "Unknown", or "999" should be removed or converted to nulls so they do not create false matches.
- Legitimate edge cases -- Some outliers are valid (very long legal company names, unusually high transaction amounts). Leave these as-is but be aware they may score differently during matching.
- Wrong-field data -- Data entered in the wrong column should be corrected at the source if possible, or excluded from matching criteria.
Important
Outliers can cause unexpected matching behavior. A placeholder value like "Unknown" in a name field could match across hundreds of unrelated records, creating a massive false-positive group. Always review outliers before running matches.
Tip
Use the #detailed-analysis-view to sort fields by outlier count and focus your review on the fields with the most anomalies first.