Anomaly Detection


The Anomaly Detection panel identifies values that are significantly different from the rest of the data in a field. These anomalies may be data entry errors that need correction or legitimate edge cases that require special handling during matching.

How Anomalies Are Detected

MatchLogic uses Z-score analysis to identify anomalies. The Z-score measures how many standard deviations a value is from the mean. Values with Z-scores beyond a threshold (typically 2 or 3 standard deviations) are flagged as anomalies.

For text fields, anomaly detection is based on characteristics like:

  • Value length -- Names that are unusually long or short compared to the average
  • Character composition -- Values with unexpected character types (numbers in name fields)
  • Pattern deviation -- Values that do not match the dominant patterns for the field

For numeric fields, anomaly detection works on the actual numeric values:

  • Extreme values -- Numbers far above or below the typical range
  • Zero or negative values -- When the field typically contains positive numbers

Common Types of Anomalies

Here are examples of anomalies you might encounter during data profiling:

  • Unusually long names -- A name field where most values are 5-20 characters but one entry is 200 characters, possibly containing an entire address or notes pasted into the wrong field.
  • Extreme numeric values -- An age field with values of 0 or 999, indicating placeholder or error values.
  • Rare patterns -- A phone number field where 99% of values follow standard formats but a few contain letters or special characters.
  • Encoding issues -- Values with non-printable characters, Unicode artifacts, or HTML entities that slipped in during import.

Reviewing Anomalies

For each flagged anomaly, the panel shows:

  • The actual value (or a truncated preview for long values)
  • The field it belongs to
  • How far it deviates from the norm

Click on an anomalies to view the full record, which helps you determine whether it is a genuine error or a valid edge case.

What to Do with Anomalies

Before proceeding to matching, decide how to handle each category of outlier:

  1. Data entry errors -- Correct these using data cleansing rules. For example, use replace-and-remove-operations to fix known bad values.
  2. Placeholder values -- Values like "N/A", "Unknown", or "999" should be removed or converted to nulls so they do not create false matches.
  3. Legitimate edge cases -- Some outliers are valid (very long legal company names, unusually high transaction amounts). Leave these as-is but be aware they may score differently during matching.
  4. Wrong-field data -- Data entered in the wrong column should be corrected at the source if possible, or excluded from matching criteria.

Important

Anomalies can cause unexpected matching behavior. A placeholder value like "Unknown" in a name field could match across hundreds of unrelated records, creating a massive false-positive group. Always review anomalies before running matches.

Tip

Use the detailed-analysis-view to sort fields by anomaly count and focus your review on the fields with the most anomalies first.