Anomaly Detection

The Anomaly Detection panel identifies values that are significantly different from the rest of the data in a field. These anomalies may be data entry errors that need correction or legitimate edge cases that require special handling during matching.

How Anomalies Are Detected

MatchLogic uses Z-score analysis to identify anomalies. The Z-score measures how many standard deviations a value is from the mean. Values with Z-scores beyond a threshold (typically 2 or 3 standard deviations) are flagged as anomalies.

For text fields, anomaly detection is based on characteristics like:

Value length -- Names that are unusually long or short compared to the average
Character composition -- Values with unexpected character types (numbers in name fields)
Pattern deviation -- Values that do not match the dominant patterns for the field

For numeric fields, anomaly detection works on the actual numeric values:

Extreme values -- Numbers far above or below the typical range
Zero or negative values -- When the field typically contains positive numbers

Common Types of Anomalies

Here are examples of anomalies you might encounter during data profiling:

Unusually long names -- A name field where most values are 5-20 characters but one entry is 200 characters, possibly containing an entire address or notes pasted into the wrong field.
Extreme numeric values -- An age field with values of 0 or 999, indicating placeholder or error values.
Rare patterns -- A phone number field where 99% of values follow standard formats but a few contain letters or special characters.
Encoding issues -- Values with non-printable characters, Unicode artifacts, or HTML entities that slipped in during import.

Reviewing Anomalies

For each flagged anomaly, the panel shows:

The actual value (or a truncated preview for long values)
The field it belongs to
How far it deviates from the norm

Click on an anomalies to view the full record, which helps you determine whether it is a genuine error or a valid edge case.

What to Do with Anomalies

Before proceeding to matching, decide how to handle each category of outlier:

Data entry errors -- Correct these using data cleansing rules. For example, use replace-and-remove-operations to fix known bad values.
Placeholder values -- Values like "N/A", "Unknown", or "999" should be removed or converted to nulls so they do not create false matches.
Legitimate edge cases -- Some outliers are valid (very long legal company names, unusually high transaction amounts). Leave these as-is but be aware they may score differently during matching.
Wrong-field data -- Data entered in the wrong column should be corrected at the source if possible, or excluded from matching criteria.

Important

Anomalies can cause unexpected matching behavior. A placeholder value like "Unknown" in a name field could match across hundreds of unrelated records, creating a massive false-positive group. Always review anomalies before running matches.

Tip

Use the detailed-analysis-view to sort fields by anomaly count and focus your review on the fields with the most anomalies first.