Uniqueness and Duplicate Indicators
Uniqueness and Duplicate Indicators
The Uniqueness chart shows the number of distinct values in each field relative to the total record count. This metric is fundamental for understanding which fields make strong matching criteria and which fields contain too many repeated values to be useful on their own.
Understanding Uniqueness
Uniqueness is expressed as the ratio of distinct values to total records. If a datasource has 10,000 records and a field contains 9,800 distinct values, that field has 98% uniqueness. Here is how to interpret the spectrum:
- Very high uniqueness (90-100%) -- Nearly every record has a different value. These fields are excellent identifiers. Examples: email addresses, Social Security numbers, account IDs.
- High uniqueness (70-90%) -- Most values are unique with some repetition. Examples: full names, street addresses.
- Moderate uniqueness (30-70%) -- Significant repetition exists. Examples: city names, company names, last names.
- Low uniqueness (5-30%) -- Highly repetitive values. Examples: state abbreviations, country codes, job titles.
- Very low uniqueness (below 5%) -- Only a handful of distinct values. Examples: status fields, gender, boolean flags.
Uniqueness and Matching Strategy
Uniqueness directly influences how you should use a field in your matching definitions:
- High-uniqueness fields make strong primary matching criteria. If two records share the same value in a highly unique field, that is strong evidence of a match.
- Low-uniqueness fields are poor standalone matching criteria. Matching on a field where thousands of records share the same value (like "California") would produce an overwhelming number of false positives.
- Moderate-uniqueness fields work well as supplementary criteria. Last name alone produces too many matches, but last name combined with first name and date of birth creates a strong composite criterion.
Spotting Potential Duplicates
The uniqueness chart can also hint at existing duplicates in your data:
- If a field that should be unique (like email or account number) shows less than 100% uniqueness, duplicate records likely exist.
- Look at the gap between expected and actual uniqueness. An ID field with 95% uniqueness in a dataset of 10,000 records means roughly 500 values are repeated, suggesting potential duplicate entries.
Fields to Watch
Pay special attention to these field types:
- Identifier fields -- Should be close to 100%. Anything less suggests duplicates or data merges from multiple sources.
- Name fields -- Moderate uniqueness is expected. Very low uniqueness in a name field may indicate data quality issues (e.g., many records with placeholder names).
- Address fields -- Full addresses should have high uniqueness. Low uniqueness may indicate many records at the same location, which could be legitimate (apartment buildings) or a data issue.
Tip
Combine uniqueness information with completeness data from https://help.matchlogic.io/article/225-completeness-filled-vs-null. A field with high uniqueness but low completeness may still be a poor matching criterion because too many records lack values for it.