(Advanced) Data Profiling Metrics Reference

data profiling metrics reference

This reference document describes every metric produced by MatchLogic's Data Profiling module. Understanding these metrics helps you assess data quality, identify columns that need cleansing, and select the best fields for matching.

Completeness

What it measures: The proportion of records in the datasource that have a non-null, non-empty value for this field.

Formula:

completeness = (filled_count / total_records) × 100

A completeness of 100% means every record has a value for this field. A completeness of 60% means 40% of records are missing data.

Why it matters for matching: Fields with low completeness make poor matching criteria because they cannot contribute to the score for records where the value is absent. A field that is 50% complete will fail to score for half of all comparisons involving those records. As a general guideline, fields below 80% completeness should be used with caution as primary matching criteria.

Uniqueness Ratio

What it measures: The proportion of distinct values among all non-null values in the field.

Formula:

uniqueness_ratio = distinct_value_count / non_null_record_count

A ratio of 1.0 means every non-null value is different (ideal for natural keys, email addresses, and primary identifiers). A ratio near 0.0 means almost all records share the same value (common in binary or low-cardinality fields like "Active/Inactive" flags or gender codes).

Why it matters for matching: High-uniqueness fields are more discriminating — a match on a field where every value is unique is strong evidence of a true match. A match on a field where 90% of records share the value "Unknown" carries almost no evidential weight. Fields with uniqueness below 0.05 (fewer than 5% of values are distinct) are generally not useful as matching criteria.

Validity

What it measures: The percentage of values that conform to the detected or expected data type and format pattern for the field.

MatchLogic infers the dominant data type (text, integer, decimal, date, email, phone, etc.) and counts how many values match that type. Values that do not conform are flagged as invalid.

Example: A field where 950 out of 1,000 values are valid dates but 50 contain free-text entries like "unknown" or "N/A" would show 95% validity.

Why it matters: Low validity indicates a field with mixed content or poor data entry discipline. Numeric Range matching on a field with 20% non-numeric values will produce unreliable results. A validity below 90% on a field intended for matching is a signal to apply cleansing rules before running the match.

Entropy (Shannon)

What it measures: The information diversity of the field — how evenly values are distributed across distinct categories.

Formula:

H = -Σ p(x) × log₂(p(x))

where p(x) is the relative frequency of each distinct value x. Entropy is measured in bits.

Interpreting entropy values:

A field where all records have the same value has entropy of 0 bits — no information, useless for matching.
A binary field (Yes/No) with a 50/50 split has entropy of 1 bit — maximum for two categories.
A field with 1,000 records and 1,000 distinct values has entropy approaching log₂(1,000) ≈ 10 bits — maximum diversity.

Why it matters: High entropy fields are good matching candidates because they carry more information. A name field with entropy of 8 bits contributes far more evidence per match than a country field with entropy of 2 bits. Entropy is related to, but not the same as, uniqueness ratio: a field can have low uniqueness but high entropy if values are distributed across many moderately common categories.

Data Type Confidence

What it measures: The percentage of values in the field that match the automatically detected data type.

MatchLogic's profiling engine attempts to infer the most likely data type for each column (integer, decimal, date, boolean, email, phone, free text, etc.) and reports what percentage of values are consistent with that inference.

Example: A column that is 98% numeric with 2% text entries would show data type "Integer" with 98% confidence.

Why it matters: Low data type confidence indicates a mixed-type column. Using such a column as a Numeric Range criteria when it contains significant non-numeric values will produce incorrect scores or errors.

Outliers (Z-Score)

What it measures: For numeric fields, the count and percentage of values that are statistical outliers — more than 3 standard deviations from the mean.

Formula:

z = (value - mean) / standard_deviation
outlier if |z| > 3

Under a normal distribution, approximately 0.3% of values are expected to fall beyond ±3 standard deviations. A substantially higher outlier rate indicates either genuine extreme values (e.g., a very wealthy customer in an income field) or data entry errors (e.g., a date entered as year 9999).

Why it matters: Outlier values can disproportionately affect numeric matching. An age of 999 entered as a placeholder will never match any real age. Identifying and cleansing outliers before matching improves overall score quality.

Character Composition

What it measures: For text fields, the average percentage breakdown of character types across all values:

% Alphabetic: characters A–Z and a–z
% Numeric: digits 0–9
% Special characters: punctuation, symbols, and non-alphanumeric characters
% Whitespace: spaces and other whitespace characters

Why it matters: Character composition reveals structural patterns in a field. A field that is 100% numeric probably contains codes or identifiers, not free text. A field with a high proportion of special characters may need a RemoveSpecialCharacters cleansing rule before matching. Unexpected compositions (e.g., alphabetic characters in an ID field) indicate data quality issues.

Min / Max / Mean / Median / Mode

What they measure: Standard descriptive statistics for numeric fields:

Min: the smallest numeric value in the field
Max: the largest numeric value in the field
Mean: the arithmetic average
Median: the middle value when all values are sorted; less sensitive to outliers than the mean
Mode: the most frequently occurring value

Why they matter: These statistics provide a sanity check on numeric fields. A negative minimum age, a maximum year of birth in the future, or a mode of 0 in an amount field all indicate data quality problems. They also inform the appropriate UpperLimit and LowerLimit settings when configuring Numeric Range matching.

Standard Deviation

What it measures: The average spread of values around the mean for a numeric field. A high standard deviation means values are widely distributed; a low standard deviation means most values are clustered close to the mean.

Formula:

σ = √[ Σ(value - mean)² / n ]

Why it matters: Standard deviation is used to calculate Z-scores for outlier detection (see above) and provides context for interpreting the mean. It also informs the choice of absolute vs. percentage tolerance in Numeric Range matching: if the standard deviation is large relative to the mean, a percentage-based tolerance often makes more sense than a fixed absolute tolerance.

Pattern Frequency

What it measures: The top structural patterns discovered in the field's values, where each character is replaced by its character type (alpha → A, digit → 9, special → X), along with the count and percentage of values that match each pattern.

Examples:

A Social Security Number field might show pattern "999-99-9999" covering 97% of records
A phone number field might show "(999) 999-9999" (60%) and "999-999-9999" (35%) as the top two patterns
A name field might show pattern "AAAAAAA AAAAAAAAAA" (first and last name) as the dominant pattern

Why it matters: Pattern frequency analysis reveals whether a field has a consistent structure (good for exact or fuzzy matching) or highly variable structure (may need cleansing). Discovering that a phone number field has 12 different patterns suggests a normalization cleansing step is needed before matching. It also enables the creation of targeted regex validation rules in the Data Profiling module to enforce the expected pattern going forward.