Entropy and Information Content

Entropy and Information Content

Entropy is a measure of the information content, diversity, and randomness within a field. Borrowed from information theory, this metric helps you understand how much distinguishing power each field carries, which is directly relevant to its usefulness in record matching.

Entropy chart showing entropy values per field

What Entropy Means

Entropy is measured in bits. A higher entropy value means the field contains more diverse, varied data. A lower entropy value means the field is more repetitive and predictable.

  • Zero entropy -- Every record has the same value. This field carries no information and is useless for matching.
  • Low entropy (0-2 bits) -- Very few distinct values relative to the dataset size. Examples: a gender field with two values, a status field with three possible states.
  • Moderate entropy (2-5 bits) -- A reasonable range of distinct values. Examples: city names, state codes, department names.
  • High entropy (5-10 bits) -- Many distinct values with relatively even distribution. Examples: last names, street names, company names.
  • Very high entropy (10+ bits) -- Extremely diverse values, approaching uniqueness. Examples: email addresses, full addresses, timestamps.

Entropy and Matching

Fields with higher entropy are generally better candidates for matching because they provide more discriminating power:

  • A high-entropy field like an email address can strongly confirm or deny a match between two records. If two records share the same email, that is powerful evidence.
  • A low-entropy field like a country code (where 90% of records might be "US") provides very little matching value on its own -- matching on country alone would group nearly all records together.
  • Fields with moderate entropy work best as secondary criteria in combination with other fields.

When Entropy Signals Problems

Entropy values can also flag data quality issues:

  • Unexpectedly high entropy in name fields -- If a first name field has unusually high entropy, it might contain non-name data mixed in (full names, nicknames, company names, or random text). Investigate the actual values.
  • Unexpectedly low entropy in address fields -- If address fields show very low entropy, many records may share the same address, which could indicate duplicate entries or a data issue.
  • Near-zero entropy in fields that should vary -- May indicate a data import problem where a single default value was applied to most records.

Comparing Entropy Across Fields

Review entropy values for all fields together to prioritize which fields to include in your matching definitions. Fields with higher entropy should generally receive higher weights in your matching strategy, as they contribute more discriminating power.

Tip

Combine entropy analysis with https://help.matchlogic.io/article/227-uniqueness-and-duplicate-indicators. High uniqueness and high entropy together indicate a strong identifier field. High uniqueness but low entropy might indicate structured codes with little variation in parts of the value.