Character Composition

Character Composition

The Character Composition chart provides a breakdown of the types of characters found in each field. This analysis reveals hidden data quality issues that may not be apparent from looking at a few sample records, such as non-printable characters, unexpected numeric content in text fields, or punctuation that could interfere with matching.

Character Categories

For each field, values are analyzed and characters are classified into the following categories:

  • Alphabetic -- Letters A-Z and a-z, including accented and international characters
  • Numeric -- Digits 0-9
  • Punctuation -- Periods, commas, hyphens, apostrophes, and other standard punctuation marks
  • Spaces -- Standard space characters, tabs, and other whitespace
  • Non-printable -- Control characters, zero-width characters, and other invisible characters that do not render visually but exist in the data

The chart displays the proportion of each character type across all values in a field, giving you a profile of that field's textual structure.

What Character Composition Reveals

Unexpected character types are a strong indicator of data quality problems. Here are the patterns to watch for:

  • Numbers in a name field -- A first name or last name field should be almost entirely alphabetic. If you see significant numeric content, records may contain account numbers, IDs, or other non-name data mixed in.
  • Letters in a numeric field -- A field expected to contain only numbers (zip codes, phone numbers, account numbers) should have minimal alphabetic content. Letters may indicate formatting artifacts or data entry errors.
  • High punctuation in address fields -- Some punctuation is normal (periods in "St.", hyphens in suite numbers), but excessive punctuation may indicate concatenated fields or encoding issues.
  • Non-printable characters -- Any field showing non-printable characters needs cleansing. These invisible characters cause exact matches to fail even when values appear identical on screen. Common sources include copy-paste from word processors, web scraping, and legacy system exports.
  • Excessive spaces -- High space content relative to other characters may indicate extra whitespace, tab characters, or padding from fixed-width data sources.

Impact on Matching

Character composition issues directly affect match quality:

  1. Non-printable characters cause exact string comparisons to fail. Two records that look identical will not match if one contains hidden characters.
  2. Extra spaces change string similarity scores. "John Smith" and "John Smith" (double space) will not score 100% on exact matching.
  3. Mixed character types reduce the effectiveness of phonetic matching algorithms, which expect primarily alphabetic input.

Recommended Cleansing Actions

Based on character composition findings, consider these cleansing operations:

  • Use #removing-characters-by-type to strip numbers from name fields or letters from phone fields.
  • Apply #whitespace-cleaning to normalize spaces and remove extra whitespace.
  • Use the Remove Non-Printable operation to eliminate invisible characters.

Tip

Fields you plan to use for matching should ideally have a clean character composition. Run data cleansing to normalize character content, then re-profile to verify the improvement before proceeding to match configuration.