Whitespace Cleaning

Whitespace Cleaning

Whitespace issues are among the most common and insidious data quality problems. Extra spaces, leading or trailing whitespace, and tab characters are invisible to the naked eye but cause exact matches to fail. Two records that look identical on screen may not match because one has a trailing space. Whitespace cleaning should be one of the first operations in any cleansing workflow.

Available Whitespace Nodes

MatchLogic provides five whitespace-related operations:

  • Trim

    Removes both leading and trailing whitespace from values. " John Smith " becomes "John Smith". This is the most commonly used whitespace operation and should be applied to virtually every field before matching.

  • Remove Extra Whitespace

    Collapses multiple consecutive spaces into a single space. "John Smith" becomes "John Smith". Does not affect leading or trailing spaces -- combine with Trim for complete cleanup.

  • Remove Whitespace

    Removes all spaces from the value entirely. "John Smith" becomes "JohnSmith". Use with caution -- this is appropriate for fields like phone numbers or codes where spaces should not exist, but not for name or address fields where spaces separate words.

  • Remove Leading Whitespace

    Removes spaces only from the beginning of the value. " John Smith" becomes "John Smith" but "John Smith " retains the trailing space. Use when you specifically need to preserve trailing whitespace (rare).

  • Remove Trailing Whitespace

    Removes spaces only from the end of the value. "John Smith " becomes "John Smith" but " John Smith" retains the leading space. Use when you specifically need to preserve leading whitespace (rare).

Recommended Approach

For most datasources, apply these two operations to every field you plan to use for matching:

  1. Remove Extra Whitespace -- Normalizes internal spacing.
  2. Trim -- Strips leading and trailing spaces.

This combination handles the vast majority of whitespace issues. Apply them early in your workflow, before case conversion and other transformations.

Why Whitespace Matters for Matching

Consider these examples of how whitespace affects matching:

  • "John Smith" and "John Smith " (trailing space) -- These will not match exactly. The trailing space changes the string comparison.
  • "123 Main St" and "123 Main St" (double spaces) -- Fuzzy matching may still score these highly, but exact matching will fail, and even fuzzy scores will be lower than they should be.
  • "(555) 123-4567" and " (555) 123-4567" (leading space) -- The leading space reduces match scores across all matching types.

Tip

Apply Trim as the very first operation on every field in your workflow. It is fast, safe, and eliminates the most common invisible data quality issue. There is virtually no scenario where leading or trailing whitespace in a data field is intentional.

Important

The Remove Whitespace node removes ALL spaces, including spaces between words. Only use it on fields where spaces are unwanted (phone numbers, postal codes, account numbers). Do not apply it to name or address fields, or "John Smith" will become "JohnSmith".