Numeric Statistical Summary

Numeric Statistical Summary

The Numeric View within the Detailed Analysis tab focuses exclusively on fields that contain numeric data. It provides a set of classical statistical measures that help you understand the distribution, central tendency, and spread of values in each numeric field. These insights are especially useful for identifying data anomalies and understanding how numeric fields might behave during matching.

Advanced Feature

The Numeric Statistical Summary is most useful when your datasource contains fields with numeric data such as ages, amounts, scores, or counts. If your data is primarily text-based (names, addresses), the Standard View in https://help.matchlogic.io/article/348-detailed-analysis-view will be more relevant.

Accessing the Numeric View

  1. Navigate to the Data Profiling page and ensure a profile has been generated for your datasource.
  2. Switch to the Detailed Analysis tab.
  3. Select Numeric View from the view toggle at the top of the table.

Only fields detected as numeric types will appear in this view. Text, date, and other non-numeric fields are filtered out automatically.

Statistical Columns

The Numeric View table displays the following columns for each numeric field:

  • Field -- The column name from your datasource.
  • Type -- The detected numeric subtype (integer, decimal, etc.).
  • Min -- The smallest value found in the field. Useful for spotting negative values, zeros, or unexpectedly low numbers that may indicate errors.
  • Max -- The largest value found. Extremely high maximums can indicate outliers or data entry mistakes (e.g., an age of 999).
  • Mean -- The arithmetic average of all non-null values. Provides a general sense of the typical value in the field.
  • Median -- The middle value when all values are sorted. Unlike the mean, the median is not affected by extreme outliers, making it a more robust measure of the "typical" value.
  • Mode -- The most frequently occurring value. A mode that appears significantly more often than other values may indicate default entries or data quality issues.
  • Semantic Type -- The inferred meaning of the field (e.g., age, currency, percentage, count) based on value ranges and patterns.
  • Anomalies -- The number of statistically anomalous values detected in the field.

Interpreting the Statistics

Use these statistical measures together to build a picture of each field's distribution:

  • Large gap between Min and Max -- Indicates a wide range of values, which may include outliers. Check the anomaly count for confirmation.
  • Mean significantly different from Median -- Suggests the data is skewed. A mean much higher than the median indicates a few very large values pulling the average up. A mean much lower than the median suggests a few very small values.
  • Mode equals Min or Max -- The most common value is at an extreme end of the range, which may indicate placeholder or default values (e.g., a mode of 0 in an age field).
  • All statistics nearly identical -- Very low variance. The field may contain mostly the same value and would contribute little to matching discrimination.

Click-to-Filter Behavior

Just like the Standard View, clicking on any cell in the Numeric View table filters the data preview panel below to show related records. For example:

  • Click the Min value to see all records with the minimum value for that field.
  • Click the Anomalies count to see the records flagged as statistical outliers.
  • Click the Mode value to see all records sharing the most common value.

This drill-down capability lets you quickly investigate suspicious statistics without leaving the profiling interface.

Using Numeric Insights for Matching

Numeric profiling results inform your matching strategy in several ways:

  1. Range-based matching -- Fields with a wide, evenly distributed range (similar mean and median, low anomaly count) are good candidates for numeric range matching where two records match if their values fall within a configurable tolerance.
  2. Identifying poor candidates -- Fields where the mode dominates (e.g., 80% of records have the same value) provide little matching discrimination and should receive low weight or be excluded.
  3. Spotting data issues -- A Max of 9999 in an age field or a Min of -1 in a quantity field strongly suggests data errors that should be addressed through data cleansing before matching.

Tip

Compare the Mean and Median values for each field. When they are close together, the data is relatively symmetrical. When they diverge significantly, investigate the outliers using https://help.matchlogic.io/article/230-outlier-detection to determine whether extreme values need cleansing.