Glossary of Terms
This glossary defines the key terms and concepts used throughout MatchLogic. Terms are listed alphabetically for quick reference.
- Blocking
- A performance optimization that pre-groups records into candidate pairs before detailed comparison begins. Rather than comparing every record against every other record (an O(n²) operation), blocking narrows the field to only those records that share a common key value — such as the first three letters of a surname or a ZIP code. Only pairs within the same block are compared, dramatically reducing the number of comparisons needed for large datasets while retaining most true matches.
- Cleansing workflow
- A visual, node-based pipeline of data transformation rules applied to a datasource before matching begins. Built in the Data Cleansing module using a drag-and-drop canvas (powered by React Flow), a workflow chains together rule nodes — such as Trim, Replace, UpperCase, or WordSmith dictionary lookups — to standardize data in a repeatable, auditable way. Workflows are saved per datasource and can be re-applied after data refreshes.
- Confidence band
- A labeled tier applied to a match score range to help interpret result quality at a glance. MatchLogic uses six bands: Excellent (95–100), High (80–94), Good (60–79), Moderate (40–59), Low (20–39), and Poor (below 20). Confidence bands appear as color-coded badges in match results tables and summary reports.
- Criteria
- A single comparison rule within a match definition. Each criteria specifies: the field pair to compare (via field mappings), the match type (Exact, Fuzzy, Phonetic, or Numeric Range), the data type (Text, Number, or Phonetic), and a weight (0–100) that controls how much this criteria contributes to the overall match score. A match definition may contain multiple criteria, all of which are evaluated and combined to produce a final score.
- Cross-reference export
- An export action that outputs a mapping table showing which records from different datasources refer to the same real-world entity. Instead of exporting merged records, a cross-reference export produces a file containing the identifiers of matched record pairs or groups — useful for loading into downstream systems that need to know which source records are linked without altering the source data itself.
- Data profiling
- The automated analysis of a datasource's statistical and structural properties. MatchLogic's Data Profiling module examines each column and reports metrics including completeness (fill rate), data type confidence, uniqueness ratio, validity, entropy, character composition, outliers, and discovered patterns. Profiling results help users understand data quality before configuring match strategies and identify fields that may need cleansing.
- Datasource
- An imported dataset used as input to the MatchLogic pipeline. A datasource can originate from a CSV file, an Excel spreadsheet, or a database table (SQL Server, MySQL, PostgreSQL, or Snowflake). Each datasource belongs to a project and is assigned a unique identifier. Multiple datasources can exist within a single project, and matching can be performed between any pair of datasources or within a single datasource.
- Deduplication
- The process of identifying and consolidating duplicate records within or across datasources. In MatchLogic, deduplication is achieved by running matching on a single datasource (within-source matching), which compares every record to every other record in that same dataset to find pairs that represent the same real-world entity. The results can then be reviewed and merged using the Merge and Survivorship module.
- Entropy
- A measure of information diversity in a field, calculated using the Shannon entropy formula: H = −Σ p(x) × log₂(p(x)), where p(x) is the relative frequency of each distinct value. Low entropy means a field is dominated by a small number of values (e.g., a binary Yes/No field has maximum entropy of 1 bit when values are evenly split; a field where every record says "Unknown" has entropy near 0). High entropy indicates many distinct values with relatively even distribution. Entropy is reported in the Data Profiling module as a measure of field richness and match suitability.
- Field mapping
- An association between a field in one datasource and a semantically equivalent field in another datasource, established in the Match Definitions module. Field mappings tell MatchLogic which columns represent the same concept across sources (e.g., "FirstName" in Source A maps to "FIRST_NM" in Source B). They are a prerequisite for defining match criteria and can be created manually or generated automatically using the auto-mapping feature.
- Golden record
- The single, authoritative master record produced by the Merge and Survivorship module after processing a duplicate group. A golden record combines the best field values from all records in the group according to master record rules (which record "wins") and overwrite rules (which record provides each individual field value). The golden record represents the most complete and accurate representation of the real-world entity that the group of duplicates describes.
- Group
- A cluster of records determined to be duplicates of each other, formed by taking the transitive closure of matched pairs. If record A matches record B, and record B matches record C, then A, B, and C form a single group — even if A and C were never directly compared. Groups are the primary unit of review in the Match Results module when using Group view mode, and are the input to the Merge and Survivorship module.
- Lexicographic ordering
- Alphabetical ordering of text values, used by Max and Min survivorship operations when applied to non-numeric fields. Under lexicographic ordering, "Zebra" is greater than "Apple" because "Z" comes after "A" in the alphabet. When a Max operation is applied to a text field, MatchLogic selects the value that comes last alphabetically across all records in the group. This is distinct from numeric ordering, where 100 is greater than 9; lexicographically, "100" comes before "9" because "1" < "9".
- Master record
- The representative record selected from a duplicate group to serve as the basis for the golden record. The Master Record Rules in the Merge and Survivorship module determine which record in a group earns "master" status (e.g., the record with the longest value in a key field, or the record from a preferred datasource). Once a master is selected, overwrite rules can further refine individual field values by pulling from other records in the group.
- Match definition
- A named set of criteria (field comparisons, match types, and weights) that determines how records are scored for similarity. A project can have multiple match definitions, each representing a different matching strategy. When multiple definitions are active, they operate with OR logic — the definition that yields the highest score for a given pair is the one used. Match definitions are configured in the Match Definitions module.
- Match score
- A numeric value between 0 and 100 reflecting how similar two records are according to the active match definitions. The score is calculated as a weighted average of individual field-level scores across all criteria in the winning definition. A score of 100 indicates a perfect match across all criteria; a score of 0 indicates no similarity. Scores are used to filter results via the threshold setting and are displayed in match results tables with confidence band labels.
- Merge and Survivorship
- The MatchLogic module responsible for producing golden records from duplicate groups identified by matching. It operates in two stages: (1) Master Record Rules determine which record in each group is designated as the master; (2) Overwrite Rules refine individual field values by optionally replacing the master's field values with values from other records in the group, subject to configurable conditions. The output is a set of golden records ready for export.
- Overwrite rule
- A survivorship rule that refines a specific field in the master record by pulling a value from another record in the group. Overwrite rules are applied field by field and support operations such as Longest, Shortest, Max, Min, MostPopular, FromMaster, FromBestRecord, and MergeAllValues. Each rule can also specify conditions that must be met before the overwrite is applied (e.g., only overwrite if the master's field is empty, or only overwrite with data from a specific datasource).
- Pair
- Two records that have been compared and assigned a match score. Each pair consists of one record from each datasource being matched (or two records from the same datasource in deduplication mode). Pairs are the fundamental output of the matching process and can be reviewed in the Match Results module in Pairs view mode. A pair includes the scores from each match definition as well as the overall maximum score.
- Pipeline
- The sequential set of processing modules in MatchLogic through which data flows from raw input to clean, deduplicated output: Import (load data) → Profile (analyze quality) → Cleanse (standardize and transform) → Configure (set match strategy) → Define (specify match criteria) → Match (run comparisons) → Merge (apply survivorship) → Export (output results). Each module builds on the work of the previous one, and most steps must be completed in order.
- Probabilistic matching
- A matching mode that uses statistical models to estimate the likelihood that two records refer to the same real-world entity, rather than requiring exact or near-exact string similarity. Probabilistic matching assigns different significance to different fields based on their discriminating power (e.g., a matching date of birth carries more evidential weight than a matching common first name). MatchLogic supports probabilistic matching via machine learning components in the matching engine.
- Project
- A workspace containing all datasources, configurations, and results for a single matching initiative. Every object in MatchLogic — datasources, match configurations, match definitions, results, survivorship rules, and export settings — belongs to a project. Users must create or select a project before accessing any other module. Projects can be named and described to keep multiple independent matching initiatives organized within the same MatchLogic instance.
- ProjectRun
-
A background job execution record that tracks the status and progress of a long-running operation such as data import, profiling, matching, or export. Each ProjectRun has a unique ID and a status (NotStarted, InProgress, Completed, Failed, or Cancelled). The frontend polls the
/run/status/{id}endpoint every 10 seconds to detect completion and notify the user. ProjectRuns are persisted so that job status survives page refreshes. - Survivorship
- The process of deciding which field values survive into the golden record when merging duplicate records. Survivorship rules allow fine-grained control: the master record may supply most field values, but individual fields can be overridden by values from other records in the group based on configured operations (e.g., take the longest value, the most recent value, or the value from the most trusted datasource). See also: Overwrite rule and Merge and Survivorship.
- Threshold
- A minimum match score (0–100) below which record pairs are not considered duplicates and are excluded from results. Setting an appropriate threshold is a key tuning decision: too high and genuine duplicates are missed (false negatives); too low and unrelated records are incorrectly matched (false positives). The threshold is configured in the Match Definitions module and applied globally to all match results for a project.
- Uniqueness
- The proportion of distinct values in a field relative to the total number of non-null values. A uniqueness ratio of 1.0 means every record has a different value (ideal for ID fields and primary keys). A uniqueness ratio near 0.0 means nearly all records share the same value (common in low-cardinality fields like country codes or status flags). Uniqueness is reported by the Data Profiling module and is a useful indicator of a field's discriminating power for matching.
- Weight
- A numeric value (0–100) assigned to a match criteria that determines how much that criteria contributes to the overall match score. Higher-weight criteria have greater influence on the final score. Weights are relative: a criteria with weight 80 contributes four times as much as a criteria with weight 20. The overall score is the weighted average of all field-level scores: (Σ field_score × weight) / Σ weight. Weights are configured per criteria in the Match Definitions module.
- WordSmith
- MatchLogic's dictionary-based standardization feature for replacing variant spellings, abbreviations, and synonyms with canonical values before matching. A WordSmith dictionary contains replacement rules (e.g., "St." → "Street", "Intl" → "International"). Dictionaries are managed in the Data Cleansing module and can be applied as nodes in a cleansing workflow. WordSmith is particularly effective at normalizing address data, company names, and industry-specific terminology where free-text variation is common.