Quick Start: Your First Deduplication Project
Quick Start: Your First Deduplication Project
This tutorial walks you through a complete deduplication project from start to finish. By the end, you will have imported a dataset, found duplicate records, reviewed the results, and exported a clean file. The entire process takes about 15 minutes.
What You Will Need
- A CSV or Excel file containing records you suspect have duplicates. A customer list, contact database, or vendor file works well.
- Access to a running MatchLogic instance.
Step 1: Create a Project
Open MatchLogic and navigate to Project Management. Click Create Project, give your project a descriptive name (for example, "Customer Dedup Q1"), and optionally add a description. Click Save. Your new project is now selected and ready to use.
Step 2: Import Your Data
Click Data Import in the sidebar. Select your file type (CSV or Excel), then drag your file onto the upload area or click to browse. MatchLogic will show a preview of the first few rows so you can verify the data looks correct. Confirm column names and data types, then click Import. The import runs as a background job; you will see a notification when it finishes.
Step 3: Profile Your Data
Navigate to Data Profiling. Select your imported data source and click Run Profile. Profiling analyzes every column for completeness, uniqueness, data types, patterns, and outliers. Review the results to identify quality issues. Pay attention to fields you plan to match on -- high uniqueness and completeness lead to better match accuracy.
Tip
Profiling is optional but highly recommended. It only takes a minute and can reveal issues like inconsistent formatting that would reduce match quality.
Step 4: Cleanse Data (Optional)
If profiling revealed quality issues, navigate to Data Cleansing. Use the visual flow builder to add transformation rules such as trimming whitespace, standardizing case, or replacing abbreviations. Apply the rules and preview the cleansed output. Skip this step if your data is already clean.
Step 5: Configure Matching
Navigate to Match Configuration. Since you have a single data source, MatchLogic automatically sets up a within-source deduplication configuration. Verify that your data source appears and the strategy is set to find duplicates within it. Click Save.
Step 6: Define Match Rules
Navigate to Match Definitions. Here you tell the engine how to compare records:
- Map fields — For single-source dedup, fields are automatically mapped to themselves.
- Add a definition — Click Add Definition and give it a name (for example, "Name + Email").
- Add criteria — Select the fields to compare. For a customer list, try
FirstNamewith Fuzzy match type (weight 30),LastNamewith Fuzzy (weight 30), andEmailwith Exact (weight 40). - Save the definition.
Step 7: Run the Match
Navigate to Match Results and click Run Match. The matching engine processes your data in the background. Depending on dataset size, this may take a few seconds to several minutes. You will receive a notification when it finishes.
Step 8: Review Results
Once the match completes, the Summary tab shows a quality report: total pairs found, score distribution across confidence bands, and key statistics. Switch to the Detailed Analysis tab to see individual pairs and groups. Inspect scores, review the matched fields, and mark records as duplicates or non-duplicates where the engine needs correction.
Step 9: Set Merge Rules
Navigate to Merge & Survivorship. Configure master record rules (for example, keep the most complete record as master) and field-level survivorship rules (for example, keep the longest address value). Preview the merged output to verify correctness.
Step 10: Export Clean Data
Navigate to Final Export. Select an export action (such as "Suppress All Duplicate Records" to get a deduplicated file), choose your output format (CSV or Excel), and click Export. Download the file when the job completes.
Congratulations -- you have completed your first deduplication project. For matching across multiple files, see #quick-start-cross-file-matching.