Scalable Binning for Big Data Deduplication

PublishedApril 7, 2020

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

10 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computer implemented method for use in fast record deduplication comprising: inputting, into software running on one or more computer processors, data records having multiple attributes; inputting, into the software, local similarity functions of individual attributes with local similarity thresholds; generating, by the software, Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating, by the software, the local candidate pairs to produce a set of global candidate pairs; and filtering, by the software, the set of global candidate pairs by deciding whether a pair of data records represents a duplicate, wherein the generating and identifying further comprise: extracting building blocks from text within an attribute of a data record; mapping the extracted building blocks to a global pre-defined order; selecting subsets of the extracted building blocks as Bin IDs; repeating the extracting, mapping, and selecting steps for every data record containing one or more text attributes; and matching any two data records sharing a same Bin ID as a local candidate pair.

Plain English Translation

This invention relates to fast record deduplication in computer systems, addressing the challenge of efficiently identifying and removing duplicate data records from large datasets. The method processes data records with multiple attributes, using local similarity functions and thresholds to generate Bin IDs for each attribute. These Bin IDs are derived by extracting building blocks (e.g., tokens or substrings) from text attributes, mapping them to a predefined order, and selecting subsets of these blocks as Bin IDs. Records sharing the same Bin ID are identified as local candidate pairs. These local pairs are then aggregated into a set of global candidate pairs, which are further filtered to determine if they represent true duplicates. The approach improves deduplication speed by reducing the number of comparisons needed, leveraging attribute-level similarity functions to narrow down potential duplicates before global filtering. The method is particularly useful in applications requiring high-throughput data processing, such as databases, data warehouses, and record linkage systems.

Claim 2

Original Legal Text

2. A computer implemented method for use in fast record deduplication comprising: inputting, into software running on one or more computer processors, data records having multiple attributes; inputting, into the software, local similarity functions of individual attributes with local similarity thresholds; generating, by the software, Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating, by the software, the local candidate pairs to produce a set of global candidate pairs; and filtering, by the software, the set of global candidate pairs by deciding whether a pair of data records represents a duplicate, wherein the generating and identifying further comprise: creating two sets of numeric bins, where the length of each numeric bin equals two times a threshold, the bins within each set are disjoint and interleaved with overlap equal to the threshold, and assigning a unique Bin ID to each bin in each set; mapping a data record having a numeric value to two Bin IDs, one from each of the two sets of numeric bins, based on the first bin being 2*floor (numeric value/2*threshold) and the second bin being 2*floor (((numeric value+threshold)/2*threshold)+1); repeating the mapping for every data record having a numeric value; and matching any two data records sharing a same Bin ID as a local candidate pair.

Plain English Translation

This invention relates to fast record deduplication in computer systems, addressing the challenge of efficiently identifying and removing duplicate data records from large datasets. The method processes data records with multiple attributes, using local similarity functions and thresholds for each attribute to generate Bin IDs. These Bin IDs are used to identify candidate pairs of potentially duplicate records. The method creates two sets of numeric bins, where each bin's length equals twice a threshold value. The bins in each set are disjoint but interleaved, overlapping by the threshold value. Each bin is assigned a unique Bin ID. For numeric attributes, a record's value is mapped to two Bin IDs—one from each set—using a specific mathematical formula. This mapping is repeated for all numeric records. Records sharing the same Bin ID are marked as local candidate pairs. These local pairs are then aggregated into a global set of candidate pairs, which are further filtered to determine if they represent true duplicates. The approach improves deduplication efficiency by reducing the number of comparisons needed through intelligent binning and candidate selection.

Claim 3

Original Legal Text

3. A computer implemented method for use in fast record deduplication comprising: inputting, into software running on one or more computer processors, data records having multiple attributes; inputting, into the software, local similarity functions of individual attributes with local similarity thresholds; generating, by the software, Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs by a Cartesian product of all data records sharing a same Bin ID; aggregating, by the software, the local candidate pairs to produce a set of global candidate pairs; filtering, by the software, the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

Plain English Translation

This technical summary describes a computer-implemented method for fast record deduplication, addressing the challenge of efficiently identifying and removing duplicate records in large datasets. The method processes data records containing multiple attributes and applies local similarity functions to each attribute, along with corresponding local similarity thresholds. These functions and thresholds are used to generate Bin IDs, which group records into bins based on their attribute similarities. The method then identifies local candidate pairs by computing the Cartesian product of all records within the same bin, effectively narrowing down potential duplicates. These local candidate pairs are aggregated to form a set of global candidate pairs, which are further filtered to determine whether any pair of records represents a true duplicate. The filtering step evaluates the global candidate pairs to ensure accurate deduplication while maintaining computational efficiency. The approach leverages local similarity functions and binning to reduce the search space, making the deduplication process faster and more scalable for large datasets.

Claim 4

Original Legal Text

4. The method of claim 3 , wherein aggregating the local candidate pairs to produce a set of global candidate pairs further comprises: intersecting the local candidate pairs to obtain the set of global candidate pairs; or unioning the local candidate pairs to obtain the set of global candidate pairs; or intersecting the union of the local candidate pairs to obtain the set of global candidate pairs.

Plain English Translation

This invention relates to methods for aggregating local candidate pairs to produce a set of global candidate pairs in a data processing system. The problem addressed is efficiently combining multiple sets of local candidate pairs, which may originate from different sources or processing stages, to generate a unified global set. The method provides flexibility in how these local candidate pairs are merged, allowing for different aggregation strategies depending on the application requirements. The method involves three distinct approaches for producing the global candidate pairs. The first approach intersects the local candidate pairs, meaning only pairs that appear in all local sets are included in the global set. This ensures high confidence in the selected pairs but may exclude some valid candidates. The second approach unions the local candidate pairs, meaning all pairs from any local set are included in the global set. This maximizes coverage but may introduce noise or irrelevant pairs. The third approach intersects the union of the local candidate pairs, which first combines all local pairs into a single set and then filters them to retain only those that meet a specific criterion, such as appearing in a majority of the local sets. This hybrid approach balances coverage and confidence. The method is particularly useful in applications like data matching, entity resolution, or recommendation systems where candidate pairs must be aggregated from multiple sources or iterations. The flexibility in aggregation strategies allows the system to adapt to different accuracy and recall requirements.

Claim 5

Original Legal Text

5. A computer implemented method for use in fast record deduplication comprising: inputting, into software running on one or more computer processors, data records having multiple attributes; inputting, into the software, local similarity functions of individual attributes with local similarity thresholds; generating, by the software, Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating, by the software, the local candidate pairs to produce a set of global candidate pairs by: intersecting the local candidate pairs to obtain the set of global candidate pairs; or unioning the local candidate pairs to obtain the set of global candidate pairs; or intersecting the union of the local candidate pairs to obtain the set of global candidate pairs; and filtering, by the software, the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

Plain English Translation

This invention relates to fast record deduplication in computer systems, addressing the challenge of efficiently identifying and removing duplicate records from large datasets. The method processes data records with multiple attributes and uses local similarity functions for each attribute to generate Bin IDs, which group records based on similarity thresholds. These Bin IDs help identify local candidate pairs of potentially duplicate records. The method then aggregates these local candidate pairs into a set of global candidate pairs through intersection, union, or intersection of the union operations. Finally, the system filters these global candidate pairs to determine whether each pair represents a true duplicate. The approach optimizes deduplication by reducing the number of comparisons needed, improving efficiency in large-scale data processing. The local similarity functions and thresholds allow for flexible similarity assessments across different attributes, while the aggregation steps ensure comprehensive coverage of potential duplicates. The filtering step ensures accuracy by verifying whether the identified pairs are actual duplicates. This method is particularly useful in applications requiring high-speed deduplication of large datasets, such as data cleaning, database management, and big data analytics.

Claim 6

Original Legal Text

6. A system for performing fast record deduplication comprising at least one non-transitory computer-readable medium containing computer program instructions that when executed by at least one computer processor causes the at least one computer processor to perform the steps of: inputting data records having multiple attributes; inputting local similarity functions of individual attributes with local similarity thresholds; generating Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating the local candidate pairs to produce a set of global candidate pairs; and filtering the set of global candidate pairs by deciding whether a pair of data records represents a duplicate, wherein the generating and identifying further comprise: extracting building blocks from text within an attribute of a data record; mapping the extracted building blocks to a global pre-defined order; selecting subsets of the extracted building blocks as Bin IDs; repeating the extracting, mapping, and selecting steps for every data record containing one or more text attributes; and matching any two data records sharing a same Bin ID as a local candidate pair.

Plain English Translation

This invention relates to a system for fast record deduplication in databases or data processing systems. The problem addressed is the computational inefficiency of traditional deduplication methods, which often require comparing every record against every other record, leading to high processing time and resource consumption, especially with large datasets. The system processes data records containing multiple attributes, including text attributes. It uses local similarity functions and thresholds for individual attributes to generate Bin IDs, which are compact representations of the records. These Bin IDs are derived by extracting building blocks (e.g., tokens, n-grams) from text attributes, mapping them to a predefined global order, and selecting subsets of these blocks as Bin IDs. Records sharing the same Bin ID are identified as local candidate pairs, reducing the number of comparisons needed. The system then aggregates these local candidate pairs into a global set of candidate pairs, which are further filtered to determine if any pair represents a duplicate. This approach significantly reduces the computational overhead by narrowing down potential duplicates early in the process, making the deduplication faster and more scalable. The method is particularly useful in applications requiring real-time or near-real-time data processing, such as customer databases, financial transactions, or large-scale data integration tasks.

Claim 7

Original Legal Text

7. A system for performing fast record deduplication comprising at least one non-transitory computer-readable medium containing computer program instructions that when executed by at least one computer processor causes the at least one computer processor to perform the steps of: inputting data records having multiple attributes; inputting local similarity functions of individual attributes with local similarity thresholds; generating Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating the local candidate pairs to produce a set of global candidate pairs; and filtering the set of global candidate pairs by deciding whether a pair of data records represents a duplicate, wherein the generating and identifying further comprise: creating two sets of numeric bins, where the length of each numeric bin equals two times a threshold, the bins within each set are disjoint and interleaved with overlap equal to the threshold, and assigning a unique Bin ID to each bin in each set; mapping a data record having a numeric value to two Bin IDs, one from each of the two sets of numeric bins, based on the first bin being 2*floor (numeric value/2*threshold) and the second bin being 2*floor (((numeric value+threshold)/2*threshold)+1); repeating the mapping for every data record having a numeric value; and matching any two data records sharing a same Bin ID as a local candidate pair.

Plain English Translation

A system for fast record deduplication processes data records with multiple attributes to identify and filter duplicate records efficiently. The system inputs data records and local similarity functions for individual attributes, each with a corresponding local similarity threshold. It generates Bin IDs by creating two sets of numeric bins, where each bin's length equals twice the threshold. The bins within each set are disjoint but interleaved, with an overlap equal to the threshold, ensuring comprehensive coverage. Each bin in both sets is assigned a unique Bin ID. For numeric values in the data records, the system maps each value to two Bin IDs—one from each set. The first Bin ID is determined by 2*floor(numeric value / (2*threshold)), and the second by 2*floor(((numeric value + threshold) / (2*threshold)) + 1). This mapping is repeated for all numeric values in the records. Data records sharing the same Bin ID are identified as local candidate pairs. These local candidate pairs are aggregated to form a set of global candidate pairs, which are then filtered to determine if any pair of records represents a duplicate. The system optimizes deduplication by leveraging binning and similarity functions to reduce computational overhead while maintaining accuracy.

Claim 8

Original Legal Text

8. A system for performing fast record deduplication comprising at least one non-transitory computer-readable medium containing computer program instructions that when executed by at least one computer processor causes the at least one computer processor to perform the steps of: inputting data records having multiple attributes; inputting local similarity functions of individual attributes with local similarity thresholds; generating Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs by a Cartesian product of all data records sharing a same Bin ID; aggregating the local candidate pairs to produce a set of global candidate pairs; and filtering the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

Plain English Translation

This system is designed for fast data deduplication. It inputs data records, each having multiple attributes, along with specific local similarity functions and thresholds for these attributes. The system then generates "Bin IDs" for records based on these defined similarity functions and thresholds. To identify potential duplicates, it creates local candidate pairs by performing a Cartesian product of *all* data records that share the same Bin ID. These local candidate pairs are subsequently aggregated to form a set of global candidate pairs. Finally, the system filters this global set to accurately determine and identify pairs of data records that represent actual duplicates. ERROR (embedding): Error: Failed to save embedding: Could not find the 'embedding' column of 'patent_claims' in the schema cache

Claim 9

Original Legal Text

9. The system of claim 8 , wherein aggregating the local candidate pairs to produce a set of global candidate pairs further comprises: intersecting the local candidate pairs to obtain the set of global candidate pairs; or unioning the local candidate pairs to obtain the set of global candidate pairs; or intersecting the union of the local candidate pairs to obtain the set of global candidate pairs.

Plain English Translation

This invention relates to a system for processing candidate pairs in a distributed computing environment, particularly for tasks requiring aggregation of local candidate pairs across multiple nodes to produce a set of global candidate pairs. The problem addressed is efficiently combining local candidate pairs from different computing nodes while ensuring accuracy and minimizing computational overhead. The system operates by first generating local candidate pairs at individual nodes, where each node processes a subset of input data. These local candidate pairs are then aggregated to produce a global set. The aggregation process includes three possible methods: intersecting the local candidate pairs to obtain the global set, unioning the local candidate pairs to obtain the global set, or intersecting the union of the local candidate pairs to obtain the global set. The choice of method depends on the specific requirements of the application, such as the need for strict matching (intersection) or broader inclusion (union). The intersection method ensures that only candidate pairs present in all local sets are included in the global set, which is useful for high-precision applications. The union method includes all candidate pairs from any local set, maximizing recall but potentially increasing noise. The intersection of the union method provides a middle ground by first combining all local pairs and then filtering to retain only those that meet certain criteria. This approach optimizes distributed processing by allowing flexible aggregation strategies tailored to different use cases, improving efficiency and scalability in large-scale data analysis tasks.

Claim 10

Original Legal Text

10. A system for performing fast record deduplication comprising at least one non-transitory computer-readable medium containing computer program instructions that when executed by at least one computer processor causes the at least one computer processor to perform the steps of: inputting data records having multiple attributes; inputting local similarity functions of individual attributes with local similarity thresholds; generating Bin IDs based on the local similarity functions and the local similarity thresholds; identifying local candidate pairs based on data records that share Bin IDs; aggregating the local candidate pairs to produce a set of global candidate pairs by: intersecting the local candidate pairs to obtain the set of global candidate pairs; or unioning the local candidate pairs to obtain the set of global candidate pairs; or intersecting the union of the local candidate pairs to obtain the set of global candidate pairs; and filtering the set of global candidate pairs by deciding whether a pair of data records represents a duplicate.

Plain English Translation

A system performs fast record deduplication by processing data records with multiple attributes. The system inputs local similarity functions for individual attributes, each with a corresponding local similarity threshold. These functions generate Bin IDs for the records, grouping similar records into the same bins. The system then identifies local candidate pairs by comparing records within the same bins. These local candidate pairs are aggregated to form a set of global candidate pairs through intersection, union, or intersection of the union operations. The global candidate pairs are filtered to determine whether the pairs represent duplicates, ensuring accurate deduplication. The approach efficiently narrows down potential duplicates by leveraging attribute-level similarity before performing a final global comparison, reducing computational overhead while maintaining accuracy. The system is designed to handle large datasets by breaking down the deduplication process into manageable steps, optimizing performance and scalability.

Patent Metadata

Filing Date

Unknown

Publication Date

April 7, 2020

Inventors

George Beskales

Ihab F. Ilyas

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search