Most research into entity resolution (also known as
record linkage or data matching) has concentrated on
the quality of the matching results. In this paper, we
focus on matching time and scalability, with the aim
to achieve large-scale real-time entity resolution.
Traditional entity resolution techniques have assumed the matching of two static databases. In our
networked and online world, however, it is becoming
increasingly important for many organisations to be
able to conduct entity resolution between a collection
of often very large databases and a stream of query
or update records. The matching should be done in
(near) real-time, and be as automatic and accurate as
possible, returning a ranked list of matched records
for each given query record. This task therefore becomes similar to querying large document collections,
as done for example by Web search engines, however
based on a different type of documents: structured
database records that, for example, contain personal
information, such as names and addresses.
In this paper, we investigate inverted indexing
techniques, as commonly used in Web search engines,
and employ them for real-time entity resolution. We
present two variations of the traditional inverted index approach, aimed at facilitating fast approximate
matching. We show encouraging initial results on
large real-world data sets, with the inverted index approaches being up-to one hundred times faster than
the traditionally used standard blocking approach.
However, this improved matching speed currently
comes at a cost, in that matching quality for larger
data sets can be lower compared to when standard
blocking is used, and thus more work is required.
|Cite as: Christen, P. and Gayler, R. (2008). Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 51-60. |