Object Identification by Statistical Methods
Numerical data fusion or merging of overlapping data files becomes a hard problem if no global unique identifying keys exist in the corresponding data sets. Typical examples are the linkage of address files supplied from different sources for commercial purposes - a money making area-, the merging of special offers in various media (cf. duplicate detection), or an administrative record census (ARC) as planed in Germany, where several autonomous, heterogeneous registers are to be merged. We present a three-step procedure consisting of the steps conversion of attributes, comparison of values of a pair of objects, and classification ('matching problem') of pairs either as "same" or "matched and "not same" or "not matched". We pay special attention to the quality and the efficiency of the methodology. We briefly discuss questions like correctness and completeness as well as pre-selection techniques like 'blocking' to reduce the computational complexity of pairwise comparisons. The approach is illustrated on data from carefully composed benchmark data sets. We assume some basic knowledge in computer science and classification (supervised learning).
Author: Hans-Joachim Lenz, Free University, Berlin