STA 290 Seminar Series
DATE: Wednesday November 15th, 11:00am
LOCATION: MSB 1147, Colloquium Room
SPEAKER: Brenda Betancourt, Postdoctoral Associate, Duke University
TITLE: “Microclustering Models for Record Linkage Tasks”
ABSTRACT: In the age of data, record linkage tasks are increasingly pervasive in many areas of application such as public health, official statistics and human rights. Record linkage techniques are needed to combine multiple sources of information that lack (shared) unique identifiers in order to remove duplicate entries. Automated approaches from a machine learning perspective are widespread in the record linkage literature but there is a gap in principled record linkage model-based approaches that allow for probabilistic matching of the records as well as error propagation for future analysis with the merged data. For a more principled alternative for record linkage, we propose a Bayesian generative model based on clustering techniques. In contrast to other clustering applications, in record linkage the number of data points in each cluster should remain small even for large data sets. In this work, we introduce prior models that are well suited for the small clustering problem - called microclustering - that can be used as default prior specification to perform record linkage in noisy databases. The proposed models are robust and flexible enough to allow for introduction of prior information when available, while maintaining the microclustering property even when the size of the data increases.