## A comparative view on statistical matching

- Statistical matching offers a way to broaden the scope of analysis without increasing respondent burden and costs. These would result from conducting a new survey or adding variables to an existing one. Statistical matching aims at combining two datasets A and B referring to the same target population in order to analyse variables, say Y and Z, together, that initially were not jointly observed. The matching is performed based on matching variables X that correspond to common variables present in both datasets A and B. Furthermore, Y is only observed in B and Z is only observed in A. To overcome the fact that no joint information on X, Y and Z is available, statistical matching procedures have to rely on suitable assumptions. Therefore, to yield a theoretical foundation for statistical matching, most procedures rely on the conditional independence assumption (CIA), i.e. given X, Y is independent of Z. The goal of this thesis is to encompass both the statistical matching process and the analysis of the matched dataset. More specifically, the aim is to estimate a linear regression model for Z given Y and possibly other covariates in data A. Since the validity of the assumptions underlying the matching process determine the validity of the obtained matched file, the accuracy of statistical inference is determined by the suitability of the assumptions. By putting the focus on these assumptions, this work proposes a systematic categorisation of approaches to statistical matching by relying on graphical representations in form of directed acyclic graphs. These graphs are particularly useful in representing dependencies and independencies which are at the heart of the statistical matching problem. The proposed categorisation distinguishes between (a) joint modelling of the matching and the analysis (integrated approach), and (b) matching subsequently followed by statistical analysis of the matched dataset (classical approach). Whereas the classical approach relies on the CIA, implementations of the integrated approach are only valid if they converge, i.e. if the specified models are identifiable and, in the case of MCMC implementations, if the algorithm converges to a proper distribution. In this thesis an implementation of the integrated approach is proposed, where the imputation step and the estimation step are jointly modelled through a fully Bayesian MCMC estimation. It is based on a linear regression model for Z given Y and accounts for both a linear regression model and a random effects model for Y. Furthermore, it yields its validity when the instrumental variable assumption (IVA) holds. The IVA corresponds to: (a) Z is independent of a subset X’ of X given Y and X*, where X* = X\X’ and (b) Y is correlated with X’ given X*. The proof, that the joint Bayesian modelling of both the model for Z and the model for Y through an MCMC simulation converges to a proper distribution is provided in this thesis. In a first model-based simulation study, the proposed integrated Bayesian procedure is assessed with regard to the data situation, convergence issues, and underlying assumptions. Special interest lies in the investigation of the interplay of the Y and the Z model within the imputation process. It turns out that failure scenarios can be distinguished by comparing the CIA and the IVA in the completely observed dataset. Finally, both approaches to statistical matching, i.e. the classical approach and the integrated approach, are subject to an extensive comparison in (1) a model-based simulation study and (2) a simulation study based on the AMELIA dataset, which is an openly available very large synthetic dataset and, by construction, similar to the EU-SILC survey. As an additional integrated approach, a Bayesian additive regression trees (BART) model is considered for modelling Y. These integrated procedures are compared to the classical approach represented by predictive mean matching in the form of multiple imputations by chained equation. Suitably chosen, the first simulation framework offers the possibility to clarify aspects related to the underlying assumptions by comparing the IVA and the CIA and by evaluating the impact of the matching variables. Thus, within this simulation study two related aspects are of special interest: the assumptions underlying each method and the incorporation of additional matching variables. The simulation on the AMELIA dataset offers a close-to-reality framework with the advantage of knowing the whole setting, i.e. the whole data X, Y and Z. Special interest lies in investigating assumptions through adding and excluding auxiliary variables in order to enhance conditional independence and assess the sensitivity of the methods to this issue. Furthermore, the benefit of having an overlap of units in data A and B for which information on X, Y, Z is available is investigated. It turns out that the integrated approach yields better results than the classical approach when the CIA clearly does not hold. Moreover, even when the classical approach obtains unbiased results for the regression coefficient of Y in the model for Z, it is the method relying on BART that over all coefficients performs best. Concluding, this work constitutes a major contribution to the clarification of assumptions essential to any statistical matching procedure. By introducing graphical models to identify existing approaches to statistical matching combined with the subsequent analysis of the matched dataset, it offers an extensive overview, categorisation and extension of theory and application. Furthermore, in a setting where none of the assumptions are testable (since X, Y and Z are not observed together), the integrated approach is a valuable asset by offering an alternative to the CIA.

Author: | Lisa Borsi |
---|---|

URN: | urn:nbn:de:hbz:385-1-19028 |

DOI: | https://doi.org/10.25353/ubtr-xxxx-aa87-6a4b |

Document Type: | Doctoral Thesis |

Language: | English |

Date of completion: | 2022/07/28 |

Publishing institution: | Universität Trier |

Granting institution: | Universität Trier |

Date of final exam: | 2022/06/24 |

Release Date: | 2022/08/16 |

GND Keyword: | Amtliche Statistik; Matching; Regressionsmodell; Statistik; Umfrage |

Number of pages: | xv, 158 Blätter |

First page: | i |

Last page: | 158 |

Licence (German): | CC BY: Creative-Commons-Lizenz 4.0 International |