Filtern
Sprache
- Englisch (2) (entfernen)
Volltext vorhanden
- ja (2) (entfernen)
Schlagworte
- Matching (2) (entfernen)
Statistical matching offers a way to broaden the scope of analysis without increasing respondent burden and costs. These would result from conducting a new survey or adding variables to an existing one. Statistical matching aims at combining two datasets A and B referring to the same target population in order to analyse variables, say Y and Z, together, that initially were not jointly observed. The matching is performed based on matching variables X that correspond to common variables present in both datasets A and B. Furthermore, Y is only observed in B and Z is only observed in A. To overcome the fact that no joint information on X, Y and Z is available, statistical matching procedures have to rely on suitable assumptions. Therefore, to yield a theoretical foundation for statistical matching, most procedures rely on the conditional independence assumption (CIA), i.e. given X, Y is independent of Z.
The goal of this thesis is to encompass both the statistical matching process and the analysis of the matched dataset. More specifically, the aim is to estimate a linear regression model for Z given Y and possibly other covariates in data A. Since the validity of the assumptions underlying the matching process determine the validity of the obtained matched file, the accuracy of statistical inference is determined by the suitability of the assumptions. By putting the focus on these assumptions, this work proposes a systematic categorisation of approaches to statistical matching by relying on graphical representations in form of directed acyclic graphs. These graphs are particularly useful in representing dependencies and independencies which are at the heart of the statistical matching problem. The proposed categorisation distinguishes between (a) joint modelling of the matching and the analysis (integrated approach), and (b) matching subsequently followed by statistical analysis of the matched dataset (classical approach). Whereas the classical approach relies on the CIA, implementations of the integrated approach are only valid if they converge, i.e. if the specified models are identifiable and, in the case of MCMC implementations, if the algorithm converges to a proper distribution.
In this thesis an implementation of the integrated approach is proposed, where the imputation step and the estimation step are jointly modelled through a fully Bayesian MCMC estimation. It is based on a linear regression model for Z given Y and accounts for both a linear regression model and a random effects model for Y. Furthermore, it yields its validity when the instrumental variable assumption (IVA) holds. The IVA corresponds to: (a) Z is independent of a subset X’ of X given Y and X*, where X* = X\X’ and (b) Y is correlated with X’ given X*. The proof, that the joint Bayesian modelling of both the model for Z and the model for Y through an MCMC simulation converges to a proper distribution is provided in this thesis. In a first model-based simulation study, the proposed integrated Bayesian procedure is assessed with regard to the data situation, convergence issues, and underlying assumptions. Special interest lies in the investigation of the interplay of the Y and the Z model within the imputation process. It turns out that failure scenarios can be distinguished by comparing the CIA and the IVA in the completely observed dataset.
Finally, both approaches to statistical matching, i.e. the classical approach and the integrated approach, are subject to an extensive comparison in (1) a model-based simulation study and (2) a simulation study based on the AMELIA dataset, which is an openly available very large synthetic dataset and, by construction, similar to the EU-SILC survey. As an additional integrated approach, a Bayesian additive regression trees (BART) model is considered for modelling Y. These integrated procedures are compared to the classical approach represented by predictive mean matching in the form of multiple imputations by chained equation. Suitably chosen, the first simulation framework offers the possibility to clarify aspects related to the underlying assumptions by comparing the IVA and the CIA and by evaluating the impact of the matching variables. Thus, within this simulation study two related aspects are of special interest: the assumptions underlying each method and the incorporation of additional matching variables. The simulation on the AMELIA dataset offers a close-to-reality framework with the advantage of knowing the whole setting, i.e. the whole data X, Y and Z. Special interest lies in investigating assumptions through adding and excluding auxiliary variables in order to enhance conditional independence and assess the sensitivity of the methods to this issue. Furthermore, the benefit of having an overlap of units in data A and B for which information on X, Y, Z is available is investigated. It turns out that the integrated approach yields better results than the classical approach when the CIA clearly does not hold. Moreover, even when the classical approach obtains unbiased results for the regression coefficient of Y in the model for Z, it is the method relying on BART that over all coefficients performs best.
Concluding, this work constitutes a major contribution to the clarification of assumptions essential to any statistical matching procedure. By introducing graphical models to identify existing approaches to statistical matching combined with the subsequent analysis of the matched dataset, it offers an extensive overview, categorisation and extension of theory and application. Furthermore, in a setting where none of the assumptions are testable (since X, Y and Z are not observed together), the integrated approach is a valuable asset by offering an alternative to the CIA.
Matching problems with additional resource constraints are generalizations of the classical matching problem. The focus of this work is on matching problems with two types of additional resource constraints: The couple constrained matching problem and the level constrained matching problem. The first one is a matching problem which has imposed a set of additional equality constraints. Each constraint demands that for a given pair of edges either both edges are in the matching or none of them is in the matching. The second one is a matching problem which has imposed a single equality constraint. This constraint demands that an exact number of edges in the matching are so-called on-level edges. In a bipartite graph with fixed indices of the nodes, these are the edges with end-nodes that have the same index. As a central result concerning the couple constrained matching problem we prove that this problem is NP-hard, even on bipartite cycle graphs. Concerning the complexity of the level constrained perfect matching problem we show that it is polynomially equivalent to three other combinatorial optimization problems from the literature. For different combinations of fixed and variable parameters of one of these problems, the restricted perfect matching problem, we investigate their effect on the complexity of the problem. Further, the complexity of the assignment problem with an additional equality constraint is investigated. In a central part of this work we bring couple constraints into connection with a level constraint. We introduce the couple and level constrained matching problem with on-level couples, which is a matching problem with a special case of couple constraints together with a level constraint imposed on it. We prove that the decision version of this problem is NP-complete. This shows that the level constraint can be sufficient for making a polynomially solvable problem NP-hard when being imposed on that problem. This work also deals with the polyhedral structure of resource constrained matching problems. For the polytope corresponding to the relaxation of the level constrained perfect matching problem we develop a characterization of its non-integral vertices. We prove that for any given non-integral vertex of the polytope a corresponding inequality which separates this vertex from the convex hull of integral points can be found in polynomial time. Regarding the calculation of solutions of resource constrained matching problems, two new algorithms are presented. We develop a polynomial approximation algorithm for the level constrained matching problem on level graphs, which returns solutions whose size is at most one less than the size of an optimal solution. We then describe the Objective Branching Algorithm, a new algorithm for exactly solving the perfect matching problem with an additional equality constraint. The algorithm makes use of the fact that the weighted perfect matching problem without an additional side constraint is polynomially solvable. In the Appendix, experimental results of an implementation of the Objective Branching Algorithm are listed.