Filtern
Dokumenttyp
Sprache
- Englisch (3) (entfernen)
Volltext vorhanden
- ja (3) (entfernen)
Schlagworte
- Umfrage (3) (entfernen)
Institut
The Eurosystem's Household Finance and Consumption Survey (HFCS) collects micro data on private households' balance sheets, income and consumption. It is a stylised fact that wealth is unequally distributed and that the wealthiest own a large share of total wealth. For sample surveys which aim at measuring wealth and its distribution, this is a considerable problem. To overcome it, some of the country surveys under the HFCS umbrella try to sample a disproportionately large share of households that are likely to be wealthy, a technique referred to as oversampling. Ignoring such types of complex survey designs in the estimation of regression models can lead to severe problems. This thesis first illustrates such problems using data from the first wave of the HFCS and canonical regression models from the field of household finance and gives a first guideline for HFCS data users regarding the use of replicate weight sets for variance estimation using a variant of the bootstrap. A further investigation of the issue necessitates a design-based Monte Carlo simulation study. To this end, the already existing large close-to-reality synthetic simulation population AMELIA is extended with synthetic wealth data. We discuss different approaches to the generation of synthetic micro data in the context of the extension of a synthetic simulation population that was originally based on a different data source. We propose an additional approach that is suitable for the generation of highly skewed synthetic micro data in such a setting using a multiply-imputed survey data set. After a description of the survey designs employed in the first wave of the HFCS, we then construct new survey designs for AMELIA that share core features of the HFCS survey designs. A design-based Monte Carlo simulation study shows that while more conservative approaches to oversampling do not pose problems for the estimation of regression models if sampling weights are properly accounted for, the same does not necessarily hold for more extreme oversampling approaches. This issue should be further analysed in future research.
We study planned changes in protective routines after the COVID-19 pandemic: in a survey in Germany among >650 respondents, we find that the majority plans to use face masks in certain situations even after the end of the pandemic. We observe that this willingness is strongly related to the perception that there is something to be learned from East Asians’ handling of pandemics, even when controlling for perceived protection by wearing masks. Given strong empirical evidence that face masks help prevent the spread of respiratory diseases and given the considerable estimated health and economic costs of such diseases even pre-Corona, this would be a very positive side effect of the current crisis.
Statistical matching offers a way to broaden the scope of analysis without increasing respondent burden and costs. These would result from conducting a new survey or adding variables to an existing one. Statistical matching aims at combining two datasets A and B referring to the same target population in order to analyse variables, say Y and Z, together, that initially were not jointly observed. The matching is performed based on matching variables X that correspond to common variables present in both datasets A and B. Furthermore, Y is only observed in B and Z is only observed in A. To overcome the fact that no joint information on X, Y and Z is available, statistical matching procedures have to rely on suitable assumptions. Therefore, to yield a theoretical foundation for statistical matching, most procedures rely on the conditional independence assumption (CIA), i.e. given X, Y is independent of Z.
The goal of this thesis is to encompass both the statistical matching process and the analysis of the matched dataset. More specifically, the aim is to estimate a linear regression model for Z given Y and possibly other covariates in data A. Since the validity of the assumptions underlying the matching process determine the validity of the obtained matched file, the accuracy of statistical inference is determined by the suitability of the assumptions. By putting the focus on these assumptions, this work proposes a systematic categorisation of approaches to statistical matching by relying on graphical representations in form of directed acyclic graphs. These graphs are particularly useful in representing dependencies and independencies which are at the heart of the statistical matching problem. The proposed categorisation distinguishes between (a) joint modelling of the matching and the analysis (integrated approach), and (b) matching subsequently followed by statistical analysis of the matched dataset (classical approach). Whereas the classical approach relies on the CIA, implementations of the integrated approach are only valid if they converge, i.e. if the specified models are identifiable and, in the case of MCMC implementations, if the algorithm converges to a proper distribution.
In this thesis an implementation of the integrated approach is proposed, where the imputation step and the estimation step are jointly modelled through a fully Bayesian MCMC estimation. It is based on a linear regression model for Z given Y and accounts for both a linear regression model and a random effects model for Y. Furthermore, it yields its validity when the instrumental variable assumption (IVA) holds. The IVA corresponds to: (a) Z is independent of a subset X’ of X given Y and X*, where X* = X\X’ and (b) Y is correlated with X’ given X*. The proof, that the joint Bayesian modelling of both the model for Z and the model for Y through an MCMC simulation converges to a proper distribution is provided in this thesis. In a first model-based simulation study, the proposed integrated Bayesian procedure is assessed with regard to the data situation, convergence issues, and underlying assumptions. Special interest lies in the investigation of the interplay of the Y and the Z model within the imputation process. It turns out that failure scenarios can be distinguished by comparing the CIA and the IVA in the completely observed dataset.
Finally, both approaches to statistical matching, i.e. the classical approach and the integrated approach, are subject to an extensive comparison in (1) a model-based simulation study and (2) a simulation study based on the AMELIA dataset, which is an openly available very large synthetic dataset and, by construction, similar to the EU-SILC survey. As an additional integrated approach, a Bayesian additive regression trees (BART) model is considered for modelling Y. These integrated procedures are compared to the classical approach represented by predictive mean matching in the form of multiple imputations by chained equation. Suitably chosen, the first simulation framework offers the possibility to clarify aspects related to the underlying assumptions by comparing the IVA and the CIA and by evaluating the impact of the matching variables. Thus, within this simulation study two related aspects are of special interest: the assumptions underlying each method and the incorporation of additional matching variables. The simulation on the AMELIA dataset offers a close-to-reality framework with the advantage of knowing the whole setting, i.e. the whole data X, Y and Z. Special interest lies in investigating assumptions through adding and excluding auxiliary variables in order to enhance conditional independence and assess the sensitivity of the methods to this issue. Furthermore, the benefit of having an overlap of units in data A and B for which information on X, Y, Z is available is investigated. It turns out that the integrated approach yields better results than the classical approach when the CIA clearly does not hold. Moreover, even when the classical approach obtains unbiased results for the regression coefficient of Y in the model for Z, it is the method relying on BART that over all coefficients performs best.
Concluding, this work constitutes a major contribution to the clarification of assumptions essential to any statistical matching procedure. By introducing graphical models to identify existing approaches to statistical matching combined with the subsequent analysis of the matched dataset, it offers an extensive overview, categorisation and extension of theory and application. Furthermore, in a setting where none of the assumptions are testable (since X, Y and Z are not observed together), the integrated approach is a valuable asset by offering an alternative to the CIA.