310 Sammlungen allgemeiner Statistiken
Refine
Keywords
- Amtliche Statistik (3)
- Statistik (3)
- Haushalt (2)
- Räumliche Statistik (2)
- Stichprobe (2)
- official statistics (2)
- Anonymisierung (1)
- BAYES (1)
- Bayes (1)
- Bayes-Regel (1)
Institute
Official business surveys form the basis for national and regional business statistics and are thus of great importance for analysing the state and performance of the economy. However, both the heterogeneity of business data and their high dynamics pose a particular challenge to the feasibility of sampling and the quality of the resulting estimates. A widely used sampling frame for creating the design of an official business survey is an extract from an official business register. However, if this frame does not accurately represent the target population, frame errors arise. Amplified by the heterogeneity and dynamics of business populations, these errors can significantly affect the estimation quality and lead to inefficiencies and biases. This dissertation therefore deals with design-based methods for optimising business surveys with respect to different types of frame errors.
First, methods for adjusting the sampling design of business surveys are addressed. These approaches integrate auxiliary information about the expected structures of frame errors into the sampling design. The aim is to increase the number of sampled businesses that are subject to frame errors. The element-specific frame error probability is estimated based on auxiliary information about frame errors observed in previous samples. The approaches discussed consider different types of frame errors and can be incorporated into predefined designs with fixed strata.
As the second main pillar of this work, methods for adjusting weights to correct for frame errors during estimation are developed and investigated. As a result of frame errors, the assumptions under which the original design weights were determined based on the sampling design no longer hold. The developed methods correct the design weights taking into account the errors identified for sampled elements. Case-number-based reweighting approaches, on the one hand, attempt to reconstruct the unknown size of the individual strata in the target population. In the context of weight smoothing methods, on the other hand, design weights are modelled and smoothed as a function of target or auxiliary variables. This serves to avoid inefficiencies in the estimation due to highly scattering weights or weak correlations between weights and target variables. In addition, possibilities of correcting frame errors by calibration weighting are elaborated. Especially when the sampling frame shows over- and/or undercoverage, the inclusion of external auxiliary information can provide a significant improvement of the estimation quality. For those methods whose quality cannot be measured using standard procedures, a procedure for estimating the variance based on a rescaling bootstrap is proposed. This enables an assessment of the estimation quality when using the methods in practice.
In the context of two extensive simulation studies, the methods presented in this dissertation are evaluated and compared with each other. First, in the environment of an experimental simulation, it is assessed which approaches are particularly suitable with regard to different data situations. In a second simulation study, which is based on the structural survey in the services sector, the applicability of the methods in practice is evaluated under realistic conditions.
The Eurosystem's Household Finance and Consumption Survey (HFCS) collects micro data on private households' balance sheets, income and consumption. It is a stylised fact that wealth is unequally distributed and that the wealthiest own a large share of total wealth. For sample surveys which aim at measuring wealth and its distribution, this is a considerable problem. To overcome it, some of the country surveys under the HFCS umbrella try to sample a disproportionately large share of households that are likely to be wealthy, a technique referred to as oversampling. Ignoring such types of complex survey designs in the estimation of regression models can lead to severe problems. This thesis first illustrates such problems using data from the first wave of the HFCS and canonical regression models from the field of household finance and gives a first guideline for HFCS data users regarding the use of replicate weight sets for variance estimation using a variant of the bootstrap. A further investigation of the issue necessitates a design-based Monte Carlo simulation study. To this end, the already existing large close-to-reality synthetic simulation population AMELIA is extended with synthetic wealth data. We discuss different approaches to the generation of synthetic micro data in the context of the extension of a synthetic simulation population that was originally based on a different data source. We propose an additional approach that is suitable for the generation of highly skewed synthetic micro data in such a setting using a multiply-imputed survey data set. After a description of the survey designs employed in the first wave of the HFCS, we then construct new survey designs for AMELIA that share core features of the HFCS survey designs. A design-based Monte Carlo simulation study shows that while more conservative approaches to oversampling do not pose problems for the estimation of regression models if sampling weights are properly accounted for, the same does not necessarily hold for more extreme oversampling approaches. This issue should be further analysed in future research.
This dissertation deals with consistent estimates in household surveys. Household surveys are often drawn via cluster sampling, with households sampled at the first stage and persons selected at the second stage. The collected data provide information for estimation at both the person and the household level. However, consistent estimates are desirable in the sense that the estimated household-level totals should coincide with the estimated totals obtained at the person-level. Current practice in statistical offices is to use integrated weighting. In this approach consistent estimates are guaranteed by equal weights for all persons within a household and the household itself. However, due to the forced equality of weights, the individual patterns of persons are lost and the heterogeneity within households is not taken into account. In order to avoid the negative consequences of integrated weighting, we propose alternative weighting methods in the first part of this dissertation that ensure both consistent estimates and individual person weights within a household. The underlying idea is to limit the consistency conditions to variables that emerge in both the personal and household data sets. These common variables are included in the person- and household-level estimator as additional auxiliary variables. This achieves consistency more directly and only for the relevant variables, rather than indirectly by forcing equal weights on all persons within a household. Further decisive advantages of the proposed alternative weighting methods are that original individual rather than the constructed aggregated auxiliaries are utilized and that the variable selection process is more flexible because different auxiliary variables can be incorporated in the person-level estimator than in the household-level estimator.
In the second part of this dissertation, the variances of a person-level GREG estimator and an integrated estimator are compared in order to quantify the effects of the consistency requirements in the integrated weighting approach. One of the challenges is that the estimators to be compared are of different dimensions. The proposed solution is to decompose the variance of the integrated estimator into the variance of a reduced GREG estimator, whose underlying model is of the same dimensions as the person-level GREG estimator, and add a constructed term that captures the effects disregarded by the reduced model. Subsequently, further fields of application for the derived decomposition are proposed such as the variable selection process in the field of econometrics or survey statistics.
The dissertation deals with methods to improve design-based and model-assisted estimation techniques for surveys in a finite population framework. The focus is on the development of the statistical methodology as well as their implementation by means of tailor-made numerical optimization strategies. In that regard, the developed methods aim at computing statistics for several potentially conflicting variables of interest at aggregated and disaggregated levels of the population on the basis of one single survey. The work can be divided into two main research questions, which are briefly explained in the following sections.
First, an optimal multivariate allocation method is developed taking into account several stratification levels. This approach results in a multi-objective optimization problem due to the simultaneous consideration of several variables of interest. In preparation for the numerical solution, several scalarization and standardization techniques are presented, which represent the different preferences of potential users. In addition, it is shown that by solving the problem scalarized with a weighted sum for all combinations of weights, the entire Pareto frontier of the original problem can be generated. By exploiting the special structure of the problem, the scalarized problems can be efficiently solved by a semismooth Newton method. In order to apply this numerical method to other scalarization techniques as well, an alternative approach is suggested, which traces the problem back to the weighted sum case. To address regional estimation quality requirements at multiple stratification levels, the potential use of upper bounds for regional variances is integrated into the method. In addition to restrictions on regional estimates, the method enables the consideration of box-constraints for the stratum-specific sample sizes, allowing minimum and maximum stratum-specific sampling fractions to be defined.
In addition to the allocation method, a generalized calibration method is developed, which is supposed to achieve coherent and efficient estimates at different stratification levels. The developed calibration method takes into account a very large number of benchmarks at different stratification levels, which may be obtained from different sources such as registers, paradata or other surveys using different estimation techniques. In order to incorporate the heterogeneous quality and the multitude of benchmarks, a relaxation of selected benchmarks is proposed. In that regard, predefined tolerances are assigned to problematic benchmarks at low aggregation levels in order to avoid an exact fulfillment. In addition, the generalized calibration method allows the use of box-constraints for the correction weights in order to avoid an extremely high variation of the weights. Furthermore, a variance estimation by means of a rescaling bootstrap is presented.
Both developed methods are analyzed and compared with existing methods in extensive simulation studies on the basis of a realistic synthetic data set of all households in Germany. Due to the similar requirements and objectives, both methods can be successively applied to a single survey in order to combine their efficiency advantages. In addition, both methods can be solved in a time-efficient manner using very comparable optimization approaches. These are based on transformations of the optimality conditions. The dimension of the resulting system of equations is ultimately independent of the dimension of the original problem, which enables the application even for very large problem instances.
Surveys are commonly tailored to produce estimates of aggregate statistics with a desired level of precision. This may lead to very small sample sizes for subpopulations of interest, defined geographically or by content, which are not incorporated into the survey design. We refer to subpopulations where the sample size is too small to provide direct estimates with adequate precision as small areas or small domains. Despite the small sample sizes, reliable small area estimates are needed for economic and political decision making. Hence, model-based estimation techniques are used which increase the effective sample size by borrowing strength from other areas to provide accurate information for small areas. The paragraph above introduced small area estimation as a field of survey statistics where two conflicting philosophies of statistical inference meet: the design-based and the model-based approach. While the first approach is well suited for the precise estimation of aggregate statistics, the latter approach furnishes reliable small area estimates. In most applications, estimates for both large and small domains based on the same sample are needed. This poses a challenge to the survey planner, as the sampling design has to reflect different and potentially conflicting requirements simultaneously. In order to enable efficient design-based estimates for large domains, the sampling design should incorporate information related to the variables of interest. This may be achieved using stratification or sampling with unequal probabilities. Many model-based small area techniques require an ignorable sampling design such that after conditioning on the covariates the variable of interest does not contain further information about the sample membership. If this condition is not fulfilled, biased model-based estimates may result, as the model which holds for the sample is different from the one valid for the population. Hence, an optimisation of the sampling design without investigating the implications for model-based approaches will not be sufficient. Analogously, disregarding the design altogether and focussing only on the model is prone to failure as well. Instead, a profound knowledge of the interplay between the sample design and statistical modelling is a prerequisite for implementing an effective small area estimation strategy. In this work, we concentrate on two approaches to address this conflict. Our first approach takes the sampling design as given and can be used after the sample has been collected. It amounts to incorporate the survey design into the small area model to avoid biases stemming from informative sampling. Thus, once a model is validated for the sample, we know that it holds for the population as well. We derive such a procedure under a lognormal mixed model, which is a popular choice when the support of the dependent variable is limited to positive values. Besides, we propose a three pillar strategy to select the additional variable accounting for the design, based on a graphical examination of the relationship, a comparison of the predictive accuracy of the choices and a check regarding the normality assumptions.rnrnOur second approach to deal with the conflict is based on the notion that the design should allow applying a wide variety of analyses using the sample data. Thus, if the use of model-based estimation strategies can be anticipated before the sample is drawn, this should be reflected in the design. The same applies for the estimation of national statistics using design-based approaches. Therefore, we propose to construct the design such that the sampling mechanism is non-informative but allows for precise design-based estimates at an aggregate level.
Zum Einfluss von Transformationen schiefer Verteilungen auf die Analyse mit imputierten Daten
(2015)
Die korrekte Behandlung fehlender Daten in empirischen Untersuchungen spielt zunehmend eine wichtige Rolle in der anwendungsorientierten, quantitativen Forschung. Als zentrales flexibles Instrument wurde von Rubin (1987) die multiple Imputation entwickelt, welche unter regulären Bedingungen eine korrekte Inferenz der eigentlichen Schätzungen ermöglicht. Eine Reihe von Imputationsmethoden beruht im Wesentlichen auf der Normalverteilungsannahme. In der Empirie wird diese Annahme normalverteilter Daten zunehmend kritisiert. So erweisen sich Variablen auf Grund ihrer sehr schiefen Verteilungen für die Imputation als besonders problematisch. In dieser Arbeit steht die korrekte Behandlung fehlender Werte mit der Intention einer validen Inferenz der eigentlichen Schätzung im Vordergrund. Ein Instrument ist die Transformation schiefer Verteilungen, um mit Hilfe der transformierten und approximativ normalverteilten Daten Imputationen unter regulären Bedingungen durchzuführen. In der Arbeit wird ein multivariater Ansatz eingeführt. Anschließend wird im Rahmen mehrerer Monte-Carlo-Simulationsstudien gezeigt, dass der neue Ansatz bereits bekannte Verfahren dominiert und sich die Transformation positiv auf die Analyse mit imputierten Daten auswirkt.
In politics and economics, and thus in the official statistics, the precise estimation of indicators for small regions or parts of populations, the so-called Small Areas or domains, is discussed intensively. The design-based estimation methods currently used are mainly based on asymptotic properties and are thus reliable for large sample sizes. With small sample sizes, however, this design based considerations often do not apply, which is why special model-based estimation methods have been developed for this case - the Small Area methods. While these may be biased, they often have a smaller mean squared error (MSE) as the unbiased design based estimators. In this work both classic design-based estimation methods and model-based estimation methods are presented and compared. The focus lies on the suitability of the various methods for their use in official statistics. First theory and algorithms suitable for the required statistical models are presented, which are the basis for the subsequent model-based estimators. Sampling designs are then presented apt for Small Area applications. Based on these fundamentals, both design-based estimators and as well model-based estimation methods are developed. Particular consideration is given in this case to the area-level empirical best predictor for binomial variables. Numerical and Monte Carlo estimation methods are proposed and compared for this analytically unsolvable estimator. Furthermore, MSE estimation methods are proposed and compared. A very popular and flexible resampling method that is widely used in the field of Small Area Statistics, is the parametric bootstrap. One major drawback of this method is its high computational intensity. To mitigate this disadvantage, a variance reduction method for parametric bootstrap is proposed. On the basis of theoretical considerations the enormous potential of this proposal is proved. A Monte Carlo simulation study shows the immense variance reduction that can be achieved with this method in realistic scenarios. This can be up to 90%. This actually enables the use of parametric bootstrap in applications in official statistics. Finally, the presented estimation methods in a large Monte Carlo simulation study in a specific application for the Swiss structural survey are examined. Here problems are discussed, which are of high relevance for official statistics. These are in particular: (a) How small can the areas be without leading to inappropriate or to high precision estimates? (b) Are the accuracy specifications for the Small Area estimators reliable enough to use it for publication? (c) Do very small areas infer in the modeling of the variables of interest? Could they cause thus a deterioration of the estimates of larger and therefore more important areas? (d) How can covariates, which are in different levels of aggregation be used in an appropriate way to improve the estimates. The data basis is the Swiss census of 2001. The main results are that in the author- view, the use of small area estimators for the production of estimates for areas with very small sample sizes is advisable in spite of the modeling effort. The MSE estimates provide a useful measure of precision, but do not reach in all Small Areas the level of reliability of the variance estimates for design-based estimators.
Bei synthetischen Simulationsgesamtheiten handelt es sich um künstlichernDaten, die zur Nachbildung von realen Phänomenen in Simulationen verwendetrnwerden. In der vorliegenden Arbeit werden Anforderungen und Methoden zur Erzeugung dieser Daten vorgestellt. Anhand von drei Beispielen wird gezeigt, wie erzeugte synthetische Daten in einer Simulation zur Anwendung kommen.
The demand for reliable statistics has been growing over the past decades, because more and more political and economic decisions are based on statistics, e.g. regional planning, allocation of funds or business decisions. Therefore, it has become increasingly important to develop and to obtain precise regional indicators as well as disaggregated values in order to compare regions or specific groups. In general, surveys provide the information for these indicators only for larger areas like countries or administrative divisions. However, in practice, it is more interesting to obtain indicators for specific subdivisions like on NUTS 2 or NUTS 3 levels. The Nomenclature of Units for Territorial Statistics (NUTS) is a hierarchical system of the European Union used in statistics to refer to subdivisions of countries. In many cases, the sample information on such detailed levels is not available. Thus, there are projects such as the European Census, which have the goal to provide precise numbers on NUTS 3 or even community level. The European Census is conducted amongst others in Germany and Switzerland in 2011. Most of the participating countries use sample and register information in a combined form for the estimation process. The classical estimation methods of small areas or subgroups, such as the Horvitz-Thompson (HT) estimator or the generalized regression (GREG) estimator, suffer from small area-specific sample sizes which cause high variances of the estimates. The application of small area methods, for instance the empirical best linear unbiased predictor (EBLUP), reduces the variance of the estimates by including auxiliary information to increase the effective sample size. These estimation methods lead to higher accuracy of the variables of interest. Small area estimation is also used in the context of business data. For example during the estimation of the revenues of specific subgroups like on NACE 3 or NACE 4 levels, small sample sizes can occur. The Nomenclature statistique des activités économiques dans la Communauté européenne (NACE) is a system of the European Union which defines an industry standard classification. Besides small sample sizes, business data have further special characteristics. The main challenge is that business data have skewed distributions with a few large companies and many small businesses. For instance, in the automotive industry in Germany, there are many small suppliers but only few large original equipment manufacturers (OEM). Altogether, highly influential units and outliers can be observed in business statistics. These extreme values in connection with small sample sizes cause severe problems when standard small area models are applied. These models are generally based on the normality assumption, which does not hold in the case of outliers. One way to solve these peculiarities is to apply outlier robust small area methods. The availability of adequate covariates is important for the accuracy of the above described small area methods. However, in business data, the auxiliary variables are hardly available on population level. One of several reasons for that is the fact that in Germany a lot of enterprises are not reflected in business registers due to truncation limits. Furthermore, only listed enterprises or companies which trespass specific thresholds are obligated to publish their results. This limits the number of potential auxiliary variables for the estimation. Even though there are issues with available covariates, business data often include spatial dependencies which can be used to enhance small area methods. Next to spatial information based on geographic characteristics, group-specific similarities like related industries based on NACE codes can be used. For instance, enterprises from the same NACE 2 level, e.g. sector 47 retail trade, behave more similar than two companies from different NACE 2 levels, e.g. sector 05 mining of coal and sector 64 financial services. This spatial correlation can be incorporated by extending the general linear mixed model trough the integration of spatially correlated random effects. In business data, outliers as well as geographic or content-wise spatial dependencies between areas or domains are closely linked. The coincidence of these two factors and the resulting consequences have not been fully covered in the relevant literature. The only approach that combines robust small area methods with spatial dependencies is the M-quantile geographically weighted regression model. In the context of EBLUP-based small area models, the combination of robust and spatial methods has not been considered yet. Therefore, this thesis provides a theoretical approach to this scientific and practical problem and shows its relevance in an empirical study.
For the first time, the German Census 2011 will be conducted via a new method the register based census. In contrast to a traditional census, where all inhabitants are surveyed, the German government will mainly attempt to count individuals using population registers of administrative authorities, such as the municipalities and the Federal Employment Agency. Census data that cannot be collected from the registers, such as information on education, training, and occupation, will be collected by an interview-based sample survey. Moreover, the new method reduces citizens' obligations to provide information and helps reduce costs significantly. The use of sample surveys is limited if results with a detailed regional or subject-matter breakdown have to be prepared. Classical estimation methods are sometimes criticized, since estimation is often problematic for small samples. Fortunately, model based small area estimators serve as an alternative. These methods help to increase the information, and hence the effective sample size. In the German Census 2011 it is possible to embed areas on a map in a geographical context. This may offer additional information, such as neighborhood relations or spatial interactions. Standard small area models, like Fay-Herriot or Battese-Harter-Fuller, do not account for such interactions explicitly. The aim of our work is to extend the classical models by integrating the spatial information explicitly into the model. In addition, the possible gain in efficiency will be analyzed.