OPUS 4 | Suchen

An Integer Optimization Approach to k-Anonymity on Nominal Data (2023)

Cost, Vera Charlotte

The publication of statistical databases is subject to legal regulations, e.g. national statistical offices are only allowed to publish data if the data cannot be attributed to individuals. Achieving this privacy standard requires anonymizing the data prior to publication. However, data anonymization inevitably leads to a loss of information, which should be kept minimal. In this thesis, we analyze the anonymization method SAFE used in the German census in 2011 and we propose a novel integer programming-based anonymization method for nominal data. In the first part of this thesis, we prove that a fundamental variant of the underlying SAFE optimization problem is NP-hard. This justifies the use of heuristic approaches for large data sets. In the second part, we propose a new anonymization method belonging to microaggregation methods, specifically designed for nominal data. This microaggregation method replaces rows in a microdata set with representative values to achieve k-anonymity, ensuring each data row is identical to at least k − 1 other rows. In addition to the overall dissimilarities of the data rows, the method accounts for errors in resulting frequency tables, which are of high interest for nominal data in practice. The method employs a typical two-step structure: initially partitioning the data set into clusters and subsequently replacing all cluster elements with representative values to achieve k-anonymity. For the partitioning step, we propose a column generation scheme followed by a heuristic to obtain an integer solution, which is based on the dual information. For the aggregation step, we present a mixed-integer problem formulation to find cluster representatives. To this end, we take errors in a subset of frequency tables into account. Furthermore, we show a reformulation of the problem to a minimum edge-weighted maximal clique problem in a multipartite graph, which allows for a different perspective on the problem. Moreover, we formulate a mixed-integer program, which combines the partitioning and the aggregation step and aims to minimize the sum of chi-squared errors in frequency tables. Finally, an experimental study comparing the methods covered or developed in this work shows particularly strong results for the proposed method with respect to relative criteria, while SAFE shows its strength with respect to the maximum absolute error in frequency tables. We conclude that the inclusion of integer programming in the context of data anonymization is a promising direction to reduce the inevitable information loss inherent in anonymization, particularly for nominal data.

Consistent Estimation in Household Surveys (2019)

Konrad, Anne

This dissertation deals with consistent estimates in household surveys. Household surveys are often drawn via cluster sampling, with households sampled at the first stage and persons selected at the second stage. The collected data provide information for estimation at both the person and the household level. However, consistent estimates are desirable in the sense that the estimated household-level totals should coincide with the estimated totals obtained at the person-level. Current practice in statistical offices is to use integrated weighting. In this approach consistent estimates are guaranteed by equal weights for all persons within a household and the household itself. However, due to the forced equality of weights, the individual patterns of persons are lost and the heterogeneity within households is not taken into account. In order to avoid the negative consequences of integrated weighting, we propose alternative weighting methods in the first part of this dissertation that ensure both consistent estimates and individual person weights within a household. The underlying idea is to limit the consistency conditions to variables that emerge in both the personal and household data sets. These common variables are included in the person- and household-level estimator as additional auxiliary variables. This achieves consistency more directly and only for the relevant variables, rather than indirectly by forcing equal weights on all persons within a household. Further decisive advantages of the proposed alternative weighting methods are that original individual rather than the constructed aggregated auxiliaries are utilized and that the variable selection process is more flexible because different auxiliary variables can be incorporated in the person-level estimator than in the household-level estimator. In the second part of this dissertation, the variances of a person-level GREG estimator and an integrated estimator are compared in order to quantify the effects of the consistency requirements in the integrated weighting approach. One of the challenges is that the estimators to be compared are of different dimensions. The proposed solution is to decompose the variance of the integrated estimator into the variance of a reduced GREG estimator, whose underlying model is of the same dimensions as the person-level GREG estimator, and add a constructed term that captures the effects disregarded by the reduced model. Subsequently, further fields of application for the derived decomposition are proposed such as the variable selection process in the field of econometrics or survey statistics.

Data Fusion in Official Statistics: An Evaluation of Classical versus Statistical Learning Approaches (2024)

Schaller, Jannik

Data fusions are becoming increasingly relevant in official statistics. The aim of a data fusion is to combine two or more data sources using statistical methods in order to be able to analyse different characteristics that were not jointly observed in one data source. Record linkage of official data sources using unique identifiers is often not possible due to methodological and legal restrictions. Appropriate data fusion methods are therefore of central importance in order to use the diverse data sources of official statistics more effectively and to be able to jointly analyse different characteristics. However, the literature lacks comprehensive evaluations of which fusion approaches provide promising results for which data constellations. Therefore, the central aim of this thesis is to evaluate a concrete plethora of possible fusion algorithms, which includes classical imputation approaches as well as statistical and machine learning methods, in selected data constellations. To specify and identify these data contexts, data and imputation-related scenario types of a data fusion are introduced: Explicit scenarios, implicit scenarios and imputation scenarios. From these three scenario types, fusion scenarios that are particularly relevant for official statistics are selected as the basis for the simulations and evaluations. The explicit scenarios are the fulfilment or violation of the Conditional Independence Assumption (CIA) and varying sample sizes of the data to be matched. Both aspects are likely to have a direct, that is, explicit, effect on the performance of different fusion methods. The summed sample size of the data sources to be fused and the scale level of the variable to be imputed are considered as implicit scenarios. Both aspects suggest or exclude the applicability of certain fusion methods due to the nature of the data. The univariate or simultaneous, multivariate imputation solution and the imputation of artificially generated or previously observed values in the case of metric characteristics serve as imputation scenarios. With regard to the concrete plethora of possible fusion algorithms, three classical imputation approaches are considered: Distance Hot Deck (DHD), the Regression Model (RM) and Predictive Mean Matching (PMM). With Decision Trees (DT) and Random Forest (RF), two prominent tree-based methods from the field of statistical learning are discussed in the context of data fusion. However, such prediction methods aim to predict individual values as accurately as possible, which can clash with the primary objective of data fusion, namely the reproduction of joint distributions. In addition, DT and RF only comprise univariate imputation solutions and, in the case of metric variables, artificially generated values are imputed instead of real observed values. Therefore, Predictive Value Matching (PVM) is introduced as a new, statistical learning-based nearest neighbour method, which could overcome the distributional disadvantages of DT and RF, offers a univariate and multivariate imputation solution and, in addition, imputes real and previously observed values for metric characteristics. All prediction methods can form the basis of the new PVM approach. In this thesis, PVM based on Decision Trees (PVM-DT) and Random Forest (PVM-RF) is considered. The underlying fusion methods are investigated in comprehensive simulations and evaluations. The evaluation of the various data fusion techniques focusses on the selected fusion scenarios. The basis for this is formed by two concrete and current use cases of data fusion in official statistics, the fusion of EU-SILC and the Household Budget Survey on the one hand and of the Tax Statistics and the Microcensus on the other. Both use cases show significant differences with regard to different fusion scenarios and thus serve the purpose of covering a variety of data constellations. Simulation designs are developed from both use cases, whereby the explicit scenarios in particular are incorporated into the simulations. The results show that PVM-RF in particular is a promising and universal fusion approach under compliance with the CIA. This is because PVM-RF provides satisfactory results for both categorical and metric variables to be imputed and also offers a univariate and multivariate imputation solution, regardless of the scale level. PMM also represents an adequate fusion method, but only in relation to metric characteristics. The results also imply that the application of statistical learning methods is both an opportunity and a risk. In the case of CIA violation, potential correlation-related exaggeration effects of DT and RF, and in some cases also of RM, can be useful. In contrast, the other methods induce poor results if the CIA is violated. However, if the CIA is fulfilled, there is a risk that the prediction methods RM, DT and RF will overestimate correlations. The size ratios of the studies to be fused in turn have a rather minor influence on the performance of fusion methods. This is an important indication that the larger dataset does not necessarily have to serve as a donor study, as was previously the case. The results of the simulations and evaluations provide concrete implications as to which data fusion methods should be used and considered under the selected data and imputation constellations. Science in general and official statistics in particular benefit from these implications. This is because they provide important indications for future data fusion projects in order to assess which specific data fusion method could provide adequate results along the data constellations analysed in this thesis. Furthermore, with PVM this thesis offers a promising methodological innovation for future data fusions and for imputation problems in general.

Design and Estimation in Business Surveys - Selected Topics (2020)

Dörr, Patricia

Estimation and therefore prediction -- both in traditional statistics and machine learning -- encounters often problems when done on survey data, i.e. on data gathered from a random subset of a finite population. Additional to the stochastic generation of the data in the finite population (based on a superpopulation model), the subsetting represents a second randomization process, and adds further noise to the estimation. The character and impact of the additional noise on the estimation procedure depends on the specific probability law for subsetting, i.e. the survey design. Especially when the design is complex or the population data is not generated by a Gaussian distribution, established methods must be re-thought. Both phenomena can be found in business surveys, and their combined occurrence poses challenges to the estimation. This work introduces selected topics linked to relevant use cases of business surveys and discusses the role of survey design therein: First, consider micro-econometrics using business surveys. Regression analysis under the peculiarities of non-normal data and complex survey design is discussed. The focus lies on mixed models, which are able to capture unobserved heterogeneity e.g. between economic sectors, when the dependent variable is not conditionally normally distributed. An algorithm for survey-weighted model estimation in this setting is provided and applied to business data. Second, in official statistics, the classical sampling randomization and estimators for finite population totals are relevant. The variance estimation of estimators for (finite) population totals plays a major role in this framework in order to decide on the reliability of survey data. When the survey design is complex, and the number of variables is large for which an estimated total is required, generalized variance functions are popular for variance estimation. They allow to circumvent cumbersome theoretical design-based variance formulae or computer-intensive resampling. A synthesis of the superpopulation-based motivation and the survey framework is elaborated. To the author's knowledge, such a synthesis is studied for the first time both theoretically and empirically. Third, the self-organizing map -- an unsupervised machine learning algorithm for data visualization, clustering and even probability estimation -- is introduced. A link to Markov random fields is outlined, which to the author's knowledge has not yet been established, and a density estimator is derived. The latter is evaluated in terms of a Monte-Carlo simulation and then applied to real world business data.

Design-Based Methods for Compensating for the Effects of Frame Errors in Business Surveys (2023)

Manecke, Julia

Official business surveys form the basis for national and regional business statistics and are thus of great importance for analysing the state and performance of the economy. However, both the heterogeneity of business data and their high dynamics pose a particular challenge to the feasibility of sampling and the quality of the resulting estimates. A widely used sampling frame for creating the design of an official business survey is an extract from an official business register. However, if this frame does not accurately represent the target population, frame errors arise. Amplified by the heterogeneity and dynamics of business populations, these errors can significantly affect the estimation quality and lead to inefficiencies and biases. This dissertation therefore deals with design-based methods for optimising business surveys with respect to different types of frame errors. First, methods for adjusting the sampling design of business surveys are addressed. These approaches integrate auxiliary information about the expected structures of frame errors into the sampling design. The aim is to increase the number of sampled businesses that are subject to frame errors. The element-specific frame error probability is estimated based on auxiliary information about frame errors observed in previous samples. The approaches discussed consider different types of frame errors and can be incorporated into predefined designs with fixed strata. As the second main pillar of this work, methods for adjusting weights to correct for frame errors during estimation are developed and investigated. As a result of frame errors, the assumptions under which the original design weights were determined based on the sampling design no longer hold. The developed methods correct the design weights taking into account the errors identified for sampled elements. Case-number-based reweighting approaches, on the one hand, attempt to reconstruct the unknown size of the individual strata in the target population. In the context of weight smoothing methods, on the other hand, design weights are modelled and smoothed as a function of target or auxiliary variables. This serves to avoid inefficiencies in the estimation due to highly scattering weights or weak correlations between weights and target variables. In addition, possibilities of correcting frame errors by calibration weighting are elaborated. Especially when the sampling frame shows over- and/or undercoverage, the inclusion of external auxiliary information can provide a significant improvement of the estimation quality. For those methods whose quality cannot be measured using standard procedures, a procedure for estimating the variance based on a rescaling bootstrap is proposed. This enables an assessment of the estimation quality when using the methods in practice. In the context of two extensive simulation studies, the methods presented in this dissertation are evaluated and compared with each other. First, in the environment of an experimental simulation, it is assessed which approaches are particularly suitable with regard to different data situations. In a second simulation study, which is based on the structural survey in the services sector, the applicability of the methods in practice is evaluated under realistic conditions.

Finite Mixture Models for Small Area Estimation in Cases of Unobserved Heterogeneity (2018)

Articus, Charlotte

A basic assumption of standard small area models is that the statistic of interest can be modelled through a linear mixed model with common model parameters for all areas in the study. The model can then be used to stabilize estimation. In some applications, however, there may be different subgroups of areas, with specific relationships between the response variable and auxiliary information. In this case, using a distinct model for each subgroup would be more appropriate than employing one model for all observations. If no suitable natural clustering variable exists, finite mixture regression models may represent a solution that „lets the data decide“ how to partition areas into subgroups. In this framework, a set of two or more different models is specified, and the estimation of subgroup-specific model parameters is performed simultaneously to estimating subgroup identity, or the probability of subgroup identity, for each area. Finite mixture models thus offer a fexible approach to accounting for unobserved heterogeneity. Therefore, in this thesis, finite mixtures of small area models are proposed to account for the existence of latent subgroups of areas in small area estimation. More specifically, it is assumed that the statistic of interest is appropriately modelled by a mixture of K linear mixed models. Both mixtures of standard unit-level and standard area-level models are considered as special cases. The estimation of mixing proportions, area-specific probabilities of subgroup identity and the K sets of model parameters via the EM algorithm for mixtures of mixed models is described. Eventually, a finite mixture small area estimator is formulated as a weighted mean of predictions from model 1 to K, with weights given by the area-specific probabilities of subgroup identity.

Model-Based Prediction and Estimation Using Incomplete Survey Data (2023)

Wölwer, Anna-Lena

Survey data can be viewed as incomplete or partially missing from a variety of perspectives and there are different ways of dealing with this kind of data in the prediction and the estimation of economic quantities. In this thesis, we present two selected research contexts in which the prediction or estimation of economic quantities is examined under incomplete survey data. These contexts are first the investigation of composite estimators in the German Microcensus (Chapters 3 and 4) and second extensions of multivariate Fay-Herriot (MFH) models (Chapters 5 and 6), which are applied to small area problems. Composite estimators are estimation methods that take into account the sample overlap in rotating panel surveys such as the German Microcensus in order to stabilise the estimation of the statistics of interest (e.g. employment statistics). Due to the partial sample overlaps, information from previous samples is only available for some of the respondents, so the data are partially missing. MFH models are model-based estimation methods that work with aggregated survey data in order to obtain more precise estimation results for small area problems compared to classical estimation methods. In these models, several variables of interest are modelled simultaneously. The survey estimates of these variables, which are used as input in the MFH models, are often partially missing. If the domains of interest are not explicitly accounted for in a sampling design, the sizes of the samples allocated to them can, by chance, be small. As a result, it can happen that either no estimates can be calculated at all or that the estimated values are not published by statistical offices because their variances are too large.

Optimization for Multivariate and Multi-domain Methods in Survey Statistics (2018)

Rupp, Martin

The dissertation deals with methods to improve design-based and model-assisted estimation techniques for surveys in a finite population framework. The focus is on the development of the statistical methodology as well as their implementation by means of tailor-made numerical optimization strategies. In that regard, the developed methods aim at computing statistics for several potentially conflicting variables of interest at aggregated and disaggregated levels of the population on the basis of one single survey. The work can be divided into two main research questions, which are briefly explained in the following sections. First, an optimal multivariate allocation method is developed taking into account several stratification levels. This approach results in a multi-objective optimization problem due to the simultaneous consideration of several variables of interest. In preparation for the numerical solution, several scalarization and standardization techniques are presented, which represent the different preferences of potential users. In addition, it is shown that by solving the problem scalarized with a weighted sum for all combinations of weights, the entire Pareto frontier of the original problem can be generated. By exploiting the special structure of the problem, the scalarized problems can be efficiently solved by a semismooth Newton method. In order to apply this numerical method to other scalarization techniques as well, an alternative approach is suggested, which traces the problem back to the weighted sum case. To address regional estimation quality requirements at multiple stratification levels, the potential use of upper bounds for regional variances is integrated into the method. In addition to restrictions on regional estimates, the method enables the consideration of box-constraints for the stratum-specific sample sizes, allowing minimum and maximum stratum-specific sampling fractions to be defined. In addition to the allocation method, a generalized calibration method is developed, which is supposed to achieve coherent and efficient estimates at different stratification levels. The developed calibration method takes into account a very large number of benchmarks at different stratification levels, which may be obtained from different sources such as registers, paradata or other surveys using different estimation techniques. In order to incorporate the heterogeneous quality and the multitude of benchmarks, a relaxation of selected benchmarks is proposed. In that regard, predefined tolerances are assigned to problematic benchmarks at low aggregation levels in order to avoid an exact fulfillment. In addition, the generalized calibration method allows the use of box-constraints for the correction weights in order to avoid an extremely high variation of the weights. Furthermore, a variance estimation by means of a rescaling bootstrap is presented. Both developed methods are analyzed and compared with existing methods in extensive simulation studies on the basis of a realistic synthetic data set of all households in Germany. Due to the similar requirements and objectives, both methods can be successively applied to a single survey in order to combine their efficiency advantages. In addition, both methods can be solved in a time-efficient manner using very comparable optimization approaches. These are based on transformations of the optimality conditions. The dimension of the resulting system of equations is ultimately independent of the dimension of the original problem, which enables the application even for very large problem instances.

Optimization Methods and Large-Scale Algorithms in Small Area Estimation (2018)

Wagner, Julian

Sample surveys are a widely used and cost effective tool to gain information about a population under consideration. Nowadays, there is an increasing demand not only for information on the population level but also on the level of subpopulations. For some of these subpopulations of interest, however, very small subsample sizes might occur such that the application of traditional estimation methods is not expedient. In order to provide reliable information also for those so called small areas, small area estimation (SAE) methods combine auxiliary information and the sample data via a statistical model. The present thesis deals, among other aspects, with the development of highly flexible and close to reality small area models. For this purpose, the penalized spline method is adequately modified which allows to determine the model parameters via the solution of an unconstrained optimization problem. Due to this optimization framework, the incorporation of shape constraints into the modeling process is achieved in terms of additional linear inequality constraints on the optimization problem. This results in small area estimators that allow for both the utilization of the penalized spline method as a highly flexible modeling technique and the incorporation of arbitrary shape constraints on the underlying P-spline function. In order to incorporate multiple covariates, a tensor product approach is employed to extend the penalized spline method to multiple input variables. This leads to high-dimensional optimization problems for which naive solution algorithms yield an unjustifiable complexity in terms of runtime and in terms of memory requirements. By exploiting the underlying tensor nature, the present thesis provides adequate computationally efficient solution algorithms for the considered optimization problems and the related memory efficient, i.e. matrix-free, implementations. The crucial point thereby is the (repetitive) application of a matrix-free conjugated gradient method, whose runtime is drastically reduced by a matrx-free multigrid preconditioner.

Regression Modelling with Complex Survey Data: An Investigation Using an Extended Close-to-Reality Simulated Household Population (2021)

Ertz, Florian

The Eurosystem's Household Finance and Consumption Survey (HFCS) collects micro data on private households' balance sheets, income and consumption. It is a stylised fact that wealth is unequally distributed and that the wealthiest own a large share of total wealth. For sample surveys which aim at measuring wealth and its distribution, this is a considerable problem. To overcome it, some of the country surveys under the HFCS umbrella try to sample a disproportionately large share of households that are likely to be wealthy, a technique referred to as oversampling. Ignoring such types of complex survey designs in the estimation of regression models can lead to severe problems. This thesis first illustrates such problems using data from the first wave of the HFCS and canonical regression models from the field of household finance and gives a first guideline for HFCS data users regarding the use of replicate weight sets for variance estimation using a variant of the bootstrap. A further investigation of the issue necessitates a design-based Monte Carlo simulation study. To this end, the already existing large close-to-reality synthetic simulation population AMELIA is extended with synthetic wealth data. We discuss different approaches to the generation of synthetic micro data in the context of the extension of a synthetic simulation population that was originally based on a different data source. We propose an additional approach that is suitable for the generation of highly skewed synthetic micro data in such a setting using a multiply-imputed survey data set. After a description of the survey designs employed in the first wave of the HFCS, we then construct new survey designs for AMELIA that share core features of the HFCS survey designs. A design-based Monte Carlo simulation study shows that while more conservative approaches to oversampling do not pose problems for the estimation of regression models if sampling weights are properly accounted for, the same does not necessarily hold for more extreme oversampling approaches. This issue should be further analysed in future research.

Autor(en)
Titel
Weitere Person(en)
Gutachter
Zusammenfassung
Volltext

Filtern

Autor

Erscheinungsjahr

Sprache

Schlagworte

Institut

13 Treffer