OPUS 4 | Suchen

13 Treffer

11 bis 13

Sortieren nach

Statistical and Machine Learning Methods for Handling Selectivity in Non-Probability Samples (2023)

Lenau, Simon

Non-probability sampling is a topic of growing relevance, especially due to its occurrence in the context of new emerging data sources like web surveys and Big Data. This thesis addresses statistical challenges arising from non-probability samples, where unknown or uncontrolled sampling mechanisms raise concerns in terms of data quality and representativity. Various methods to quantify and reduce the potential selectivity and biases of non-probability samples in estimation and inference are discussed. The thesis introduces new forms of prediction and weighting methods, namely a) semi-parametric artificial neural networks (ANNs) that integrate B-spline layers with optimal knot positioning in the general structure and fitting procedure of artificial neural networks, and b) calibrated semi-parametric ANNs that determine weights for non-probability samples by integrating an ANN as response model with calibration constraints for totals, covariances and correlations. Custom-made computational implementations are developed for fitting (calibrated) semi-parametric ANNs by means of stochastic gradient descent, BFGS and sequential quadratic programming algorithms. The performance of all the discussed methods is evaluated and compared for a bandwidth of non-probability sampling scenarios in a Monte Carlo simulation study as well as an application to a real non-probability sample, the WageIndicator web survey. Potentials and limitations of the different methods for dealing with the challenges of non-probability sampling under various circumstances are highlighted. It is shown that the best strategy for using non-probability samples heavily depends on the particular selection mechanism, research interest and available auxiliary information. Nevertheless, the findings show that existing as well as newly proposed methods can be used to ease or even fully counterbalance the issues of non-probability samples and highlight the conditions under which this is possible.

An Integer Optimization Approach to k-Anonymity on Nominal Data (2023)

Cost, Vera Charlotte

The publication of statistical databases is subject to legal regulations, e.g. national statistical offices are only allowed to publish data if the data cannot be attributed to individuals. Achieving this privacy standard requires anonymizing the data prior to publication. However, data anonymization inevitably leads to a loss of information, which should be kept minimal. In this thesis, we analyze the anonymization method SAFE used in the German census in 2011 and we propose a novel integer programming-based anonymization method for nominal data. In the first part of this thesis, we prove that a fundamental variant of the underlying SAFE optimization problem is NP-hard. This justifies the use of heuristic approaches for large data sets. In the second part, we propose a new anonymization method belonging to microaggregation methods, specifically designed for nominal data. This microaggregation method replaces rows in a microdata set with representative values to achieve k-anonymity, ensuring each data row is identical to at least k − 1 other rows. In addition to the overall dissimilarities of the data rows, the method accounts for errors in resulting frequency tables, which are of high interest for nominal data in practice. The method employs a typical two-step structure: initially partitioning the data set into clusters and subsequently replacing all cluster elements with representative values to achieve k-anonymity. For the partitioning step, we propose a column generation scheme followed by a heuristic to obtain an integer solution, which is based on the dual information. For the aggregation step, we present a mixed-integer problem formulation to find cluster representatives. To this end, we take errors in a subset of frequency tables into account. Furthermore, we show a reformulation of the problem to a minimum edge-weighted maximal clique problem in a multipartite graph, which allows for a different perspective on the problem. Moreover, we formulate a mixed-integer program, which combines the partitioning and the aggregation step and aims to minimize the sum of chi-squared errors in frequency tables. Finally, an experimental study comparing the methods covered or developed in this work shows particularly strong results for the proposed method with respect to relative criteria, while SAFE shows its strength with respect to the maximum absolute error in frequency tables. We conclude that the inclusion of integer programming in the context of data anonymization is a promising direction to reduce the inevitable information loss inherent in anonymization, particularly for nominal data.

Data Fusion in Official Statistics: An Evaluation of Classical versus Statistical Learning Approaches (2024)

Schaller, Jannik

Data fusions are becoming increasingly relevant in official statistics. The aim of a data fusion is to combine two or more data sources using statistical methods in order to be able to analyse different characteristics that were not jointly observed in one data source. Record linkage of official data sources using unique identifiers is often not possible due to methodological and legal restrictions. Appropriate data fusion methods are therefore of central importance in order to use the diverse data sources of official statistics more effectively and to be able to jointly analyse different characteristics. However, the literature lacks comprehensive evaluations of which fusion approaches provide promising results for which data constellations. Therefore, the central aim of this thesis is to evaluate a concrete plethora of possible fusion algorithms, which includes classical imputation approaches as well as statistical and machine learning methods, in selected data constellations. To specify and identify these data contexts, data and imputation-related scenario types of a data fusion are introduced: Explicit scenarios, implicit scenarios and imputation scenarios. From these three scenario types, fusion scenarios that are particularly relevant for official statistics are selected as the basis for the simulations and evaluations. The explicit scenarios are the fulfilment or violation of the Conditional Independence Assumption (CIA) and varying sample sizes of the data to be matched. Both aspects are likely to have a direct, that is, explicit, effect on the performance of different fusion methods. The summed sample size of the data sources to be fused and the scale level of the variable to be imputed are considered as implicit scenarios. Both aspects suggest or exclude the applicability of certain fusion methods due to the nature of the data. The univariate or simultaneous, multivariate imputation solution and the imputation of artificially generated or previously observed values in the case of metric characteristics serve as imputation scenarios. With regard to the concrete plethora of possible fusion algorithms, three classical imputation approaches are considered: Distance Hot Deck (DHD), the Regression Model (RM) and Predictive Mean Matching (PMM). With Decision Trees (DT) and Random Forest (RF), two prominent tree-based methods from the field of statistical learning are discussed in the context of data fusion. However, such prediction methods aim to predict individual values as accurately as possible, which can clash with the primary objective of data fusion, namely the reproduction of joint distributions. In addition, DT and RF only comprise univariate imputation solutions and, in the case of metric variables, artificially generated values are imputed instead of real observed values. Therefore, Predictive Value Matching (PVM) is introduced as a new, statistical learning-based nearest neighbour method, which could overcome the distributional disadvantages of DT and RF, offers a univariate and multivariate imputation solution and, in addition, imputes real and previously observed values for metric characteristics. All prediction methods can form the basis of the new PVM approach. In this thesis, PVM based on Decision Trees (PVM-DT) and Random Forest (PVM-RF) is considered. The underlying fusion methods are investigated in comprehensive simulations and evaluations. The evaluation of the various data fusion techniques focusses on the selected fusion scenarios. The basis for this is formed by two concrete and current use cases of data fusion in official statistics, the fusion of EU-SILC and the Household Budget Survey on the one hand and of the Tax Statistics and the Microcensus on the other. Both use cases show significant differences with regard to different fusion scenarios and thus serve the purpose of covering a variety of data constellations. Simulation designs are developed from both use cases, whereby the explicit scenarios in particular are incorporated into the simulations. The results show that PVM-RF in particular is a promising and universal fusion approach under compliance with the CIA. This is because PVM-RF provides satisfactory results for both categorical and metric variables to be imputed and also offers a univariate and multivariate imputation solution, regardless of the scale level. PMM also represents an adequate fusion method, but only in relation to metric characteristics. The results also imply that the application of statistical learning methods is both an opportunity and a risk. In the case of CIA violation, potential correlation-related exaggeration effects of DT and RF, and in some cases also of RM, can be useful. In contrast, the other methods induce poor results if the CIA is violated. However, if the CIA is fulfilled, there is a risk that the prediction methods RM, DT and RF will overestimate correlations. The size ratios of the studies to be fused in turn have a rather minor influence on the performance of fusion methods. This is an important indication that the larger dataset does not necessarily have to serve as a donor study, as was previously the case. The results of the simulations and evaluations provide concrete implications as to which data fusion methods should be used and considered under the selected data and imputation constellations. Science in general and official statistics in particular benefit from these implications. This is because they provide important indications for future data fusion projects in order to assess which specific data fusion method could provide adequate results along the data constellations analysed in this thesis. Furthermore, with PVM this thesis offers a promising methodological innovation for future data fusions and for imputation problems in general.

11 bis 13

Autor(en)
Titel
Weitere Person(en)
Gutachter
Zusammenfassung
Volltext

Filtern

Autor

Erscheinungsjahr

Sprache

Schlagworte

Institut

13 Treffer