Refine
Language
- English (13) (remove)
Keywords
- Schätzung (4)
- Stichprobe (4)
- survey statistics (4)
- Erhebungsverfahren (3)
- Amtliche Statistik (2)
- Datenerhebung (2)
- Haushalt (2)
- Official Statistics (2)
- Optimierung (2)
- Unternehmen (2)
Institute
- Fachbereich 4 (9)
- Mathematik (1)
- Wirtschaftswissenschaften (1)
The publication of statistical databases is subject to legal regulations, e.g. national statistical offices are only allowed to publish data if the data cannot be attributed to individuals. Achieving this privacy standard requires anonymizing the data prior to publication. However, data anonymization inevitably leads to a loss of information, which should be kept minimal. In this thesis, we analyze the anonymization method SAFE used in the German census in 2011 and we propose a novel integer programming-based anonymization method for nominal data.
In the first part of this thesis, we prove that a fundamental variant of the underlying SAFE optimization problem is NP-hard. This justifies the use of heuristic approaches for large data sets. In the second part, we propose a new anonymization method belonging to microaggregation methods, specifically designed for nominal data. This microaggregation method replaces rows in a microdata set with representative values to achieve k-anonymity, ensuring each data row is identical to at least k − 1 other rows. In addition to the overall dissimilarities of the data rows, the method accounts for errors in resulting frequency tables, which are of high interest for nominal data in practice. The method employs a typical two-step structure: initially partitioning the data set into clusters and subsequently replacing all cluster elements with representative values to achieve k-anonymity. For the partitioning step, we propose a column generation scheme followed by a heuristic to obtain an integer solution, which is based on the dual information. For the aggregation step, we present a mixed-integer problem formulation to find cluster representatives. To this end, we take errors in a subset of frequency tables into account. Furthermore, we show a reformulation of the problem to a minimum edge-weighted maximal clique problem in a multipartite graph, which allows for a different perspective on the problem. Moreover, we formulate a mixed-integer program, which combines the partitioning and the aggregation step and aims to minimize the sum of chi-squared errors in frequency tables.
Finally, an experimental study comparing the methods covered or developed in this work shows particularly strong results for the proposed method with respect to relative criteria, while SAFE shows its strength with respect to the maximum absolute error in frequency tables. We conclude that the inclusion of integer programming in the context of data anonymization is a promising direction to reduce the inevitable information loss inherent in anonymization, particularly for nominal data.