Refine
Keywords
- Weighted Regression (1) (remove)
Estimation and therefore prediction -- both in traditional statistics and machine learning -- encounters often problems when done on survey data, i.e. on data gathered from a random subset of a finite population. Additional to the stochastic generation of the data in the finite population (based on a superpopulation model), the subsetting represents a second randomization process, and adds further noise to the estimation. The character and impact of the additional noise on the estimation procedure depends on the specific probability law for subsetting, i.e. the survey design. Especially when the design is complex or the population data is not generated by a Gaussian distribution, established methods must be re-thought. Both phenomena can be found in business surveys, and their combined occurrence poses challenges to the estimation.
This work introduces selected topics linked to relevant use cases of business surveys and discusses the role of survey design therein: First, consider micro-econometrics using business surveys. Regression analysis under the peculiarities of non-normal data and complex survey design is discussed. The focus lies on mixed models, which are able to capture unobserved heterogeneity e.g. between economic sectors, when the dependent variable is not conditionally normally distributed. An algorithm for survey-weighted model estimation in this setting is provided and applied to business data.
Second, in official statistics, the classical sampling randomization and estimators for finite population totals are relevant. The variance estimation of estimators for (finite) population totals plays a major role in this framework in order to decide on the reliability of survey data. When the survey design is complex, and the number of variables is large for which an estimated total is required, generalized variance functions are popular for variance estimation. They allow to circumvent cumbersome theoretical design-based variance formulae or computer-intensive resampling. A synthesis of the superpopulation-based motivation and the survey framework is elaborated. To the author's knowledge, such a synthesis is studied for the first time both theoretically and empirically.
Third, the self-organizing map -- an unsupervised machine learning algorithm for data visualization, clustering and even probability estimation -- is introduced. A link to Markov random fields is outlined, which to the author's knowledge has not yet been established, and a density estimator is derived. The latter is evaluated in terms of a Monte-Carlo simulation and then applied to real world business data.