Mixed-Integer Optimization for Semi-Supervised Learning with Cardinality Constraints: Support Vector Machines, Classification Trees, and Random Forests
- In machine learning, classification is the task of predicting a label for each point within a data set. When the class of each point in the labeled subset is already known, this information is used to recognize patterns and make predictions about the points in the remainder of the set, referred to as the unlabeled set. This scenario falls in the field of supervised learning.
However, the number of labeled points can be restricted, because, e.g., it is expensive to obtain this information. Besides, this subset may be biased, such as in the case of self-selection in a survey. Consequently, the classification performance for unlabeled points may be limited. To improve the reliability of the results, semi-supervised learning tackles the setting of labeled and unlabeled data. Moreover, in many cases, additional information about the size of each class can be available from undisclosed sources.
This cumulative thesis presents different studies to combine this external cardinality constraint information within three important algorithms for binary classification in the supervised context: support vector machines (SVM), classification trees, and random forests. From a mathematical point of view, we focus on mixed-integer programming (MIP) models for semi-supervised approaches that consider a cardinality constraint for each class for each algorithm.
Furthermore, since the proposed MIP models are computationally challenging, we also present techniques that simplify the process of solving these problems. In the SVM setting, we introduce a re-clustering method and further computational techniques to reduce the computational cost. In the context of classification trees, we provide correct values for certain bounds that play a crucial role for the solver performance. For the random forest model, we develop preprocessing techniques and an intuitive branching rule to reduce the solution time. For all three methods, our numerical results show that our approaches have better statistical performances for biased samples than the standard approach.