Mixed-Integer Optimization for Semi-Supervised Learning with Cardinality Constraints: Support Vector Machines, Classification Trees, and Random Forests

  • In machine learning, classification is the task of predicting a label for each point within a data set. When the class of each point in the labeled subset is already known, this information is used to recognize patterns and make predictions about the points in the remainder of the set, referred to as the unlabeled set. This scenario falls in the field of supervised learning. However, the number of labeled points can be restricted, because, e.g., it is expensive to obtain this information. Besides, this subset may be biased, such as in the case of self-selection in a survey. Consequently, the classification performance for unlabeled points may be limited. To improve the reliability of the results, semi-supervised learning tackles the setting of labeled and unlabeled data. Moreover, in many cases, additional information about the size of each class can be available from undisclosed sources. This cumulative thesis presents different studies to combine this external cardinality constraint information within three important algorithms for binary classification in the supervised context: support vector machines (SVM), classification trees, and random forests. From a mathematical point of view, we focus on mixed-integer programming (MIP) models for semi-supervised approaches that consider a cardinality constraint for each class for each algorithm. Furthermore, since the proposed MIP models are computationally challenging, we also present techniques that simplify the process of solving these problems. In the SVM setting, we introduce a re-clustering method and further computational techniques to reduce the computational cost. In the context of classification trees, we provide correct values for certain bounds that play a crucial role for the solver performance. For the random forest model, we develop preprocessing techniques and an intuitive branching rule to reduce the solution time. For all three methods, our numerical results show that our approaches have better statistical performances for biased samples than the standard approach.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Maria Eduarda Pinheiro
URN:urn:nbn:de:hbz:385-1-23957
Referee:Jan Pablo Burgard, Martin Schmidt, Dolores Romero Morales
Advisor:Jan Pablo Burgard
Document Type:Doctoral Thesis
Language:English
Date of completion:2024/12/12
Publishing institution:Universität Trier
Granting institution:Universität Trier, Fachbereich 4
Date of final exam:2024/10/31
Release Date:2024/12/12
Number of pages:vi, 122 Blätter
First page:i
Last page:122
Institutes:Fachbereich 4
Licence (German):License LogoCC BY: Creative-Commons-Lizenz 4.0 International

$Rev: 13581 $