Filtern
Dokumenttyp
Schlagworte
- Maschinelles Lernen (7) (entfernen)
Institut
- Fachbereich 4 (2)
- Raum- und Umweltwissenschaften (2)
- Fachbereich 1 (1)
- Fachbereich 2 (1)
Determining the exact position of a forest inventory plot—and hence the position of the sampled trees—is often hampered by a poor Global Navigation Satellite System (GNSS) signal quality beneath the forest canopy. Inaccurate geo-references hamper the performance of models that aim to retrieve useful information from spatially high remote sensing data (e.g., species classification or timber volume estimation). This restriction is even more severe on the level of individual trees. The objective of this study was to develop a post-processing strategy to improve the positional accuracy of GNSS-measured sample-plot centers and to develop a method to automatically match trees within a terrestrial sample plot to aerial detected trees. We propose a new method which uses a random forest classifier to estimate the matching probability of each terrestrial-reference and aerial detected tree pair, which gives the opportunity to assess the reliability of the results. We investigated 133 sample plots of the Third German National Forest Inventory (BWI, 2011"2012) within the German federal state of Rhineland-Palatinate. For training and objective validation, synthetic forest stands have been modeled using the Waldplaner 2.0 software. Our method has achieved an overall accuracy of 82.7% for co-registration and 89.1% for tree matching. With our method, 60% of the investigated plots could be successfully relocated. The probabilities provided by the algorithm are an objective indicator of the reliability of a specific result which could be incorporated into quantitative models to increase the performance of forest attribute estimations.
We consider a linear regression model for which we assume that some of the observed variables are irrelevant for the prediction. Including the wrong variables in the statistical model can either lead to the problem of having too little information to properly estimate the statistic of interest, or having too much information and consequently describing fictitious connections. This thesis considers discrete optimization to conduct a variable selection. In light of this, the subset selection regression method is analyzed. The approach gained a lot of interest in recent years due to its promising predictive performance. A major challenge associated with the subset selection regression is the computational difficulty. In this thesis, we propose several improvements for the efficiency of the method. Novel bounds on the coefficients of the subset selection regression are developed, which help to tighten the relaxation of the associated mixed-integer program, which relies on a Big-M formulation. Moreover, a novel mixed-integer linear formulation for the subset selection regression based on a bilevel optimization reformulation is proposed. Finally, it is shown that the perspective formulation of the subset selection regression is equivalent to a state-of-the-art binary formulation. We use this insight to develop novel bounds for the subset selection regression problem, which show to be highly effective in combination with the proposed linear formulation.
In the second part of this thesis, we examine the statistical conception of the subset selection regression and conclude that it is misaligned with its intention. The subset selection regression uses the training error to decide on which variables to select. The approach conducts the validation on the training data, which oftentimes is not a good estimate of the prediction error. Hence, it requires a predetermined cardinality bound. Instead, we propose to select variables with respect to the cross-validation value. The process is formulated as a mixed-integer program with the sparsity becoming subject of the optimization. Usually, a cross-validation is used to select the best model out of a few options. With the proposed program the best model out of all possible models is selected. Since the cross-validation is a much better estimate of the prediction error, the model can select the best sparsity itself.
The thesis is concluded with an extensive simulation study which provides evidence that discrete optimization can be used to produce highly valuable predictive models with the cross-validation subset selection regression almost always producing the best results.
Natural hazards are diverse and uneven in time and space, therefore, understanding its complexity is key to save human lives and conserve natural ecosystems. Reducing the outputs obtained after each modelling analysis is key to present the results for stakeholders, land managers and policymakers. So, the main goal of this survey was to present a method to synthesize three natural hazards in one multi-hazard map and its evaluation for hazard management and land use planning. To test this methodology, we took as study area the Gorganrood Watershed, located in the Golestan Province (Iran). First, an inventory map of three different types of hazards including flood, landslides, and gullies was prepared using field surveys and different official reports. To generate the susceptibility maps, a total of 17 geo-environmental factors were selected as predictors using the MaxEnt (Maximum Entropy) machine learning technique. The accuracy of the predictive models was evaluated by drawing receiver operating characteristic-ROC curves and calculating the area under the ROC curve-AUCROC. The MaxEnt model not only implemented superbly in the degree of fitting, but also obtained significant results in predictive performance. Variables importance of the three studied types of hazards showed that river density, distance from streams, and elevation were the most important factors for flood, respectively. Lithological units, elevation, and annual mean rainfall were relevant for detecting landslides. On the other hand, annual mean rainfall, elevation, and lithological units were used for gully erosion mapping in this study area. Finally, by combining the flood, landslides, and gully erosion susceptibility maps, an integrated multi-hazard map was created. The results demonstrated that 60% of the area is subjected to hazards, reaching a proportion of landslides up to 21.2% in the whole territory. We conclude that using this type of multi-hazard map may be a useful tool for local administrators to identify areas susceptible to hazards at large scales as we demonstrated in this research.
There is no longer any doubt about the general effectiveness of psychotherapy. However, up to 40% of patients do not respond to treatment. Despite efforts to develop new treatments, overall effectiveness has not improved. Consequently, practice-oriented research has emerged to make research results more relevant to practitioners. Within this context, patient-focused research (PFR) focuses on the question of whether a particular treatment works for a specific patient. Finally, PFR gave rise to the precision mental health research movement that is trying to tailor treatments to individual patients by making data-driven and algorithm-based predictions. These predictions are intended to support therapists in their clinical decisions, such as the selection of treatment strategies and adaptation of treatment. The present work summarizes three studies that aim to generate different prediction models for treatment personalization that can be applied to practice. The goal of Study I was to develop a model for dropout prediction using data assessed prior to the first session (N = 2543). The usefulness of various machine learning (ML) algorithms and ensembles was assessed. The best model was an ensemble utilizing random forest and nearest neighbor modeling. It significantly outperformed generalized linear modeling, correctly identifying 63.4% of all cases and uncovering seven key predictors. The findings illustrated the potential of ML to enhance dropout predictions, but also highlighted that not all ML algorithms are equally suitable for this purpose. Study II utilized Study I’s findings to enhance the prediction of dropout rates. Data from the initial two sessions and observer ratings of therapist interventions and skills were employed to develop a model using an elastic net (EN) algorithm. The findings demonstrated that the model was significantly more effective at predicting dropout when using observer ratings with a Cohen’s d of up to .65 and more effective than the model in Study I, despite the smaller sample (N = 259). These results indicated that generating models could be improved by employing various data sources, which provide better foundations for model development. Finally, Study III generated a model to predict therapy outcome after a sudden gain (SG) in order to identify crucial predictors of the upward spiral. EN was used to generate the model using data from 794 cases that experienced a SG. A control group of the same size was also used to quantify and relativize the identified predictors by their general influence on therapy outcomes. The results indicated that there are seven key predictors that have varying effect sizes on therapy outcome, with Cohen's d ranging from 1.08 to 12.48. The findings suggested that a directive approach is more likely to lead to better outcomes after an SG, and that alliance ruptures can be effectively compensated for. However, these effects
were reversed in the control group. The results of the three studies are discussed regarding their usefulness to support clinical decision-making and their implications for the implementation of precision mental health.
Data used for the purpose of machine learning are often erroneous. In this thesis, p-quasinorms (p<1) are employed as loss functions in order to increase the robustness of training algorithms for artificial neural networks. Numerical issues arising from these loss functions are addressed via enhanced optimization algorithms (proximal point methods; Frank-Wolfe methods) based on the (non-monotonic) Armijo-rule. Numerical experiments comprising 1100 test problems confirm the effectiveness of the approach. Depending on the parametrization, an average reduction of the absolute residuals of up to 64.6% is achieved (aggregated over 100 test problems).
Stress position in English words is well-known to correlate with both their morphological properties and their phonological organisation in terms of non-segmental, prosodic categories like syllable structure. While two generalisations capturing this correlation, directionality and stratification, are well established, the exact nature of the interaction of phonological and morphological factors in English stress assignment is a much debated issue in the literature. The present study investigates if and how directionality and stratification effects in English can be learned by means of Naive Discriminative Learning, a computational model that is trained using error-driven learning and that does not make any a-priori assumptions about the higher-level phonological organisation and morphological structure of words. Based on a series of simulation studies we show that neither directionality nor stratification need to be stipulated as a-priori properties of words or constraints in the lexicon. Stress can be learned solely on the basis of very flat word representations. Morphological stratification emerges as an effect of the model learning that informativity with regard to stress position is unevenly distributed across all trigrams constituting a word. Morphological affix classes like stress-preserving and stress-shifting affixes are, hence, not predefined classes but sets of trigrams that have similar informativity values with regard to stress position. Directionality, by contrast, emerges as spurious in our simulations; no syllable counting or recourse to abstract prosodic representations seems to be necessary to learn stress position in English.
In dem Gebiet der Informationsextraktion angesiedelt kombiniert diese Arbeit mehrere Verfahren aus dem Bereich des maschinellen Lernens. Sie stellt einen neuen Algorithmus vor, der teil-überwachtes Lernen mit aktivem Lernen verknüpft. Ausgangsbasis ist die Analyse der Daten, indem sie in mehrere Sichten aufgeteilt werden. Hier werden die Eingaben verschiedener Personen unterteilt. Jeweils getrennt voneinander erzeugt der Algorithmus mittels Klassifizierern Modelle, die aus den individuellen Auszeichnungen der Personen aufgebaut werden. Um die dafür benötigte Datenmenge zu erhalten wird Crowdsourcing genutzt, dass es ermöglicht eine große Anzahl an Personen zu erreichen. Die Personen erhalten die Aufgabe, Texte zu annotieren. Einerseits wird dies initial für einen historischen Textkorpus vorgenommen. Dabei wird aufgeführt, welche Schritte notwendig sind, um die Annotationsaufgabe in Crowdsourcing-Portalen zur Bearbeitung anzubieten und durchzuführen. Andererseits wird ein aktueller Datensatz von Kurznachrichten genutzt. Der Algorithmus wird auf diese Beispieldatensätze angewandt. Durch Experimente wird die Ermittlung der optimalen Parameterauswahl durchgeführt. Außerdem werden die Ergebnisse mit den Resultaten bisheriger Algorithmen verglichen.