Large-scale Data-Informed Screening Under Population Heterogeneity
The design of large-scale screening strategies is an essential aspect of accurate and efficient classification of large populations for a binary characteristic (e.g., presence of an infectious disease, a genetic disorder, or a product defect). The performance of screening strategies heavily depends on various population-level characteristics that often vary, sometimes substantially, by individuals. Available datasets provide crucial, yet often neglected, information on population heterogeneity, and incorporating such information into the testing designs is of utmost importance, as failing to do so leads to strategies with higher classification errors and costs. A natural question arises: How does one optimally incorporate population heterogeneity into the mathematical framework to obtain optimal data-informed screening schemes? This research has received limited attention in the literature due to the resulting difficult combinatorial optimization problems. In this project, we theoretically prove that a sub-class of these optimization problems, which were once claimed to be intractable, can be cast as network flow problems (e.g., constrained shortest path), significantly improving the computational tractability. This, in turn, leads to polynomial time algorithms capable of handling realistic instances of the problem. Further, we investigate important, yet often neglected, dimensions of screening, such as equity/fairness and robustness. Our empirical results, from publicly available datasets (e.g., Centers for Disease Control and Prevention), demonstrate how data-informed screening strategies have the potential to significantly reduce screening costs and improve classification accuracy over current screening practices.
Optimization-based Cluster Analysis: Exact and Scalable Algorithms
Cluster analysis, the most common unsupervised learning method, is an essential and important tool extensively used in various domains (e.g., pattern recognition, economics, marketing, medical research, and political science). The literature on clustering is rich and several approaches and methods have been proposed throughout the years. These include, among others, partitioning methods (e.g., k-means), hierarchical methods (e.g., agglomerative and divisive), and graph clustering methods (e.g., spectral). Existing clustering methods rely on heuristic algorithmic procedures that have no performance guarantees and that depend on randomly generated initial clusters. This, in turn, leads to clustering algorithms that not only get stuck in local optima, which may be far from optimality, but whose output can be heavily influenced by the random initial conditions. Furthermore, the majority of these methods are not informed by the nature of data they are clustering, leading to underperforming clusters. In this project, we investigate a novel class of optimization-based clustering procedures that are exact, highly efficient, and capable of handling large-scale datasets. When embedded within an optimization-based framework, cluster analysis leads to difficult multi-variate set partitioning problems, which are known to be NP-complete in general. We provide novel theoretical results on a class of multivariate set partitioning problems that enable us to solve them in a highly efficient manner. This, in turn, allows us to construct multidimensional clustering procedures that run in quadratic time, as well as enabling the construction of hierarchical heuristic algorithms that run in linear time. Consequently, our approach can handle very large dataset. Our empirical results, on disease risk categorizations, demonstrate how our proposed clustering schemes significantly outperform current practices and existing state-of-the-art clustering schemes, such as k-means.
Robust Testing Designs for Time-varying and Dynamic Environments
In a world that is constantly changing and dynamic, the task of constructing efficient and resilient testing procedures becomes challenging. In many settings, system parameters vary, sometimes substantially, in time. In the case of vector-borne diseases, for example, risk levels are highly seasonal and depend on many factors including, among others, climate and weather conditions. Ignoring such time dependencies and variations, by only relying on point estimates (e.g., moving average), can lead to testing schemes with poor and highly unpredictable performance. Therefore, multi-period robust testing designs, which can take into account the variation and uncertainty in system parameters, play a crucial role, as they can lead to designs that are highly resilient to system perturbations. In this project, we investigate novel multi-period adaptive and non-adaptive testing designs that incorporate time-dependent system parameters into the mathematical framework. Two fundamental questions need to be addressed: (i) At what points in time is it optimal for the decision-maker to intervene and modify the testing designs?, and (ii) what are the optimal testing designs for each of the resulting time intervals. Answering these questions requires the analysis of difficult regret-based robust optimization and stochastic dynamic programming formulations. Our analytical results on novel closed-form solutions to the inner regret problem enable us to reformulate a sub-class of the problems to a narrowest path problem, drastically improving its tractability. Our empirical results, on a class of multi-period testing schemes, demonstrate the benefits of the proposed robust schemes, by substantially improving resilience and reducing costs over traditional testing practices.