Browsing by Department "Statistics"

Now showing 1 - 20 of 21

Advances in Data Science: Theory, Methods and Computation
Statistics; TAMU; https://hdl.handle.net/20.500.14641/249; National Science Foundation
Due to advancements in data acquisition techniques over the last two decades, new types of exceedingly complex datasets have emerged and present tremendous challenges that require synergy of interdisciplinary ideas for analysis and decision making. As a result, the field of data-science is rapidly evolving as an interdisciplinary field, where advances often result from combinations of ideas from multiple disciplines. A convening of leading experts, early-career researchers, and students from varied disciplines to exchange ideas is essential for progress in this field. Texas A&M University will host a two-day conference on Advances in Data Science in February 2022. More information on the conference can be found at https://stat.tamu.edu/advances-in-data-science-conference/. The primary objective of the conference is to provide a much-needed platform for accelerating the depth and quality of research on the foundations of data science through interdisciplinarity. The conference will bring together researchers from three major disciplinary areas (Statistics, Mathematics and Engineering) for presentation and dissemination of their research, to engage in discussions and foster future collaborations. This conference will involve women, minorities and young researchers across the nation. The conference will present a tremendous opportunity for first generation undergraduate students to be inspired and pursue careers in data-science in both academia and industry. The conference will feature a number of activities to engage the students and recognize their contributions through awards. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria
Advances in Data Science: Theory, Methods and Computation
Statistics; TAMU; https://hdl.handle.net/20.500.14641/249; National Science Foundation
Due to advancements in data acquisition techniques over the last two decades, new types of exceedingly complex datasets have emerged and present tremendous challenges that require synergy of interdisciplinary ideas for analysis and decision making. As a result, the field of data-science is rapidly evolving as an interdisciplinary field, where advances often result from combinations of ideas from multiple disciplines. A convening of leading experts, early-career researchers, and students from varied disciplines to exchange ideas is essential for progress in this field. Texas A&M University will host a two-day conference on Advances in Data Science in February 2022. More information on the conference can be found at https://stat.tamu.edu/advances-in-data-science-conference/. The primary objective of the conference is to provide a much-needed platform for accelerating the depth and quality of research on the foundations of data science through interdisciplinarity. The conference will bring together researchers from three major disciplinary areas (Statistics, Mathematics and Engineering) for presentation and dissemination of their research, to engage in discussions and foster future collaborations. This conference will involve women, minorities and young researchers across the nation. The conference will present a tremendous opportunity for first generation undergraduate students to be inspired and pursue careers in data-science in both academia and industry. The conference will feature a number of activities to engage the students and recognize their contributions through awards. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria
ATD: A Statistical Geo-Enabled Dynamic Human Network Analysis
Statistics; TAMU; https://hdl.handle.net/20.500.14641/352; National Science Foundation
Recently, new tracking and sensor technologies such as the Global Positioning System (GPS) have been deployed on mobile objects to collect their tracking position with high spatio-temporal resolution. The availability of these massive amounts of tracking data brings great opportunities to many application fields that rely on human movement knowledge. However, the size and the complex spatial temporal dynamic nature of the data also impose challenges for statistical modeling and computation. There is a pressing need to develop computationally efficient quantitative models to handle massive spatial temporal human trajectory data. This project combines theoretical methods and computational approaches to develop novel statistical models, along with efficient algorithms, to meet the increasing demand of efficient analytical tools for massive human trajectory data. The project has broad impact on multiple interdisciplinary fields. The results can be applied to a wide range of practical and important problems including military and national security operations, urban planning, transportation management, traffic forecasting, public health, and social behavioral studies. The increasing use of GPS and other location-aware devices has led to an increasing amount of available human trajectory data at high spatial temporal resolution. Analysis of such data provides invaluable information for many important research problems in different fields. This project will focus on the following research thrusts. First, a new class of trajectory models at the individual level will be developed to describe individual movement behavior in both space and time. With the use of this method, trajectory data is denoised and compressed by a segmented representation with different homogeneous movement states within each segment. In addition, a spatio-temporal point process model is developed to recognize important and complex movement patterns from the segmented trajectories. Extensions beyond the individual trajectory model are then pursued to develop a new class of population level trajectory models that involve a latent dynamic network to describe interactions among individual movements in space and time. Both individual and population trajectory models are carefully designed to allow scalable parallel and online inference algorithms for near real time efficient computations. Finally, the developed methods are used to solve a problem in urban planning with human movement data collected from GPS.
ATD: Collaborative Research: Predicting The Threat of Vector-Borne Illnesses Using Spatiotemporal Weather Patterns
Statistics; TAMU; https://hdl.handle.net/20.500.14641/620; National Science Foundation
Vector-borne diseases affect virtually everyone on earth. Mosquitoes are the most widely distributed disease vectors and are a serious threat to human life and health. West Nile virus (WNV) is one of the mosquito-borne diseases for which there is still no effective treatment; to date, the Centers for Disease Control and Prevention has reported over 40,000 cases across the United States. Temperature and precipitation are the two most important weather variables that affect mosquito populations and thus affect the WNV transmission cycle. The mosquito infection rate (MIR) is considered an important mediator to study WNV risk. Based on surveillance data for WNV in Illinois, this project aims to develop new methodologies and algorithms to study WNV and MIR using weather and environmental variables. Specifically, the investigators plan first to make predictions of MIR and then characterize the spatial pattern of temperature and precipitation to identify the risk level of WNV human illness and MIR. They will also establish a WNV Index to provide a reliable and interpretable warning for vector-borne disease risk. Finally, since mosquito-borne diseases are particularly affected by rising temperatures, changing precipitation patterns, and a higher frequency of extreme weather events, the project aims to both quantitatively and qualitatively project the current risk to the future under climate change. The research will foster fundamental statistical methodology development as well as collaborations between statistics and public health. Graduate and undergraduate students will be engaged in aspects of the scientific research. The project will provide new results on the impact of climate change on national security, of general interest and importance to the wider public and policymakers. The methods of this project include a spatially-varying-coefficient model with functional weather covariates to make predictions of MIR, as well as a multiple-testing approach to characterize the spatial pattern of temperature and precipitation for ultimately classifying the weather pattern into different risk levels with respect to WNV. The statistical models and algorithms learned from the historical data will be applied to downscaled future weather data to study the impact of climate change on WNV human illness and MIR. The analyses will be based on massive data including WNV human cases, MIR, current and future spatio-temporal stochastic weather processes, land cover, and the length of daylight. The statistical methods used in the project are not only effective for this WNV study but can be a general methodology for a wide range of vector-borne diseases. The spatially-varying-coefficient model with functional covariates takes the continuous and dynamic influence of the retrospective weather on MIR into account while allowing the relationship between MIR and weather and other environmental variables to vary over a spatial domain. The characterization of the spatial weather pattern and the establishment of WNV Index provide a new perspective to study and prevent WNV risk. Compared to previous methods that evaluate the difference between two spatio-temporal random fields as a whole, the multiple-testing approach in this project can detect exactly where the differences occur. This feature is crucial for regional risk detection. Quantifying the impact of climate change on vector-borne diseases is essential to policymakers; the results of the project are expected to provide a reliable resource for such purposes. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
CAREER: Data Assimilation for Massive Spatio-Temporal Systems Using Multi-Resolution Filters
Statistics; TAMU; https://hdl.handle.net/20.500.14641/215; National Science Foundation
The research supported by this award will produce powerful and scalable open-source software for data assimilation in large spatio-temporal systems with varying degrees of nonlinearity. It will lead to improved inference, forecasts, diagnostics, downscaling, and calibration using data assimilation in many fields of science with direct impact on society, including weather forecasting, climate studies, renewable energy, and pollution monitoring. Despite the great importance and highly statistical nature of data assimilation, there is a lack of statisticians involved in this research area. Thus, the educational component of this project revolves around bridging the gap between the statistics and data-assimilation communities, and getting more statisticians involved in the latter. The principal investigator will develop approaches for filtering inference on high-dimensional states that can outperform existing methods in linear and nonlinear settings. The novel approaches are based on the multi-resolution approximation, a state-of-the-art method for spatial covariance approximations that employs many adaptive, compactly supported basis functions at multiple resolutions. Algorithmic implementations of the methods are highly scalable and can take full advantage of massively parallel high-performance computing systems. Validation, testing, and comparison of the methods will be carried out using realistic observations simulated from models of varying complexity.
CAREER: Data Assimilation for Massive Spatio-Temporal Systems Using Multi-Resolution Filters
Statistics; TAMU; https://hdl.handle.net/20.500.14641/215; National Science Foundation
The research supported by this award will produce powerful and scalable open-source software for data assimilation in large spatio-temporal systems with varying degrees of nonlinearity. It will lead to improved inference, forecasts, diagnostics, downscaling, and calibration using data assimilation in many fields of science with direct impact on society, including weather forecasting, climate studies, renewable energy, and pollution monitoring. Despite the great importance and highly statistical nature of data assimilation, there is a lack of statisticians involved in this research area. Thus, the educational component of this project revolves around bridging the gap between the statistics and data-assimilation communities, and getting more statisticians involved in the latter. The principal investigator will develop approaches for filtering inference on high-dimensional states that can outperform existing methods in linear and nonlinear settings. The novel approaches are based on the multi-resolution approximation, a state-of-the-art method for spatial covariance approximations that employs many adaptive, compactly supported basis functions at multiple resolutions. Algorithmic implementations of the methods are highly scalable and can take full advantage of massively parallel high-performance computing systems. Validation, testing, and comparison of the methods will be carried out using realistic observations simulated from models of varying complexity.
Collaborative Research: New Bayesian Methods for Modeling the Effect of Antiretroviral Drugs on Depressive Symptomatology in HIV patients
Statistics; TAMU; https://hdl.handle.net/20.500.14641/605; National Science Foundation
Antiretroviral therapy (ART) has transformed HIV infection into a manageable chronic disease, thereby shifting the focus of the care for people living with HIV more toward controlling the adverse effects of ART. Depression is the leading mental health comorbidity of HIV infection and may trigger negative consequences such as poor adherence to ART, more rapid HIV disease progression, and engagement in risky behaviors. Since ART is recommended for all HIV patients and must be continued indefinitely, minimizing the adverse effects of ART has garnered increasing attention. Due to the rapid generation of drug-resistant mutations, modern ART typically combines three or four ART drugs of different mechanisms or against different targets. Understanding the effects of a single ART drug or combinations of ART drugs can help physicians better manage patients' depression, guide treatment changes if needed, and facilitate individualized treatment. This project aims to fill a critical gap in the availability of appropriate statistical models to systematically investigate the effects of ART on depression. Recent technological advances in the biomedical field have led to rapid accumulation of health- and disease-related data, which provide researchers with an unprecedented opportunity to make reliable and efficient inference from these complex and heterogeneous datasets using novel statistical models. This project will use data from the Women's Interagency HIV Study (WIHS), a prospective, observational, multi-center study which includes more than 4,000 women living with HIV or at risk for HIV infection in the United States. This project aims to develop novel Bayesian parametric and nonparametric models to estimate the effects of ART based on patients' longitudinal medication data and depression outcomes, adjusting for socio-demographic, behavioral, and clinical factors. Specifically, a new Bayesian longitudinal graphical model will be developed with nodes representing drugs and depression items, and weighted edges representing the strength of the drug-depression relationships, which may vary across different clinical visits and different patients. In addition, a novel Bayesian framework that incorporates the similarity between different drug combinations as well as accounts for patients' treatment histories will be developed to learn arbitrary drug combination effects. The proposed work will bridge the gap between the experience/knowledge acquired during basic research and day-to-day practice by facilitating the understanding of the adverse effects of individual drugs, guiding more informed and effective treatment regimen selection, and eventually helping to reduce the healthcare resource burden. The proposed models can be easily generalized to learn other ART-related complications such as cognitive impairment, and may also be used in a wide range of applications across multiple biomedical fields and beyond, such as electronic health record data analysis for chronic conditions, study of combination therapy for cancer treatment, and injury prevention in sports medicine. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Collaborative Research: New Directions in Multidimensional and Multivariate Functional Data Analysis
Statistics; TAMU; https://hdl.handle.net/20.500.14641/501; National Science Foundation
Functional data analysis, which deals with a sample of functions or curves, plays an important role in modern data analysis. Nowadays in the era of "Big Data", multidimensional and multivariate functional data are becoming increasingly common, especially in biological, medical, and engineering applications. There are significant challenges posed by the very large dimension and complex structure of these data. The proposed research will substantially narrow the gap between the increasing demand for handling such data in practice, and the insufficient development of statistical methods and computational tools. This research has applications to neuroscience, climate science, and engineering. It will provide scientists, engineers, and doctors with tools to help understand problems in their area, and enhance interdisciplinary collaborations. This project offers a comprehensive research plan to advance the understanding and applicability of multidimensional and multivariate functional data. The research will focus on the following three sub-projects: (1) Develop data-adaptive and interpretable representation of the covariance function for multidimensional functional data, (2) Develop a novel model-free procedure to detect dependency between components of multivariate functional data, and (3) Address the modeling and prediction of multivariate functional time series. The resulting methods will be applied to neuroimaging and climate data. The integration of these three sub-projects will foster creative directions and strategies for multidimensional and multivariate functional data. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Collaborative Research: Scalable Bayesian Methods for Complex Data with Optimality Guarantees
Statistics; TAMU; https://hdl.handle.net/20.500.14641/395; National Science Foundation
Spectacular advances in data acquisition, processing, and storage present the opportunity to analyze datasets of ever-increasing size and complexity in various applications, such as social and biological networks, epidemiology, genomics, and Internet recommender systems. Underlying the massive size and dimension of these data, there is often a parsimonious structure. The Bayesian approach to statistical inference is attractive in this context in terms of incorporating structural assumptions through prior distributions, enabling probabilistic modeling of complex phenomenon, and providing an automatic characterization of uncertainty. This research project aims to advance eliciting and translating prior knowledge regarding the low-dimensional skeleton of big data to provide realistic uncertainty characterizations while maintaining computational efficiency. Bayesian computation poses substantial challenge in high-dimensional and big data problems. The research aims to develop cutting-edge computational strategies and software packages for implementation to be made available publicly. The project involves graduate students in the research. The research project focuses on theoretical foundations and computational strategies for Bayesian methods in high-dimensional and big data problems motivated by applications in social networks and epidemiology. Techniques for systematically developing and evaluating prior distributions in high-dimensional problems will be investigated with a special emphasis on the trade-off between statistical efficiency and computational scalability. Specific directions include efficient algorithms for posterior sampling with shrinkage priors, a theoretical framework for divide and conquer strategies in big data problems, fast algorithms for clustering nodes in large networks with unknown number of communities, and methods for discovering structure in sparse contingency tables. The algorithms will be motivated by rigorous theoretical understanding of the behavior of the posterior distribution with a particular emphasis on proper quantification of uncertainty in a distributed computing framework. Software will be developed for each application.
Collaborative Research: Scalable Bayesian Methods for Complex Data with Optimality Guarantees
Statistics; TAMU; https://hdl.handle.net/20.500.14641/662; National Science Foundation
Spectacular advances in data acquisition, processing, and storage present the opportunity to analyze datasets of ever-increasing size and complexity in various applications, such as social and biological networks, epidemiology, genomics, and Internet recommender systems. Underlying the massive size and dimension of these data, there is often a parsimonious structure. The Bayesian approach to statistical inference is attractive in this context in terms of incorporating structural assumptions through prior distributions, enabling probabilistic modeling of complex phenomenon, and providing an automatic characterization of uncertainty. This research project aims to advance eliciting and translating prior knowledge regarding the low-dimensional skeleton of big data to provide realistic uncertainty characterizations while maintaining computational efficiency. Bayesian computation poses substantial challenge in high-dimensional and big data problems. The research aims to develop cutting-edge computational strategies and software packages for implementation to be made available publicly. The project involves graduate students in the research. The research project focuses on theoretical foundations and computational strategies for Bayesian methods in high-dimensional and big data problems motivated by applications in social networks and epidemiology. Techniques for systematically developing and evaluating prior distributions in high-dimensional problems will be investigated with a special emphasis on the trade-off between statistical efficiency and computational scalability. Specific directions include efficient algorithms for posterior sampling with shrinkage priors, a theoretical framework for divide and conquer strategies in big data problems, fast algorithms for clustering nodes in large networks with unknown number of communities, and methods for discovering structure in sparse contingency tables. The algorithms will be motivated by rigorous theoretical understanding of the behavior of the posterior distribution with a particular emphasis on proper quantification of uncertainty in a distributed computing framework. Software will be developed for each application.
Collaborative Research: Scalable Gaussian-Process Methods for Spatial Statistics and Machine Learning
Statistics; https://hdl.handle.net/20.500.14641/1106; National Science Foundation
PI Transferred
Equilibrium in Multivariate Nonstationary Time Series
Statistics; TAMU; https://hdl.handle.net/20.500.14641/475; National Science Foundation
Nonstationary time series systems appear routinely in economics, seismology, neuroscience, and physics, where stationarity is usually synonymous with equilibrium. Such systems are usually multidimensional, and their modeling, prediction, and control have tremendous social and scientific impacts. Isolating and identifying equilibrium or stationary features are of fundamental importance in prediction and control of such systems. This research project aims to develop methodologies for extracting aspects of multivariate nonstationary processes that display a sense of equilibrium or stationarity. It is interdisciplinary in nature and has immediate applications to the analysis of economics, seismology, and neuroscience data. A graduate student will be involved in the research. This research project aims to elevate the concept and theory of cointegration from multivariate integrated time series rooted in economics theory to the more general multivariate nonstationary time series setup in probability and statistics. In spite of its central role in econometrics in the last four decades and well-founded motivations in economics, the cointegration theory suffers from the requirements that the series be integrated (unit-root nonstationary) and satisfy a vector autoregressive and moving average model. The goal of this project is to avoid such restrictions and focus on general multivariate nonstationary time series. Three distinct methods for computing analogues of cointegrating vectors and the cointegrating rank will be developed. The first is a time-domain method in line with the classical (Johansen's) approach that relies on the reduced rank regression and likelihood ratio tests. The second method is in the spectral domain and relies on the idea of projection pursuit. It searches for coefficients of candidate linear combinations by minimizing a projection index measuring the discrepancy between time-varying and constant spectral density functions. The third method is concerned with a time-varying cointegration setup where the coefficients are piecewise constant over time. Its successful implementation rests on a good solution of the problem of change-point detection for nonstationary processes, and a novel solution is explored in this research. The results will have immediate impact in settings where multivariate time series data are collected, such as in financial markets, epidemiology, environmental monitoring, and global change.
HDR Tripods: Texas A&M Research Institute for Interdisciplinary Foundations of DATA Science (TRIFECDAS)
Statistics; https://hdl.handle.net/20.500.14641/1088; National Science Foundation
Data Science is rapidly evolving as an essential interdisciplinary field, where advances often result from a combination of ideas from several disciplines. New types of data have emerged and present tremendous complexities and challenges that require a novel way of interdisciplinary thinking. The Texas A&M Research Institute for Foundations of Interdisciplinary Data Science (FIDS) will bring together researchers from five disciplinary areas, Statistics, Electrical Engineering, Mathematics, Computer Science and Industrial Engineering, to conduct research on the foundations of data science motivated by problems arising in bioinformatics, the energy arena, power systems, and transportation systems. The Institute for Foundations of Interdisciplinary Data Science will be well-positioned to develop rigorous theories, novel methodologies, and efficient computational techniques to solve data challenges in many application domains. Modern large datasets are extremely complex and finding answers to seemingly simple questions often turns into an intractable problem. To address these challenges, FIDS will advance the foundations of data science through research on modeling complex data and developing related theory and algorithms. Development of efficient methods to identify low-dimensional structures in these high-dimensional complex data will be the key strategy to recovering high-dimensional signals with related uncertainties. Novel data-analysis models and algorithms will be developed for representation learning, information extraction, and knowledge discovery from complex data to enable better decision making. To complement the research effort, FIDS will educate and train students and postdoctoral fellows in areas at the interface of engineering, mathematics, and statistics. Targeted outreach programs will be developed to increase the pool of women and underrepresented minorities who pursue data-science careers. An external engagement program will be designed to facilitate collaborations with domain scientists and external data scientists. These programs will help to develop the intellectual foundation for a new generation of scientists poised to make novel breakthroughs in this exciting new field.
Innovative Statistical Models for Development of First Huntingtons Disease Progression Risk Assessment Tool
Statistics; TAMU; https://hdl.handle.net/20.500.14641/569; DHHS-NIH-National Institute of Neurological Disorders and Stroke
Project Summary/Abstract Huntington's disease (HD) is a progressive, neurodegenerative disorder that can be genetically diagnosed years before clinical symptoms onset. This presents groundbreaking opportunities to learn the overall, dynamic progression of HD which is critical to the timing of therapeutic interventions and design of effective clinical trials. Despite advancements in this area, significant gaps exist about the transitional period from premanifest to manifest HD, particularly how and when overt clinical symptoms and neurological deterioration develop. As part of the candidate's long-term goal to become an independent, lead expert biostatistician for neurodegenerative diseases, the overarching goal of this K01 is to acquire training in the disease-related background and quantitative analytical skills to develop innovative methods that target new discoveries of HD progression. The candidate, Dr. Tanya P. Garcia, is a Huntington's Disease Society of America (HDSA) Human Biology Project Fellow (2013-2015) and has assembled a team of outstanding mentors and collaborators who will provide training to acquire the skills she lacks for an independent, biostatistically-focused, neuroscience career. Her two primary mentors are Dr. Karen Marder and Dr. Raymond J. Carroll. Dr. Marder is the Sally Kerlin Professor of Neurology at Columbia University with over 300 publications in behavioral neurology, neuroepidemiology and neurodegenerative diseases including Huntington's, Alzheimer's, Parkinson's and HIV dementia. Dr. Carroll is Distinguished Professor of Statistics at Texas A&M University with over 400 publications and 5 books in multiple statistics areas, particularly in those needed for this proposal. To conduct high-level research that fills significant gaps about HD progression knowledge, Dr. Garcia proposes in-depth training (i) To learn the latest developments and challenges in clinical and neurological understanding of HD to fine-tune statistical methodology; (ii) To obtain proficiency in analysis of correlated, longitudinal, big data; and (iii) To develop programming expertise to make the proposed methods accessible to neuroscience investigators in user-friendly software. Training in these areas directly support Dr. Garcia's research aims which are (i) To improve prediction of HD motor-diagnosis by modeling the time-varying effects of multiple clinical performance measures; (ii) To improve identification of disease-relevant brain regions in relation to HD motor-diagnosis by modeling the spatial-temporal brain structure; and (iii) To develop the first generation of a HD Progression Risk Assessment Tool (HD-PRAT). Expected research outcomes include models that support President Obama's Precision Medicine Initiative in that they adhere to “2P's” of the NIH New Strategic Vision of the “4P's” of Medicine: they will offer promising ways to Predict the pattern and intensity of an individual's clinical and neurological changes over time; and increase the capacity to Personalize early intervention based on these learned predictions. Having the models available in user-friendly HD-PRAT is of high
Leveraging Covariate and Structural Information for Efficient Large-Scale and High-Dimensional Inference
Statistics; TAMU; https://hdl.handle.net/20.500.14641/620; National Science Foundation
The proliferation of big data is accompanied by a vast number of questions, in the form of hypothesis tests, which call for effective methods to conduct large-scale and high-dimensional inferences. These influential methods must involve statistical analysis on many study units simultaneously. Conventional simultaneous inference procedures often assume that hypotheses for different units are exchangeable. However, in many scientific applications, external covariate and structural information regarding the patterns of signals are available. Exploiting such side information efficiently and accurately will lead to improved statistical power, as well as enhanced interpretability of research results. The main thrust of this research is to advance statistical methodologies and theories for large-scale and high-dimensional inference with a particular focus on integrating potentially useful external covariate and structural information into inferential procedures. This research aims to develop innovative methodologies and theories to address several significant problems in large-scale and high-dimensional inference. In Project 1, the PI will introduce a new multiple testing procedure that can automatically select relevant covariates to improve the efficiency in inference when a large number of external covariates are available. In Project 2, the PI will develop a new multiple testing framework, which can integrate various forms of structural information. Because prior information is seldom perfectly accurate, a particular focus will be on developing procedures that are robust to misspecified/imperfect prior information. In Project 3, the PI shall propose new procedures for simultaneous inference in high-dimensional regressions with side information. The statistical tools will be used to identify skilled fund managers, assess the performance of climate field reconstructions, and analyze genomic data in an integrative way. Methods and computer code developed will be made publicly available. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Prior Calibration and Algorithmic Guarantees Under Parameter Restrictions
Statistics; TAMU; https://hdl.handle.net/20.500.14641/662; National Science Foundation
Statistical learning of many real systems can be significantly enhanced by harnessing and translating domain knowledge into meaningful parameter restrictions. With the advent of high throughput datasets, such restrictions are often present on high-dimensional parameter spaces thereby complicating inference. This research aims to develop novel statistical methods and computational algorithms for such problems drawing motivation from a number of real applications. Working within a nonparametric Bayes framework, the first part of the research project lays emphasis on the importance of calibrating prior distributions in these constrained problems and theoretically quantifying the impact of the constraints on parameter learning. The second part aims to develop efficient Markov Chain Monte Carlo and variational algorithms and analyze their convergence behaviors for the said problems. The PIs will also propose undergraduate courses that will focus on the modeling and applied components of Bayesian methods. When teaching the courses, the PIs will use daily life as well as scientific examples across different disciplines to inspire students' learning. The Activity-Based Learning (ABL) courses aim to enrich students' academic experience and learning outcomes by connecting theory with practice and concepts with methods, using data & insights obtained through engagement with the larger world. The research project is motivated by statistical and computing challenges posed by a number of real scientific applications where various complex restrictions are posed on key parameters, necessitating novel statistical methods and associated computational algorithms. Operating in a Bayesian paradigm which enables incorporation of various constraints in a principled framework and provides readily available uncertainty estimates often sought after in scientific applications, a major emphasis will be laid on calibration of prior distributions under these constrained spaces. Examples will be provided where seemingly innocuous prior choices routinely used in practice can lead to biased inferences in certain specific situations. A rigorous theoretical understanding of such phenomenon will be provided along with development of alternative default priors on these constrained spaces. The methodological and theoretical developments will be accompanied by efficient computational algorithms using novel approximation techniques in the context of Markov chain Monte Carlo and variational algorithms that meet the scalability demanded by the specific applications and beyond. The algorithm development will be paralleled by novel convergence analysis, bridging ideas between the optimization and sampling literature. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Records of Fused and Assimilated Satellite Carbon Dioxide Observations and Fluxes From Multiple Instruments SRS# 1707951
Statistics; TAMU; NASA - Jet Propulsion Lab
Statement of Work The Texas A&M team consists of Dr. Matthias Katzfuss, an assistant professor in the Department of Statistics. As part of this project, he would work with Dr. Braverman and Dr. Nguyen on fusing XCO2 and other ancillary fields from different instruments. He would lead the effort on: (1) a parallel version of the SSDF algorithm, and (2) compression of the error covariance matrix so that it can be provided as part of the fused and gap-filled products. Additionally, he would collaborate with Dr. Hobbs on validation and modification of the EFDR algorithm for identifying spatio-temporal anomalies in the flux maps obtained from DA systems.
Regression with Time Series Regressors
Statistics; TAMU; https://hdl.handle.net/20.500.14641/556; National Science Foundation
The projects to be investigated are motivated by the following two different problems. It is well known that sea ice in the Arctic is receding. Changes in climate are thought to be a contributing factor. Therefore, it is important to understand if, when, and how daily temperatures in the Arctic region are impacting sea ice. The second motivation comes from neuroscience. Is it possible to predict the decision a person makes based on biometric and neurophysiological responses, such as eye dilation observed over time or an EEG? The objectives in both examples are very different, however they are bound by a common theme; to understand how data observed over time (usually called a time series) affects an outcome of interest. These problems fall under the canopy of regression (a broad topic in statistics), which is a widely researched area in statistics. However, what distinguishes these problems from the classical regression framework is that the regressors have a well-defined structure, which is rarely exploited in most classical regression techniques. By modelling the time series, methods will be developed that exploit the structure of the time series. This will facilitate estimation in models that otherwise would not be possible. The approach will be used to identify salient periods in the time series that have the greatest impact on the outcome and can be used to physically interpret the data. In recent years, there has been a growing number of data sets, from a wide spectrum of applications ranging from the neurosciences to the geosciences, where an outcome is observed together with a time series that is believed to influence the outcome. Despite the clear need in applications, there exists surprisingly few results that exploit the properties of a time series in the prediction of outcomes. This project will bridge this gap by developing regression methods that utilize the fact that the regressors are a time series or are spatially dependent. To achieve these aims, many new statistical methods will be developed. In signal processing, deconvolution methods are often used to estimate the parameters of a two-sided linear filter. This is because the deconvolution is computationally very fast to implement. However, there has been very little exploration on the use of deconvolution methods within the framework of estimating regression parameters. This project will develop deconvolution techniques for (i) linear regression models, and (ii) generalized linear models and when the regressors are stationary and locally stationary. The focus will be on the realistic situation where the time series or spatial data is far larger than the number of responses. Besides the computational simplicity of deconvolution, by isolating the Fourier transform of the regression coefficients, diagnostic tools to understand the nature of the underlying regression coefficients will be developed. For example, the methods can tell whether the coefficients are smooth, contain periodicities, or are sparse. The project will develop inferential methods for parameter estimators that allow for uncertainty quantification, construction of confidence intervals, and tests for linear dependence. Included is a new technique for estimating the variance of the regression coefficients. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Scalable Methods For Classification Of Heterogeneous High-Dimensional Data
Statistics; TAMU; https://hdl.handle.net/20.500.14641/356; National Science Foundation
Recent technological advances have enabled routine collection of large-scale high-dimensional data in the biomedical fields. For example, in cancer research it is common to use multiple high-throughput technology platforms to measure genotype, gene expression levels, and methylation levels. One of the main challenges in the analysis of such data is the identification of key biological measurements that can be used to classify the subject into a known cancer subtype. While significant progress has been made in the development of computationally efficient classification methods to address this challenge, existing methods do not adequately take into account the heterogeneity across the cancer subtypes and the mixed types of measurements (binary/count/continuous) across technology platforms. As such, existing methods may fail to identify relevant biological patterns. The goal of this project is to develop new classification methods that explicitly take into account the type and heterogeneity of measurements. While the primary focus is on methodology, high priority will be given to computational considerations and software development to encourage dissemination and ensure ease of use for domain scientists. Regularized linear discriminant methods are commonly used for simultaneous classification and variable selection due to their interpretability and computational efficiency. These methods, however, rely on unrealistic assumptions of equality of group-covariance matrices and normality of measurements. This project aims to address the limitations present in current discriminant approaches, and has three objectives: (1) to develop computationally efficient quadratic classification rules that perform variable selection; (2) to generalize the discriminant analysis framework to non-normal measurements; (3) to develop a classification framework for mixed type data coming from multiple technology platforms collected on the same set of subjects. The key methodological innovation is the combination of sparse low-rank singular value decomposition, which enables computational efficiency, with geometric interpretation of linear discriminant analysis, which allows for the construction of nonlinear classification rules by redefining the space for discrimination
World Meeting of the International Society for Bayesian Analysis 2022
Statistics; TAMU; https://hdl.handle.net/20.500.14641/215; National Science Foundation
This award provides travel support for US-based participants in the 2022 World Meeting of the International Society for Bayesian Analysis (ISBA), to be held from June 25 to July 1, 2022, in Montreal, Canada. The conference themes are theory, modeling, and applications of Bayesian statistics. The focus of this award is on funding for junior researchers (graduate students and postdoctoral researchers) from U.S.-based institutions to travel to the conference. Emphasis will be placed on supporting women and members of underrepresented groups. Participating in the conference will inform junior statisticians about key problems and methods that shape research in modern Bayesian statistics and provide them with opportunities to learn from more established researchers and to build mentoring and collaborative relationships. More information on the conference is available on the meeting web page: https://isbawebmaster.github.io/ISBA2022/ Statisticians play an indispensable role in making decisions based on noisy, complex data structures. Statistical methods find application in myriad areas, including biomedical research, environmental science, finance, marketing, psychology, public health, and genomics. Within statistics, the Bayesian approach offers many appealing features: Bayesian methods provide a coherent framework for integrating information from different sources and communicating findings and conclusions using probabilities; Bayesian hierarchical models can capture different sources of variability in data and processes; relevant knowledge can be incorporated easily; and all relevant uncertainties are coherently propagated and incorporated into the final inference. Consequently, Bayesian analyses are commonplace across a wide variety of application areas. ISBA 2022 will bring together a diverse international community of researchers and practitioners who develop and use Bayesian statistical methods to share recent findings, exchange ideas, and discuss new, challenging questions. As a truly international meeting, it will provide participants with access to ideas and colleagues from other countries with whom they may not ordinarily interact. Meeting organizers hope to provide a venue that facilitates the exchange of ideas and cross-fertilization, is welcoming to young researchers, and promotes collaborations and interactions. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria