Developing Methods for Clustering Longitudinal Mixed-Type Data for Comparative Effectiveness Research

Health science and public health research have been transformed by technological advances, including those related to the collection and storage of massive amounts of data. These data are diverse in nature, in measurement scale and in form. For example, some are measured on categorical and some on interval/ratio scale (diversity in measurement scale), while other data are represented numerically or via images and text descriptions, such as doctors’ notes (diversity in form). Thoughtful processing and analysis of such heterogeneous data can enable the extraction of information and insights that may be useful in informing patient-centered treatment plans.

We are especially concerned with data from long-term studies, which include information collected and recorded from the same participants over long time periods (that is, longitudinal data). For this type of data, techniques to analyze heterogeneous scale and type datasets are not widely available. We propose novel methods of analysis by: 1. Providing the tools to obtain well-characterized data sets that integrate multi-modal heterogeneous longitudinal measurements, 2. Developing methods and systems, including software, based on strong statistical and mathematical foundations that allow informative, reproducible analysis of the created datasets. To achieve our goals, we integrate modern artificial intelligence tools (Large Language Models) with statistical methods of experimentation to transform and integrate the different data types into computable entities for use in downstream analyses. We concentrate on thoroughly studying the development of datasets and clustering methods, and study the reproducibility of all methods – data creation and data clustering – employing a variety of metrics. We develop and disseminate software solutions that encode our methods and demonstrate their use in the identification of social determinants of health that are potentially important factors in the development of treatment plans for patients with opioid use disorder.
Clinical Relevance: The new methods will aid the development of more comprehensive treatment plans for treating opioid use disorder and substance use in general. Furthermore, they will enable a comprehensive evaluation of programs and will provide evidence for thoughtful allocation of resources for combating the opioid use disorder.
Significance: Our work will improve the use of artificial intelligence and machine learning in clinical research. Our designs for text2data extraction consider the structure of LLMs and leverage their current limitations to obtain datasets for use in downstream analysis. Furthermore, the novel clustering methods for mixed-type longitudinal and cross-sectional data contribute significantly to the toolkit of methods able to handle heterogenous measurement scale and form data. Finally, our methods of analyses are applicable to studies of numerous diseases, ensuring reproducibility of results and generalizability.