Drug Discovery/Personalized Medicine

Research Members: Sijia Liu, Taruna Seth

Relation extraction from biomedical texts


In the era of big data in biomedical domains, relation understanding among more than one entities remains one of the major challenges in biomedical Natural Language Processing (NLP) of efficient usage of unstructured texts, such as scientific literatures, clinical texts, and narrative texts in knowledge bases. The detection of relations among entities is of key relevance in machine understanding of unstructured texts. Despite the considerable number of existing systems, most of them use fine-tuned supervised machine learning methods on heavy feature engineering, which make the systems lack of generalizability and model portability. Aiming those common issues in real-world NLP applications, we proposed unsupervised learning methods extending Latent Dirichlet Allocation (LDA) to extract of coreference relations [1] and drug-drug interactions [2]. Our proposed methods can achieve comparable results to the state-of-the-art systems without using gold standard annotations. This research also addresses other challenges in biomedical natural language processing such as corpus availability, feature representation and model interpretability. We leverage various machine learning methodologies, including semantic word embedding and deep neural networks, to solve a wide range of relation extraction problems such as event temporal relations [3] and other biomedical semantic relations [4,5].

Related Publications

  1. S Liu, H Liu, V Chaudhary, D Li. An Infinite Mixture Model for Coreference Resolution in Clinical Notes. AMIA Summits on Translational Science Proceedings 2016, 428–437. pdf
  2. D Li*, S Liu*, M Rastegar-Mojarad, Y Wang, V Chaudhary, T Therneau, H Liu. A Topic-modeling Based Framework for Drug-drug Interaction Classification from Biomedical Text. AMIA Annual Symposium Proceedings 2016, 789–798 (*equal contribution)pdf
  3. S Liu, L Wang, D Ihrke, V Chaudhary, C Tao, C Weng, H Liu. Correlating Lab Test Results in Clinical Notes with Structured Lab Data: A Case Study in HbA1c and Glucose. AMIA Summits on Translational Science Proceedings. 2017 (AMIA CRI 2017). pdf
  4. S Liu, F Shen, V Chaudhary and H Liu. MayoNLP at SemEval 2017 Task 10: Word Embedding Distance Pattern for Keyphrase Classification in Scientific Publications. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). (Top system in participated evaluation scenario, tweet)
  5. Sijia Liu, Feichen Shen, Yanshan Wang, Majid Rastegar-Mojarad, Ravikumar Komandur Elayavilli, Vipin Chaudhary, Hongfang Liu, Attention-based Neural Networks for Chemical Protein Relation Extraction, BioCreative VI Workshop Proceedings

Interaction Analysis

Characterization of pharmacological signal transductions leading to drug-induced expressions of genes and proteins requires the capability to identify interactions between different potential predictor components, e.g. genomic data, clinical data, and environmental data. Our work primarily focuses on the problem of effective characterization and detection of critical gene-gene and gene-environment interactions associated with the outcomes of interest. This problem is very challenging because the computational complexity of the interaction problem is exponential as there are n-choose-k ways of selecting a subset of k attributes for assessing interactions from n attributes. The addition of extra dimensions to a mathematical space exponentially increases the hyper volume in which data is distributed. This combinatorial growth makes it computationally difficult to exhaustively search the full range of genetic and environment (predictor) variables for potential interactions associated with diseases or outcomes in epidemiologic studies. Also, during statistical analysis of interactions using methods such as logistic regression, an additional combinatorial explosive problem arises because several different models for the data have to be evaluated to select the best model based on a suitable criterion combining goodness-of-fit and parsimony considerations.Our work primarily focuses on the problem of effective characterization and detection of critical gene-gene and gene-environment interactions associated with the outcomes of interest. This problem is very challenging because the computational complexity of the interaction problem is exponential as there are n-choose-k ways of selecting a subset of k attributes for assessing interactions from n attributes. The addition of extra dimensions to a mathematical space exponentially increases the hyper volume in which data is distributed. This combinatorial growth makes it computationally difficult to exhaustively search the full range of genetic and environment (predictor) variables for potential interactions associated with diseases or outcomes in epidemiologic studies. Also, during statistical analysis of interactions using methods such as logistic regression, an additional combinatorial explosive problem arises because several different models for the data have to be evaluated to select the best model based on a suitable criterion combining goodness-of-fit and parsimony considerations.
Our approach to interaction analysis is unique and leverages the remarkable properties of two complementary information-theoretic metrics, the k-way interaction information (KWII) and the Target-Associated Information (TAI) that are considered multivariate generalizations of the Kullback-Leibler Divergence. Our methodology contains two key algorithms: i) a directed search algorithm called AMBIENCE that detects the most promising combinations that are strongly associated with the outcomes for which predictors have to be identified in the training set (e.g., illegal activities); ii) a modeling strategy called AMBROSIA that uses the output from AMBIENCE to identify a parsimonious set of interactions capable of explaining the outcome. A wide variety of predictors can be included in the analysis. The information theoretic foundation of AMBIENCE enable detection of both linear and non-linear dependencies in the data is novel.
In this project we focus on enabling in-database interaction analysis by integrating these novel information theoretic approach based methods within our BigData data warehouse frameworks (Netezza and XtremeData) along with other scalable big data processing technologies like Hadoop to facilitate efficient high-order interaction analysis of large scale epidemiological datasets that is not feasible with the traditional data frameworks.
Our current work focuses on the development of scalable in-database algorithms that can perform all the analytic processing required for interaction analysis within the data warehouse platforms thereby, eliminating the need for separate compute environments and their associated network I/O bottlenecks. We have already implemented the Phenotype Associated Information (PAI) and KWII information theoretic metrics on Netezza and Hadoop. The core functionality of our developed algorithms is provided through custom developed UDXs (user defined functions and aggregates) in Netezza. We have observed several orders of magnitude performance improvement. Furthermore, through our approach, we have been able to handle extremely large datasets that could not be processed previously.
Presently, we are also working on the development of novel libraries for information theoretic interaction analysis leveraging the capabilities of these technologies specifically, Netezza and Hadoop. We have successfully implemented some initial information theoretic measures and are currently in the process of evaluating their performance on different architectures
In the coming months, we aim to augment our information theoretic library, algorithms, and develop expert knowledge bases to represent phenotypes under study and their known associations. Our research will also focus on novel ways to effectively utilize the derived information during interaction analysis. Once fully integrated with our data warehouse platforms, we expect our algorithms to enable computational identification of the key high order interactions among 106-107 predictors or variables. As part of this work, we intend to make our developed tools publically available to allow others to utilize our interaction analysis methods for their research applications. This will be accomplished through the deployment of high availability and efficient, web enabled services that will allow dynamic interaction analysis harnessing the capabilities of our data warehouse appliances and deployed scalable technologies.

Vascular Blood Flow Dynamics

In this project we investigate the problem of blood flow dynamics using computational fluid dynamics (CFD) solutions. This problem is particularly cogent because of the ubiquity of 3D vascular data and as a result, a number of groups are developing patient-specific methods for calculating the flow field and flow parameters associated with vascular abnormalities via CFD. High-order CFD calculations themselves run relatively automatically, but they do require several hours on a standard computer for steady-state solutions and tens of hours for pulsatile flow solutions. Because the boundary conditions are not that well known, the reliability of a specific result for specific boundary conditions is uncertain. It might be more sensible and clinically relevant to obtain multiple CFD solutions to allow perception of trends of CFD results that correspond to the range of solutions. Unfortunately, multiple CFD solutions dramatically increase the computational complexity.We have developed CFD code which is in the process of being transferred to GPU implementation along with the conversion of our 3D CFD code. From the 2D investigations, the changes in the flow patterns with changing parameters, such as percent stenosis, length of stenosis, viscosity, and velocity, appear to follow relatively simple relationships, i.e., differential changes in the parameters result in differential changes in the flow patterns (see figure below). These results bode well for implementation of the larger strategy of interpolation based on a large data base.
In addition to the CFD calculations, we are pursuing analysis of angiograms to determine whether we can determine the flow field or at least a good approximation of the flow field from high-frame-rate angiographic acquisitions as the contrast flows through the vessel. Here GPUs will be a critical component as our initial investigations involve generation and analysis of simulated angiograms for a variety of situations. Using standard CPUs, generation of single angiographic sequences take several minutes, with the analysis to calculate the flow field taking tens of minutes. We will port these calculations to GPU in the near future to allow us to evaluate the various situations investigated using CFD and to allow us to move to an iterative approach for calculation of the flow field, which will use our initial flow field calculation as a starting point.