Research Member: Chao Feng
The Library for Information Theoretic Metrics |
Currently, information theory has been applied into many areas for mining significant information about the datasets. There have been many new measurements proposed based on information theory to resolve the problems in the specific areas such as genetics. To better make use of the information theoretic metrics, we developed an information theoretic metrics library on both Hadoop platform and Netezza parallel database to compute the information theoretic metrics such as entropy, K-way interaction information and statistical significance for analyzing genetic datasets. The performance are compared on the two platforms. Generally the library performs better on Netezza than Hadoop. The library currently concentrates on the genetic area and will be generalized and applied in other areas such as finance. |
Genomic and Environmental Interactions Analysis |
In this project we implemented the AMBIENCE algorithm using Hadoop MapReduce. AMBIENCE algorithm is a algorithm proposed in 2008 by Pritam Chanda etc, which is computationally efficient for identifying the informative variables involved in gene–gene (GGI) and gene–environment interactions (GEI) that are associated with disease phenotypes. It uses a novel information theoretic metric called phenotype associated information (PAI) to search for combinations of genetic variants and environmental variable associated with the disease phenotype, and effectively and efficiently detected GEI in simulated data sets of varying size and complexity. Scalable distributed platforms like the Hadoop framework can offer better scalability to such problems. Our hadoop implementation is executed on the cluster at Center for Computational Research in University at Buffalo in a distributed way and compared with the available implementation on IBM Netezza data warehouse appliance. |