Big Data Architecture

Research Members: Taruna Seth, Chao Feng

Big Data Sets are overwhelming the Discovery Process in science, industry and healthcare. Today, most scientific super-computing is done on parallelized server-based machines. The standard practice of deriving information from raw data typically involves deployment of a data warehouse system exclusively for data storage and separate file systems and compute environment for mining and analysis, thus, requiring the data to be moved from a data warehouse to the compute environment for analytics. Data extraction, movement over the network, and replication constitute one of the most time consuming phases of data analysis and deployment of analytical models. This approach works better for small data sets or multidimensional analytical or transaction processing (OLAP, OLTP) tasks on small scale data because the movement of low volume data doesn’t introduce any significant network constraints. However, it doesn’t scale well with voluminous data sets as it becomes impractical to repetitively replicate large data from the data warehouse to the compute environment over the network because of the disk I/O performance bottlenecks. Moreover, data replication/redundancy makes it difficult to execute data governance and regulatory policies in one place. Additionally, the inherent benefits of typical RDBMS, like data integrity, security, availability, reliability, and task management, utilized by the traditional warehouse systems are lost once the data is extracted from the relational database to an intermediate platform for data mining and analytics. Consequently, organizations following this practice are forced to consider only a small sample of the available raw data for analysis and rely on the inferences made from such small fraction of data because of the network bottlenecks and data warehouse constraints.
Our core focus in this research domain is to make applications run faster (that will include algorithms, architectures and confluence of those) and to allow scientists solve problems they may not be able to solve due the size of the problem data or the time it takes to solve them.
Presently, we are actively working on the development of algorithms and techniques that rely on the tight integration of data and computation. At the center of this research initiative is the unique data intensive computing infrastructure acquired as part of the NSF MRI grant and housed in the CCR. The DI2 infrastructure consists of two data intensive supercomputers from Netezza, shown in Figure 1, and XtremeData, shown in Figure 2, and about a 1 PByte of network storage. The unique Data Intensive SuperComputer (DISC) is able to meet the demanding needs of the various research projects involving “Big Data” and can easily scale upto petabytes of storage and computation. These facilities are already being used for some of our research projects such as interaction analysis for Gene-Gene and Gene-Environment Analysis (GGI, GEI) on very large scale biological datasets, vascular simulations, financial forensics, and development of tools for big-data.

DI2’s unique hybrid supercomputing architectures involve:

• Hybrid 64 Nvidia Tesla GPU Cluster 128 TFlops
• Netezza TwinFin 24 – 192 Logical Snippet blades – CPU, FPGA, Memory, & Storage; 67TB users data (201TB Total), 225TB Compressed
• XtremeData dbX – 10 Nodes, 40 cores, 20 big accessible FPGAs, 30TB users Data, 100TB Compressed, all internally Infiniband Connected
• High Speed Network Storage – ~ 1PB Raw (this summer)