Past projects

Virtual Surgery

Despite its promise, realistic telesurgery has thus far eluded researchers and surgeons due to its computational challenges and limitations in surgical robotics. Currently, surgeons are limited to the operating suite, subjecting them to potentially harmful radiation. By separating surgeons from their patients, surgeons would be able to limit their exposure to radiation during fluoroscopy.
In order to accomplish this, however, advances in surgical robotics and haptics are required. By combining these advances with new computational accelerators and improved data compression and transmission, it is now feasible to enable truly remote telesurgery.
Virtual training systems have been designed for minimally invasive procedures like endoscopy and arthroscopic surgery. However, a realistic training environment for Orthopaedic surgery is limited by the huge computing requirement (>540 TFlops) that rivals the performance of supercomputers (>$1 million).
Our Virtual Surgery project initiates research to enhance current training modalities in Orthopaedic surgery by using computerized training and assessment systems. Simulations will be designed such that residents can experience surgical procedures hands-on before operating on patients. Ex-vivo training on virtual systems will familiarize residents with real-life scenarios, minimize the risk to patients and allow for competence based advancement of residents.
The goal of our research is to provide an inexpensive, multi-use solution to complement current training methods in Orthopaedic Surgery. The overarching goal of our project is to develop a complete surgical system to increase the accuracy of robotic surgery and enable remote surgeries whether being in adjacent room to reduce repeated fluoroscopy exposure or to have access to experts who are spatially remote. This system can also be used for training using virtual data. Our system will allow a surgeon to complete the operation, have a recorded copy of the procedure and if desired have an assessment completed by one or a series of raters. Our assessment tool will allow for the demonstration of mastery at all stages of the operation in one or repeated trials and evaluations. By taking advantage of the advances in computer technology, we would like to change the status of medical education while decreasing the risk to patients and ideally increasing patient outcomes.
We are currently working on the development of a virtual intervention system prototype, RoboPlan, with an aim to establish an efficient collaborative framework for virtual telesurgery applications. The system intends to provide a comprehensive platform for the assembly of different components crucial for virtual reality based training, invasive surgical procedures, and standardized surgical skills assessment. This versatile tool aims to facilitate patient specific training and allow medical trainees to get realistic hands-on experience in patient specific surgical procedures in dynamic and interactive surgical scenarios within a safe learning environment without compromising patient safety.
Our virtual intervention system, RoboPlan, will provide a comprehensive virtual open surgery platform with surgical training and assessment capabilities. This versatile tool will enable medical trainees to get realistic hands-on experience in a range of patient specific surgical procedures before operating on real patients by providing real-time haptic/tactile and visual feedback.
RoboPlan will provide capabilities for pre-operative planning, 3D visualization, and intelligent surgical guidance based on pre-existing data and critical anatomical features. In addition, Roboplan’s assessment feature will provide standardized feedback and evaluation scores allowing for competence based advancement of residents.
RoboPlan comprises of three key elements namely, Visualization, Haptics, and Robotics.
 
Related Publications

  • Taruna Seth, Vipin Chaudhary, Cathy Buyea, Lawrence Bone: A Haptic Enabled Virtual Reality Framework for Orthopedic Surgical Training and Interventions. I. J. Comput. Appl. 21(4): 220-229 (2014) pdf
  • Taruna Seth, Vipin Chaudhary, Cathy Buyea, and Lawrence Bone, “ A Virtual Interactive Navigation System for Orthopaedic Surgical Interventions ”, In Proc. of the 4th Int. Conf. on Applied Sciences in Biomedical and Communication Technologies, ISABEL’11, Spain. pdf

Computer Aided Neurosurgery

Neurosurgery is a complex procedure where the surgeon has to integrate multi-modal data to produce an optimum surgical plan. Many times the region of interest is surrounded by vital structures (such as motor cortex, temporal cortex, vision and audio sensors, etc.) and has irregular configurations. Slight damage to such eloquent brain structures can severely impair the patient.
CADI has developed CASMIL, an image guided neurosurgery toolkit to produce optimum plans resulting in minimally invasive surgeries. This system, originally developed at Wayne State University has many innovative features needed by neurosurgeons that are not available in other academic and commercial systems. CASMIL is an integration of various vital modules, such as rigid and non-rigid co-registration (image-image, image-atlas, and image-patient), 3D segmentation, brain shift predictor (BSP), knowledge based query tools, intelligent planning, and augmented reality.

MigThread: Heterogeneous Thread Migration

As high-performance facilities shift from supercomputers to Networks of Workstations (NOWs), migration of computing from one node to another will be indispensable. Thread/process migration enables dynamic load distribution, fault tolerance, eased system administration, data access locality and mobile computing. Since multi-threading involves breaking down the various tasks in one or more applications into several threads and scheduling them for the most efficient use of the microprocessor’s computing power, thread programming is becoming popular. Therefore, thread migration turns out to be a necessity.
MigThread is a heterogeneous thread migration package which enables the multi-threaded parallel computations to fit the grid computing paradigm and achieve idle cycle utilization, resource sharing, fault tolerance, and mobile computing. MigThread is an application-level migration scheme and consists of two parts: preprocessor and runtime support module.
MigThread contains the following features:
* Portability : Thread state is taken out of kernels or libraries, and moved up to the language level User-level Memory
* Management : Dynamic memory allocation is supported in migration
* Scalability : No restrictions on thread stacks and heaps
* Safety : Unsafe features in C are detected to make more programs “migration-safe”
* Efficiency : A light-weight solution
* Fault Tolerance : Checkpointing
* Heterogeneity : Migration across all UNIX variants

 
Presentation migthread.ppt

Related Publication

  • Hai Jiang and Vipin Chaudhary, “On Improving Thread Migration: Safety & Performance”, Proceedings of the International Conference on High Performance Computing (HiPC), IEEE and ACM, Bangalore, India, December, 2002 pdf
  • Jiang, Hai, and Vipin Chaudhary. “MigThread: Thread migration in DSM systems.” Parallel Processing Workshops, 2002. Proceedings. International Conference on. IEEE, 2002.
    pdf
  • Jiang, Hai, and Vipin Chaudhary. “Compile/run-time support for thread migration.” Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. IEEE, 2001.pdf

ADAM:A Distributed Adaptively-shared Memory System

Recent improvements in commodity processors and networks have provided an opportunity to support high-performance parallel applications within an everyday computing infrastructure. However, applications for such distributed systems are cumbersome to develop due to the need for programmers to handle communication primitives explicitly. Distributed shared memory (DSM) systems are gaining popularity for providing a logically shared memory over physically distributed memory. The programmer is given the illusion of a large global address space encompassing all available memory, eliminating the task of explicitly moving data between processes located on separate machines. DSM systems combine programming advantages of shared memory and the cost advantages of distributed memory.
Studies have indicated that a large fraction of workstations could be unused for a large fraction of time. In order to improve the performance of parallel computing and harvest the idle cycles in networks, we are developing an adaptive DSM system ADAM (A Distributed Adaptively-shared Memory System), extended from our existing DSM Strings. ADAM can adjust itself dynamically to fit the dramatically changing environment in networks of workstations. When idle processors or workstations appear in the networks, ADAM is able to distribute some data and computations over and adopt this kind of system resource. When a workstation is reclaimed by its primary user, the ADAM jobs have to stop and evict to respect primary user’s priority. Therefore, workstations can join and leave ADAM on the fly.
In the meantime, ADAM hides the unstable situations from the above parallel computing which should be reconfigurable, especially in terms of the degree of parallelism, or the number of processors required. Reconfiguration may need data and loop repartitioning, and data and/or computation migration. With the help from ADAM, such data and computation redistribution can be transparent to above parallel computing.
ADAM aims at:
* Improving parallel computation performance by load balancing
* Utilizing system resources efficiently by load sharing
presentationadam.ppt

 
Related Publications

  • Roy, Sumit, and Vipin Chaudhary. “Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters.” Cluster Computing 2.3 (1999): 177-186.pdf

Strings : Software Distributed Shared Memory

Applications for distributed memory systems are cumbersome to develop due to the need for programmers to handle communication primitives explicitly, just as coding in MPI. In addition, applications have to be tuned for each individual architecture to achieve reasonable performance. Since hardware shared memory machines do not scale well and are relatively expensive to build, software distributed shared memory (DSM) systems are gaining popularity for providing a logically shared memory over physically distributed memory. These software DSM systems combine programming advantages of shared memory and the cost advantages of distributed memory. The programmer is given the illusion of a large global address space encompassing all available memory, thereby eliminating the task of explicitly moving data between processes located on separate machines.
DSMs share data at the relatively large granularity of a virtual memory page and can suffer from a phenomenon known as “false sharing”, wherein two processes simultaneously attempt to write to different data items that reside on the same page. If only a single writer is permitted, the page may ping-pong between the nodes. One solution to this problem is to “hold” a freshly arrived page for some time before releasing it to another requester. Relaxed memory consistency models that allow multiple concurrent writers have also been proposed to alleviate this symptom. The systems ensure that all nodes see the same data at well defined points in the program, usually when synchronization occurs. Extra effort is required to ensure program correctness in this case. One technique that has been investigated to improve DSM performance is the use of multiple threads of control in the system. Up to now, the third generation DSM systems utilize relaxed consistency models and multithreading technologies.
Strings is built using POSIX threads, which can be multiplexed on kernel lightweight processes. The kernel can schedule these lightweight processes across multiple processors on symmetrical multiprocessors (SMPs) for better performance. Therefore, in Strings, each thread could be assigned to any processor on the SMP if there is no special request, and all local threads could run in parallel if there are enough processors. Strings is designed to exploit data parallelism by allowing multiple application threads to share the same address space on a node. Additionally, the protocol handler is multi-threaded. The overhead of interrupt driven network I/O is avoided by using a dedicated communication thread. Strings is designed to exploit data parallelism at the application level and task parallelism at the run-time level.
Execution Model
Strings starts a master process that forks child processes on remote nodes using rsh(). Each of these processes creates a dsm_server thread and a communication thread. The forked processes then register their listening ports with the master. The master process enters the application proper and creates shared memory regions. It then creates application threads on remote nodes by sending requests to the dsm_server threads on the respective nodes. Shared memory identifiers and global synchronization primitives are sent as part of the thread create call. The virtual memory subsystem is used to enforce coherent access to the globally shared regions.
Kernel Threads
Thread implementations can be either user-level, usually implemented as a library, or kernel-level in terms of light-weight processes. Kernel level threads are more expensive to create, since the kernel is involved in managing them. User level threads suffer from some limitations, since they are implemented as a user-level library, they cannot be scheduled by the kernel. If any thread issues a blocking system call, all associated threads will also be blocked. Also on a multi-processor system, user-level threads bound to a light-weight process can only on one processor at a time. User level threads do not allow the programmer to control their scheduling within the process, on the other hand kernel level threads can be scheduled by the operating system across multiple processors.
Shared Memory
Strings implements shared memory by using the mmap() call to map a file to the bottom of the stack segment. With dynamically linked programs, it was found that mmap() would map the same page to different addresses on different processors. Allowing multiple application threads on the same node leads to a peculiar problem. Once a page has been fetched from a remote node, its contents must be written to the corresponding memory region, so the protection has to be changed to writable. At this time no other thread should be able to access this page. Suspending all kernel level threads can lead to a deadlock and also reduce concurrency. In Strings, every page is mapped to two different addresses. It is then possible to write to the shadow address without changing the protection of the primary memory region.
Memory Consistency Model
A release consistency model using an update protocol has been implemented. When a thread tries to write to a page, a twin copy of the page is created. When either a lock is released or a barrier is reached, the difference (diff) between the current contents and its twin are sent to threads that share the page. Multiple diffs are aggregated to decrease the number of messages sent.

Related Publications

  • Roy, Sumit, et al. “Application based evaluation of distributed shared memory versus message passing.” Proceedings of ISCA 12th Intl. Conference on Parallel and Distributed Computing Systems. 1999.pdf
  • Chaudhary, V., et al. “Parallelization of Radiation Therapy Treatment Planning (RTTP): A Case Study.” Proceedings of ISCA 12th Intl. Conference on Parallel and Distributed Computing Systems. 1999.pdf
  • Thaker, D., Chaudhary, V., Edjlali, G., & Roy, S. (1999). Cost-Performance Evaluation of SMP Clusters. In PDPTA (pp. 718-724).pdf
  • Roy, Sumit, and Vipin Chaudhary. “Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters.” Cluster Computing 2.3 (1999): 177-186.pdf
  • Roy, Sumit, and Vipin Chaudhary. “Evaluation of cluster interconnects for a distributed shared memory.” Performance, Computing and Communications Conference, 1999 IEEE International. IEEE, 1999. pdf
  • S. Roy and V. Chaudhary.”Communication Requirements of Software Distributed Shared Memory.” In Proc. of the National Conference on Communications, pp. 409 – 416, Kharagpur, India, Jan. 30 – 31, 1999.pdf
  • Roy, Sumit, and Vipin Chaudhary. “Strings: A high-performance distributed shared memory for symmetrical multiprocessor clusters.” High Performance Distributed Computing, 1998. Proceedings. The Seventh International Symposium on. IEEE, 1998.pdf

APE4NOW : Automatic Parallelization Environment for Network of Workstations

APE4NOW integrates multiple facets of parallel computing into a single environment. APE4NOW is a system for automatic parallelization of large-scale scientific and engineering applications on distributed memory computers. The target platform consists of a number of processing workstations connected by a scalable network. APE4NOW provides several paradigms for application programming. The goal is to let both the experts and naïve programmers write high performance computing applications executing on commodity computer networks directly or indirectly. Eventually application execution could be speeded up on multiple workstations.
Sequential Programs
If programmers know nothing about parallel programming, they can still write sequential FORTRAN or C programs in the programming style they like. APE4NOW takes the extended Stanford SUIF compiler (implemented at Wayne) as the front-end which translates the sequential programs into multi-threaded C code. This compiler could detect possible portions for automatic parallelization. The multi-threaded output of the compiler could run on workstations with multiple processors. Normally the execution time of such programs is shorter than the original sequential programs. APE4NOW removes the requirement for parallel programming expertise.
Hand-written Parallel Programs
If the programmers are experts in parallel computing, they can write parallel programs directly. This kind of program could be more efficient or readable to them. APE4NOW definitely accepts such code.
Hand-written Threaded MPI Programs
Experts can write MPI programs and execute them on NOWs directly. Then programmers have to take care of not only the parallelism, but also the communication between workstations. Sometimes this could be far more complicated, error-prone and difficult to debug. This is the traditional parallel programming style. APE4NOW is just to reduce the burden of programmers and let them focus on application problem itself. Note that each of the workstation needs to be a uni-processor, else, one has to implement multi-threaded code as well. This increases the programming complexity drastically.
Parallelizing Compilers
SUIF handles uniform loops over dense and regular data structures. For non-uniform data dependence loops, we proposed the concept of Complete Dependence Convex Hull, which contains the entire dependence information of the program. We also proposed the concepts of Unique Head Sets and Unique Tail Sets that isolated the dependence information. The relationship of the unique head and tail sets forms the foundation for partitioning the iteration space. Depending on the relative placement of these unique sets, various cases were considered and several partitioning schemes were also suggested.
Runtime Parallelization
The multi-threaded code is shared memory code. They can only run on single workstation with one or more processors. Some parallelizable portions of programs are not detected at compile time. Or it is impossible to determine at that time because of their dynamic feature. For these cases, the Runtime Parallelization module is used to convert sequential code into parallel one during execution. APE4NOW has two new run-time techniques for the parallelization of loops that have indirect access patterns. Our schemes can handle any type of loop-carried dependencies. They follow the DOACROSS INSPECTOR/EXECUTOR approach and improve upon previous algorithms with the same generality by allowing concurrent reads of the same location and by increasing the overlap of dependent iterations. The algorithms are implemented based on stamping rules and using multithreading tools. The difference between the two proposed algorithms is that one allows partially concurrent reads without causing extra overhead in its inspector, while the other allows fully concurrent reads at the slight overhead in the dependence analysis.
Distributed vs. Shared Memory Code
Nowadays, workstations are omnipresent. Shared memory code could be transformed into distributed memory code to utilize more computation power in multiple workstations. APE4NOW provides this conversion module so that this could happen automatically sheltering the programmers from complicated communication issues. The APE4NOW environment makes it easy to program in parallel, parallelizes programs automatically, and provides supercomputing-like performance on commodity network of workstations. The developed system will be made available to the entire computing community.

Presentation ape4now.ppt

Related Publications

  • Xu, C-Z., and Vipin Chaudhary. “Time stamp algorithms for runtime parallelization of DOACROSS loops with dynamic dependences.” IEEE Transactions on Parallel and Distributed Systems 12.5 (2001): 433-450. pdf
  • Ju, Jialin, and Vipin Chaudhary. “A fission technique enabling parallelization of imperfectly nested loops.” International Conference on High-Performance Computing. Springer, Berlin, Heidelberg, 1999.pdf
  • Ju, Jialin, and Vipin Chaudhary. “Unique sets oriented parallelization of loops with non-uniform dependences.” The Computer Journal 40.6 (1997): 322-339. pdf
  • Punyamurtul, Swamy, et al. “Compile time partitioning of nested loop iteration spaces with non-uniform dependences.” Parallel Algorithms and Applications 12.1-3 (1997): 113-141. pdf
  • Xu, Chengzhong, and Vipin Chaudhary. “Time-stamping algorithms for parallelization of loops at run-time.” Parallel Processing Symposium, 1997. Proceedings., 11th International. IEEE, 1997. pdf
  • Ju, Jialin, and Vipin Chaudhary. “Unique sets oriented partitioning of nested loops with non-uniform dependences.” Parallel Processing, 1996. Vol. 3. Software., Proceedings of the 1996 International Conference on. Vol. 3. IEEE, 1996.pdf
  • Chaudhary, Vipin, et al. “Design and evaluation of an environment ape for automatic parallelization of programs.” Parallel Architectures, Algorithms, and Networks, 1996. Proceedings., Second International Symposium on. IEEE, 1996.pdf
  • Chaudhary, V., Ju, J., Luo, L., Roy, S., Sinha, V., Xu, C. Z., & Konda, V. (1996). Automatic Parallelization of Non-uniform Dependences. In of: First SUIF Compiler Workshop (pp. 148-152).pdf
  • Punyamurtula, Swamy, and Vipin Chaudhary. “Minimum dependence distance tiling of nested loops with non-uniform dependences.” Parallel and Distributed Processing, 1994. Proceedings. Sixth IEEE Symposium on. IEEE, 1994. pdf

High Performance Computing Applications

It has been well known that parallelization of application programs can improve their execution performance. Many supercomputers have been built to accomplish this task. Recently a Network of Workstations (NOW) acting as a single, large-scale computer has gained tremendous popularity with the advent of low latency, high bandwidth local area networking and the incessant increase in computing power of individual workstation. The workstations of today (and increasingly in the future) are using multiple processors sharing the same memory. Such workstations are called symmetrical multiprocessors (SMP). Therefore, clusters of workstations are actually becoming clusters of SMPs with increasing promise for cost/performance. A network of workstations or SMPs provides an opportunity to support high-performance parallel applications within an everyday computing infrastructure.
On this new cost-effective parallel computing platform, we have analyzed most popular parallelization mechanisms, such as vendor-provided compilers, OpenMP, MPI, and software Distributed Shared Memory (DSM), to determine the most efficient method for parallel computation in science and industry. The evaluation criteria are: system requirements, programming style, efficiency of exploring parallelism, and the application characteristics they can handle. We have developed a DSM system, Strings, to accomplish this analysis.
High Performance Computing Applications:
Tribology Application
We have applied various parallelization techniques to a Tribology application, a Molecular Dynamics (MD) simulation of friction forces of sliding hydroxylated -aluminum oxide surfaces. In a MD simulation, the motion of each atom is governed by the Newton’s equation of motion and their positions are determined by the time evolution of the Newton’s equation. At each time integration step the force between atoms, the potential energies and kinetic energies are evaluated. The computational effort grows linearly with the number of Newton’s equations, so it is an ideal method to treat mid-sized systems (e.g. ~102 atoms). However, there are generally two factors limiting the application of MD to large scale simulation (e.g. ~106 atoms). First, the time step of integration in a MD simulation is usually about a femtosecond (10-15 s) due to the numerical and physical consideration. In contrast to simulation, the time scale for tribology experiments is at least in nanoseconds (10-9 s). The time step in simulation is so small compared with experiment that it generally requires a large number of integration steps to reach a desired total evolution time. Second, when the number of atoms in the simulation system increases, the computation time for force evaluation increases rapidly. Parallel and distributed processing is a promising solution for these computational requirements.

Related Publications

  • Chaudhary, V., Hase, W., Jiang, H., Sun, L., & Thaker, D. (2002, August). Comparing Various Parallelizing Approaches for Tribology. In 4th International Workshop on High Performance Scientific and Engineering computing with Applications. pdf