English / 日本語 |

The objective of our ongoing research project is to develop an advanced computing and optimization infrastructure for extremely large-scale graphs on post peta-scale supercomputers. The recent emergence of extremely large-scale graphs in various application fields, such as transportation, social networks, cyber-security, and bioinformatics, requires fast and scalable analysis (Figure 1). For example, a graph that represents interconnections of all neurons of the human brain has over 89 billion vertices and over 100 trillion edges. To analyze these extremely large-scale graphs, we require an exascale supercomputer, which will not appear until the 2020’s. |

We have entered the Graph 500 (http://www.graph500.org) and Green Graph 500 (http://green.graph500.org) benchmarks, which are designed to measure the performance of a computer system for applications that require irregular memory and network access patterns. Following its announcement in June 2010, the Graph500 list was released in November 2010. The list has been updated semiannually ever since. The Graph500 benchmark measures the performance of any supercomputer performing a breadth-first search (BFS) in terms of traversed edges per second (TEPS). We implemented the world’s first GPU-based BFS on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology and came in 4th in the 4th Graph500 list in 2012. Rapidly increasing numbers of these large-scale graphs and their applications drew significant attention in recent Graph500 lists (Figure 2). |

In 2013, our project team came in 1st in both the big and small data categories in the 2nd Green Graph 500 benchmarks (Figure 3). The Green Graph 500 list collects TEPS-per-watt metrics. |

We also present our parallel implementation for large-scale mathematical optimization problems. The semidefinite programming (SDP) problem is one of the most predominant problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottleneck parts (the generation of the Schur complement matrix (SCM) and its Cholesky factorization) exist in the algorithmic framework of PDIPM. We have developed a new version of SDPARA, which is a parallel implementation on multiple CPUs and GPUs for solving extremely large-scale SDP problems that have over a million constraints. SDPARA can automatically extract the unique characteristics from an SDP problem and identify the bottleneck. When the generation of SCM becomes a bottleneck, SDPARA can attain high scalability using a large quantity of CPU cores and some techniques for processor affinity and memory interleaving. SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques to overlap computation and communication if an SDP problem has over two million constraints and Cholesky factorization constitutes a bottleneck. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments at the TSUBAME 2.5 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs (Figure 4). |