Nvidia Efficient computing Solution architect Axel Koehler recently introduced the Nvidia New Generation Volta GPU architecture and the latest Cuda 9 parallel computing platform and programming model built for Volta in an nvidia GPU technology conference. Claims that Volta has a new serial-flow multiprocessor (streaming multiprocessor; SM) architecture, the CUDA 9 database also shows a number of upgrades, showing a new program design model for improved performance. According to the inside HPC report, according to Koehler, the current field of HPC in the increasing demand for computing, the complexity of the neural network is also exploding, in this market trend, Nvidia introduced the Volta architecture based Tesla V100 processor, providing information center artificial intelligence ( AI), HPC and drawing requirements, claim to be the fastest and most productive graphics processor (GPU) for depth learning (deep learning) and HPC. For the new SM Microarchitecture, Koehler says, Volta GV100 SM is a redesigned product for productivity with a new instruction set architecture (ISA), a simplified Release logic unit (issue logic), and a large, faster L1 cache, To improve the SIMT model and support tensor acceleration, one feature is that compared to the previous generation of Pascal SM, Volta SM integrates key technologies to integrate l1$ with shared memory up to 128KB, creating lower latency and streaming effects, such as streaming l1$ has 4 times times more bandwidth, 5 times times the ability to improve, and can shorten with the previous generation of Pascal's shared memory gap. Another feature that Koehler mentions is the ability to have an independent thread scheduling that supports staggered execution (interleaved execution) from divergent branching statement data, and the ability to perform fine-grained parallel algorithms, And execution is still the schema of the multiple threads corresponding to a single instruction. The new HBM2 memory architecture integrates operations and memory into a single package, with more bandwidth and higher energy efficiency, and V100 can also achieve 95% dram utilization, which is superior to the 76% DRAM utilization of previous generation P100. The Volta GV100 architecture's multiple service Processing (MPS) features are also enhanced, allowing MPs clients to directly transfer running tasks to work queues within the GPU (work queues), thereby reducing launch latency and improving throughput. Applied in inference, it claims that Volta MPs can be effectively inferred for deployment without a batch system. In the overall GPU performance comparison, Nvidia claims that V100 than P100 in the training accelerated, inference accelerated, HBM2 bandwidth, nvlink bandwidth, such as the speed of performance, including the speed of training accelerated growth of 12.5 times times.