NVIDIA
Conclusions
As Machine Learning matures beyond the research and development stage, attention is turning to the processing needs for inference. This data can be quite simple, such as text or images, or incredibly demanding, such as real-time spoken translation and high-definition video/Lidar. Therefore the corresponding processing requirements will vary from simple mobile processors in our phones to miniature supercomputers in our autonomous vehicles. NVIDIA is not content with just being the brains behind the creation of these AIs, and is positioning itself to compete with CPUs, FPGAs, and ASICs for the coming explosion in datacenter and edge ML processing. The customer wins announced by Mr. Huang demonstrate that they have what it takes to be a player in the next phase of Machine Learning and AI. However, unlike with training, which has been an all-NVIDIA show, the diversity of inference data, latency, and power requirements will create a wide range of solutions and an interesting competitive landscape.
Disclosure: Moor Insights & Strategy, like all research and analyst firms, provides or has provided research, analysis, advising and/or consulting to many high-tech companies in the industry mentioned in this article, including AMD, Intel, Microsoft, NVIDIA, Xilinx, and others. The author does not have any investment positions in any of the companies named in this article, except Google.
'>
NVIDIA’s meteoric growth in the datacenter, where its business is now generating some $1.6B annually, has been largely driven by the demand to train deep neural networks for Machine Learning (ML) and Artificial Intelligence (AI)—an area where the computational requirements are simply mindboggling. Much of this business is coming from the largest datacenters in the US, including Amazon, Google, Facebook, IBM, and Microsoft. Recently, NVIDIA announced new technology and customer initiatives at its annual Beijing GTC event to help drive revenue in the inference market for Machine Learning, as well as solidify the company’s position in the huge Chinese AI market. For those unfamiliar, inference is where the trained neural network is used to predict and classify sample data. It is likely that the inference market will eventually be larger, in terms of chip unit volumes, than the training market; after all, once you train a neural network, you probably intend to use it and use it a lot. Therefore it is critical that NVIDIA capture its share of this market as AI moves from early R&D to commercial deployment, both in the cloud and at the edge.
What did NVIDIA announce?
As is typically the case, NVIDIA’s CEO, Jensen Huang, made these announcements during a keynote address at Graphics Technology Conference (GTC) in Beijing—the first stop on a worldwide tour of GTC events. First, and perhaps most importantly, Huang announced new TensorRT3 software that optimizes trained neural networks for inference processing on NVIDIA GPUs. TensorRT3 can be used to package, or compile, neural networks built with any ML framework, for deployment across the NVIDIA portfolio of datacenter and edge devices. TensorRT is essentially the CUDA of inferencing. As a result, Huang announced that TensorRT3 is now being deployed by all of China’s largest Internet datacenters, namely Alibaba, Baidu, Tencent, and JD.com for ML workloads.
NVIDIA 1: TensorRT software is the cornerstone that should enable NVIDIA to deliver optimized inference performance in the cloud and at the edge.
NVIDIA
In addition to announcing the Chinese deployment wins, Huang provided some pretty compelling benchmarks to demonstrate the company’s prowess in accelerating Machine Learning inference operations, in the datacenter and at the edge. Note the ~20X increase in performance directly attributable to the new NVIDIA software (comparing the two V100 (Volta) results, in Figure 2).
Figure 2: Tensor RT3 performance for inference processing of images (ResNet-50).
NVIDIA
Gpu inference dragon boards
In addition to the TensorRT3 deployments, Huang announced that the largest Chinese Cloud Service Providers, Alibaba, Baidu, and Tencent, are all offering the company’s newest Tesla V100 GPUs to their customers for scientific and deep learning applications. For customers wanting to deploy deep learning in their own datacenters, he announced that Huawei, Inspur, and Lenovo would be selling HGX-based servers with Volta to their global customer base. HGX is an 8-GPU chassis with the NVLink interconnect, used to provide high levels of GPU scaling in a dense package. HGX, announced earlier this year, was designed with Microsoft and is available as an open source hardware platform through the Open Compute program. The Lenovo win is significant, seeing as the company seeks a high-density GPU server for large-scale training workloads, and is, at least for now, the only global OEM to offer HGX.
Figure 3: the Open Compute HGX platform allows 8 P100 or V100 GPUs to connect to any server for Machine Learning acceleration.
NVIDIA
Continuing with the theme of inference processing in China, NVIDIA also announced that the JD.com delivery subsidiary would be using the NVIDIA Jetson platform to guide and control its land and air drone delivery services. Delivering products through China’s crowded highway infrastructure is unreliable and time-consuming. To address this growing challenge, JD.com plans to have a million drones, with NVIDIA Jetson on board, in service by 2020.
Figure 4: These self-piloting drones will help JD.Com quickly deliver goods through or above the congested Chinese urban transportation system.
NVIDIA
Conclusions
As Machine Learning matures beyond the research and development stage, attention is turning to the processing needs for inference. This data can be quite simple, such as text or images, or incredibly demanding, such as real-time spoken translation and high-definition video/Lidar. Therefore the corresponding processing requirements will vary from simple mobile processors in our phones to miniature supercomputers in our autonomous vehicles. NVIDIA is not content with just being the brains behind the creation of these AIs, and is positioning itself to compete with CPUs, FPGAs, and ASICs for the coming explosion in datacenter and edge ML processing. The customer wins announced by Mr. Huang demonstrate that they have what it takes to be a player in the next phase of Machine Learning and AI. However, unlike with training, which has been an all-NVIDIA show, the diversity of inference data, latency, and power requirements will create a wide range of solutions and an interesting competitive landscape.
Disclosure: Moor Insights & Strategy, like all research and analyst firms, provides or has provided research, analysis, advising and/or consulting to many high-tech companies in the industry mentioned in this article, including AMD, Intel, Microsoft, NVIDIA, Xilinx, and others. The author does not have any investment positions in any of the companies named in this article, except Google.
When Google announced its second generation of ASICs to accelerate the company’s machine learning processing, my phone started ringing off the hook with questions about the potential impact on the semiconductor industry. Would the other members of the Super 7, the world’s largest datacenters, all rush to build their own chips for AI? How might this affect NVIDIA, a leading supplier of AI silicon and platforms, and potentially other companies such as AMD, Intel, and the many startups that hope to enter this lucrative market? Is it game over for GPUs and FPGAs just when they were beginning to seem so promising? To answer these and other questions, let us get inside the heads of these Goliaths of the Internet and see what they may be planning.

The Google Cloud TPU is a four ASIC board that delivers 180 Teraflops of performance Source: Google.

Google
As I explored in an article earlier this year, there are four major types of technology that can be used to accelerate the training and use of deep neural networks: CPUs, GPUs, FPGAs, and ASICs. The good old standby CPU has the advantage of being infinitely programmable, with decent but not stellar performance. It is used primarily in inference workloads where the trained Neural Network guides the computation to make accurate predictions about the input data item. FPGAs from Intel and Xilinx, on the other hand, offer excellent performance at very low power, but also offer more flexibility by allowing the designer to change the underlying hardware to best support changing software. FPGAs are used primarily in Machine Learning inference, video algorithms, and thousands of small-volume specialized applications. However, the skills needed to program the FPGA hardware are fairly hard to come by, and the performance of an FPGA will not approach that of a high-end GPU for certain workloads.

There are many types of hardware accelerators that are used in Machine Learning today, in training and inference, and in the cloud and at the edge. Source: Moor Insights & Strategy

Moor Insights & Strategy
Technically, a GPU is an ASIC used for processing graphics algorithms. The difference is an ASIC offers an instruction set and libraries to allow the GPU to be programmed to operate on locally stored data—as an accelerator for many parallel algorithms. GPUs excel at performing matrix operations (primarily matrix multiplications, if you remember your high school math) that underlie graphics, AI, and many scientific algorithms. Basically, GPUs are very fast and relatively flexible.
The alternative is to design a custom ASIC dedicated to performing fixed operations extremely fast since the entire chip’s logic area can be dedicated to a set of narrow functions. In the case of the Google TPU, they lend themselves well to a high degree of parallelism, and processing neural networks is an “embarrassingly parallel” workload. Think of an ASIC as a drag racer; it can go very fast, but it can only carry one person in a straight line for a quarter mile. You couldn’t drive one around the block, or take it out on an oval racetrack.
Here’s the catch: designing an ASIC can be an expensive endeavor, costing many tens or even hundreds of millions of dollars and requiring a team of fairly expensive engineers. Paying for all that development means many tens or hundreds of thousands of chips are needed to amortize those expenses across the useful lifetime of the design (typically 2-3 years). Additionally, the chip will need to be updated frequently to keep abreast of new techniques and manufacturing processes. Finally, since the designers froze the logic early in the development process, they will be unable to react quickly when new ideas emerge in a fast moving field, such as AI. On the other hand, an FPGA (and to a limited extent even a GPU) can be reprogrammed to implement a new feature.
If you think about Google’s business, it has three attributes that likely led it to invest in custom silicon for AI. Examining these factors may be helpful in assessing other companies’ potential likelihood of making similar investments.
  1. Strategic Intent: Google has repeatedly stated that it has become an “AI First” company. In other words, AI technology has a strategic role across the entire business: Search, self-driving vehicles, Google Cloud, Google Home, and many other new and existing products and services. It, therefore, makes sense that Google would want to control its own hardware (TPU) accelerators and its own software framework (TensorFlow) on which they will build their products and services. The company is willing to invest to give themselves an edge over others with similar, albeit perhaps less ambitious, aspirations.
  2. Required Scale: Google’s computing infrastructure is the largest in the world, and that scale means that it may have the required volume needed to justify the significant costs of developing and maintaining its own hardware platform for AI acceleration. In fact, Google claims that the TPU saved the company from building another 12 datacenters to handle the AI load. Let’s do some sensitivity analysis to understand the likely scale required for a single ASIC cycle. For the sake of argument, let’s assume Google spent an Order of Magnitude of O($100M) including mask production, and that each chip will save them around O($1K). For reference, a single Cloud TPU chip at 45 TFLOPS potentially has a little more than 1/3rd the performance of a NVIDIA Volta GPU at a peak 120 TFLOPS, so you need 3 TPU chips to displace a high-end GPU. That implies Google can just about break even if they deploy an Order of Magnitude O(100K) TPUs, not accounting for the time value of money. That’s a lot of chips for most companies, even for Google. On the other hand, if it only cost Google O($60M) to develop the chip and TPU board, and they save O($2K) per chip, then they only need O(30K) chips to break even. Therefore a similar effort by another large datacenter may require an Order of Magnitude of O(30-100K) chips just to break even over a 2-3 year period).
  3. Importance of Google Cloud: Google execs can’t be satisfied to remain a distant 3rd in the global cloud computing market behind Amazon and Microsoft. They are investing a great deal in Google Cloud under the leadership of Diane Greene, and are now enjoying some of the fastest growth in the industry. Google could use the pricing power and performance of the Google Cloud TPU, along with the popularity of TensorFlow, as a potentially significant advantage in capturing market share for the development of machine learning in the cloud. However, it is important to note that Google says the use of their Cloud TPU would be priced at parity with a high-end GPU for cloud access. Google does not intend to sell the TPU outright.
Frankly, while all of the other Super 7 members (Amazon, Alibaba, Baidu, Facebook, Microsoft, and Tencent) are capable of building their own accelerator, nobody exhibits all three attributes to the extent that Google does. Furthermore, of the companies that are actually close, several seem to be moving in different directions:
  1. Baidu has recently stated publicly that it is in partnership with NVIDIA for its AI initiatives in the Cloud, Home, and Autos. This doesn’t mean Baidu can’t and won’t build its own chip someday, but for now, the company seems to be satisfied to concentrate on its software and services, which the Chinese market already values. Also, Baidu’s cloud remains a relatively small part of its business.
  2. Microsoft is the 2nd largest Cloud Services provider, has a large (>5000) stable of AI engineers, and is on a mission to “democratize” AI through its tools and APIs for enterprise customers. However, the company has decided (at least for now) to use Altera FPGAs from Intel in its Azure and Bing infrastructure, believing that it can benefit from a more flexible hardware platform in a fast-changing world. Also, Microsoft uses NVIDIA GPUs to train its neural networks.
  3. Amazon is perhaps the closest to the Google model outlined above; AWS is huge, and the company is investing heavily in AI. While Amazon may favor Apache MXNet framework for AI development, its AWS cloud services for AI supports all major frameworks, making it the open software Switzerland of the AI development world. Also, being an NVIDIA-based Ying to Google’s TPU-centric Yang could be an effective strategy. However, Amazon has gone down the ASIC path before; it acquired Ana Purna Labs in 2012, apparently to shave off costs and latencies in the AWS infrastructure. Because of this, the company already has a chip team on board in Israel. Finally, Amazon, like Baidu, seems to be keen on using FPGAs for their all-programmable nature.
I’m not forecasting that none of the other Super 7 companies will jump the GPU ship and hop on board their own ASIC, but it seems highly unlikely to me that many will—not soon, anyways. They all seem to have their hands full developing Machine Learning models with their vast troves of data, and are busy monetizing those models in a variety of products and services. Building an ASIC, and the software that enables it is an ongoing and expensive proposition that could be a distraction. Alternatively, combining the performance of a GPU for training with the flexibility and efficiency of an FPGA, for inference, also holds a great deal of promise.
So I for one do not think the GPU sky is falling, at least not in the near future.  AMD certainly believes there is plenty of demand for GPUs and is aiming their Vega technology right at it.
One final note: if one or more of these companies decide to go down the ASIC path, NVIDIA has made an attractive offer with their own Deep Learning Accelerator (DLA) Open Source ASIC, which the company is designing into their next generation DrivePX platform for autonomous driving. As a result, AI hardware developers will be able to use NVIDIA’s latest AI hardware and software technology (which is arguably best-in-class) and license the NVIDIA hardware IP as open source technology contributed by NVIDIA. NVIDIA could then potentially monetize that free Open Source hardware IP with software services. I suspect this DLA technology will be more readily suited for IoT type applications; NVIDIA CEO Jensen Huang half-jokingly spoke of TPUs for intelligent lawn mowers when he announced DLA in April. But there’s nothing stopping a development team from taking it to the limits if they choose to do so.
Disclosure: Moor Insights & Strategy, like all research and analyst firms, provides or has provided research, analysis, advising and/or consulting to many high-tech companies in the industry mentioned in this article, including AMD, Intel, Microsoft, and NVIDIA. The author does not have any investment positions in any of the companies named in this article except Amazon and Google.
'>
When Google announced its second generation of ASICs to accelerate the company’s machine learning processing, my phone started ringing off the hook with questions about the potential impact on the semiconductor industry. Would the other members of the Super 7, the world’s largest datacenters, all rush to build their own chips for AI? How might this affect NVIDIA, a leading supplier of AI silicon and platforms, and potentially other companies such as AMD, Intel, and the many startups that hope to enter this lucrative market? Is it game over for GPUs and FPGAs just when they were beginning to seem so promising? To answer these and other questions, let us get inside the heads of these Goliaths of the Internet and see what they may be planning.
The Google Cloud TPU is a four ASIC board that delivers 180 Teraflops of performance Source: Google.
Google
As I explored in an article earlier this year, there are four major types of technology that can be used to accelerate the training and use of deep neural networks: CPUs, GPUs, FPGAs, and ASICs. The good old standby CPU has the advantage of being infinitely programmable, with decent but not stellar performance. It is used primarily in inference workloads where the trained Neural Network guides the computation to make accurate predictions about the input data item. FPGAs from Intel and Xilinx, on the other hand, offer excellent performance at very low power, but also offer more flexibility by allowing the designer to change the underlying hardware to best support changing software. FPGAs are used primarily in Machine Learning inference, video algorithms, and thousands of small-volume specialized applications. However, the skills needed to program the FPGA hardware are fairly hard to come by, and the performance of an FPGA will not approach that of a high-end GPU for certain workloads.
There are many types of hardware accelerators that are used in Machine Learning today, in training and inference, and in the cloud and at the edge. Source: Moor Insights & Strategy
Moor Insights & Strategy
Technically, a GPU is an ASIC used for processing graphics algorithms. The difference is an ASIC offers an instruction set and libraries to allow the GPU to be programmed to operate on locally stored data—as an accelerator for many parallel algorithms. GPUs excel at performing matrix operations (primarily matrix multiplications, if you remember your high school math) that underlie graphics, AI, and many scientific algorithms. Basically, GPUs are very fast and relatively flexible.
The alternative is to design a custom ASIC dedicated to performing fixed operations extremely fast since the entire chip’s logic area can be dedicated to a set of narrow functions. In the case of the Google TPU, they lend themselves well to a high degree of parallelism, and processing neural networks is an “embarrassingly parallel” workload. Think of an ASIC as a drag racer; it can go very fast, but it can only carry one person in a straight line for a quarter mile. You couldn’t drive one around the block, or take it out on an oval racetrack.
Here’s the catch: designing an ASIC can be an expensive endeavor, costing many tens or even hundreds of millions of dollars and requiring a team of fairly expensive engineers. Paying for all that development means many tens or hundreds of thousands of chips are needed to amortize those expenses across the useful lifetime of the design (typically 2-3 years). Additionally, the chip will need to be updated frequently to keep abreast of new techniques and manufacturing processes. Finally, since the designers froze the logic early in the development process, they will be unable to react quickly when new ideas emerge in a fast moving field, such as AI. On the other hand, an FPGA (and to a limited extent even a GPU) can be reprogrammed to implement a new feature.
If you think about Google’s business, it has three attributes that likely led it to invest in custom silicon for AI. Examining these factors may be helpful in assessing other companies’ potential likelihood of making similar investments.
  1. Strategic Intent: Google has repeatedly stated that it has become an “AI First” company. In other words, AI technology has a strategic role across the entire business: Search, self-driving vehicles, Google Cloud, Google Home, and many other new and existing products and services. It, therefore, makes sense that Google would want to control its own hardware (TPU) accelerators and its own software framework (TensorFlow) on which they will build their products and services. The company is willing to invest to give themselves an edge over others with similar, albeit perhaps less ambitious, aspirations.
  2. Required Scale: Google’s computing infrastructure is the largest in the world, and that scale means that it may have the required volume needed to justify the significant costs of developing and maintaining its own hardware platform for AI acceleration. In fact, Google claims that the TPU saved the company from building another 12 datacenters to handle the AI load. Let’s do some sensitivity analysis to understand the likely scale required for a single ASIC cycle. For the sake of argument, let’s assume Google spent an Order of Magnitude of O($100M) including mask production, and that each chip will save them around O($1K). For reference, a single Cloud TPU chip at 45 TFLOPS potentially has a little more than 1/3rd the performance of a NVIDIA Volta GPU at a peak 120 TFLOPS, so you need 3 TPU chips to displace a high-end GPU. That implies Google can just about break even if they deploy an Order of Magnitude O(100K) TPUs, not accounting for the time value of money. That’s a lot of chips for most companies, even for Google. On the other hand, if it only cost Google O($60M) to develop the chip and TPU board, and they save O($2K) per chip, then they only need O(30K) chips to break even. Therefore a similar effort by another large datacenter may require an Order of Magnitude of O(30-100K) chips just to break even over a 2-3 year period).
  3. Importance of Google Cloud: Google execs can’t be satisfied to remain a distant 3rd in the global cloud computing market behind Amazon and Microsoft. They are investing a great deal in Google Cloud under the leadership of Diane Greene, and are now enjoying some of the fastest growth in the industry. Google could use the pricing power and performance of the Google Cloud TPU, along with the popularity of TensorFlow, as a potentially significant advantage in capturing market share for the development of machine learning in the cloud. However, it is important to note that Google says the use of their Cloud TPU would be priced at parity with a high-end GPU for cloud access. Google does not intend to sell the TPU outright.
Frankly, while all of the other Super 7 members (Amazon, Alibaba, Baidu, Facebook, Microsoft, and Tencent) are capable of building their own accelerator, nobody exhibits all three attributes to the extent that Google does. Furthermore, of the companies that are actually close, several seem to be moving in different directions:
  1. Baidu has recently stated publicly that it is in partnership with NVIDIA for its AI initiatives in the Cloud, Home, and Autos. This doesn’t mean Baidu can’t and won’t build its own chip someday, but for now, the company seems to be satisfied to concentrate on its software and services, which the Chinese market already values. Also, Baidu’s cloud remains a relatively small part of its business.
  2. Microsoft is the 2nd largest Cloud Services provider, has a large (>5000) stable of AI engineers, and is on a mission to “democratize” AI through its tools and APIs for enterprise customers. However, the company has decided (at least for now) to use Altera FPGAs from Intel in its Azure and Bing infrastructure, believing that it can benefit from a more flexible hardware platform in a fast-changing world. Also, Microsoft uses NVIDIA GPUs to train its neural networks.
  3. Amazon is perhaps the closest to the Google model outlined above; AWS is huge, and the company is investing heavily in AI. While Amazon may favor Apache MXNet framework for AI development, its AWS cloud services for AI supports all major frameworks, making it the open software Switzerland of the AI development world. Also, being an NVIDIA-based Ying to Google’s TPU-centric Yang could be an effective strategy. However, Amazon has gone down the ASIC path before; it acquired Ana Purna Labs in 2012, apparently to shave off costs and latencies in the AWS infrastructure. Because of this, the company already has a chip team on board in Israel. Finally, Amazon, like Baidu, seems to be keen on using FPGAs for their all-programmable nature.
I’m not forecasting that none of the other Super 7 companies will jump the GPU ship and hop on board their own ASIC, but it seems highly unlikely to me that many will—not soon, anyways. They all seem to have their hands full developing Machine Learning models with their vast troves of data, and are busy monetizing those models in a variety of products and services. Building an ASIC, and the software that enables it is an ongoing and expensive proposition that could be a distraction. Alternatively, combining the performance of a GPU for training with the flexibility and efficiency of an FPGA, for inference, also holds a great deal of promise.
So I for one do not think the GPU sky is falling, at least not in the near future. AMD certainly believes there is plenty of demand for GPUs and is aiming their Vega technology right at it.
One final note: if one or more of these companies decide to go down the ASIC path, NVIDIA has made an attractive offer with their own Deep Learning Accelerator (DLA) Open Source ASIC, which the company is designing into their next generation DrivePX platform for autonomous driving. As a result, AI hardware developers will be able to use NVIDIA’s latest AI hardware and software technology (which is arguably best-in-class) and license the NVIDIA hardware IP as open source technology contributed by NVIDIA. NVIDIA could then potentially monetize that free Open Source hardware IP with software services. I suspect this DLA technology will be more readily suited for IoT type applications; NVIDIA CEO Jensen Huang half-jokingly spoke of TPUs for intelligent lawn mowers when he announced DLA in April. But there’s nothing stopping a development team from taking it to the limits if they choose to do so.
Disclosure: Moor Insights & Strategy, like all research and analyst firms, provides or has provided research, analysis, advising and/or consulting to many high-tech companies in the industry mentioned in this article, including AMD, Intel, Microsoft, and NVIDIA. The author does not have any investment positions in any of the companies named in this article except Amazon and Google.