Theres a lot of hype about GPUs and NVIDIA, but how much do you know about TPUs?

image[1]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
image[2]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
Rack with TPUs (left) and GPUs (right)

The article contains code examples, which you can find at the end

Rise of GPUs

Graphics processing units have been around for quite some time. Its job is to render 2D and 3D graphics in millions of pixels, calculate their color, texture and lighting, and send them to your monitor. For a 60 Hz monitor, this means images are rendered 60 times per second.

Rendering graphics is one thing, but developing code to deal with GPUs was a bit more difficult. Until NVIDIA released CUDA (Compute Unified Device Architecture) dans 2006, which allows scientific researchers and developers working in areas that require extensive parallel mathematics to leverage the capabilities of a GPU. With the advent of machine learning in the early 2010s, it was discovered that massively parallel mathematics was exactly what ML engineers needed to train deep neural networks. Since then, CUDAs focus has shifted more towards optimization for machine learning and AI workloads.

Because GPUs were commercially available and relatively inexpensive at the time, the barrier to entry was low. An ML engineer could train models on their NVIDIA graphics card during the day and jump into a League of Legends game on the same hardware at night.

Honorable mention

AMDs GPUs with Radeon Open Compute (ROCm) in an open source software stack designed to compete in the AI ​​ecosystem. Although it is not as popular as CUDA, this gap is closing Meta recently signed a deal to expand its existing partnership with AMD.

Tensor processing unit

In the early 2010s, Google predicted that the growing demands of its AI workloads, particularly the rapid adoption of deep learning in products such as search and photos, would require its data center computing capacity to double approximately every year and a half. Instead of scaling generic hardware indefinitely, Google looked for a more efficient solution designed specifically for neural network computation, and so the Tensor Processing Unit (TPU) was born. The TPU is a custom application-specific integrated circuit (ASIC) designed by Google specifically to accelerate AI workloads and will be deployed internally starting in 2015. By specializing hardware in the dense matrix operations at the heart of neural networks, TPUs achieve significantly better performance per watt than general-purpose CPUs or GPUs, reducing both power consumption and cooling requirements at data center scale.

Google has a history of exposing internally used tools to the world, and TPUs are another example of this. The existence of TPUs was first publicly announced at Google I/O in 2016. Dans 2018, Cloud TPU v2 became available to external users via Google Cloud. This was the first time that developers outside of Google were able to use the same accelerators that power Googles own AI systems. TPUs are also available in two performance variants: Efficiency And Performance to meet different market needs.

NOTE: As of the 8th generation of TPUs announced during Google Next 2026, Efficiency And Performance TPUs are renamed conclusion And Training or in favor of a more descriptive, workload-based naming convention.

Architectural layout

From an architectural perspective, GPUs can be thought of as individual computers with accelerators (imagine your home gaming PC). If you want to connect them into a cluster, it does so over a network, but no matter how fast the network is, it still has to cross node boundaries and the bandwidth drops as a result.

TPUs are designed from the ground up to be interconnected at scale with a physical layout that includes thousands of TPU chips in a torus topology that gives each chip six neighbors (two per axis, one on each side). Knowing that interconnect bandwidth would be the biggest bottleneck at this scale, Google developed its own proprietary Inter-Chip Interconnect (ICI) network that provides consistent, high-bandwidth, low-latency connections between all chips in an area, regardless of physical location. In torus topology, there is no concept of crossing a node boundary. If you request TPUs, you will not receive the entire TPU cluster or pod. Rather, you only get a small subset or disc. To make this possible, Google developed Optical Circuit Switch (OCS) to rewire physical connections on the fly (entirely in software), allowing the same hardware to serve different forms of workload without physical reconfiguration.

NOTE: Efficiency TPU versions use a 2D Torus topology, while Performance TPUs use a 3D Torus architecture to give you maximum performance with minimum latency.

Precision and range

A floating point number consists of three parts:

  • Sign: Positive or negative (represented by the first bit)
  • exponent: Determines the range of the number
  • mantissa: Significant digits of a floating point number that determine the precision

Traditionally, FP32 has been the standard for high-performance computing. When AI researchers moved to FP16 to save memory, they not only lost accuracy but also range. FP32 uses 8 bits for the exponent while FP16 only uses 5. The 3-bit difference in the exponent bits gives a range difference of almost 10³⁴ (FP32 has a range of 3.4 x 10³⁸ while FP16 only has a range of 6.5 x 10⁴). In deep learning, where the gradients can be incredibly small, FP16 often experiences underflow (meaning that numbers are rounded to 0 because they are too small for the range of FP16 to represent), which is a technical workaround called « Loss scaling” to keep the math stable.

Google Brain (now part of Google DeepMind) solved this made-up brain floating point (bfloat16), which simply shifted 3 bits from the mantissa to exponents:

image[3]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
Table comparing FP32, FP16 and bfloat16

By sacrificing precision for range, bfloat offers the same massive range as FP32, but with the reduced memory and bandwidth of FP16. A big reason this works is that deep learning models are surprisingly noise-tolerant, and greater training stability is more important than a few extra decimal places of precision. Aujourd'hui, bfloat16 is the de facto standard for training modern LLMs on NVIDIAs GPUs and Googles TPUs.

Why XLA is important

The default execution of Python usually takes one eager Approach. This means that each step is performed as it occurs. This is great for debugging because you can always insert print statements to check variables.

XLA (Accelerated Linear Algebra), on the other hand, is a domain-specific JIT compiler. Instead of executing steps one at a time, it analyzes the entire execution graph to optimize and merge operations before execution. Le lazy This approach results in an initial warm-up delay, but is significantly faster than standard methods once training begins. The downside is transparency: your step-by-step Python code becomes an optimized black box, making traditional debugging strategies more difficult. Que’s why TPUs are the powerhouses for large-scale corporate training, while GPUs remain the flexible choice for rapid experimentation.

NOTE: Although XLA was designed for TPUs, since PyTorch 2.0 it has also found its way into the NVIIA GPU ecosystem via tools like JAX and Torch.compile.

FlashlightTPU

Google is developing one FlashlightTPU Stack that provides native PyTorch support. This would allow you to run models in TPUs as-is, with full support for native PyTorch features. TorchTPU is currently in preview, and once it goes GA Ill be sure to dive deeper into it!

Code example

Im attaching a few Jupyter notebooks that I ran via the Antigravity + Colab plugin so you can try it yourself:

As you can see from the results below, TPU is actually faster. Cependant, my example isnt large enough or complex enough to really show the true speeds that TPU can bring.

NOTE: I have a Colab Pro account which gives me access to additional GPUs and TPUs. Colab free tier includes NVIDIA T4 and TPU v5e-1 only

Interpret training results

These are some benchmark training (époques: 50, taille du lot: 512) where I used an NVIDIA T4 GPU with (standard) FP32 compared to Google TPU v5e-1 (single-chip TPU) with bfloat16. As expected, TPUs were faster but with lower precision:

image[4]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
image[5]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
T4 GPU (FP32) vs. TPU v5e-1 (bfloat16), époques: 50, taille du lot: 512

I then trained the same model with the T4 GPU and bfloat16 but noticed a massive drop in performance. This was because the T4 was an older GPU that did not support bfloat16 natively and had to emulate it, which caused a lot of overhead. When switching to a newer L4 GPU, I noticed the (slight) performance gain along with the reduced precision:

image[6]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
image[7]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
T4 GPU (bfloat16) vs. L4 (bfloat16), époques: 50, taille du lot: 512

Enfin, I thought I would see how training would work on a newer TPU v6e-1 and I was blown away by the improvement:

image[8]-ML acceleration guide: TPUs vs GPUs For Windows 7,8,10,11-Winpcsoft.com
TPU v6e-1 (bfloat16), époques: 50, taille du lot: 512

Diplôme

Comparing GPUs and TPUs isnt exactly a walk in the park. They represent fundamentally different philosophies in architecture, memory management and execution.

In modern companies, its usually not about choosing one, but rather using each where it shines. For fast iterations and smaller workloads, the flexibility of GPUs is unmatched. Cependant, once a project reaches a certain scale, the TPUs domain-specific architecture becomes the clear winner in terms of efficiency and throughput.

TPUs are as fast as they are because they are a specialized one-trick pony, but to truly harness this performance requires a deeper understanding of the stack. The biggest challenge is often not the computing power itself, but rather: “How do I feed the TPUs with data quickly and efficiently so that they don’t become a bottleneck?” and make sure your input pipeline is fast enough so that the hardware doesnt remain idle.

In future posts, Ill delve deeper into these advanced concepts to show how you can optimize data pipelines to get the most out of your TPUs.

BONUS: Google’s 8th generation TPUs were announced at Google Next

https://medium.com/media/e41887997802e9f7c3e2c573a2b0f3f5/hrefblank


ML Acceleration Guide: TPUs vs. GPUs was originally published in Google Developer Experts on Medium, where people are continuing the discussion by highlighting and responding to this story.