Tensor Parallelism Explained: Scaling Large Models Across GPUs
As language models grow larger, serving them efficiently demands more than a single GPU can provide — not just in terms of memory capacity but in terms of computational throughput. A 70B-parameter model in FP16 requires approximately 140GB of VRAM to hold the weights alone, which exceeds the capacity of any single commercially available GPU. Even models that technically fit on a single GPU with aggressive quantization can benefit from multi-GPU parallelism when throughput requirements demand it.
Tensor parallelism is the technique that enables efficient use of multiple GPUs for a single model forward pass. Unlike data parallelism (which runs independent copies of the model on different inputs) or pipeline parallelism (which distributes layers across GPUs sequentially), tensor parallelism distributes the computation of individual layers across multiple GPUs simultaneously. Understanding when and how to apply tensor parallelism — and how it compares to alternatives — is essential for anyone building inference infrastructure for large models.
The Fundamental Idea: Splitting Layer Computation
Transformer layers consist primarily of matrix multiplications: the attention projections (Q, K, V, and output) and the feedforward network (two linear transformations with a nonlinearity between them). These matrix multiplications are the primary computational bottleneck and the primary memory consumer. Tensor parallelism splits these matrices across GPUs so that each GPU holds and computes only a shard of the full computation.
For a column-parallel linear layer, each GPU holds a contiguous subset of output columns from the weight matrix and computes the corresponding output channels independently. For a row-parallel layer, each GPU holds a subset of input columns and computes a partial sum of the output, which is then summed across GPUs using an all-reduce collective. The Megatron-LM paper (Shoeybi et al., 2019) showed how to split transformer attention and feedforward layers such that only two all-reduce operations are required per transformer layer regardless of tensor parallel degree — a crucial efficiency that makes tensor parallelism practical at scale.
The key constraint is that all-reduce operations require all GPUs to communicate simultaneously and cannot overlap with computation. The time spent on all-reduce is therefore pure overhead added to the computation time. For tensor parallelism to be beneficial, the computation reduction from splitting the matrix must exceed the all-reduce overhead. This threshold depends on the interconnect bandwidth between GPUs: NVLink-connected GPUs on the same node can perform all-reduce at effective bandwidth of 300 to 600 GB/s, making tensor parallelism across 4 to 8 NVLink GPUs highly efficient. PCIe-connected GPUs have approximately 20x lower interconnect bandwidth, making tensor parallelism across PCIe connections viable only in edge cases.
Attention Layer Tensor Parallelism
In multi-head attention, tensor parallelism assigns a subset of attention heads to each GPU. With 32 attention heads and a tensor parallel degree of 4, each GPU handles 8 attention heads. Each GPU independently computes Q, K, V projections and attention output for its assigned heads, then an all-reduce synchronizes the attention output projections before they are summed into the layer output. The memory reduction is proportional to the tensor parallel degree: a 4-way tensor parallel split stores 1/4 of the attention weights on each GPU.
For models using grouped-query attention (GQA) — including Llama 3 and Mistral — tensor parallelism must align with the GQA group structure. With 8 KV heads and a tensor parallel degree of 4, each GPU is responsible for 2 KV heads and the corresponding 8 query heads (assuming 32 query heads total). This alignment constraint means that the tensor parallel degree for GQA models must evenly divide the number of KV heads; degree 4 works for 8 KV heads (2 per GPU), but degree 8 would require only 1 KV head per GPU, which some frameworks do not efficiently support.
Feedforward Layer Tensor Parallelism
The feedforward network in most transformer architectures consists of two linear transformations: an expansion projection that increases dimensionality (typically 4x or 8/3x the model dimension) and a contraction projection that reduces back to model dimension, with an activation function (GELU, SwiGLU, etc.) applied between them. Tensor parallelism splits the expansion along the output dimension (each GPU holds a subset of the expanded intermediate features) and the contraction along the input dimension (each GPU computes a partial sum contribution), with one all-reduce synchronizing the contraction output.
SwiGLU feedforward networks, used in Llama and Mistral models, have three weight matrices in the feedforward block rather than two: gate, up, and down projections. The gate and up projections are split column-parallel (each GPU computes a subset of intermediate features), the elementwise SiLU gate operation is applied locally on each GPU, and the down projection is split row-parallel with a final all-reduce. This requires careful implementation but achieves the same fundamental efficiency as two-matrix feedforward blocks.
Tensor Parallelism vs Pipeline Parallelism: When to Use Each
Pipeline parallelism (PP) distributes transformer layers across GPUs sequentially: GPU 0 processes layers 1-10, GPU 1 processes layers 11-20, and so on. Each GPU receives the activations from the previous GPU, computes its layers, and passes activations to the next GPU. This approach has very different tradeoffs compared to tensor parallelism and is suited to different scenarios.
Tensor parallelism's all-reduce operations occur synchronously within each layer, requiring all GPUs to be available and communicating at each step. This makes it latency-sensitive but highly efficient for throughput when the computation per layer is large relative to communication volume. Pipeline parallelism's activations pass sequentially between stages, making it latency-insensitive but introducing pipeline bubble overhead — the time when some pipeline stages are idle waiting for their inputs. The pipeline bubble fraction is approximately 1/(num_microbatches) per batch, so large batches amortize the overhead well.
The practical recommendation: use tensor parallelism within a single NVLink-connected node for latency-sensitive inference workloads requiring 2 to 8 GPUs. Use pipeline parallelism or tensor-pipeline hybrid parallelism when spanning multiple nodes (where cross-node NVLink or InfiniBand bandwidth makes tensor parallelism less efficient) or when processing large batches where pipeline bubble overhead is a small fraction of total compute time. For most enterprise inference deployments using models up to 70B parameters on 4 to 8 NVLink GPUs, pure tensor parallelism is the right choice.
Practical Configuration for Production Inference
Most modern inference frameworks — vLLM, TGI, TensorRT-LLM — expose tensor parallelism degree as a simple configuration parameter (commonly tensor_parallel_size or num_shard). Setting this parameter correctly requires knowing the model's attention head count (for GQA alignment) and the interconnect topology of your GPU cluster.
For A100 SXM nodes with NVLink (8 GPUs per node connected via NVLink), tensor parallel degrees of 2, 4, or 8 are all valid. The optimal degree for a given workload is not always the maximum: tensor parallel degree 8 divides computation across more GPUs but introduces more all-reduce overhead and distributes the KV cache across more nodes, increasing the KV cache memory management complexity. For a 70B model requiring 4 GPUs for memory capacity, tensor parallel degree 4 is the natural choice. For a 13B model that fits on one GPU, tensor parallel degree 1 (no tensor parallelism) typically gives better per-request latency than distributing across multiple GPUs, though horizontal scaling across independent single-GPU replicas provides higher throughput.
Monitoring Tensor Parallel Efficiency
The efficiency of tensor parallel inference can be monitored through several metrics. GPU utilization across all GPUs in a tensor-parallel group should be approximately equal — significant imbalance suggests that the parallelism split is not aligned with the model architecture. All-reduce latency can be measured directly using GPU performance counters and should be less than 5 to 10% of total per-layer computation time for well-configured NVLink systems. NCCL profiling tools can diagnose all-reduce performance problems including bandwidth saturation, contention, and topology-related inefficiencies.
Memory utilization across the tensor-parallel group should also be approximately equal. Significant imbalance suggests that the model loading or KV cache allocation is not distributing work evenly, which can lead to out-of-memory failures on the most loaded GPU even when aggregate memory appears sufficient. The Latentforce platform provides per-GPU memory and utilization metrics in real time, making these diagnostics accessible without custom profiling tooling.
Key Takeaways
- Tensor parallelism splits individual layer computations across GPUs, requiring only 2 all-reduce operations per transformer layer
- NVLink (300-600 GB/s) enables efficient tensor parallelism; PCIe (16 GB/s) makes tensor parallelism impractical for most workloads
- For GQA models, tensor parallel degree must evenly divide the number of KV heads
- Use tensor parallelism within a node for latency-sensitive workloads; use pipeline or hybrid parallelism for multi-node deployments
- Tensor parallel degree 4 is the most common choice for 70B models on A100/H100 NVLink nodes
- Monitor all-reduce latency (target: <5-10% of computation time) and per-GPU memory balance to verify efficient tensor parallel operation
Conclusion
Tensor parallelism is the foundation of efficient large model inference on modern GPU hardware. Understanding its mechanics — how matrix multiplications are split, where communication occurs, and what interconnect requirements enable efficiency — gives infrastructure engineers the knowledge to configure multi-GPU inference deployments correctly rather than by trial and error. As models continue growing and hardware architectures evolve, the specific configurations will change, but the fundamental tradeoffs between computation and communication will remain the same.
The most important single piece of practical advice: always use NVLink-connected GPUs for tensor parallelism in production inference. The bandwidth difference between NVLink and PCIe is not a minor performance detail — it is the difference between a viable architecture and one that the all-reduce overhead makes impractical. This hardware constraint should be the starting point for any infrastructure procurement decision involving large model inference.