By Adrien Payong and Shaoni Mukherjee

Training trillion-parameter models is expensive, but inference is the ongoing operational cost. Each request takes up GPU memory, memory bandwidth, compute cycles, batching slots, and serving capacity. Model weights must live persistently in GPU memory. The key-value cache grows during generation. The serving engine must multiplex across many concurrent users with reasonable latency.
Let’s do a quick thought experiment.
The memory required to store the model weights equals the number of parameters multiplied by the per-parameter precision in bytes. So, for example, a seven‑billion‑parameter model in FP16 would require two bytes per weight. This means just the weights themselves require about 7 × 10^9 × 2 ≈ 14 GB. This doesn’t include activation buffers, runtime overhead, batching, or KV caches. As you can imagine, large models rapidly exceed the VRAM of common GPUs. KV caches can easily exceed the memory required by the weights. Buying larger GPUs is an unsustainable way to handle this growth. LLM compression is therefore becoming critical for inference efficiency.
That is why LLM compression is important. Rather than constantly purchasing larger GPUs, teams can build smaller, more efficient models. There are several methods of compression: quantization, distillation, low-rank approximation, and pruning. In this article, we’ll focus on pruning, and more specifically, SparseGPT pruning and Wanda pruning. Both methods are post-training methods that can compress large language models without costly retraining.
LLM inference is constrained by four major production bottlenecks:
Pruning reduces the total number of non‑zero weights in the model. Sparse models store many zeros. In theory, this reduces memory and compute needs. If the inference stack can take advantage of this sparsity—using compressed storage formats and sparse matrix multiplication kernels—models can achieve lower VRAM footprint, higher throughput, and reduced latency. Real-world speedups depend on the hardware and software used to run the model. Zeroing weights without changing the kernel will not speed up inference.
Beyond GPU servers, compression enables new modes of deployment. Pruned models can be deployed to smaller GPUs, edge devices, or even multi‑model servers. Memory savings let teams colocate several models on a single GPU, use cheaper instances, or fit larger batch sizes for the same hardware budget. Compression is directly tied to the economics of inference: reducing memory per token reduces the cost per token.
Neural networks are typically dense, meaning that most of their weights are non-zero. A sparse network, on the other hand, consists mostly of zero-valued weights. Network pruning increases sparsity by permanently setting weights to zero. For language models, there are two particularly important types of sparsity: unstructured and structured.
Unstructured pruning removes individual weights at random throughout the model.This allows the pruning algorithm maximum flexibility in choosing which weights to remove while minimizing impact on model behavior. Because weights can be removed from the matrix at any position, unstructured sparsity can often maintain accuracy fairly well for a given level of sparsity.

However, GPUs are designed to efficiently handle dense matrix operations. Sparse matrices with weights at random positions will result in irregular memory access patterns, indexing overhead, and low kernel utilization. If specialized sparse-matrix multiplication kernels are not used, dense matrix multiplication still processes zero values. As a result, unstructured sparsity does not inherently reduce latency.
Structured pruning eliminates weights in regular structures, such as rows, columns, blocks, channels, or fixed groups of weights. An example of a hardware-friendly structured pattern is 2:4 sparsity, also known as semi-structured sparsity. With 2: 4 sparsity, two values within a block of four contiguous weights are zero. This results in a 50% sparsity rate.

Supported GPUs can accelerate these fixed sparse patterns. NVIDIA’s Ampere GPU architecture added support for Sparse Tensor Cores for fine-grained structured sparsity, including the 2:4 pattern. In theory, they can provide up to 2× matrix multiplication throughput compared with equivalent dense operations. But in practice, end-to-end latency improvement varies based on model, runtime, kernels, and workload.
Traditional neural-network pruning workflows consisted of a three-step loop: Train a dense model. Remove less important weights. Fine-tune or retrain the sparse model to regain accuracy. Repeat until desired level of sparsity. Iterating this process can lead to very sparse networks, but it requires substantial computation.

For billion-parameter LLMs, this pruning workflow is not easily done in practice. You’d have to load enormous models into memory and distributed GPU clusters, then prepare appropriate recovery data. After each pruning step, you would also need to run fine-tuning and evaluate model quality before moving to the next compression stage.
Another limitation of some older pruning methods is that they require expensive second-order approximations or iterative weight updates. These methods may still maintain accuracy, but are more challenging to scale to larger models as we shift from millions to billions of parameters.
Modern LLM pruning methods focus on post-training, one-shot pruning. Instead of iterative fine-tuning, they take a pretrained model and run a small calibration dataset through it to estimate which weights are less important. The model is then pruned so that its outputs are kept “close” to the original outputs on a representative set of inputs, without full retraining.
SparseGPT and Wanda are two notable methods of this category. SparseGPT estimates importance using second-order layer reconstruction; Wanda uses a simpler approach of activation-aware weight-importance scoring.
SparseGPT is a one-shot pruning method applicable to massive GPT-style models. It frames pruning as a layer-wise sparse regression problem. Given a linear layer with weights W and calibration activations X, we want to find a pruned matrix that minimizes reconstruction error:

The outputs of the pruned layer should closely match those of the original layer for a given set of calibration data. SparseGPT tries to match this layer’s output rather than naively dropping the smallest weights. It uses second-order information calculated from calibration activations to approximate the impact of pruning on a layer’s output. After identifying which weights to prune, SparseGPT updates the remaining weights to compensate for the removed ones and minimize reconstruction error.
SparseGPT balances accuracy and efficiency by combining several ideas:

Thanks to these properties, SparseGPT maintains quality far better than naive methods when pruning aggressively. Its complexity is higher, but the authors show it scales to models with tens to hundreds of billions of parameters.
Wanda (Pruning by Weights and Activations) is an approach to lightweight pruning that was introduced as an alternative to reconstruction-based methods such as SparseGPT. Unlike these methods, it doesn’t involve solving a layer-wise reconstruction problem or estimating Hessians. Instead, it uses a simpler activation-aware importance score:

The intuition behind this score is that a weight will be important if it has a large magnitude and connects to an input dimension with strong activation. A weight with small magnitude or that connects to an input dimension with weak activation will be more likely to have a low importance score, and can therefore be pruned. Wanda ranks weights by removing those with the smallest activation-scaled magnitudes per-output basis. The authors highlight that Wanda requires zero retraining or weight updates - the pruned model can be used directly. Wanda heavily outperforms magnitude-based pruning and is competitive with more complex pruning methods in experiments run on LLaMA and LLaMA-2.
Wanda’s simplicity yields several benefits:
The trade‑off is that Wanda may lose accuracy faster than SparseGPT at very high sparsity ratios. It does not support weight compensation and is designed primarily for unstructured pruning, although the implementation includes options for 2:4 and 4:8 patterns.
Let’s consider the following table:
| Feature | SparseGPT | Wanda |
|---|---|---|
| Method type | One-shot post-training pruning | One-shot post-training pruning |
| Main signal | Second-order reconstruction error minimization | Weight magnitude × activation norm |
| Hessian approximation | Yes | No |
| Weight update after pruning | Yes (optional) | No |
| Complexity | Higher | Lower |
| Runtime cost | Slower | Faster |
| Accuracy retention | Excellent | Very good |
| Implementation difficulty | Moderate | Easy |
| Best use case | High sparsity with strong quality retention | Fast baseline p |
SparseGPT shines when you need to retain accuracy and have engineering resources available. Wanda is better for rapid experiments, lighter‑weight deployments, or when approximate answers are good enough. Many teams benchmark both approaches to find the sweet spot for their model, sparsity target, and hardware.
Deploying pruned LLMs involves both model‑level and infrastructure‑level considerations. A typical workflow looks like this:

Below is a simplified PyTorch function that applies Wanda pruning to a single linear layer. In practice, you would extend this to all projection layers of the model and handle structured patterns.
import torch
import torch.nn as nn
@torch.no_grad()
def wanda_prune_linear(
layer: nn.Linear,
input_activations: torch.Tensor,
sparsity: float = 0.5
):
"""
Apply Wanda-style unstructured pruning to one Linear layer.
Args:
layer: PyTorch Linear layer.
input_activations: Calibration activations with shape
[batch, seq_len, hidden_dim] or [num_tokens, hidden_dim].
sparsity: Fraction of weights to prune per output row.
Returns:
The pruned layer, modified in-place.
"""
if not isinstance(layer, nn.Linear):
raise TypeError("wanda_prune_linear expects an nn.Linear layer.")
if not 0.0 <= sparsity <= 1.0:
raise ValueError("sparsity must be between 0 and 1.")
# Flatten activations to shape [n_tokens, input_dim]
if input_activations.dim() == 3:
X = input_activations.reshape(-1, input_activations.shape[-1])
else:
X = input_activations
W = layer.weight
if X.shape[-1] != W.shape[1]:
raise ValueError(
f"Activation dimension {X.shape[-1]} does not match "
f"layer input dimension {W.shape[1]}."
)
# Compute L2 norm of each input dimension
activation_norm = torch.norm(X, p=2, dim=0)
# Compute Wanda importance scores: |W_ij| * ||X_j||
scores = torch.abs(W) * activation_norm.unsqueeze(0)
# Number of weights to prune per output row
num_prune = int(W.shape[1] * sparsity)
if num_prune == 0:
return layer
# Build pruning mask
mask = torch.ones_like(W, dtype=torch.bool)
for row in range(W.shape[0]):
prune_indices = torch.topk(
scores[row],
k=num_prune,
largest=False
).indices
mask[row, prune_indices] = False
# Apply mask in-place
W.mul_(mask)
return layer
# Example usage
device = "cuda" if torch.cuda.is_available() else "cpu"
hidden_dim = 4096
linear = nn.Linear(hidden_dim, hidden_dim, bias=False).half().to(device)
calibration_activations = torch.randn(
4, 128, hidden_dim,
device=device,
dtype=torch.float16
)
pruned_layer = wanda_prune_linear(
linear,
calibration_activations,
sparsity=0.5
)
zero_count = torch.sum(pruned_layer.weight == 0).item()
total_count = pruned_layer.weight.numel()
print(f"Sparsity: {zero_count / total_count:.2%}")
This is an example of simplified Wanda-style pruning for a single nn.Linear layer in PyTorch. Calibration activations are used to compute the L2 norm of every input dimension. Activation norms are multiplied by the absolute value of the weight to calculate Wanda importance scores. The lowest-scoring weights are pruned for each row of the output according to the desired sparsity ratio by multiplying by a binary mask in place. In this example, a half-precision linear layer is created. Random calibration activations are generated. The layer is pruned to 50% sparsity, and the final percentage of zeros in the weight is printed.
Evaluating pruned models requires metrics that capture both quality and efficiency:

It is important to set realistic expectations: 50% sparsity does not guarantee 2× speedup. The real speedup depends on hardware support, kernel implementation, batch size, and whether the workload is dominated by weight computation or KV‑cache operations. Structured sparsity is easier to accelerate because hardware and software support specific patterns. Unstructured sparsity often requires custom CUDA kernels or frameworks like Triton.
Pruning is not just a model optimisation technique—it is an infrastructure problem. Several factors determine whether sparsity translates into speed:
The image below shows how pruning can speed up LLM inference only when the model, sparse checkpoint format, GPU kernels, and serving infrastructure are properly aligned. It uses a simple 2:4 sparsity example to show why pruning alone is not enough for real production gains.
Choose SparseGPT or Wanda when you want to reduce inference cost without retraining from scratch. Use them to fit larger models into smaller GPUs, to serve more models on a single GPU node, to reduce memory footprint and KV cache pressure. They can help to rapidly test compressed variants of your LLMs, prepare the models for edge deployment, and improve the economics of deploying open-source LLMs.
Don’t rely on pruning alone if your inference engine can’t exploit sparse weights, your workload accesses the KV cache most of the time, your model quality drops after pruning, or the deployment hardware doesn’t support sparse acceleration.
Choose SparseGPT if maintaining quality at a higher sparsity level is your primary goal. Choose Wanda if fast experimentation and low implementation complexity matter more.
The right production question is not simply: “Is the model sparse?” The better question is: “Does this sparse model reduce cost per useful token while preserving quality?”
What is LLM pruning? LLM pruning is a compression technique that removes less important weights from a model. The goal is to reduce memory usage and inference cost while preserving model quality.
What is the difference between SparseGPT and Wanda? SparseGPT uses second-order reconstruction to preserve layer outputs after pruning. Wanda uses a simpler activation-aware score based on weight magnitude and input activation norm, making it easier and faster to apply.
Does pruning automatically make an LLM faster? No. Pruning only improves speed when the inference stack can exploit sparse weights through sparse formats, sparse kernels, and compatible hardware. If dense kernels still process zero weights, latency may not improve.
When should I choose SparseGPT? Choosing SparseGPT when maintaining quality at higher sparsity levels is the main goal. It is more complex, but it is designed to preserve model behavior through reconstruction and weight compensation.
When should I choose Wanda? Choose Wanda when fast experimentation, simplicity, and low implementation complexity matter more. It is a strong baseline for quickly testing how pruning affects a model before investing in more complex optimization.
SparseGPT and Wanda both illustrate that we can prune large language models after training without retraining. SparseGPT uses an involved reconstruction‑based metric with second‑order approximations to achieve this at scale while retaining accuracy. Wanda takes a much simpler activation‑aware approach, multiplying the weight by the input norm activation, enabling fast pruning with minimal engineering cost. Both projects result in sparse models that relieve memory pressure and (with the right kernels and hardware) improve inference throughput & lower latency.
Pruning isn’t the only consideration, though. Speedups and cost savings will only be realised with system‑level support for sparse formats, custom sparse kernels, smart memory optimisation and other techniques such as quantization and KV‑cache management. Inference stacks deployed in production will likely use a mix of strategies–pruning, quantization, caching, batching, speculative decoding, routing–to serve responsive AI services at the lowest possible cost. In that future, pruning will be a standard part of the LLM optimisation pipeline: not merely a research novelty but a practical tool for reducing cost per token and scaling AI services efficiently.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Join the many businesses that use DigitalOcean’s Gradient AI Agentic Cloud to accelerate growth. Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.