By Joe Keegan and Anish Singh Walia

Deploying large language models on Kubernetes introduces a challenge that traditional application deployments rarely face: how do you get tens or hundreds of gigabytes of model weights loaded into your inference pods efficiently? A 7B parameter model requires roughly 14GB of storage, while a 70B model can exceed 140GB. These sizes fundamentally change how you think about pod startup, scaling, and storage architecture.
This tutorial provides an introductory survey of the common strategies for loading and caching model weights for vLLM pods running on Kubernetes. Our goal is to help you understand your options and their tradeoffs so you can make an informed architectural decision. This is not an exhaustive treatment as topics like model versioning strategies, multi-tenancy considerations, and detailed performance tuning deserve their own deep dives. Consider this a starting point for your exploration.
Before diving into the detailed strategies, here are the essential points to understand:
Before examining specific strategies, let’s understand what are the Key vLLM Model Loading and Caching Considerations when choosing a strategy.
Startup latency measures how long before a pod can serve its first request. For autoscaling workloads, this directly impacts how quickly you can respond to traffic spikes.
Storage efficiency considers whether you’re duplicating model weights across nodes or sharing a single copy. With models measured in tens of gigabytes, duplication costs add up.
Network bandwidth encompasses both external pressure on model registries like HuggingFace (which may rate-limit aggressive downloaders) and internal network utilization within your cluster.
Operational complexity accounts for the moving parts in your system. More components mean more potential failure modes and more to debug when things go wrong.
Scaling behavior asks what happens when a new node joins your cluster or when pods get rescheduled. Does the new node need to download everything from scratch?
Model update and rollback considers how easy it is to deploy a new model version or revert to a previous one. Some strategies make this trivial; others require careful coordination.
Performance characteristics vary significantly by infrastructure provider and configuration. Whatever strategy you select, test it under realistic conditions to validate it meets your specific requirements.
Before diving into loading strategies, consider where your models will come from. The source you choose affects reliability, security, and operational complexity.
HuggingFace Hub is the most common starting point. vLLM integrates natively with it, making development frictionless. Just specify a model identifier and vLLM handles the rest. However, HuggingFace is an external dependency outside your control. If their service experiences issues during a traffic spike or node failure, your cluster can’t scale or recover.
Self-hosted object storage such as DigitalOcean Spaces gives you control. You download models once to your own storage, and your cluster pulls from infrastructure you manage. This adds a setup step but eliminates external dependencies during runtime.
Shared filesystems, for example, DigitalOcean Volumes with NFS or other ReadWriteMany PVC providers, allow you to create a central model repository that all pods in your cluster can access. With this setup, model files are downloaded once to the shared volume and can be used by every inference pod, dramatically improving storage efficiency. The exact performance will depend on your shared filesystem’s implementation and underlying storage infrastructure.
HTTP endpoints offer flexibility for custom setups including internal model registries, artifact servers, or CDNs. KServe’s StorageInitializer supports HTTP sources alongside object storage, making it straightforward to integrate with whatever infrastructure you have.
The production principle: control your model sources. For development and experimentation, pulling directly from HuggingFace is convenient and fast. For production, mirror the models you depend on to storage you control. You don’t want an external service’s availability to determine whether you can scale up, replace a failed pod, or recover from a node failure.
With model sources understood, let’s examine the strategies for loading those models into your pods.
With this approach, vLLM itself handles model downloads on startup. Point vLLM at a model identifier or path, and it pulls the weights before beginning to serve requests.
To local storage: Each pod downloads model weights to local ephemeral storage (emptyDir) or a pod-specific persistent volume. This is the simplest approach with no shared infrastructure to manage and no coordination between pods. vLLM’s native HuggingFace integration means this often works out of the box.
The appeal is simplicity. For development, testing, or single-replica deployments, this gets you running quickly. The downsides emerge at scale: every pod downloads independently, multiplying bandwidth consumption and putting pressure on upstream sources. Cold starts can take 10+ minutes for larger models.
To shared storage: All pods can point their cache directory at a shared filesystem (NFS or ReadWriteMany PVC). The first pod to start downloads the model, and subsequent pods find it already cached.
This improves efficiency by downloading once and using everywhere. However, it introduces concurrency risks. If multiple pods start simultaneously, they may all attempt to download the same model, potentially corrupting files or creating race conditions. You need to ensure only one pod writes to the cache at a time through file locks, a readiness mechanism, or scaling to one replica initially.
For shared storage scenarios, the Central Job approach (covered below) is typically more robust. It separates the download step from pod startup entirely, avoiding the concurrency issues and giving you explicit control over when models are populated.
Init containers let you separate download logic from your inference runtime. An init container runs before the main vLLM container, downloading the model to a shared volume (typically an emptyDir). Once complete, the main container starts with the model already in place.
This separation of concerns can be valuable. Your init container can use specialized download tools, implement retry logic, or pull from private registries with different credentials than your inference container needs. The main vLLM container starts with the model ready, simplifying its configuration.
Several tools and platforms support this approach:
Custom init containers using tools like huggingface-cli, s3cmd, or similar utilities let you download models before the main container starts, giving you full control over the download process. For example, you could use doctl with DigitalOcean Spaces to fetch models from your own object storage. This approach lets you tailor the logic to your needs but does require building and maintaining your own init container image.
DigitalOcean Spaces and Volumes can be leveraged for this init container pattern. Use an init container to download model files from DigitalOcean Spaces (an S3-compatible object store) directly into a shared DigitalOcean Volume. This approach centralizes storage and provides high reliability. The init container handles pre-fetching the model weights before vLLM starts, ensuring faster container startup and avoiding redundant downloads across pods. This pattern integrates well with DigitalOcean’s managed Kubernetes, keeping your storage and deployment entirely under your control.
vLLM Production Stack implements this pattern through its Helm chart. The chart supports an initContainer configuration that can download models before the main vLLM container starts. You can configure the deployment to mount the PVC during initialization. The chart also provides flexibility in storage access modes using ReadWriteOnce volumes for single-node deployments and ReadWriteMany volumes for multi-node scenarios where pods need to share storage across nodes. For production deployments requiring horizontal scaling, you’ll want ReadWriteMany storage (such as NFS) and should consider pre-populating the shared storage to avoid race conditions when multiple pods start simultaneously.
The main limitation of init container approaches is that downloads still happen per-pod unless you combine this with shared storage. It adds complexity to your pod specification, but for many teams the cleaner separation justifies this.
Rather than downloading models as part of pod startup, you can use dedicated Kubernetes Jobs to pre-populate storage before inference pods run.
With this approach, a Job (or DaemonSet) downloads models to node-local storage, typically a hostPath volume on the node’s local SSD, before inference pods are scheduled to that node.
The benefit is performance: once downloaded, pods on that node load models from fast local storage. No network filesystem bottlenecks, no shared storage dependencies. Download happens once per node regardless of how many pods run there.
The complexity lies in coordination. How do inference pods know the model is ready? You need scheduling constraints such as node selectors, taints and tolerations, or custom logic to ensure pods only schedule to nodes where the download Job has completed. hostPath volumes have security implications you’ll need to evaluate. You also need a strategy for cleaning up storage when nodes are removed.
If you want a Kubernetes-native solution for this approach without building custom tooling, KServe’s LocalModelCache (LocalModel CRD) is purpose-built for this use case. It’s a CRD specifically designed for pre-caching models on nodes. The controller manages download jobs and handles node affinity automatically. If you’re evaluating this approach, LocalModelCache is worth serious consideration.
Alternatively, a single Job can download models to shared storage, such as a NFS backed PVC, that all pods then access.
This centralizes model management: one download populates storage for your entire cluster. It scales to any number of pods and makes it straightforward to manage multiple model versions in a central location.
The tradeoff is dependency on shared storage performance. If your shared storage can’t deliver sufficient read throughput, model loading becomes a bottleneck. You’ll also need a strategy for managing storage lifecycle: cleaning up old model versions, handling storage capacity, and so on.
A seemingly straightforward approach is including model weights directly in your container image. Your Dockerfile copies the model files, and every pod gets them automatically through the normal image pull process.
This approach offers conceptual simplicity: your deployment is fully self-contained with no external dependencies at runtime. Images are immutable, making rollbacks as simple as deploying a previous image tag.
However, the practical challenges are significant. Container images exceeding 100GB are unwieldy as they’re slow to build, slow to push, and slow to pull. Registry storage costs increase substantially. Some container registries impose layer size limits or total image size limits that may make this approach impossible for larger models. Image pulls that take 20+ minutes on cold nodes create unacceptable scaling latency for most production workloads.
Consider this approach only for very small models or development scenarios where image size is acceptable. For most production deployments, you’ll want to explore other options.
Some teams use specialized CSI drivers that mount object storage directly and stream data on-demand. Examples include JuiceFS, and SeaweedFS.
These drivers present object storage as a filesystem to your pods. The appeal is transparency as your application doesn’t need to know models come from object storage. Some implementations can begin serving before the full model is downloaded, streaming weights as needed.
Performance varies significantly by implementation and workload. This adds infrastructure complexity as its another component to deploy, monitor, and troubleshoot. For most teams, this is an advanced option to evaluate carefully against simpler strategies. It may be the right choice for specific scenarios, but it’s rarely the starting point.
| Strategy | Cold Start Time | Storage Efficiency | Complexity | Pros | Cons | Best For | Example Tools / Techniques |
|---|---|---|---|---|---|---|---|
| vLLM native (local) | Slow (minutes; model downloaded per pod) | Low (models duplicated on every pod’s local disk) | Low | Easiest setup, inline with vLLM defaults, no shared storage required | Wastes disk for each replica, slow multi-pod startup, bandwidth-intensive on large clusters | Dev/test, one replica or single-node prod | vLLM config with model_download_dir local to each pod |
| vLLM native (shared) | Medium (shared cache, faster subsequent pods) | High (one copy on shared storage for many pods) | Medium | Quick for repeat deployments, simplifies multi-pod model access | Needs shared storage (NFS, PVC), slower if sharing storage with other apps, possible race conditions on first download | Small clusters, quick launches | Shared ReadWriteMany PVC, NFS mounts |
| Init container | Slow-Medium (depends on storage, parallelization) | Medium-High (can leverage shared storage) | Medium | Clean separation of download logic, repeatable, flexible in scripting and sources | Extra container and pod spec complexity, per-pod downloads without shared storage, coordination for large clusters needed | CI/CD pipelines, separation of concerns | Init containers with s3cmd, huggingface-cli; shared PVC |
| Per-node job | Fast (warm start once downloaded per node) | Medium (one copy per node, not per pod) | High | High throughput from node-local SSDs, reduces network hot spots, independent of shared storage | Requires job coordination & node affinity, increased infra complexity, needs hostPath storage, security considerations | Large clusters, high performance | KServe LocalModelCache, DaemonSets, Job+hostPath |
| Central job | Medium (faster than per-pod, slower than local) | High (one copy serves all pods in cluster) | Medium | Centralizes management, easy version upgrades, only one download, robust handling of race conditions | Dependent on shared storage performance, some ops complexity with storage cleanup/versioning | Clusters running many replicas or models | K8s Job to PVC/NFS, Argo Workflows, custom model-populator |
| Container image | Very Slow (image pull can be 10+ minutes/100GB+) | Low (duplicated in every image and pull) | Low-Medium | Easiest to deploy in principle, simple rollback/version control, no runtime network dependencies | Massive images, slow pushes/pulls, registry limits easily exceeded, painful updates/rollbacks, not scalable for large models | Tiny models, demo/prototyping, air-gapped env | Custom images with COPY, multi-stage Docker builds |
| CSI lazy loading | Varies (depends on driver and access patterns) | High (streaming from object storage) | High | Models stream directly from external or object storage, efficient for infrequent/random access, saves disk space | New infrastructure to manage/debug, unpredictable perf under heavy load, often complex to tune, caching strategy varies | Advanced/prod scenarios with object storage | JuiceFS, Mountpoint for S3, SeaweedFS, Alluxio CSI |
Start simple and add complexity as needed. vLLM’s native download to local storage may be perfectly adequate for development or low-scale production. Don’t build elaborate caching infrastructure until you’ve validated you need it.
Consider your scale. The calculus changes significantly between one replica and ten or more. Strategies that feel like overkill for a single pod become essential when you’re managing dozens across multiple nodes.
Think about update frequency. If you’re experimenting with models frequently, centralized approaches that make swapping models easy will save operational pain. If you deploy a model and leave it running for months, the simplicity of per-pod downloads may outweigh efficiency concerns.
Match your existing infrastructure. If your team already manages object storage extensively, using it as your model source with a Central Job approach will feel natural. If you have reliable NFS infrastructure, shared storage may be the path of least resistance.
Whatever strategy you select, validate it under realistic conditions. Measure cold-start times, observe bandwidth utilization, and test behavior during scaling events. The right strategy is the one that meets your actual requirements and not the one that looks best on paper.
1. How long does it take to load a vLLM model on Kubernetes?
Model loading times vary significantly based on model size, network bandwidth, and storage strategy. For a 7B parameter model (roughly 14GB), loading from HuggingFace can take 5-10 minutes on first startup. Larger 70B models (140GB+) can take 20+ minutes. Using shared storage or pre-populated caches can reduce this to seconds for subsequent pods, while local node storage provides the fastest warm starts.
2. What’s the best storage strategy for vLLM on Kubernetes?
The best strategy depends on your scale and infrastructure. For single-replica deployments, vLLM’s native download to local storage is simplest. For multiple replicas, shared storage (NFS or ReadWriteMany PVCs) with a central pre-population job offers the best balance of efficiency and simplicity. For high-performance requirements with local SSDs, per-node jobs provide the fastest warm starts. Avoid baking models into container images for anything larger than small development models.
3. How do I prevent race conditions when multiple pods start simultaneously?
If using shared storage, ensure only one pod writes to the cache at a time. Options include file locks, readiness mechanisms, or initially scaling to one replica to populate the cache. The Central Job approach is typically more robust as it separates download from pod startup entirely, avoiding concurrency issues. For production deployments requiring horizontal scaling, pre-populate shared storage before deploying inference pods.
4. Can I use DigitalOcean Spaces for vLLM model storage?
Yes, DigitalOcean Spaces is S3-compatible object storage that works well as a model source for vLLM deployments. You can download models once to Spaces and configure your cluster to pull from it, giving you control over model availability and eliminating external dependencies. This approach works well with init containers or job-based pre-population strategies.
5. What’s the difference between init containers and job-based pre-population?
Init containers download models as part of pod startup, running before the main vLLM container starts. This keeps download logic close to your deployment but still happens per-pod unless combined with shared storage. Job-based pre-population uses dedicated Kubernetes Jobs to download models to storage before inference pods run, completely separating download from pod startup. Jobs are better for shared storage scenarios as they avoid race conditions and give explicit control over when models are populated.
There’s no single best approach to model loading and caching for vLLM on Kubernetes. The right choice depends on your scale, your existing infrastructure, your operational preferences, and your tolerance for complexity.
This introduction has covered the major strategies, but there’s much more to explore: detailed performance tuning, multi-model serving strategies, integration with CI/CD pipelines, and production hardening considerations. Use this as a starting point, experiment with the strategies that fit your constraints, and iterate based on what you learn.
For implementation details, consult the vLLM documentation, KServe storage and LocalModel documentation, and the vLLM Helm chart examples. The community has developed substantial tooling around these strategies. Leverage what exists before building custom solutions.
Now that you understand the model loading and caching strategies available for vLLM on Kubernetes, explore these resources to deepen your implementation:
Deploy on DigitalOcean Kubernetes: Get started with DigitalOcean Kubernetes (DOKS) to deploy your vLLM workloads with managed infrastructure and simplified operations.
Learn Kubernetes fundamentals: If you’re new to Kubernetes, review our Kubernetes tutorial series to understand core concepts like pods, volumes, and jobs that are essential for vLLM deployments.
Explore DigitalOcean Spaces: Use DigitalOcean Spaces as your model storage backend for S3-compatible object storage that integrates seamlessly with Kubernetes deployments.
Try DigitalOcean Gradient AI Platform: For a managed AI platform experience, explore DigitalOcean Gradient AI Platform which provides infrastructure optimized for AI workloads including vLLM.
Review related tutorials: Learn about Gateway API on DOKS for advanced traffic routing and HTTPS traffic routing for secure model serving.
For production deployments, consider implementing monitoring and observability for your vLLM setup. You can also explore advanced features like model versioning strategies, multi-tenancy considerations, and detailed performance tuning to further optimize your inference workloads.
Further Resources:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
A Senior Solutions Architect at DigitalOcean focusing on Cloud Architecture, Kubernetes, Automation and Infrastructure-as-Code.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.