Report this

What is the reason for this report?

vLLM Kubernetes: Model Loading & Caching Strategies

Published on December 15, 2025
vLLM Kubernetes: Model Loading & Caching Strategies

Deploying large language models on Kubernetes introduces a challenge that traditional application deployments rarely face: how do you get tens or hundreds of gigabytes of model weights loaded into your inference pods efficiently? A 7B parameter model requires roughly 14GB of storage, while a 70B model can exceed 140GB. These sizes fundamentally change how you think about pod startup, scaling, and storage architecture.

This tutorial provides an introductory survey of the common strategies for loading and caching model weights for vLLM pods running on Kubernetes. Our goal is to help you understand your options and their tradeoffs so you can make an informed architectural decision. This is not an exhaustive treatment as topics like model versioning strategies, multi-tenancy considerations, and detailed performance tuning deserve their own deep dives. Consider this a starting point for your exploration.

Key Takeaways

Before diving into the detailed strategies, here are the essential points to understand:

  • Startup latency matters: Cold starts can take 10+ minutes for larger models, directly impacting your ability to scale and respond to traffic spikes.
  • Storage efficiency varies significantly: Strategies range from per-pod duplication to shared storage, with major cost implications for models measured in tens or hundreds of gigabytes.
  • Control your model sources: While HuggingFace is convenient for development, production deployments should mirror models to storage you control to avoid external dependencies.
  • No one-size-fits-all solution: The right strategy depends on your scale, infrastructure, update frequency, and operational preferences.
  • Test under realistic conditions: Performance characteristics vary by provider and configuration, so validate your chosen approach meets your specific requirements.

What are the Key vLLM Model Loading and Caching Considerations when choosing a strategy?

Before examining specific strategies, let’s understand what are the Key vLLM Model Loading and Caching Considerations when choosing a strategy.

Startup latency measures how long before a pod can serve its first request. For autoscaling workloads, this directly impacts how quickly you can respond to traffic spikes.

Storage efficiency considers whether you’re duplicating model weights across nodes or sharing a single copy. With models measured in tens of gigabytes, duplication costs add up.

Network bandwidth encompasses both external pressure on model registries like HuggingFace (which may rate-limit aggressive downloaders) and internal network utilization within your cluster.

Operational complexity accounts for the moving parts in your system. More components mean more potential failure modes and more to debug when things go wrong.

Scaling behavior asks what happens when a new node joins your cluster or when pods get rescheduled. Does the new node need to download everything from scratch?

Model update and rollback considers how easy it is to deploy a new model version or revert to a previous one. Some strategies make this trivial; others require careful coordination.

Performance characteristics vary significantly by infrastructure provider and configuration. Whatever strategy you select, test it under realistic conditions to validate it meets your specific requirements.

Model Sources: Understanding vLLM Model Source Options

Before diving into loading strategies, consider where your models will come from. The source you choose affects reliability, security, and operational complexity.

HuggingFace Hub is the most common starting point. vLLM integrates natively with it, making development frictionless. Just specify a model identifier and vLLM handles the rest. However, HuggingFace is an external dependency outside your control. If their service experiences issues during a traffic spike or node failure, your cluster can’t scale or recover.

Self-hosted object storage such as DigitalOcean Spaces gives you control. You download models once to your own storage, and your cluster pulls from infrastructure you manage. This adds a setup step but eliminates external dependencies during runtime.

Shared filesystems, for example, DigitalOcean Volumes with NFS or other ReadWriteMany PVC providers, allow you to create a central model repository that all pods in your cluster can access. With this setup, model files are downloaded once to the shared volume and can be used by every inference pod, dramatically improving storage efficiency. The exact performance will depend on your shared filesystem’s implementation and underlying storage infrastructure.

HTTP endpoints offer flexibility for custom setups including internal model registries, artifact servers, or CDNs. KServe’s StorageInitializer supports HTTP sources alongside object storage, making it straightforward to integrate with whatever infrastructure you have.

The production principle: control your model sources. For development and experimentation, pulling directly from HuggingFace is convenient and fast. For production, mirror the models you depend on to storage you control. You don’t want an external service’s availability to determine whether you can scale up, replace a failed pod, or recover from a node failure.

With model sources understood, let’s examine the strategies for loading those models into your pods.

Model Caching Strategy Survey

vLLM Native Download

With this approach, vLLM itself handles model downloads on startup. Point vLLM at a model identifier or path, and it pulls the weights before beginning to serve requests.

To local storage: Each pod downloads model weights to local ephemeral storage (emptyDir) or a pod-specific persistent volume. This is the simplest approach with no shared infrastructure to manage and no coordination between pods. vLLM’s native HuggingFace integration means this often works out of the box.

The appeal is simplicity. For development, testing, or single-replica deployments, this gets you running quickly. The downsides emerge at scale: every pod downloads independently, multiplying bandwidth consumption and putting pressure on upstream sources. Cold starts can take 10+ minutes for larger models.

To shared storage: All pods can point their cache directory at a shared filesystem (NFS or ReadWriteMany PVC). The first pod to start downloads the model, and subsequent pods find it already cached.

This improves efficiency by downloading once and using everywhere. However, it introduces concurrency risks. If multiple pods start simultaneously, they may all attempt to download the same model, potentially corrupting files or creating race conditions. You need to ensure only one pod writes to the cache at a time through file locks, a readiness mechanism, or scaling to one replica initially.

For shared storage scenarios, the Central Job approach (covered below) is typically more robust. It separates the download step from pod startup entirely, avoiding the concurrency issues and giving you explicit control over when models are populated.

Init Container Download

Init containers let you separate download logic from your inference runtime. An init container runs before the main vLLM container, downloading the model to a shared volume (typically an emptyDir). Once complete, the main container starts with the model already in place.

This separation of concerns can be valuable. Your init container can use specialized download tools, implement retry logic, or pull from private registries with different credentials than your inference container needs. The main vLLM container starts with the model ready, simplifying its configuration.

Several tools and platforms support this approach:

Custom init containers using tools like huggingface-cli, s3cmd, or similar utilities let you download models before the main container starts, giving you full control over the download process. For example, you could use doctl with DigitalOcean Spaces to fetch models from your own object storage. This approach lets you tailor the logic to your needs but does require building and maintaining your own init container image.

DigitalOcean Spaces and Volumes can be leveraged for this init container pattern. Use an init container to download model files from DigitalOcean Spaces (an S3-compatible object store) directly into a shared DigitalOcean Volume. This approach centralizes storage and provides high reliability. The init container handles pre-fetching the model weights before vLLM starts, ensuring faster container startup and avoiding redundant downloads across pods. This pattern integrates well with DigitalOcean’s managed Kubernetes, keeping your storage and deployment entirely under your control.

vLLM Production Stack implements this pattern through its Helm chart. The chart supports an initContainer configuration that can download models before the main vLLM container starts. You can configure the deployment to mount the PVC during initialization. The chart also provides flexibility in storage access modes using ReadWriteOnce volumes for single-node deployments and ReadWriteMany volumes for multi-node scenarios where pods need to share storage across nodes. For production deployments requiring horizontal scaling, you’ll want ReadWriteMany storage (such as NFS) and should consider pre-populating the shared storage to avoid race conditions when multiple pods start simultaneously.

The main limitation of init container approaches is that downloads still happen per-pod unless you combine this with shared storage. It adds complexity to your pod specification, but for many teams the cleaner separation justifies this.

Job-Based Pre-Population

Rather than downloading models as part of pod startup, you can use dedicated Kubernetes Jobs to pre-populate storage before inference pods run.

Per-Node Job

With this approach, a Job (or DaemonSet) downloads models to node-local storage, typically a hostPath volume on the node’s local SSD, before inference pods are scheduled to that node.

The benefit is performance: once downloaded, pods on that node load models from fast local storage. No network filesystem bottlenecks, no shared storage dependencies. Download happens once per node regardless of how many pods run there.

The complexity lies in coordination. How do inference pods know the model is ready? You need scheduling constraints such as node selectors, taints and tolerations, or custom logic to ensure pods only schedule to nodes where the download Job has completed. hostPath volumes have security implications you’ll need to evaluate. You also need a strategy for cleaning up storage when nodes are removed.

If you want a Kubernetes-native solution for this approach without building custom tooling, KServe’s LocalModelCache (LocalModel CRD) is purpose-built for this use case. It’s a CRD specifically designed for pre-caching models on nodes. The controller manages download jobs and handles node affinity automatically. If you’re evaluating this approach, LocalModelCache is worth serious consideration.

Central Job to Shared Storage

Alternatively, a single Job can download models to shared storage, such as a NFS backed PVC, that all pods then access.

This centralizes model management: one download populates storage for your entire cluster. It scales to any number of pods and makes it straightforward to manage multiple model versions in a central location.

The tradeoff is dependency on shared storage performance. If your shared storage can’t deliver sufficient read throughput, model loading becomes a bottleneck. You’ll also need a strategy for managing storage lifecycle: cleaning up old model versions, handling storage capacity, and so on.

Baking Models into Container Images

A seemingly straightforward approach is including model weights directly in your container image. Your Dockerfile copies the model files, and every pod gets them automatically through the normal image pull process.

This approach offers conceptual simplicity: your deployment is fully self-contained with no external dependencies at runtime. Images are immutable, making rollbacks as simple as deploying a previous image tag.

However, the practical challenges are significant. Container images exceeding 100GB are unwieldy as they’re slow to build, slow to push, and slow to pull. Registry storage costs increase substantially. Some container registries impose layer size limits or total image size limits that may make this approach impossible for larger models. Image pulls that take 20+ minutes on cold nodes create unacceptable scaling latency for most production workloads.

Consider this approach only for very small models or development scenarios where image size is acceptable. For most production deployments, you’ll want to explore other options.

CSI-Based Lazy Loading

Some teams use specialized CSI drivers that mount object storage directly and stream data on-demand. Examples include JuiceFS, and SeaweedFS.

These drivers present object storage as a filesystem to your pods. The appeal is transparency as your application doesn’t need to know models come from object storage. Some implementations can begin serving before the full model is downloaded, streaming weights as needed.

Performance varies significantly by implementation and workload. This adds infrastructure complexity as its another component to deploy, monitor, and troubleshoot. For most teams, this is an advanced option to evaluate carefully against simpler strategies. It may be the right choice for specific scenarios, but it’s rarely the starting point.

Comparison Summary

Strategy Cold Start Time Storage Efficiency Complexity Pros Cons Best For Example Tools / Techniques
vLLM native (local) Slow (minutes; model downloaded per pod) Low (models duplicated on every pod’s local disk) Low Easiest setup, inline with vLLM defaults, no shared storage required Wastes disk for each replica, slow multi-pod startup, bandwidth-intensive on large clusters Dev/test, one replica or single-node prod vLLM config with model_download_dir local to each pod
vLLM native (shared) Medium (shared cache, faster subsequent pods) High (one copy on shared storage for many pods) Medium Quick for repeat deployments, simplifies multi-pod model access Needs shared storage (NFS, PVC), slower if sharing storage with other apps, possible race conditions on first download Small clusters, quick launches Shared ReadWriteMany PVC, NFS mounts
Init container Slow-Medium (depends on storage, parallelization) Medium-High (can leverage shared storage) Medium Clean separation of download logic, repeatable, flexible in scripting and sources Extra container and pod spec complexity, per-pod downloads without shared storage, coordination for large clusters needed CI/CD pipelines, separation of concerns Init containers with s3cmd, huggingface-cli; shared PVC
Per-node job Fast (warm start once downloaded per node) Medium (one copy per node, not per pod) High High throughput from node-local SSDs, reduces network hot spots, independent of shared storage Requires job coordination & node affinity, increased infra complexity, needs hostPath storage, security considerations Large clusters, high performance KServe LocalModelCache, DaemonSets, Job+hostPath
Central job Medium (faster than per-pod, slower than local) High (one copy serves all pods in cluster) Medium Centralizes management, easy version upgrades, only one download, robust handling of race conditions Dependent on shared storage performance, some ops complexity with storage cleanup/versioning Clusters running many replicas or models K8s Job to PVC/NFS, Argo Workflows, custom model-populator
Container image Very Slow (image pull can be 10+ minutes/100GB+) Low (duplicated in every image and pull) Low-Medium Easiest to deploy in principle, simple rollback/version control, no runtime network dependencies Massive images, slow pushes/pulls, registry limits easily exceeded, painful updates/rollbacks, not scalable for large models Tiny models, demo/prototyping, air-gapped env Custom images with COPY, multi-stage Docker builds
CSI lazy loading Varies (depends on driver and access patterns) High (streaming from object storage) High Models stream directly from external or object storage, efficient for infrequent/random access, saves disk space New infrastructure to manage/debug, unpredictable perf under heavy load, often complex to tune, caching strategy varies Advanced/prod scenarios with object storage JuiceFS, Mountpoint for S3, SeaweedFS, Alluxio CSI

Making the Decision: Selecting the Right Model and Caching Strategy for vLLM on Kubernetes

Start simple and add complexity as needed. vLLM’s native download to local storage may be perfectly adequate for development or low-scale production. Don’t build elaborate caching infrastructure until you’ve validated you need it.

Consider your scale. The calculus changes significantly between one replica and ten or more. Strategies that feel like overkill for a single pod become essential when you’re managing dozens across multiple nodes.

Think about update frequency. If you’re experimenting with models frequently, centralized approaches that make swapping models easy will save operational pain. If you deploy a model and leave it running for months, the simplicity of per-pod downloads may outweigh efficiency concerns.

Match your existing infrastructure. If your team already manages object storage extensively, using it as your model source with a Central Job approach will feel natural. If you have reliable NFS infrastructure, shared storage may be the path of least resistance.

Whatever strategy you select, validate it under realistic conditions. Measure cold-start times, observe bandwidth utilization, and test behavior during scaling events. The right strategy is the one that meets your actual requirements and not the one that looks best on paper.

Frequently Asked Questions

1. How long does it take to load a vLLM model on Kubernetes?

Model loading times vary significantly based on model size, network bandwidth, and storage strategy. For a 7B parameter model (roughly 14GB), loading from HuggingFace can take 5-10 minutes on first startup. Larger 70B models (140GB+) can take 20+ minutes. Using shared storage or pre-populated caches can reduce this to seconds for subsequent pods, while local node storage provides the fastest warm starts.

2. What’s the best storage strategy for vLLM on Kubernetes?

The best strategy depends on your scale and infrastructure. For single-replica deployments, vLLM’s native download to local storage is simplest. For multiple replicas, shared storage (NFS or ReadWriteMany PVCs) with a central pre-population job offers the best balance of efficiency and simplicity. For high-performance requirements with local SSDs, per-node jobs provide the fastest warm starts. Avoid baking models into container images for anything larger than small development models.

3. How do I prevent race conditions when multiple pods start simultaneously?

If using shared storage, ensure only one pod writes to the cache at a time. Options include file locks, readiness mechanisms, or initially scaling to one replica to populate the cache. The Central Job approach is typically more robust as it separates download from pod startup entirely, avoiding concurrency issues. For production deployments requiring horizontal scaling, pre-populate shared storage before deploying inference pods.

4. Can I use DigitalOcean Spaces for vLLM model storage?

Yes, DigitalOcean Spaces is S3-compatible object storage that works well as a model source for vLLM deployments. You can download models once to Spaces and configure your cluster to pull from it, giving you control over model availability and eliminating external dependencies. This approach works well with init containers or job-based pre-population strategies.

5. What’s the difference between init containers and job-based pre-population?

Init containers download models as part of pod startup, running before the main vLLM container starts. This keeps download logic close to your deployment but still happens per-pod unless combined with shared storage. Job-based pre-population uses dedicated Kubernetes Jobs to download models to storage before inference pods run, completely separating download from pod startup. Jobs are better for shared storage scenarios as they avoid race conditions and give explicit control over when models are populated.

Conclusion

There’s no single best approach to model loading and caching for vLLM on Kubernetes. The right choice depends on your scale, your existing infrastructure, your operational preferences, and your tolerance for complexity.

This introduction has covered the major strategies, but there’s much more to explore: detailed performance tuning, multi-model serving strategies, integration with CI/CD pipelines, and production hardening considerations. Use this as a starting point, experiment with the strategies that fit your constraints, and iterate based on what you learn.

For implementation details, consult the vLLM documentation, KServe storage and LocalModel documentation, and the vLLM Helm chart examples. The community has developed substantial tooling around these strategies. Leverage what exists before building custom solutions.

Next Steps

Now that you understand the model loading and caching strategies available for vLLM on Kubernetes, explore these resources to deepen your implementation:

  • Deploy on DigitalOcean Kubernetes: Get started with DigitalOcean Kubernetes (DOKS) to deploy your vLLM workloads with managed infrastructure and simplified operations.

  • Learn Kubernetes fundamentals: If you’re new to Kubernetes, review our Kubernetes tutorial series to understand core concepts like pods, volumes, and jobs that are essential for vLLM deployments.

  • Explore DigitalOcean Spaces: Use DigitalOcean Spaces as your model storage backend for S3-compatible object storage that integrates seamlessly with Kubernetes deployments.

  • Try DigitalOcean Gradient AI Platform: For a managed AI platform experience, explore DigitalOcean Gradient AI Platform which provides infrastructure optimized for AI workloads including vLLM.

  • Review related tutorials: Learn about Gateway API on DOKS for advanced traffic routing and HTTPS traffic routing for secure model serving.

For production deployments, consider implementing monitoring and observability for your vLLM setup. You can also explore advanced features like model versioning strategies, multi-tenancy considerations, and detailed performance tuning to further optimize your inference workloads.

Further Resources:

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Joe Keegan
Joe Keegan
Author
Sr. Solutions Architect
See author profile

A Senior Solutions Architect at DigitalOcean focusing on Cloud Architecture, Kubernetes, Automation and Infrastructure-as-Code.

Anish Singh Walia
Anish Singh Walia
Editor
Sr Technical Writer
See author profile

I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.