Deploying AI models into air-gapped or customer-managed infrastructure introduces a unique set of challenges—especially when those models include massive, resource-hungry large language models (LLMs). During a recent conversation with Tom Kraljevic, VP of Customer Engineering at H2O.ai, he walked us through their self-hosted architecture and offered a clear-eyed perspective on best practices for model distribution in such environments. In this blog, we’ll first walk through H2O.ai’s unique architecture for distributing open-weight AI models to their customers. Then, we talk through best practices their team learned designing this architecture, and what lessons you can adopt going forward as you build or improve your own.
At a high level, H2O.ai’s architecture reflects a clear separation between model artifacts and serving infrastructure. Their goal is to make the platform deployable in entirely offline environments without compromising on performance or flexibility.
The architecture consists of seven key components:
The following steps explain how Tom and his team deploy large models within this architecture:
The key step in this process is taking model files—typically sourced from Hugging Face—and sideloading them into MinIO using the mc CLI. The models are bundled as tar files, since the individual files are under 10 GB but collectively are very large. To avoid container registry bloat and simplify deployment, these models are not packaged into Docker images.
While this architecture is designed to work entirely offline, H2O.ai also supports a range of flexible deployment options for customers with different preferences or existing infrastructure. “Instead of pointing to a locally hosted VLLM, you could point to a VLLM somewhere else—or just use a cloud-based service like OpenAI or Anthropic,” Tom noted. Similarly, customers can swap out the default S3-compatible MinIO bucket for their own managed object storage. “We have a number of customers who like to swap in their own homegrown storage solutions,” Tom explained. This flexibility is enabled by the choice to store model data in S3-compatible formats. Because S3 is widely adopted in enterprise environments, it offers both portability and integration ease—allowing customers to move storage out-of-cluster or plug into existing infrastructure without friction. For H2O.ai, this was a key reason to favor S3 over container registries for model artifact storage.
This setup allows H2O.ai to deliver a powerful and flexible stack into customer environments—whether connected or completely air-gapped.
“It’s a great way of being able to show to the customer: we can give you this capability. It’s fully air-gapped—no need to reach the internet to do anything once you’ve copied over your bundles of assets,” Tom explained.
This architecture isn’t just technically sound—it reflects the hard lessons H2O.ai has learned from delivering LLMs into demanding enterprise environments. From sizing strategies to sideloading models and handling GPUs offline, their approach offers a clear roadmap for anyone looking to distribute AI workloads reliably into self-hosted or air-gapped settings.
Drawing from this architecture, Tom also shared a set of practical best practices that help H2O.ai deliver reliable model-serving capabilities into self-hosted environments.
Not all models are created equal. For models under ~1GB, Tom does recommend packaging them directly into Docker containers. “We’ll try to just package it to make it be more tightly integrated with the software versioning... it’s easier to dockerize the applications,” he noted.
But LLMs are a different story. These models often span hundreds of gigabytes, making them unsuitable for container packaging.
“Any model that’s maybe less than a gigabyte, we’ll try to just package it... but the big ones? It’s just not feasible,” Tom emphasized.
As part of the bundling process, they prune extraneous formats, keeping only what’s needed—typically the safetensors files and relevant metadata.
H2O.ai designs with air-gapping in mind from the beginning. That means all software and artifacts must be deliverable offline, including drivers and Helm charts.
For GPU support, customers can choose between:
“Whatever the customer is able to support with the least amount of involvement from me is the way,” Tom quipped.
H2O.ai uses Helm as the core of their deployment system, wrapped in Replicated for enterprises that want an intuitive GUI, lifecycle tooling, and Embedded Cluster support. Replicated’s Embedded Cluster provides an all-in-one Kubernetes runtime that simplifies deployment into customer environments with no pre-existing Kubernetes infrastructure.
“We’ve taken our Helm-based install... and just put that inside of Replicated,” said Tom. “Now the Replicated stack is fully integrated with the standard Helm deployment.”
This gives H2O.ai the flexibility to deliver the same experience (and leverage the testing investment) whether a customer is using Helm directly or installing through Replicated's enterprise-grade interface.
Trying to load LLMs directly into your container registry is a recipe for disaster. “You can wait 30–45 minutes, and your reward is an out-of-disk failure,” Tom warned. “That takes out one of the worker nodes in your cluster.”
Instead, always store large model artifacts externally (e.g., in MinIO) and mount them dynamically at runtime.
While many deployments start with open-weight models, fine-tuning introduces new ownership and security concerns.
“Once [customers] fine-tune a base model on their own data... they’re more interested in protecting it,” Tom explained.
If you expect to support model customization, consider how you’ll handle:
These requirements often evolve quickly as customers go from experimentation to production.
VLLM (and other inference engines) come with dozens of runtime parameters—some of which can drastically impact performance, memory consumption, or model compatibility.
“There’s just a lot of things that you need to be able to optimize... It’s almost like its own whole operating system,” Tom said.
Make sure your deployment:
Many customers will want to test models in a dev environment before promoting them to production. Supporting this requires more than just two clusters—it means enabling safe, repeatable promotion workflows.
“Typically, customers will have multiple environments… the prod environment will move less fast than dev,” Tom noted.
Consider:
If you’re building an application that needs to run inside customer-managed or air-gapped environments, Replicated gives you the tooling to package, distribute, and operate that software with ease. From embedded Kubernetes to automatic license enforcement and upgrade management, Replicated helps vendors like H2O.ai deliver modern apps into complex enterprise infrastructure.
Explore how Replicated can help you distribute your application today.