High Throughput Multi Model Serving Optimization

KeepTruckin (via Seldon)

Challenges & Goals

KeepTruckin (now Motive), a telematics and fleet management company, needed to serve a growing number of machine learning models to power various smart features in its products. The challenge was that deploying and running many models simultaneously was straining system resources especially GPU memory which led to throughput bottlenecks and rising infrastructure costs. In addition, inconsistencies between development, QA, and production Kubernetes environments were making it difficult to reliably release new model versions. The company’s goal was to optimize its model serving infrastructure to handle higher loads (multiple models per server) without performance degradation, while also implementing MLOps best practices to streamline deployments and monitoring across all environments. Another key goal was to minimize risk during model rollouts: introducing new or updated models should not jeopardize system stability, so strategies like canary releases were needed to gradually test and validate changes before full deployment.

Solution

Ahmed Ali (in his role with Seldon) spearheaded a series of enhancements to KeepTruckin’s model serving architecture, which was built on Seldon Core. He conceived and implemented a novel caching mechanism within Seldon Core’s inference engine to address the performance bottlenecks. Specifically, he introduced an “Overcommit” model caching strategy with a least recently used (LRU) eviction policy. This allowed the system to load more ML models than the GPU memory could ordinarily accommodate by dynamically offloading the least used models to disk. In effect, Seldon Core could over commit memory to many models and only keep the actively queried models in GPU memory. In benchmark tests, this enhancement boosted the throughput of KeepTruckin’s inference servers by roughly 30%, enabling faster responses even as the number of deployed models and incoming requests grew.

Beyond core algorithm improvements, Ahmed focused on deployment reliability and observability. He developed standardized Helm charts and implemented a GitOps workflow for Kubernetes (using tools like ArgoCD) to manage the model deployment process. This ensured that development, QA, and production environments were configured identically via code, eliminating configuration drift and the “it works on my machine” problem. At the same time, he integrated Prometheus exporters for custom metrics and set up Grafana dashboards to monitor system performance in real time tracking latency, throughput, GPU utilization, and memory usage for each model. With this end to end visibility, the team could immediately detect anomalies or resource contention issues and had data to guide scaling decisions and capacity planning.

To enable safer model rollouts, Ahmed established progressive delivery practices. He set up canary release workflows in the CI/CD pipeline so that new or updated models would first be deployed to a small subset of users or vehicles, where their performance could be measured against the baseline. Additionally, he introduced A/B testing capabilities into the serving platform, allowing a controlled percentage of traffic to be routed to alternative model versions and automatically comparing outcomes. These strategies significantly reduced the risk of deploying new models any issues could be identified on a limited sample and rolled back before affecting all users, which maintained system uptime and customer trust.

Finally, Ahmed played a key role in uplifting MLOps practices among the engineering teams. He mentored developers on best practices such as proper model versioning, data drift detection, and continuous performance monitoring. He also authored internal best practice guides that were shared across Seldon’s client base. As a result, not only did KeepTruckin gain a much more robust and efficient serving infrastructure, but the same standardized approaches were later adopted by over 50 other enterprise clients of Seldon, greatly improving the reliability and scalability of their own model deployment processes.

Key Results

Achieved a 30% increase in GPU inference throughput by enabling dynamic multi model caching, allowing more models to run per server without added hardware.
Eliminated environment drift across dev, QA, and prod with Helm and GitOps driven deployments, resulting in more predictable and error free releases.
Introduced real time monitoring and safe rollout (canary/A/B testing) practices that improved system stability and reduced deployment risks for new ML models.
Standardized MLOps best practices across 50+ enterprise client deployments, significantly enhancing the reliability and scalability of machine learning services on a broad scale.

View Work