Deploying LLMS at Scale

High reliability with spot instances: Because we are distributing the load across spot instances from multiple cloud providers and regions, chances of all spot instances going down becomes extremely slow and each of the pools can grow to take in the load when one pool goes down.

Queue adds extra reliability under high load: An HTTP service will start returning 503s if there are sudden spikes. Because of the queue in between, we are able to tolerate spiky workloads - there is a slight increase in latency for the requests during the spike, but they don't fail. Adding the queue in the middle adds an overall latency of around 10-20 ms which is fine for LLM inference usecase since the overall inference latency is in the order of seconds.

Cost Reduction: This helps reduce the cost drastically compared to hosting LLMs on on-demand instances - we will do a detailed comparison below of the costs.

Detailed analytics: The LLM gateway layer helps us calculate the detailed analytics about the incoming requests and can also calculate the token distribution among the requests. We can also start logging the requests to enable finetuning later.

No dependency on one cloud provider: This architecture allows us to easily swap our one GPU provider with another if we find a better price somewhere with zero downtime. Also in case one provider goes down, the other pools keep the system up and running smoothly!

Deploying LLMS at Scale

Key Issues in Deploying LLMs

Find the most optimal way to host one instance of the model

Finding the fleet of GPUs

Maintaining reliability

Ensuring high throughput at low latency

Ensure Fast Startup time of the model

Setup autoscaling

How to take LLMs to production?

Architecture for hosting LLMs on Scale

Cost Reduction for hosting LLMs

Subscribe to our newsletter

Cognita: Building an Open Source, Modular, RAG applications for Production

How To Choose The Best Vector Database

Leveraging Fractional GPUs on Kubernetes

Helping Enterprises accelerate the time to value for GenAI

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

Deploying LLMS at Scale

Key Issues in Deploying LLMs

Find the most optimal way to host one instance of the model

Finding the fleet of GPUs

Maintaining reliability

Ensuring high throughput at low latency

Ensure Fast Startup time of the model

Setup autoscaling

How to take LLMs to production?

Architecture for hosting LLMs on Scale

Cost Reduction for hosting LLMs

Subscribe to our Newsletter

Subscribe to our newsletter

Discover More

Cognita: Building an Open Source, Modular, RAG applications for Production

How To Choose The Best Vector Database

Leveraging Fractional GPUs on Kubernetes

Helping Enterprises accelerate the time to value for GenAI

Related Blogs

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

Subscribe to our newsletter