TrueFoundry Architecture

tfy-agent (Required): This is the truefoundry-agent that initiates the connection to the control-plane and helps coordinate the instructions from control-plane.

ArgoCD (Required): ArgoCD is used to do apply all the manifests to the Kubernetes cluster. This is better than doing helm install because ArgoCD controller makes sure the internal state is synced with the desired state in the manifests and is not prone to helm installation failures.

Istio (Required currently, will be optional in future): We currently rely on Istio as the ingress controller for the cluster. We do not mandate running istio sidecars and they can be enabled optionally if required for usecases like mutual TLS. We also plan to use the Gateway APIs of Kubernetes which will allow us to work with multiple ingress controllers like Nginx, Linkerd, Traefik, etc.

Argo Workflows (Required only for running jobs): We use ArgoWorkflows for running all jobs inside the cluster because of the more advances options it provides when compared to Kubernetes jobs.

Argo Rollouts (Required): We use ArgoRollouts to support Canary and BlueGreen rollouts on Kubernetes. This is currently a required prerequisite, but this will be optional in the future.

Prometheus (Optional): This is an optional dependency needed for showing metrics like CPU, memory, request counts for the services.

Keda (Optional): This is an optional dependency and needed if you want to enable autoscaling for your workloads.

Loki (Optional): This helps with log aggregation and is an optional dependency. You can always use any other log aggregator that you are comfortable with like ELK Stack, Cloudwatch, Datadog, etc.

Drivers (EFS, EBS, GPU): These are needed if you need GPU or volume support in your cluster.

Notebook Controller (Optional): This is needed if you want to provide Notebooks on the Kubernetes cluster.

TrueFoundry Architecture - Machine Learning on Kubernetes!

ML Stack for fast iteration and impact

Advantages of this architecture

A peek into the Truefoundry Control Plane

Compute-Plane Cluster

Constraints on the Kubernetes cluster AMI

Permissions for tfy-agent

Authentication in Truefoundry

Authorization in Truefoundry

Image Build Pipeline in control-plane

Secret Management in Truefoundry

Subscribe to our newsletter

SSH Server Containers For Development on Kubernetes

Prompting, RAG or Fine-tuning - the right choice?

Large Language Models for Commercial Use

Adding OAuth2 to Jupyter Notebooks on Kubernetes

Blazingly fast way to build, track and deploy your models!

Product

Resources

Company

Goodreads

TrueFoundry Architecture - Machine Learning on Kubernetes!

ML Stack for fast iteration and impact