Overall Vision: A developer platform that eases creation and management of services following all best practices and gives complete overall picture of infrastructure including monitoring of systems, data, cost and impact with initial focus on Machine Learning!
Vision for TrueFoundry (5–10 years)
TrueFoundry at its core aims to make developer experience seamless for running and managing MicroServices — where with the right level of abstractions, developers can just focus on writing the business logic at very high iteration speeds.
Imagine a flow where after writing the code — I can call a genie and tell about my requirements like kind of service (Serverless, CronJob, Database, an API service), resource requirements like CPU, memory, etc and the genie creates the service with the best practices like Gitops, Infrastructure as Code (IAC) and then shows a dashboard with all the metrics created.
We want to be able to achieve the following things with servicefoundry:
Centralized Infrastructure Provisioning using IAC
ServiceFoundry will provision and host the most commonly used open source infrastructure components on the user’s cloud. A few examples of this can be:
- Launch Kubernetes cluster with security best practices configured.
- Install centralized infrastructure components (or use managed services) like Kafka, Spark, Redis, Prometheus, Grafana, etc.
- We can use Cloud Managed services for some of them like AWS Elastic Search.
- Launch databases, storage layers.(use Managed versions for now)
- Pipeline orchestration systems like Airflow, Argo, etc.
- CI / CD (Github Actions, Gitlab, AWS Code pipelines)
- Log Aggregation (ELK, EFK)
- Monitoring (Standard and custom metrics)
- Build and deploy services based on configurable templates. ServiceFoundry will be an opinionated set of principles to automate the following:
- Dependency Management and Packing (Docker, Zip)
- Configuration Management (Static and dynamically changing configs)
- Infrastructure provisioning (on top of centralized infra provisioned earlier)
- Autoscaling configuration
- Logs aggregation
- Dashboard Generation with standard metrics (Users can add custom metrics)
Similar to the above, we also want to do the same for ML Models, Databases.
ServiceFoundry will aim to streamline the deployment and monitoring of the standard types of services:
- LoadBalanced Api Service (with autoscaling on different parameters)
- Job Service (Cron jobs, jobs triggered by events)
- Stateful Services
- Static Website
Service Catalog and Graph
All services created using ServiceFoundry can be viewed at one place along with their complete metdata. This catalog will also show all the environments for each application like dev, staging and prod. This leads to a developer platform portal where developers and business leaders can view the services running in the organization. A few of the key metadata asscociated with each service is:
- Link to Github Repository
- Monitoring Links
- Team and owners
- Ability to add members with different access control.
TrueFoundry MLOps (ML First Platform)
The initial focus of TrueFoundry will be to provide a seamless MlOps platform that focuses on post-model building pipeline and makes it really easy for datascientists to deploy, monitor and retrain their models.
A machine learning pipeline comprises of the following centralized infrastructure:
A brief explanation of the different steps involved are:
- Data Pipeline and Feature Store: This is essentially a bigdata problem where in we need to get the features to be used in the model computed from the datalake and available in the time constraints required for both training and production without disparity. It usually uses a workflow orchestration engine like Airflow, Argo, Kubeflow pipelines.
- Model Training: Model training is essentially a compute-heavy distributed job that can run over multiple machines. It should also offer built-in resiliency via checkpoint saving and restoring.
- ModelServing: This is basically a microservice that gets requests to make the model predictions and can have varied requirements like GPU, high compute and memory requirements. Each model is usually hosted as a single microservice — hence when a team scales to 10’s of models, it becomes a problem of managing 10’s of microservices which itself is a big problem in a itself.
- ModelMonitoring: This includes both system metrics monitoring and machine learning specific monitoring related to performance and decay of model. This also requires systems to store the logged data, run aggregations on it and finally compute the metrics.
- Model Management: This tracks all the data related to models and their different versions and experiments. Its highly useful to debug issues later and rollback.
Because of so many moving parts and different technologies involved, usually multiple people are involved in a ML project like DataEngineer, Datascientist, ML engineer, Devops and Product Manager. A successful project requires the coordination among all these different personas which leads to a lot of delays and hampers the speed of a data scientist.
A typical workflow in companies for a machine learning pipeline looks something like:
Key Goal behind the ML platform
We want to automate the parts in the ML pipeline that can be automated and empower the datascientist to be able to test their models in production and iterate fast with as minimum dependencies on other teams as possible. We draw our motivation from the products created by Platform teams in top tech companies that allow all teams to move much faster and deploy and iterate on their own.
We don’t handle any of the data related problems now — that section will be introduced later.
A key ML Platform comprises of the following services (apart from the central infrastructure)
- Training (A scheduled Job with different triggers)
- Model Service (A LoadBalanced API Service)
- Storage (Artifacts, datasets, model inference data)
- ML Monitoring Service (A service to compute metrics from data)
- Feature Engineering Service
If we can easily deploy these services, maintain versioning across different stages and generate monitoring for each of them, the ML Ops problem will be a much more simpler problem.
This blog was first published on Medium at https://abhishekch09.medium.com/d8e159743a4b