PipelineAI CLI for Distributed TensorFlow
PipelineAI Supports Distributed TensorFlow using Kubernetes and the PipelineCLI or REST API.
The PipelineCLI or REST API will generate the appropriate Docker images and Kubernetes YAML using
pipeline train-kube-build .
You can deploy/start the training job using these Docker images and YAML on any cluster - either manually or through the
pipeline train-kube-start command.
Distributed TensorFlow Scaling
Regarding single-node vs. multi-node training:
We encourage you to explore the large, single-node, multi-gpu cloud instances like AWS's X1e.32large (128 CPUs, 4TB RAM) and p3.16xlarge (64 CPUs, 0.5 TB RAM, 32 Nvidia V100 GPUs).
While we understand the need to scale out (instead of up), we want to highlight that TensorFlow model training is much easier to troubleshoot and tune when running on a single node versus a cluster.
This is similar to Apache Spark which is optimized for single-node deployments (Spark 2.2+, I believe).
Distributed TensorFlow Auto-Scaling
At this time (TF 1.5/1.6), TensorFlow does not handle autoscaling very well during training.
And from what we hear, this is not a priority for Google/TensorFlow.
We don't actually get a lot of real-world requests for autoscaling during model training. Usually, the workload - and server resource capacity - is well-known at the start of the training job.
It's worth noting that TensorFlow does support losing/restarting workers - as long as you're using asynchronous training - the default for the TF Estimator API.
Losing a Master or PS server is a bit more disruptive, but PipelineAI handles these failures for you.
TensorFlow Serving Auto-Scaling
While not directly related to this article, it's worth noting that TensorFlow model serving does work well with autoscaling!
PipelineAI fully supports autoscaling of TensorFlow model servers using any runtime including TensorFlow Serving, Python, C++, Nvidia TensorRT on any hardware including CPUs, GPUs, TPUs, etc.
That's one of the key benefits of using PipelineAI: You can package and deploy the same exact model in a different runtimes - and compare them in live production.
TensorFlow Estimator API
For TensorFlow, you should be using the Estimator API, otherwise you will be in a world of hurt when porting your code from a single node to distributed cluster.
Your Kubernetes Cluster must have access to the dataset from all physical nodes that may participate in the training job. This is a problem for some Kubernetes cluster configurations, so make sure you work with you ops group to apply the appropriate RBAC and IAM roles for your environment.
In AWS land, for example, you'll likely want to apply EC2 IAM Instance Roles to the EC2 instances backing your Kubernetes Cluster. This lets the pods/containers access S3 on any node in the cluster - without exposing S3 keys in your training code or Kubernetes Cluster configuration.
The PipelineAI Products handle all of this configuration for you, so please check them out.
What if I haven't been using the Estimator API?
At this point (TensorFlow 1.5+), you should always be using the Estimator API. The original single-node TensorFlow examples do not translate well to a Distributed TensorFlow Cluster.
This is likely due to Distributed TensorFlow's release 3-4 months after the original TensorFlow release.
What if we already have a Kubernetes Cluster?
This is great!
As long as the PipelineCLI environment has access to deploy Kubernetes resources to the cluster, you will be good.
Remember that the PipelineCLI can deploy Distributed TensorFlow Jobs to the Kubernetes Cluster from anywhere - your local laptop, a bastion/gateway server, or a CI/CD server like Jenkins.
The PipelineCLI will generate the appropriate Kubernetes YAML for all resources that participate in the training job. You can inspect these generated Kubernetes YAML to understand what is happening beneath the CLI layer.
What if a Master, Worker, Parameter Server, or Kubernetes Node Dies During Training?
You can inspect the PipelineCLI-generated Kubernetes YAML for the details of the Kubernetes training job resources, but in most cases, the resource will be restarted automatically.
Usually the job will continue once the resource is restarted, however Distributed TensorFlow does have its limitations. These limitations may require the entire job to be restarted.
If you experience frequent node/process failures, you should - at a minimum - increase the frequency of your model checkpoints.
While frequent checkpoints will slow down the overall training time of your job, you will improve the recovery time for your failing nodes/processes.
How Do We Schedule Recurring Distributed TensorFlow Training Jobs?
Like any other scheduled job, you can use the PipelineCLI or REST API to deploy a training job to your cluster at any interval.
A lot of our users create Airflow scripts to schedule and trigger the recurring training job.