top of page
Erbli Kuka

Maximizing Efficiency in Distributed AI Training and Execution with Ray and DeepSpeed

Ray Deepspeed Qwen Logos in abstract backgroud
Ray Deepspeed Qwen

Introduction



Processing large datasets and fine-tuning complex models present significant challenges, especially in distributed systems. Managing tasks across multiple machines in clusters adds further complexity. This blog shows how Ray simplifies job execution across nodes, while DeepSpeed optimizes large-scale model training. Together, they offer an efficient solution for job scheduling and scaling. In the linked GitHub repository we have added a set of hands-on examples. By the end, we’ll show how Ray and DeepSpeed make distributed task management and model training accessible for projects of any scale.



The Challenges of Distributed Execution



In distributed computing, managing task allocation, execution, and resource availability across nodes introduces synchronization challenges and coordination overhead. Ray’s autoscaling adjusts compute resources in real-time, balancing load by adjusting resources as demand changes. Ray’s Python API simplifies scalable execution with tools like @ray.remote to define distributed tasks, ray.get() to retrieve results, and ray.train for distributed model training. This reduces the complexity typically involved in distributed setups, making Ray an ideal choice for efficiently utilizing available nodes (PC's) with GPUs or TPUs when a single powerful node is unavailable.



Ray uses a head node to coordinate worker nodes, enabling efficient, scalable task execution with simplified communication and minimal manual synchronization.


DeepSpeed further enhances this setup by optimizing memory and adding redundancy across GPUs. Unlike alternatives like Horovod or PyTorch’s model parallelism, DeepSpeed emphasizes memory and computational efficiency. ZERO-3 (Zero Redundancy Optimizer – Stage 3) partitions model states across GPUs and offloads memory to CPU and RAM, reducing GPU memory demands and enabling larger models and batch sizes, adding redundancy to prevent job failures.



A typical setup process involves defining a cluster configuration in a YAML file, which you can then deploy using the command:


ray up raycluster.yaml

This command launches the head node and provisions the worker nodes automatically. Ray’s autoscaling ensures that the cluster dynamically adjusts resource allocation based on the workload, providing cost efficiency.



Example 1: Submitting Jobs to the Cluster



After initializing the head node, you can submit jobs across nodes for distributed processing. Ray’s dashboard, accessible with:


ray dashboard raycluster.yaml

providing real-time insights into node performance, task progress, and resource utilization from a local interface.

In this example, we’ll fine-tune the QWEN-2.5 0.5B model, chosen for its balance between performance and computational feasibility. QWEN-2.5 was selected over LLaMA 3.2, which presented ‘graph’-related issues likely related to CUDA compatibility. Despite numerous attempts, these issues prevented reliable fine-tuning. In contrast, QWEN2.5 proved a more compatible choice for our distributed setup, with Ray and DeepSpeed handling data batching and load balancing seamlessly.



You can submit a fine-tuning job to the cluster with the following command:


ray job submit --address http://localhost:8265 —working-dir . — python3 main.py

The trained model is then saved to an S3 bucket, making it easy to retrieve and analyze later.



Real-World Applications of Ray and DeepSpeed



Ray and DeepSpeed are effective for handling compute-intensive tasks, such as the fine-tuning of complex language models. In practical applications, Ray manages high-performance instance allocation and scheduling, while DeepSpeed optimizes memory usage and computation. This ensures reliability and reduces memory bottlenecks, allowing us to scale machine learning processes while maximizing resource utilization



For example, by using multiple GPU-equipped instances, we can leverage Ray’s distributed framework to reduce processing time. Though scaling is not a strict 1:1 decrease in time, the advantage is clear when compared to relying on a single powerful node, as multiple nodes can significantly speed up the fine-tuning process. While DeepSpeed ZERO-3 does add slight processing time due to redundancy, it reduces GPU memory load significantly, enabling larger batch sizes and even CPU offloading for models that exceed VRAM limits.



Lessons Learned



One standout feature of Ray and DeepSpeed is their simplification of task coordination, data batching, and asynchronous execution—tasks that typically require custom logic in distributed systems. Ray simplifies these complexities, while DeepSpeed’s optimizations reduce memory constraints. This combination is beneficial for scaling and flexibility, enabling easy switching between frameworks like HuggingFace, PyTorch, TensorFlow, and Keras, without the need to rewrite scaling logic. The frameworks’ flexibility has proven valuable for distributed execution, allowing us to focus on training without needing to delve into infrastructure management.



However, it’s worth noting that the combination of Ray and DeepSpeed is very fragile in terms of dependency versions required for CUDA and packages like transformers, accelerate, and torch. Although setting up the right versions may complicate the setup as new technology versions emerge, the result is a robust distributed training system once the setup is complete.



Why Use Multiple Instances for Fine-Tuning?



Fine-tuning large models across multiple instances offers several key benefits:



1. Faster Results: By leveraging several worker nodes, you can achieve significantly faster training times. In testing, using five worker nodes yielded a 3.88x speed increase compared to using a single node with identical resources and datasets.



2. Memory Management: When dealing with large datasets and models, a single GPU often doesn’t have enough memory to support training. DeepSpeed’s ZERO-3 optimizer enables distributed memory management, allowing larger models and datasets to fit by spreading memory usage across multiple GPUs and offloading to CPUs when necessary.



3. Seamless Integration: Ray supports various machine learning libraries, such as Hugging Face Transformers and TensorFlow, while DeepSpeed makes full use of GPU resources, removing the need for custom parallelization logic. This ease of integration is one of the reasons Ray and DeepSpeed stand out in distributed training.



Conclusion



Ray and DeepSpeed offer a robust solution for distributed training and task execution, simplifying resource scaling and enhancing model training. By combining their strengths, ML Engineers and MLOperations can efficiently handle complex machine learning tasks without extensive infrastructure management. This blend of autoscaling, task orchestration, and memory optimization makes distributed computing approachable, paving the way for scaling machine learning operations in a cost-effective manner.

Comments


bottom of page