The choice of scaling strategy is critical to performance. Horizontal scaling involves additional servers in the processing, while vertical scaling focuses on optimising processes within a single server.
The efficiency of parallelization strategies depends on several factors. For tensor parallelisation, the speed of communication between GPUs is critical. For example, the bandwidth of NVLink significantly affects the optimal number of GPUs to scale to. Model size and architecture are also important. For larger models, pipeline parallelization may be more advantageous as it reduces the need to transfer data between GPUs, especially in cases where the continuous workload ensures that the first GPUs do not become a bottleneck for the entire process.
As these two types of parallelization have different limitations, combining them can sometimes be the best solution.
Our measurements show that an appropriate parallelization strategy can yield up to 1.8x speedup with tensor parallelization and 1.38x speedup with pipeline parallelization. This is a significant increase in performance, especially when you consider that on larger systems it can mean thousands of additional requests per minute. Of course, these figures are based on a specific but realistic use case.