While our study provides valuable insights, it also has several limitations that warrant further investigation. Our tests used a specific model size and hardware configuration; understanding how these patterns scale across different model sizes and hardware platforms would be valuable. In addition, while our synthetic workload was designed to be representative, it may not capture all the complexities of real-world usage patterns.
Future work should investigate how these performance characteristics change with different model architectures, different prompt lengths, and more diverse workload patterns. In addition, exploring the impact of different optimisation techniques, such as continuous batching or dynamic tensor parallelism, could provide valuable insights for system optimisation.
The performance patterns we observe also raise interesting questions about the fundamental limits of LLM serving systems, and how close current architectures are to these theoretical limits. Understanding these limits could help guide future hardware and software co-design efforts in this area.