AI infrastructure operational challenges: optimising and scaling

Understanding the growing need for purpose-built AI infrastructure
aiOPS, AI DEVOPS
21 December 2024
Optimising AI infrastructure: strategies for performance and cost efficiency
With the rise of artificial intelligence, more and more organisations are facing the challenges of operating AI systems. As AI Researcher & Engineer experts, in our research at LogiNet Systems, we explored the key considerations for designing and managing an effective AI infrastructure.

When is a dedicated AI infrastructure needed?

Before looking at optimisation issues, it is important to clarify when a dedicated AI infrastructure is needed. If an organisation only uses AI services occasionally, cloud-based APIs may be sufficient. However, when data confidentiality, critical response times, or cost optimisation of large-scale AI operations come into play, it becomes worthwhile to consider building AI and managing an in-house infrastructure.

Infrastructure choice and its impact on performance

There are three main approaches to AI infrastructure solutions design: on-premises, cloud, and hybrid.
  • On-premises solutions offer complete control and predictable costs but require significant up-front investment and a dedicated team for maintenance.
  • Cloud solutions provide rapid deployment and flexible scalability, but can be more expensive in the long run.
  • Hybrid solutions combine the benefits of both approaches but result in a more complex architecture.

Scaling strategies and parallelisation

The choice of scaling strategy is critical to performance. Horizontal scaling involves additional servers in the processing, while vertical scaling focuses on optimising processes within a single server.
The efficiency of parallelization strategies depends on several factors. For tensor parallelisation, the speed of communication between GPUs is critical. For example, the bandwidth of NVLink significantly affects the optimal number of GPUs to scale to. Model size and architecture are also important. For larger models, pipeline parallelization may be more advantageous as it reduces the need to transfer data between GPUs, especially in cases where the continuous workload ensures that the first GPUs do not become a bottleneck for the entire process.
As these two types of parallelization have different limitations, combining them can sometimes be the best solution.
Our measurements show that an appropriate parallelization strategy can yield up to 1.8x speedup with tensor parallelization and 1.38x speedup with pipeline parallelization. This is a significant increase in performance, especially when you consider that on larger systems it can mean thousands of additional requests per minute. Of course, these figures are based on a specific but realistic use case.

Optimising performance and costs - a dynamic approach

Performance optimisation is closely linked to cost efficiency. For every AI infrastructure and model, there is an optimal operating point where the best price/performance ratio can be achieved. Optimising response time is particularly important for interactive applications. Our measurements show that with proper configuration and load balancing, response times as low as 100-200ms can be achieved, which is near real-time from a human interaction perspective. In a real-world use case, a single NVIDIA H100 GPU was able to handle 21 parallel requests at peak load with a First Token Latency (FTL) of 220-340ms.
Batch processing is a different story - the critical metric here is total system throughput. In one of our test systems, for example, we were able to increase the initial performance of 111 tokens per second to over 1,700 tokens per second with the right optimisation - a performance improvement of more than 15 times.

Efficient use of GPU memory

Effective utilisation of GPU memory is critical to performance. High memory utilisation enables more efficient operation of the KV cache, reduces the need for data transfers between the CPU and GPU, and optimises batch processing. In our experience, a well-tuned KV cache can provide a 2-3x speedup in response times, especially when processing long texts where both prompt and response lengths are significant.
By optimising the cache hit rate, not only can response times be reduced, but system throughput can be significantly increased.

Monitoring and operations - the key to continuous optimisation

A comprehensive monitoring system is essential for successful operations. In addition to hardware metrics, continuous tracking of service metrics is necessary. Monitoring not only helps to identify problems but also facilitates proactive optimisation. The data collected can be used to identify performance bottlenecks and predict the need for capacity expansion. This enables the scaling process to be automated and aligned with predicted demand.

Future trends

Significant advances are expected in AI infrastructure solutions. In addition to more efficient models and optimisation techniques, the development of cost-effective hardware solutions and smarter resource management are likely to drive progress. Current trends point to increased adoption of hybrid solutions that combine the benefits of on-premises and cloud systems.
Operating an AI infrastructure is a complex task that requires continuous improvement and optimisation. Our experience shows that choosing the right architecture and applying optimisation strategies not only improves performance but also leads to significant cost savings. The key to success is to have a deep understanding of the workload, select the infrastructure and optimisation strategies that best suit it, and implement continuous monitoring and fine-tuning. Given the rapid evolution of AI technology, this is a dynamic process that requires constant attention and expertise.
Are you interested in optimizing your company’s processes using artificial intelligence tools? Do you want to harness the potential of generative AI? Learn more about our AIOps / DevOps services!

Let's talk about

your project

Drop us a message about your digital product development project and we will get back to you within 2 days.
We'd love to hear the details about your ideas and goals, so that our experts can guide you from the first meeting.
John Radford
Client Services Director UK