Running AI at scale: What really powers performance (Hint: It’s not just the model)

From resource sprawl to platform paralysis: here’s what no one talks about when AI systems go live and why your ops team deserves a medal.

aiOPS, AI DEVOPS

21 December 2024

AI Research & Engineering team

Build powerful AI solutions with us!

Optimising AI infrastructure: strategies for performance and cost efficiency

As AI adoption accelerates, running these systems day-to-day is proving just as critical and complicated as building them. At LogiNet Systems, our AI Research & Engineering team has been deep in the weeds of infrastructure design, parallelisation strategies, and GPU-level performance tuning.

Why? Because getting AI to run fast, reliably, and cost-efficiently takes more than great models. It takes operational precision. Here's what we've learned.

When does it make sense to build your own AI infrastructure?

Before looking at optimisation issues, it is important to clarify when a dedicated AI infrastructure is needed. If an organisation only uses AI services occasionally, cloud-based APIs may be sufficient. However, when data confidentiality, critical response times, or cost optimisation of large-scale AI operations come into play, it becomes worthwhile to consider building AI and managing an in-house infrastructure.

Cloud, on-prem, or hybrid? Why your infrastructure choices matter more than you think

There are three main approaches to AI infrastructure solutions design: on-premises, cloud, and hybrid.

On-premises solutions offer complete control and predictable costs but require significant up-front investment and a dedicated team for maintenance.
Cloud solutions provide rapid deployment and flexible scalability, but can be more expensive in the long run.
Hybrid solutions combine the benefits of both approaches but result in a more complex architecture.

Scaling AI is a strategy game: Here’s how to play it right

The choice of scaling strategy is critical to performance. Horizontal scaling involves additional servers in the processing, while vertical scaling focuses on optimising processes within a single server.

The efficiency of parallelization strategies depends on several factors. For tensor parallelisation, the speed of communication between GPUs is critical. For example, the bandwidth of NVLink significantly affects the optimal number of GPUs to scale to. Model size and architecture are also important. For larger models, pipeline parallelization may be more advantageous as it reduces the need to transfer data between GPUs, especially in cases where the continuous workload ensures that the first GPUs do not become a bottleneck for the entire process.

As these two types of parallelization have different limitations, combining them can sometimes be the best solution.

Our measurements show that an appropriate parallelization strategy can yield up to 1.8x speedup with tensor parallelization and 1.38x speedup with pipeline parallelization. This is a significant increase in performance, especially when you consider that on larger systems it can mean thousands of additional requests per minute. Of course, these figures are based on a specific but realistic use case.

Make Ops the hero of your AI story

Scaling AI isn’t just a tech problem, it’s an operational one. If your infrastructure is stretched thin or your workflows feel more like patchwork, we can help you build systems that are stable, smart, and ready for growth.

Discover AI support that scales

Performance vs. cost: Finding that sweet spot isn’t one-size-fits-all

Performance optimisation is closely linked to cost efficiency. For every AI infrastructure and model, there is an optimal operating point where the best price/performance ratio can be achieved. Optimising response time is particularly important for interactive applications. Our measurements show that with proper configuration and load balancing, response times as low as 100-200ms can be achieved, which is near real-time from a human interaction perspective. In a real-world use case, a single NVIDIA H100 GPU was able to handle 21 parallel requests at peak load with a First Token Latency (FTL) of 220-340ms.

Batch processing is a different story - the critical metric here is total system throughput. In one of our test systems, for example, we were able to increase the initial performance of 111 tokens per second to over 1,700 tokens per second with the right optimisation - a performance improvement of more than 15 times.

GPU memory isn’t just a resource: It’s your secret weapon

Effective utilisation of GPU memory is critical to performance. High memory utilisation enables more efficient operation of the KV cache, reduces the need for data transfers between the CPU and GPU, and optimises batch processing. In our experience, a well-tuned KV cache can provide a 2-3x speedup in response times, especially when processing long texts where both prompt and response lengths are significant.

By optimising the cache hit rate, not only can response times be reduced, but system throughput can be significantly increased.

You can’t optimise what you don’t measure: Why monitoring is mission-critical

A comprehensive monitoring system is essential for successful operations. In addition to hardware metrics, continuous tracking of service metrics is necessary. Monitoring not only helps to identify problems but also facilitates proactive optimisation. The data collected can be used to identify performance bottlenecks and predict the need for capacity expansion. This enables the scaling process to be automated and aligned with predicted demand.

What’s next for AI infrastructure? Smarter, faster, leaner

Significant advances are expected in AI infrastructure solutions. In addition to more efficient models and optimisation techniques, the development of cost-effective hardware solutions and smarter resource management are likely to drive progress. Current trends point to increased adoption of hybrid solutions that combine the benefits of on-premises and cloud systems.

Operating an AI infrastructure is a complex task that requires continuous improvement and optimisation. Our experience shows that choosing the right architecture and applying optimisation strategies not only improves performance but also leads to significant cost savings.

The key to success is to have a deep understanding of the workload, select the infrastructure and optimisation strategies that best suit it, and implement continuous monitoring and fine-tuning. Given the rapid evolution of AI technology, this is a dynamic process that requires constant attention and expertise.

Ready to make your AI work harder (and smarter)?

AI infrastructure doesn’t just need to work—it needs to keep up. From fine‒tuning performance to streamlining costs, we help you design, scale, and operate AI systems that actually deliver. Whether you're exploring generative AI or scaling existing models, our DevOps and AIOps experts have your back.

Talk to us about AI-ready ops