The impact of concurrent requests on LLM inference performance: a case study with Llama-3.1-8B

Unlocking efficiency in LLM deployment: balancing latency and throughput under load

ai devops

17 January 2025

LogiNet AI Development Team

Build powerful AI solutions with us!

Mastering LLM performance: secrets to high concurrency and speed

This study investigates the performance characteristics of Large Language Model (LLM) inference under varying levels of concurrent requests. Using Llama-3.1-8B-Instruct on an H100 GPU, we analyse the relationship between request concurrency and key performance metrics. Our results reveal distinct optimal operating ranges: interactive scenarios maintain sub-500ms latency up to 21 concurrent requests, while batch processing achieves maximum throughput of 1,724 tokens/second at 45 concurrent requests, representing a 15.5 times improvement over single-request performance. This research provides practical insights for deploying LLM services at scale.

Large Language Models have become an integral part of modern AI applications, so their performance under different load conditions is critical for practical deployments. While single-request performance is well documented, the behaviour of these models under concurrent load presents unique challenges and trade-offs that warrant detailed investigation.

Methodology

Experimental setup

Hardware configuration

NVIDIA H100 SXM5 80GB GPU
CUDA version: 12.4
Driver version: 550.54.15
Intel(R) Xeon(R) Platinum 8462Y+ (30 cores)
120GB System RAM

Software Stack

Model: Llama-3.1-8B-Instruct
Inference framework: sglang
Model configuration: FP16 precision, context length 32000
KV-cache enabled

Network characteristics

Baseline network latency: 51ms
Stream mode enabled for response generation

Test configuration

Test parameters

Concurrent requests: 1 to 81
Fixed system prompt: 3000 tokens
Variable user input: 100-500 tokens
Test sequence repeated 100 times per concurrency level

Metrics measured

First Token Latency (FTL)
Total response time
Tokens per second
Resource Utilisation

Results and analysis

First Token Latency analysis

Our measurements reveal distinct phases in First Token Latency (FTL) behaviour as concurrency increases:

Optimizing LLM latency and throughput at scale

Low concurrency phase (1-10 requests)

Base FTL (single request): 113ms median (62ms excluding network latency)
Highly efficient initial performance for single requests
Graceful latency increase up to 10 concurrent requests, with median FTL reaching around 260ms, representing only a 2.3x increase (3.3x excluding network latency) despite 10x more concurrent requests
System shows excellent responsiveness in this range

Stable phase (11-21 requests)

Median FTL consistently remains between 220-340ms
P95 latencies remain below 500ms
Minimal variance between minimum and maximum latencies
Ideal range for interactive applications that require consistent response times

Transition phase (22-35 requests)

Sharp increase in latency variability
Median FTL rises to 400-600ms
P95 latencies start to exceed 1 second
First signs of system stress appear

High concurrency phase (36+ requests)

Exponential increase in both median and tail latency
At 70 concurrent requests:
Median FTL: 3.37 seconds
P95 FTL: 5.85 seconds
High variance between min and max latencies
Clear pattern of performance degradation

Throughput analysis

Our throughput analysis reveals complex relationships between concurrency and token generation speed:

From latency to throughput: unlocking LLM efficiency

Performance per request

Single-request baseline: 111.29 tokens/second
Gradually declining as concurrency increases:
20 requests: ~67 tokens/second per request
40 requests: ~42 tokens/second per request
70 requests: ~22 tokens/second per request

System-wide throughput

The overall system throughput shows different characteristics:
Linear growth up to ~15 requests
Sub-linear increase from 15-45 requests
Maximum throughput of 1,724 tokens/second achieved at 45 concurrent requests
Plateau and slight decline beyond 45 requests
Efficiency sweet spot identified between 35-45 concurrent requests
Peak total system throughput approximately 15.5 times the throughput of a single request

Resource usage and system behaviour

Memory usage patterns

Consistently high GPU memory utilisation (98.1%)
KV cache hit rate: 26.29%
Memory efficiency remains stable across concurrency levels

Response time distribution

Overall response time increases more slowly than FTL
At optimal concurrency (≤21 requests):
Mean response time: 1.1-1.5 seconds
P95 response time: 1.2-1.7 seconds
At high concurrency (70+ requests):
Median response time: 5.8-7.1 seconds
P95 response time: 7.1-8.7 seconds

System stability

No failed requests observed throughout testing
Consistent chunk interval times across concurrency levels
Test results show that the system remained stable under heavy load, albeit with degraded performance

Performance trade-offs

Our analysis shows clear trade-offs between different performance metrics:

1. Latency vs. concurrency

Inverse relationship between response speed and number of concurrent requests
Non-linear degradation pattern
Critical points identified at 22 and 45 concurrent requests

2. Throughput vs. latency

While performance per request decreases with concurrency, overall system throughput increases up to 45 requests
System achieves maximum throughput of 1,724 tokens/second at 45 concurrent requests
Sweet spot highly dependent on use case requirements
Clear distinction between interactive and batch processing sweet spots

3. Resource utilization vs. performance

High GPU memory utilisation at all concurrency levels
KV cache hit rate suggests room for memory management optimisation
System remains stable at high concurrency levels, indicating robust resource management

Discussion

Our experimental results reveal complex performance characteristics of LLM inference under varying concurrent loads, with significant implications for both system design and deployment strategies. The observed 15.5-fold throughput improvement from single-request to optimal batch performance suggests significant opportunities for system optimisation, but also highlights the inherent trade-offs in LLM serving systems.

Performance characteristics and implications for system design

The performance patterns we observed suggest that LLM serving systems operate in different efficiency regimes. The initial sharp improvement in system throughput up to 15 concurrent requests, followed by sub-linear growth up to 45 requests, suggests a complex interaction between the GPU's parallel processing capabilities and the memory subsystem. This behaviour is likely due to the GPU's ability to hide latency through parallel execution up to a certain point, after which memory bandwidth and cache efficiency become limiting factors.

The relatively low KV cache hit rate of 26.29% is particularly interesting, as it suggests significant room for optimisation in memory usage patterns. This could be addressed through various mechanisms, such as improved prompt templating or request similarity analysis. However, the high GPU memory utilisation (98.1%) indicates that we're operating close to the limits of the hardware, suggesting that any improvements would have to come from better memory management rather than additional resource allocation.

Architectural implications

These results suggest that single-queue architectures may be suboptimal for LLM serving systems. Instead, our results suggest that a multi-tier architecture, with separate serving strategies for different types of workloads, could better serve different use cases. Interactive applications requiring low latency could be served by a low concurrency tier, while batch processing jobs could be routed to a high concurrency tier optimised for throughput.

The sharp performance degradation beyond certain concurrency thresholds suggests the importance of implementing sophisticated queue management systems. Rather than allowing unlimited concurrent requests, systems should implement adaptive admission control based on current load and performance metrics. This could help maintain optimal performance by keeping the system operating in its most efficient range.

Future optimisation opportunities

Several promising directions for optimisation emerge from our analysis. The relatively low cache hit rate suggests that investigating prompt standardisation and cache warming strategies could yield significant improvements. In addition, the distinct performance regimes we observed suggest that dynamic scaling and load balancing strategies could be highly effective if properly tuned to these characteristics.

The relationship between throughput and latency that we observed also suggests that custom scheduling algorithms designed specifically for LLM workloads could significantly improve system performance. These schedulers could take into account factors such as prompt length, expected response length and cache affinity to make more intelligent scheduling decisions.

Limitations and future work

While our study provides valuable insights, it also has several limitations that warrant further investigation. Our tests used a specific model size and hardware configuration; understanding how these patterns scale across different model sizes and hardware platforms would be valuable. In addition, while our synthetic workload was designed to be representative, it may not capture all the complexities of real-world usage patterns.

Future work should investigate how these performance characteristics change with different model architectures, different prompt lengths, and more diverse workload patterns. In addition, exploring the impact of different optimisation techniques, such as continuous batching or dynamic tensor parallelism, could provide valuable insights for system optimisation.

The performance patterns we observe also raise interesting questions about the fundamental limits of LLM serving systems, and how close current architectures are to these theoretical limits. Understanding these limits could help guide future hardware and software co-design efforts in this area.

Conclusions and recommendations

Key findings and recommendations

Performance measurement and monitoring

Each deployment should perform specific performance measurements under expected load patterns
Regular monitoring of key metrics (FTL, throughput, resource utilisation) is essential to maintain optimal performance
Consider implementing real-time performance monitoring to detect degradation

Use case-based optimisation

1. Latency critical applications:

Determine maximum concurrency based on worst-case latency requirements
Monitor tail latencies (p95, p99) to ensure consistent user experience
Consider using median latency targets for less stringent applications

2. Batch processing scenarios:

Optimise for maximum system throughput rather than individual request latency
Perform throughput measurements to find optimal concurrency levels
Balance resource utilisation with processing efficiency

Dynamic scaling strategies

Implement dynamic concurrency control based on real-time performance metrics
Consider using adaptive scaling based on:
Current load patterns
Latency requirements
Resource utilisation
Time of day patterns
Set up alerts for performance anomalies

Additional considerations

Although not directly derived from our measurements, the following practices can help organisations better manage their LLM deployments:

Regularly reassess performance characteristics as workload patterns evolve
A/B testing different concurrency strategies
Document performance characteristics for different types of requests
Capacity planning based on both average and peak load scenarios

Future research should examine these patterns across different model sizes, hardware configurations and workload characteristics to develop more generalised optimisation strategies.

Are you interested in optimizing your company’s processes using artificial intelligence tools? Do you want to harness the potential of generative AI? Learn more about our AI services!

Discover how we lead

our partners to success