Monday, March 14, 2011

Apache Optimal vCPU Analysis

Last week I posed a question in the VMware forums:

How to determine optimal vCPU for apache workloads given a specified hardware and software configuration?

I prefaced the question by stating we strictly adhere to the best practice of keeping the vCPU at 1 unless the workload is multithreaded and capable of benefitting from the additional vCPUs.
Given the highly multithreaded nature of apache we had set the vCPU to 8, but without any numbers to quantify this was the optimal value this was more of an intuitive configuration based on our workloads and knowledge of ESX.

Without any feedback in the week following the posting, I took it as an opportunity to design an experiment to measure the effect of varying the vCPU on apache throughput, latency, response time etc.

What follows are some unexpected observations that may or may not be useful to others looking at tuning vCPU for their environments.

Experiment design:
The goal of this experiment is to measure the effects of varying the vCPU setting of a CentOS 5 Apache webserver VM. For generating the web server load I chose apachebench.


setup:
cloned a production web server for testing (changing only its IP)
Config:

apache version 2.2.3

running in Centos5 VM

with 8Gb RAM allocated

Threads (via scoreboard) range from 50-100 active (with max 300)

vSphere ESXi 4.1 U1 (build 348481)

Hardware: Dell R610 with 2 x 6 Core Intel Xeon X5680 CPUS @ 3.3GHz


Starting with a vCPU setting of 1, I ran the following script to iterate from 25 to 175 concurrent requests in increments of 25 - (the URL was an average page of 50K and the 10000 requests took about 1 minute to serve total):

#!/bin/csh
# Script to increase concurrent requests

set x = 25
while ( $x -lt 200)
ab -n 10000 -c $x http://web-06/

@ x+=25
end

The output for each vCPU# run was captured, then the VM was brought down to increase the vCPU setting to 2, 4, 6, 8 - capturing the results for each of the 5 vCPU levels. Bringing the results into excel tables the following metrics were compared across the vCPU runs (and I re-ran the vCPU runs later to confirm the data per vCPU config was not varying wildly from run to run) :

Total Connect Time:
This is the total time (Connect, Processing, Waiting) in milliseconds spent serving the request (we want this to be as low as possible. Observe
the 1 vCPU configuration is markedly higher than the 2, 4, 6, 8 configurations.




Deviation: The following apache bench metric accounts for how variable the total time to serve is - ideally we want this to be as small as possible so our apache performance is deterministically consistent. Observe how the 2 and 4 vCPU deviation is markedly higher than the 1 and 8 vCPU configurations (y-axis is milliseconds deviation from mean)



Requests per second: This metric measures the average requests per second served (x-axis is requests served per second). Below we can see the 2 and 4 vCPU configurations outperform the rest, but we have to remember this comes at the expense of increased variability. We are beginning to see the tradeoff: the 2 or 4 vCPU config will give most users slightly better response times, but the 8 vCPU config's behavior is more deterministic and gives a more consistently decent response time.




Throughput
: This metric measures the Kb/sec delivered by the apache instance at each vCPU setting. It mirrors the requests/sec metric - 2 and 4 vCPU configs deliver higher throughput on average, but at the expense of higher variability - giving some requests much lower throughput.






Conclusions:
At our average peak thread load (call it 75 requests/second per apache instance) we see that while the 4 vCPU config will deliver 9.4% higher throughput ON AVERAGE, than the 8 vCPU config, the 8 vCPU config provides 45.3% better (lower) variability from that average response. All other things equal, the decision to keep the 8 vCPU configuration for our apache VM instances is therefore easily rationalized in the context of giving most users a slightly lower throughput while guaranteeing their worst response will be much better with the 8 vCPU vs the 4vCPU config.


3 comments:

Sean Foley said...

Very interesting stuff. Thanks for sharing. Could you please tell me what apache mpm is in use (worker, prefork, etc) and if the content was static or not? I am not sure how much either matters, but it would be helpful to know how closely it compares to our workloads. Thanks again, Sean.

vExpert2011 said...

Hi Sean - we use Prefork and track threads with cacti's scoreboard template:
http://forums.cacti.net/about9861.html

thanks

vExpert2011 said...

Please check out Josh's CPU ready and scale OUT analysis here:
http://joshodgers.com/2012/07/22/common-mistake-using-cpu-reservations-to-solve-cpu-ready