Monitoring the CPU steal time on AWS EC2 instances
The “CPU Steal time” issue is a well known phenomenon in the cloud area. We’ve been recently hard hit by this problem, so we’ve decided to start monitoring it.
In this article I’ll explain how to easily configure an AWS CloudWatch metric in order to report it.
Updated on 02/Nov/2020: Fixed reload (restart) logic.
Content:
Background and disclaimer
This article describes how to setup CPU steal time monitoring; for an explanation of the concept, there’s a good article by Scout APM.
However, there’s a major concern underlying this concept: who is the party responsible of a given instance of (high) CPU steal time.
CPU steal time is, in a way, a bit of a dangerous concept, because it’s caused by one of two, in a sense opposite, parties:
- the provider, overselling the hardware;
- the customer, exceeding the allocated resources.
In this sense, I understand (but not necessarily agree with) why AWS doesn’t offer the metric - while they’d certainly receive valid complaints, they’d likely be flooded by customers misinterpreting cause #2 with cause #1.
Therefore, this article has a big disclaimer:
Do not assume that CPU steal time is caused by AWS overselling their hardware; the data must be analyzed before making conclusions.
In our case, we observed that, in a stack of identically configured servers, one had a higher CPU load and was less responsive, and during the time while this was happening, there was a correlation with a higher steal time. The stack was also not particularly busy, and it was well within the limits of the instance type.
Procedure: preliminary information
Overview
The monitoring configuration is fairly simple:
- a Systemd unit is created and configured, which runs the
nmon
tool in background; - a script, periodically run by
cron
, parses the data and sends it to CloudWatch.
That’s all!
Requisites
A Debian derivative is required (including Ubuntu and so on); other O/Ss can be trivially adjusted.
The script uses Ruby (which is installed as part of the procedure), however, the logic is (relatively) simple, and can be adjusted to any language.
All the commands need to be run as root, and in a single session (since variables are reused).
For simplicity, don’t use any character that requires quoting in the variables; in addition to being nonstandard and requiring boilerplate quoting functions, Systemd units have different escaping specifications (see systemd-escape
program).
Variables quoting is used nonetheless, in order to allow a clean static analysis of the script (commands).
nmon and its output
nmon has a batch (“recording”) mode, with output in CSV format:
nmon -F $filename -s $time_interval [-c $count]
There is no support for infinite execution. The maximum count, based on a source code review, is the int32 max (2^31 - 1 == 2147483647); higher values will cause undefined behavior, due to the atoi
API used.
Before starting with the capture, a snapshot is taken, with the stats metadata, and some system information:
AAA,progname,nmon
AAA,command,nmon -s 60 -F test.log
AAA,version,16g
# ...
CPU001,CPU 1 web0-999,User%,Sys%,Wait%,Idle%,Steal%
CPU002,CPU 2 web0-999,User%,Sys%,Wait%,Idle%,Steal%
CPU_ALL,CPU Total web0-999,User%,Sys%,Wait%,Idle%,Steal%,Busy,CPUs
MEM,Memory MB web0-999,memtotal,hightotal,lowtotal,swaptotal,memfree,highfree,lowfree,swapfree,memshared,cached,active,bigfree,buffers,swapcached,inactive
# ...
BBBP,000,/etc/release
BBBP,001,/etc/release,"DISTRIB_ID=Ubuntu"
# ...
BBBP,132,/proc/cpuinfo
BBBP,133,/proc/cpuinfo,"processor : 0"
BBBP,134,/proc/cpuinfo,"vendor_id : GenuineIntel"
BBBP,135,/proc/cpuinfo,"cpu family : 6"
# ...
There are different stat groups; for any snapshot, each group is dumped on a line, whose first field is the group name (CPU_ALL
is needed in our case):
# ...
CPU_ALL,T0003,42.5,1.8,0.0,50.0,5.8,,2
CPUUTIL_ALL,T0003,84.98,0.00,3.50,99.97,0.00,0.00,0.00,11.50,0.00,0.00
CPUUTIL000,T0003,0.00,0.00,0.00,99.97,0.00,0.00,0.00,0.50,0.00,0.00
CPUUTIL001,T0003,84.98,0.00,3.50,0.00,0.00,0.00,0.00,11.50,0.00,0.00
MEM,T0003,7860.5,-0.0,-0.0,0.0,6787.5,-0.0,-0.0,0.0,-0.0,580.6,688.9,-1.0,63.2,0.0,246.3
VM,T0003,57,0,0,1740,28948,-1,0,0,0,0,0,0,0,131,0,0,0,0,0,0,0,0,0,132,0,0,0,0,0,0,0,0,0,0,0,0,0
# ...
ZZZZ,T0004,11:26:11,20-OCT-2020
The name of the corresponding fields are stored in the pre-capture snapshot.
The timestamp, whose group name is ZZZZ
, is a bit confusing: does a timestamp line belong to the previous snapshot (which would justify the name), or the following?
Since nmon -f -c 1
prints the labels section, with a timestamp line, then a data section, it can be deduced that the timestamp belongs to the section following it.
Procedure
Packages/libraries
Install nmon and Ruby:
apt update && apt install --yes nmon ruby
Then, install the CloudWatch SDK library:
gem install aws-sdk-cloudwatch
Variables
stats_path=/var/lib/nmon/recording.log
stats_interval=60 # seconds
nmon_location=$(which nmon)
processing_script_name=/opt/nmon_processing/send_nmon_stats_to_aws
# In some SDKs, the instance region can't be gathered programmatically
#
aws_region=eu-west-1
Systemd unit
We configure a System unit, so that the system takes care of running it on boot, but also to handle (hypothetical) failures.
mkdir -p "$(dirname $stats_path)"
cat > /etc/systemd/system/nmon.service <<UNIT
[Unit]
Description=Nmon system stats recording
[Service]
Type=forking
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=nmon
ExecStart=$nmon_location -F $stats_path -s $stats_interval -c 2147483647
Restart=always
[Install]
WantedBy=multi-user.target
UNIT
systemctl enable nmon
systemctl start nmon
The most important thing to consider is that nmon doesn’t support the HUP
signal, which is a common way to restart a program; for this reason, we don’t specify the ExecReload
parameter, and we’ll rely on the reload-or-restart
systemctl command when we need to perform a reload.
Regarding the Restart
parameter, there are alternative, valid, options; for example, on-failure
is also appropriate. For conceptual consistency with the fact that nmon will necessarily stop at some point, always
is used, although the specified (2**32 - 1) cycles, even with one run per second, take almost a lifetime.
Note that, in case of restart, nmon will overwrite the existing log. Given the (supposed) rarity of the event, and the fact that the loss is minimal (the processing scripts sends the result every minute, so the previous log records can be discarded), we don’t need to handle this occurrence.
A tweak is that we send stdout and stderr to the syslog, and we assign a program name (Syslogidentifier
). This is just good practice in terms of system logging; using a program name allows sysadmins to separate a program’s logs, if required.
Log rotation
Since this monitoring is assumed to be permanent, we need to manage the logs; the logrotate standard tool will take care of this.
cat > /etc/logrotate.d/nmon <<LOGROTATE
$stats_path {
rotate 12
monthly
compress
missingok
notifempty
sharedscripts
postrotate
systemctl reload-or-restart nmon
endscript
}
LOGROTATE
Due to the mentioned lack of handling of the HUP
signal by nmon, we need to manually restart the process, by invoking the systemctl reload-or-start
command.
Processing script
This is a production script, so it needs to be solid; at the same time, for the sake of simplicity, we ignore potential edge cases (while still documenting them).
Something very important to consider is the metric name. Once the first metric is sent to AWS, if the metric is subsequently removed, it will (currently) take 15 months for the previous metric to be removed (although it won’t cost anything, if unused).
The current namespace name follows the AWS standard, although it will be separated into “Custom Namespaces”.
mkdir -p "$(dirname "$processing_script_name")"
cat > "$processing_script_name" <<'RUBY'
#!/usr/bin/env ruby
require 'time'
require 'aws-sdk-cloudwatch'
class NmonCpuStealProcessor
METRIC_NAMESPACE = "EC2/Per-Instance Metrics"
METRIC_NAME = "CPUStealTime"
def initialize(aws_region, log_filename, stats_interval)
@aws_region = aws_region
@log_filename = log_filename
@stats_interval = stats_interval
end
def execute
stat_timestamp, steal_time = extract_latest_steal_time
if steal_time && recent_enough?(stat_timestamp)
metric_data = build_metric_data(METRIC_NAME, steal_time, stat_timestamp, "Percent")
send_metric(METRIC_NAMESPACE, metric_data)
end
end
private
# Return [timestamp (Time), steal time (Float)].
#
def extract_latest_steal_time
# nmon's output is small enough not to require only to load the log tail. Ruby doesn't natively
# have such API, so either we need use a library, or we use `tac`, with the process output/control
# complexity.
#
log_content = IO.read(@log_filename)
# Input sample:
#
# ZZZZ,T0025,11:34:39,21-OCT-2020
# CPU_ALL,T0025,1.5,1.0,0.0,95.4,2.1,,2
#
timestamp_line, cpu_all_line = log_content.scan(/^(?:ZZZZ|CPU_ALL),.*/).last(2)
# We assume that if there is timestamp line, there is also a CPU line.
#
if timestamp_line
timestamp_fields = timestamp_line.split(',')[2..3] || raise("Timestamp field not matching!: #{timestamp_line.inspect}")
steal_time_field = cpu_all_line.split(',')[6] || raise("Steal time field not matching!: #{timestamp_line.inspect}")
raw_timestamp = timestamp_fields.join(" ")
stat_timestamp = Time.parse(raw_timestamp, "%T %Y-%b-%d")
steal_time = steal_time_field.to_f
[stat_timestamp, steal_time]
end
rescue Errno::ENOENT => error
$stderr.puts "nmon processing: logfile #{@log_filename.inspect} not found"
end
def recent_enough?(stat_timestamp)
# Don't be fussy about edge cases. If exactness is needed, this script needs to be stateful, by
# storing the last processed timestamp (for example, in a file).
#
stat_timestamp >= Time.now - @stats_interval
end
def build_metric_data(metric_name, value, timestamp, unit)
hostname = `hostname --fqdn`.strip
[
{
metric_name: metric_name,
dimensions: [
{
name: "Host",
value: hostname
}
],
timestamp: timestamp,
value: value.to_s,
unit: unit
}
]
end
def send_metric(metric_namespace, metric_data)
Aws::CloudWatch::Client.new(region: @aws_region).put_metric_data(
namespace: METRIC_NAMESPACE,
metric_data: metric_data,
)
end
end
if __FILE__ == $PROGRAM_NAME
if ARGV.size != 3
puts "Usage: #{File.basename($PROGRAM_NAME)} <aws_region> <log_filename> <stats_interval_secs>"
exit
else
aws_region, log_filename, stats_interval = ARGV
end
NmonCpuStealProcessor.new(aws_region, log_filename, stats_interval.to_i).execute
end
RUBY
chmod +x "$processing_script_name"
Processing script scheduling
Finally, we crate a cron job to launch the script every minute; for simplicity and clarity, we use a standard cron job template.
cat > "/etc/cron.d/$(basename "$processing_script_name")" <<CRON
SHELL=/bin/bash
PATH=/usr/bin:/usr/sbin:/sbin:/bin:/usr/local/bin
# .---------------------------- minute: 0 - 59 | every n minutes: */n | each minute from x to y: x-y
# | .---------------------- hour: 0 - 23
# | | .---------------- day of month: 1 - 31
# | | | .---------- month: 1 - 12 | jan,feb, ...
# | | | | .---- day of week: 0 - 7 (Sunday: 0 or 7) | sun, mon, tue, wed, thu, fri, sat
# | | | | | (once every x: */x, from x to y: x-y, a list: x, y, z)
# v v v v v command to be executed
* * * * * root $processing_script_name $aws_region $stats_path $stats_interval
CRON
That’s all! cron will automatically pick up the change, and the statistics will be available in CloudWatch in a few minutes.
Conclusion
In this article, we’ve seen how to build an fully custom AWS metric, in an end-to-end fashion from a systems perspective.
The code provided can be executed without any changes, and can also be used as a template for implementing other future metrics.
Happy monitoring!