I recently discovered a Linux shell script that was running slowly due to an inefficiently implemented loop. This innocent investigation ended up uncovering misleading information from time and a bad interaction between the Linux thread scheduler and its CPU power management algorithms.
How often does an eight line shell script expose such rich information?
Conclusions first, explanations later
- Using $(expr) to do math is over a hundred times slower than using bash internals. This is not a new discovery, but it’s worth reiterating.
- Excessive use of sub processes (see point 1) can confuse the Linux kernel’s power management algorithms, leading to an additional three times slowdown. Interesting…
- time gives accurate elapsed time information but its CPU time information should not be trusted, especially if you make excessive use of sub processes (see point 1). Interesting…
- When time lies you can always use the Windows Task Manager to measure Linux CPU consumption. Amusing.
- Looping in Windows batch files can be faster than looping in Linux shell scripts.
- Process creation in Windows is slower than on Linux. Not surprising.
This is part of a series of posts about Linux. Previous posts include:
- Getting symbols for the C and C++ libraries on your machine
- Getting symbols for other Linux distributions and other versions of files
- Getting symbols: Linux versus Windows
- Fedora’s dark server: making symbol finding trivial
The setup
We recently added a new shell script to our build process and I noticed that this was causing a two times to three times slowdown on our builds. Not good. One of my coworkers found that the use of $(expr) to loop over arguments was the culprit. A minimal repro is shown here:
# Warning: this code is excessively slow ExprCount() { for (( i = $1 ; i > 0; i=$(expr $i - 1) )); do : done echo Just did $1 iterations using expr math } time ExprCount 1000
Note: this code copies/pastes much better on Chrome than on Internet Explorer. FWIW.
On my 3.2 GHz six-core twelve-thread Core i7-3930K this script took almost two seconds to run and gave results like this:
Just did 1000 iterations using expr math
real 0m1.883s
user 0m0.096s
sys 0m0.260s
This means that my machine can do only 530 iterations a second using this technique because each $(expr) statement is starting up a child process in order to decrement an integer. Linux process startup has a reputation for being fast, but this is a reminder that ‘fast’ is not the same as ‘free’. So far so good.
I tested to see whether expr was particularly slow by having a loop that invoked $(nop) a thousand times, where nop was just a C program that just returns immediately. The cost to invoke $(nop) was measured to be about 20% lower than the cost to use $(expr) to do some math, so not really enough faster to be interesting.
$(expr) is commonly recommended for implementing counters in shell scripts, but this is bad advice.
Faster looping
The solution is to use the math operations that are built into bash like this:
# This code performs quite well BashCount() { for (( i = $1 ; i > 0 ; i-- )); do : done echo Just did $1 iterations using bash math } time BashCount 150000
On the same test machine this script gave results like this:
Just did 150000 iterations using bash math
real 0m0.716s
user 0m0.684s
sys 0m0.032s
In other words, using the built-in decrement operation ran about 400 times faster than using $(expr)! Not a huge surprise – experts know this already – but still, nice.
Anomaly the first, CPU usage
A less OCD person might have called a 400 times speedup sufficient, but I’d noticed a few peculiarities during the investigation that I wanted to investigate. Noticing and investigating anomalies is, I believe, a critical skill for a developer, as anomalies often reveal mistaken assumptions, bugs, or ‘jankiness’.
The results from time had been confusing throughout the investigation. They consistently showed that only about 15-25% of the elapsed time in the $(expr) loops was CPU time. This then caused me to not use CPU profilers during the investigation but instead think about what could be causing stalls. Network delays? Disk I/O? Nothing made sense.
I eventually ran iostat 1 in another window and noticed that it said that CPU usage during my test runs was about 8.5%. On my twelve hardware thread (six cores, hyperthreaded) system this corresponds to one CPU core. In other words, total CPU time was equal to elapsed time, the script was CPU bound, and time was lying to me.
I then used a stopwatch to check that time’s elapsed time calculations were correct. Finally I used multiple other tools to confirm that my ExprCount() shell script, running on an otherwise idle machine, was indeed CPU bound:
- Running the test script under perf stat showed 0.945 CPUs utilized
- Running mpstat -P ALL 1 in another terminal showed one core worth of CPU time being consumed
- Context switch graphing (see screenshot in the next section) showed that the task was CPU bound
It also made sense that the loop would be CPU bound. After the first invocation of $(expr) everything is in the disk cache so there should be no I/O latency. All of process creation, reading from the disk cache, process execution, and process teardown is done by the CPU, so the overall task should be entirely CPU bound. And it is.
So, be aware that time always underestimates CPU time, and that this underestimation is roughly proportional to how many sub processes you create per second. Presumably the process creation time is being attributed neither to the parent nor the child process, even though it really should be, as sys time.
This anomaly occurs both with the bash time-builtin and with /usr/bin/time. I assume that the problem is ultimately a limitation of what the Linux kernel reports back to time, so the proper fix here is probably to acknowledge the limitation in the man page. I’m discussing this with the man-page maintainers.
Since time is mostly just a wrapper around the wait3() call it makes sense that both time versions give the same results – they are just reporting what the kernel tells them.
To be crystal clear, my complaint/concern is that when time is used to measure my ExprCount() loop it makes the loop seem like it is not CPU bound, when it is. This can lead a performance investigation astray. I believe that this should be fixed, or documented.
Anomaly the second, CPU frequency
I was about ready to declare the case closed when I noticed another anomaly. The ExprCount() time was usually extremely stable, but occasionally it would run three times faster – usually it was 400 times slower than BashCount() but sometimes it was only ~133 times slower. Now that is an anomaly worth investigating.
I quickly discovered that if a busy loop CPU hog was spinning away in another process then the ExprCount() loop would run three times faster. The loop that used bash math was unaffected. I immediately suspected power management, and I eventually verified this by running the test scripts under perf stat. Running the tests under perf stat was the most rewarding because it clearly shows the CPU frequency, and it showed that it was far lower when running the ExprCount() loop:
Performance counter stats for ‘counttest.sh ExprTest’:
1780.553888 task-clock # 0.979 CPUs utilized
3,183 context-switches # 0.002 M/sec
1,119 CPU-migrations # 0.001 M/sec
319,758 page-faults # 0.180 M/sec
1,948,359,438 cycles # 1.094 GHzPerformance counter stats for ‘counttest.sh BashTest’:
716.316659 task-clock # 0.997 CPUs utilized
74 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
354 page-faults # 0.000 M/sec
2,524,171,488 cycles # 3.524 GHz
The perf stat results also hint at why the CPU frequency is lower on the ExprCount() tests. Perf stat shows that the ExprCount() loop does thousands of CPU migrations compared to zero in the BashCount() loop.
Once I started to suspect lots of CPU migrations it was interesting to see what other tools might be useful. Running mpstat -P ALL 1 while running the two different count loops showed dramatic results. When the counting was done with $(expr) then the CPU usage was distributed relatively evenly across all of my CPU cores. However when the counting was done with bash internals then CPU usage was localized entirely to one core.
As a final bit of investigative reporting (I enjoy this sort of stuff way too much) I used the Zoom profiler to record context switches while running the $(expr) loop. The results, color coded by process and put on per-CPU racetracks, are shown below:
The bash process is in red and we can see that it spends very little time running. The other colors are various invocations of expr which bash is usually waiting for. The fascinating thing is how often the processes migrate between CPUs. The bash process alternates between CPUs 2 and 5, and each expr process starts on one thread and is then forcibly migrated to another.
We can see that the overall task is CPU bound. The bash process runs for a fraction of a millisecond, then expr launches, then expr moves to another core, then expr terminates and the cycle repeats.
After discussions with some kernel developers it appears that new processes are placed on the least-busy core, however that doesn’t explain why bash keeps moving or why each expr instance moves once.
The net result is that none of the CPUs are fully loaded which, apparently, means that the power management algorithms don’t realize that a CPU bound task is running. It’s kind of crazy awesome, and it is an example of when a multi-core system actually runs (a lot) more slowly than a single-core system.
Setting thread affinity is usually a bad idea because it reduces the kernel’s scheduling flexibility and may cause unexpected interactions with other processes. However, in the interests of science I knew I had to run my test script under taskset 1 to see what happens when it is forced to a single CPU. Doing this caused the average CPU clock speed in ExprCount() to triple, and performance increased identically.
Please don’t work around this by setting your power management settings to High Performance. If every Linux user did that then the cumulative extra power draw would be significant and would make me regret writing this article. The behavior is an anomaly. It probably should be fixed, but in most situations I’m sure that the power management algorithms work fine. Just don’t go create hundreds of child processes per second.
I tested this on Ubuntu 12.04.1 (32-bit) and Ubuntu 12.04.2 (64-bit). The timings were slightly different, but the conclusions were fundamentally the same.
Measuring Linux performance using Windows
While trying to decide which tools to trust regarding CPU consumption I decided to let Windows be the tie breaker. I have a Linux VM on my laptop so I ran my tests in the VM and then monitored the CPU consumption of the VM with Task Manager. It correctly showed that the VM, and therefore the task running inside it, was CPU bound. Which made me laugh.
Measuring Windows Performance
In the interests of science I made some similar measurements on Windows. First, a simple batch file that counts down using internal commands.
@echo off
echo Starting for loop at %time%
for /L %%i in (400000,-1,0) do (
rem empty statement
)
echo Finishing at %time%
This ran in about a second, so 400,000 iterations per second.
Then I added a call to nop.exe to the loop to measure minimal process creation time (and reduced the iteration count to 1,000). Here’s the new batchfile:
@echo off
echo Starting for loop at %time%
for /L %%i in (1000,-1,0) do (
nop.exe
)
echo Finishing at %time%
This showed that a Windows batch file can launch about 133 processes per second. A lot slower than Linux.
Counting the easy way
Maybe it would be better if I did my counting in PHP instead:
https://github.com/Herzult/SimplePHPEasyPlus
Or maybe counting in assembly language (~14 billion subtractions a second?) would be better.
Full test program
Here’s my full test script:
#!/bin/bash # Warning: this code is excessively slow ExprCount() { for (( i = $1 ; i > 0; i=$(expr $i - 1) )); do : done echo Just did $1 iterations using expr math } # This code performs quite well BashCount() { for (( i = $1 ; i > 0 ; i-- )); do : done echo Just did $1 iterations using bash math } ExprCountWrapper() { time ExprCount $1 } BashCountWrapper() { time BashCount $1 } case "$1" in "ExprTest") ExprCountWrapper 1000;; "BashTest") BashCountWrapper 150000;; "") perf stat counttest.sh ExprTest && perf stat counttest.sh BashTest;; esac
How about using for loop also in Windows bat file (instead of manual goto):
@echo off
echo Starting at %time%
for /L %%i in (10000,-1,0) do (
rem empty statement
)
echo Finishing at %time%
I didn’t use a for loop because I’m terrible at writing batch files. Wow — batch file for loops are extremely fast (400,000 iterations per second on my laptop). I’ll update the post.
As always, thanks for sharing these interesting tid bits of knowledge Bruce (:
why /bin/bash when /bin/sh would work just as well?
Also /bin/bash is always wrong for portable scripts. /usr/bin/env bash is slightly better but /bin/sh is best.
On my system /bin/sh links to /bin/dash which then fails on my script. It says:
/home/bruced/bin/counttest.sh: 5: /home/bruced/bin/counttest.sh: Syntax error: Bad for loop variable
So I guess my script (the ExprCount() one – hah) is less portable than I thought. I’m not sure what the fix is. Ah, the joys of being a Linux newbie.
If I recall, the C-style for loop you are using is a bashism and it will not work on a POSIX-compliant shell like sh (or dash, which is stricter than bash). Ubuntu symlinks sh to dash because dash is faster than bash at executing init scripts. Some people symlink sh to busybox for faster startups
Here’s a POSIX sh version:
and how it measures up to the other variants on my system
Awesome. I wish I had more time for digging in my system 😉
You can get a faster loop by using seq instead of using bash’s math routines. On my machine, “for i in $(seq $1 -1 1)” runs 25% faster than your “for (( i = $1 ; i > 0 ; i– ))”. It should also be more portable and work with dash, which should be even faster than bash.
Using $(seq) would only be faster because my test does a single loop that runs for a very long time. In our actual test script we have lots of loops that execute fewer times — 20 or so. The sub-process overhead of $(seq) would then be enough to be a problem.
That leaves me still unsure as to what options I have for a fast and portable loop. I’ll experiment with dash.
If the bound is shared between multiple loops you could generate the list once and store it in a variable, so you don’t need fork a new process every time. You could also generate the list in the shell (or hardcode it) if you want to avoid depending on seq at all, since seq isn’t standard.
And the only option for a portable for loop is “for some_variable in some_list”. Everything else is a non-standard extension.
Good to know. In the original script the range was slightly different each time so it appears that there is no option for a portable and efficient loop. We’ve chosen efficient for now. Eventually we’ll probably recode it in C++.
I think you’re supposed to use the “seq” command for a reasonably fast loop. Something like:
for i in $(seq $1 -1 1); do …
runs almost twice as fast as the BashCount one on my system. Also,
for i in `seq $1 -1 1`; do …
runs even faster, but I have no idea why.
You can make your script more then twice as fast in two ways:
SeqCount ()
{
for I in $(seq $1}; do : ; done
echo Just did $1 iterations using seq
}
or
for I in {1..150000}; do : ; done
echo just did 150000 iterations using brace expansions
Turning the latter into a function requires a bit of eval magic, which slows it down a bit again
I’ll try using {1..150000}, and perhaps some tests to check nested loops. Our real script is still slowing down our build by about 15% and if we can avoid switching to C that would be nice.
But hey, what about the actual *interesting* stuff in the post — ‘time’ and CPU power management glitches!
In my testing, all methods (bash math, seq, brace expansion), except the “expr” one, spike my CPU (i7-3520M) to max clock (and 100% util on one core). The disparity you see with your loop using expr is most likely caused by the CFS scheduler needing to balance lots and lots of new processes in a very short time, trying to balance the run queue between cores, and since no single core ever reaches a utilization threshold the CPU is kept scaled down.
The disparity you see when using ‘time’ (I can only reproduce it using the ‘expr’ tests btw.) is probably down to the bash process waiting for the ‘expr’ command to complete, time spent waiting like this does not count towards either user time or system time (just like I/O wait)
Your results match mine. The scheduler puts the child processes on idle cores (and when it returns to bash puts bash on an idle core!) which means that no core is heavily utilized which means that the CPU frequency is kept down. It’s a bad interaction between the scheduler and the frequency adjustment that happens anytime many sub-processes are used.
Your analysis of the ‘time’ results is not quite correct. You are correct that it only happens when using sub processes and that the time which bash spends waiting is (correctly) not counted as CPU time. However, the entire time that ‘bash’ is waiting the CPU is working very hard to create and run the expr process. It is this process creation time that is not accounted for. Since child-process execution time is counted (I tested that) it feels like a bug to not count child-process *creation* time. Showing a task as 80% idle when the CPU is working to perform that task 100% of the time is dangerously misleading.
The behaviour of time in these cases is expected, time accounts wall clock (correct in these cases), and cpu time spent in userland and kernel _for the processes you are monitoring_. Time spent in the scheduler is not accounted for (since you can never be sure it was spent for your processes). If you would try to read large amounts of data from something like /dev/urandom, or perform lots of system calls you will see system time being accounted for your process.
“Expected” by whom? I can find it documented nowhere that process creation time is not counted.
I also don’t think it’s correct to describe the time as being spent in the ‘scheduler’. The time is being spent in process creation. Data from ‘perf’ (graphed by zoom) attributes the time to expr and I’m sure that wait3() (used by time) could attribute it similarly — the ownership of the time seems clear and unambiguous.
I haven’t actually looked at the kernel source to see what it would take to attribute the process creation time to the process being created, but it seems like it should be manageable.
There is a tiny fraction of time (<1%) in the migration process. It would be reasonable and expected for that to not be counted by 'time'.
As for the “expected”, no single version of time I have ever used (be it shell built-in or /usr/bin/time) be it on linux or one of the other *NIX-es has ever reported on time spent by the scheduler and/or process creation time.
If you want to dig deeper I would suggest attaching a SystemTap script to process creation and context switching, to find out _exactly_ where your time is going, and why time(1) is not reporting on it.
Oh, and limiting the expr test to only run on one core gets rid of the cpu scaling problem you experienced. (you still have the time disparity, but again, that’s probably due to the loop waiting for the expr command to exit):
$ taskset 0x0001 perf stat looptest.sh ExprTest
Yep, taskset 1 does the trick. I mention that in the post. It’s obviously a dangerous solution, but it was quite helpful in the diagnosis.
For those not using Linux: cpuset(1) works
Great post!
Not to diminish it, but I think the underlying problem is that using expressions at all in bash is nonidiomatic. I think I’d have written the loop as
for i in $(seq 30 -1 0); do …
but glancing at a page on bash scripting it appears that bash now has built-in support for ranges that don’t rely on an external binary (seq):
for i in {30..0}; do
TempleOS will blow you out of the water outputting counting numbers to the command-line–even C linux code –, but it’s a dumb benchmark. Also, I can change tasks 3,000,000 times in one second per CPU core.
Ring-0-only and identity-mapped is obviously untouchably more optimal.
How about adjusting /sys/devices/system/cpu/cpufreq/ondemand/up_threshold to work around the frequency stepping problem (if you are using the ondemand governor)?
Interesting — I didn’t know about that setting.
Since my machine has twelve hardware threads and the load is evenly distributed I’d have to set that threshold to ~8 (default is 95) to get the CPU to ramp up on demand, and that is too low for normal usage.
The only fix would be to get the scheduler to not spread the workload out so much. If it only used one or two hardware threads then the frequency management code would have a fighting chance.
Reblogged this on Elmar Klausmeier's Weblog and commented:
Very good article on timing CPU bound application.
On a Raspberry PI model B (512MB) the result was :
root@raspberrypi:~# bash benchmark.sh
Just did 1000 iterations using expr math
real 0m11.201s
user 0m0.810s
sys 0m1.110s
On a Raspberry PI model B (256MB) the result was :
root@raspberrypi:~# bash benchmark.sh
Just did 1000 iterations using expr math
real 0m11.113s
user 0m0.820s
sys 0m1.050s
basically the same…
Having let it percolate for a few months, what ended up happening on this topic? One thought that came to mind as I was reading is I wonder how a small embedded interpreter like Lua or Scheme would perform? And in the interest of joining the crowd, here are my own numbers from my 3770k:
wyatt@Yue ~/tmp $ ./sloloop ExprTest
Just did 1000 iterations using expr math
real 0m0.520s
user 0m0.257s
sys 0m0.262s
#That’s at 1.6-3.5GHz. Still slow, but I’m a bit curious why the same code is four times faster than on your system.
wyatt@Yue ~/tmp $ ./sloloop BashTest
Just did 150000 iterations using bash math
real 0m0.770s
user 0m0.733s
sys 0m0.036s
#No surprises here, and a POSIX-compliant while loop with static busybox is only maybe a smidgen faster from what I’m seeing. Using a bash range is not quite three times faster (~280ms for 150k)
Very little came of this, unfortunately. Updates on some of the points:
2) The Linux kernel’s power management is still confused by lots of short-lived processes. This is a tricky problem and I think Windows has the same flaw. This does make some real tasks run ~3x slower.
3) time still gives inaccurate information about CPU time consumed, with no documentation or disclaimers of this error. There was much discussion about this but I think that ultimately the maintainers didn’t believe that the task was CPU bound, despite all the evidence. That’s a pity. I hate tools that lie.
Lua or Scheme would probably be faster — anything that doesn’t involve crazy amounts of process creation will avoid most of the problems.
We reduced our use of child processes significantly so the slowdown is now acceptable, so I moved on.
I would seriously recommend always running any desktop machine with “performance” scheduler. As long as you don’t disable C1-C6 sleep states for the processor, an average desktop processor core will be halted anyway for most of the time. The idea is to keep the processor halted (somewhere in C1-C6 state) as much as possible and only run the CPU for short periods of time. See `sudo powertop` for statistics…
That means that the CPUs will always run at top frequency. If every desktop Linux PC did that then the increased worldwide power draw would probably be quite noticeable.
Disclaimer: I haven’t actually measured the power draw change. Maybe later.
I just wish that Linux could detect that a serially dependent chain of CPU bound processes was running and treat that like any CPU bound task, by ramping up the CPU frequency. But, it is a hard problem and I think Windows has the same limitation, just less frequently hit.
Hi, very interesting just one question have you tried the BFS scheduler? It recently fixed some SMT Nice mechanism problems. Latest version has even more discussion and fixes but here:
http://ck-hack.blogspot.com/2014/08/smthyperthreading-nice-and-scheduling.html
linux-pf is a good set of binary packages to test. https://pf.natalenko.name/
Schedulers comparison http://grayskysblog.blogspot.com
I haven’t tried any alternate schedulers. I don’t think my power management problems were related to SMT so BFS doesn’t necessarily sound encouraging.
Just tried the BashCount() code under bash and ksh93, and ksh93 is substantially faster (about a factor of four). Timings are on a mid-2010 15″ MacBook Pro (2.66 GHz core i7) under OS X 10.10.3, on mains power. I repeated the test five times with each shell and selected the fastest result in each case.
Bash (3.2.57(1)-release):
Just did 150000 iterations using bash math
real 0m1.669s
user 0m1.558s
sys 0m0.097s
Ksh93 (sh (AT&T Research) 93u+ 2012-08-01):
Just did 150000 iterations using ksh math
real 0m0.38s
user 0m0.37s
sys 0m0.00s
Just repeated the test with 64-bit Ubuntu GNU/Linux 14.04 under VMware Fusion on the above MacBook Pro:
Bash (4.3.11(1)-release):
Just did 150000 iterations using bash math
real 0m0.775s
user 0m0.768s
sys 0m0.004s
Ksh (sh (AT&T Research) 93u+ 2012-08-01):
Just did 150000 iterations using ksh math
real 0m0.32s
user 0m0.31s
sys 0m0.00s
And repeating the test natively under OS X 10.10.3 using a more up-to-date version of bash v4 installed using Homebrew results in a marginal speedup over bash v3:
Bash (4.3.33(1)-release):
Just did 150000 iterations using bash math
real 0m1.521s
user 0m1.498s
sys 0m0.008s
I have no explanation for the poor performance of both bash3 and bash4 relative to ksh93 under OS X, though the margin appears to be halved under Linux (though ksh93 is still more than twice as fast as bash v4).