Power debugging

How to measure and optimize power consumption

Power debugging is a method that measures how much power is being consumed when your application is running, and correlates it with the program’s execution flow. Because it is software that controls the hardware power consumption, there is much to gain also for you as a software developer by being aware of how your code affects power consumption. By coupling the source code to the power consumption, you can test and tune your application to optimize the power consumption.

Introduction and technology

The technology for power debugging is based on the ability to sample the power consumption and correlate each power sample with the program's instruction sequence and hence with the source code. The key to accuracy is a good correlation between the instruction trace and the power samples, and this is only achieved when there is a close integration between the current measurement and the trace probe. In many cases, the target can be powered through the debug probe. In addition to handling the debug communication between the target and the host debugger, the probe can then also sample the current it is feeding to the target. It measures the voltage drop across a small resistor in series with the supply power to the device. The voltage drop is measured by a differential amplifier and then sampled by an AD converter. In this case the integration is very close and the accuracy should be good.

Another difficulty is to achieve high precision sampling. Ideally you would want to sample the power consumption for individual instructions, but the physical limitations of the target system and even the device itself makes this more or less impossible. A microcontroller running at 100 MHz should need be sampled at the same frequency, but the system capacitance makes it meaningless to sample any faster than at approximately 40 kHz. When sampling the power consumed by the entire system including off-chip peripherals and circuitry with all its combined capacitances, the meaningful sample rate will be even lower than that.

In practice this is not a problem though. As it is almost always more interesting to correlate the power consumption with various events in the program execution than with individual instructions, the resolution needed is much lower than one sample per instruction. Even though it would be interesting to see how much power the microcontroller consumes when executing individual instructions, your goal is most likely to make the battery last for as long as possible, and the battery powers the system, not just the microcontroller!

Power samples in IAR Embedded Workbench

In IAR Embedded Workbench, the power samples are visualized together with the interrupt activity on a shared timeline, giving you a quick understanding of how events and power consumption correlate also in complex execution flows. The Timeline window is correlated to the source code, so by double-clicking in the Timeline window the corresponding line of source code will be highlighted.

In practice and in a task-oriented system it is probably more interesting to see how a particular function affects power consumption than to see statement by statement how the power consumption changes. The function profiler will help you find the functions where most time is spent during execution for a given stimulus. In this way, regions in the application where optimizations for power consumption should be done are exposed. With power profiling we combine the function profiling with the power sampling to measure the power consumption per function and display that in the Function Profiler window. This can reveal functions that consume disproportionately much power. For the number crunchers, IAR Embedded Workbench can also display a textual log of all power samples including their timestamp, which can be useful when studying what has been going on in more detail.

Turn off your peripherals!

In many systems, peripherals consume the major part of the power and the CPU only consumes a minor part. Exercising strict control of how peripherals are being used is a fundamental approach to minimizing power consumption.

The basic problem is that peripheral units in an embedded system could consume a lot of power even when they are not actively in use. If you are designing for low power, it is vital that you disable them and not just leave them unattended when they are not in use. But for different reasons, a peripheral can be left with its power supply on; it can be a careful and correct design decision, or it can be an inadequate design or just a mistake. If not the first case, then more power than expected will be consumed by the system. This will easily be revealed by the power graph in the Timeline window which visualizes power consumption over time. Double-clicking in the Timeline window where the power consumption is unexpectedly high will take you to the corresponding source and disassembly code, allowing you to find the cause of the problem and to modify the application. In many cases, it is enough to disable the peripheral when it is inactive, for example by turning off its clock which in most cases will shut down its power consumption completely.

There are some cases where clock gating will not be enough though. Analog peripherals like converters or comparators can consume a substantial amount of power even when the clock is turned off. Again, the Timeline widow will reveal that turning off the clock was not enough, and that you need to turn the peripheral off completely.

So far it is pretty straightforward, and good coding practices will help you avoid most power leaks of this kind, but the problems could be more complex than this. Let us take a look at an example of an eventdriven system where it becomes more difficult to avoid unnecessary peripheral power consumption.

Consider a system where one task uses an analog comparator while executing but is suspended by a higher-priority task. Ideally, the comparator should be turned off when the task is suspended and then turned on again once the task is resumed. This would minimize the power being consumed during the execution of the high priority task.

Figure 1 shows a schematic diagram of the power consumption of the system where at t0 it is in an inactive mode and the current is i0. At t1 the system is activated and the comparator is started whereby the current rises to i2. At t2 the higher priority task suspends the first task, but the comparator is not turned off. Instead, more peripheral devices are activated by the new task, resulting in an increase in current to i3 between t2 and t3 where control is handed back to the task with lower priority to finish the execution.


Figure 1 Schematic diagram of the power consumption in the example

The functionality of the system could be excellent and it can be well optimized in terms of execution speed and code size. But in the power domain, more optimizations can be done. The blue area represents the power that could have been saved if the comparator had been turned off between t2 and t3. In a case like this, it is not always worth the added code required to turn off or on a peripheral, and you might have timing and availability constraints that have to be carefully considered.

Got some time to spare? Take a nap!

Most embedded systems spend a lot of time waiting for something to happen. It could for example be waiting for user input, a peripheral unit, or just for some time to pass. While waiting, the microcontroller will consume power as long as the CPU is executing instructions, even if it is doing nothing. Fortunately, modern low power microcontrollers provide a plethora of low power modes in which the application can take refuge while waiting, saving a little bit of energy. The different power modes provide different combinations of power consumption, functionality, and wake up time. As a general rule of thumb, when the application is waiting for the device status to change it should use a low power mode and make sure it will be woken by either a timer or an interrupt. One common mistake is to ignore this rule and use a poll loop instead.

In the example below, we are waiting for a USB device to finish configuring and become ready for communication. The code construct executes without interruption until the status value changes into the expected state.


For a system with an unlimited power budget, this is a straightforward construct, but when there are power constraints it will waste unnecessary power because the CPU and all on-chip systems are active. A better idea would be to enter a low power mode that keeps the USB powered but at least turns off the CPU while waiting, and set up an interrupt to trigger once the USB device is done. Another way to do it, in particular in cases where timing is not of vital importance, is to set a hardware timer to periodically trigger an interrupt. The timer approach will normally consume a little more power depending on how often you need to wake up and check device status.

In a multitasking system, the situation might be a little different as the scheduler might be executing other tasks while waiting for the USB device to finish. But the principle is the same. The application should not spend clock cycles in active mode while waiting. If the USB task uses an interrupt instead, the scheduler can put the device to sleep. The interrupt will put the CPU back in active mode and the ISR can be completed before the device goes back to sleep.

Another related situation where the application is waiting while doing nothing is when a time delay is implemented as a for or a while loop, like in the following example:

i = 10000;        // SW Delay 
do i--;
while (i != 0);

This piece of code keeps the CPU very busy executing instructions that do nothing except make the time go by. Time delays are much better implemented using a hardware timer unless they are meant to last only just a few clock cycles.

In both these situations and many others not exemplified here, the code could be changed to minimize power consumption. To do this, the code first must be correlated to the power consumption and the debugger has to display this.


Power profiling in IAR Embedded Workbench

The power debugging features available in IAR Embedded Workbench do this and can help you identify where in the application power is unnecessarily consumed. For example, the Profiler can be used to make a power profile of the application, allowing you to find the spots where any power optimization efforts should be focused. In event-driven systems, the Timeline window can display both events (interrupts) and power consumption on a common timeline, allowing you to visually monitor how various events affect the power consumption.

Haste makes waste

One of the many ways for you as a software engineer to reduce power consumption is to minimize the CPU frequency. If you have some hardware performance to spare it can have a drastic effect on power consumption since CPU frequency is typically proportional to power consumption in CMOS logic. This is because a digital CMOS circuit has very little static power consumption; it only consumes significant power when it switches states.

Before we go into what you can do to reduce the power consumption, we will take a quick look at what actually draws power in a CMOS circuit. For each clock cycle, the load capacitance (CL) is charged and then discharged. The total charge flowing from VDD to ground is then CLVDD, and the energy drawn will be CLVDD2. As this happens every clock cycle, the power consumption can be written as:


where ƒ is the CPU frequency and α is the activity factor representing the fraction of gates switching each clock cycle. Adjusting the CPU frequency to have fewer switches seems to be the easiest way to reduce power consumption for you as a software engineer as the other factors are often determined by the circuit design.

But keeping the CPU frequency low is not a question of just setting it as low as possible while maintaining an acceptable performance, there is more to it than that. First of all, you need to make a trade-off between performance and power consumption. Reducing the clock speed will reduce the performance by the same factor, but on the other hand also reduce the power consumption.

There is probably a lowest level of performance that is required for the system to function properly, but because software can be written and optimized in many different ways, there are also chances to trade hardware performance against software performance. This means optimizing the application for speed so that fewer instructions are needed to get the same job done. You will then free some hardware performance that can be used to lower the CPU frequency.

The debugger’s function profiler is a good point to start tuning your application to execute faster. It will show you where the application is spending most of its time and where performance is most critical. Once you have established which functions are critical for your application, you can get started. For example, you should make sure that any time-critical functions are executing from RAM and not flash memory as this will speed up performance. You can also instruct the compiler to optimize for speed on critical sections of code, even if you use size optimizations on other sections to keep overall code size down. Then of course you need to look into the code that you have written to see what improvements can be done. How to write compiler-friendly and optimized code is the subject for other articles however.

While trading performance for power works well for applications in active mode, it is rare that embedded applications need to be constantly active. More common is that the application, in between short bursts of executing, can be put into sleep mode. This changes a lot because now you will have to make a trade-off between setting a lower clock frequency or keeping the high frequency and instead finish the tasks earlier so that more time can be spent in low power mode.

For argument’s sake, we assume that the device consumes zero power in sleep mode and that there is no overhead entering and leaving sleep mode. With doubled clock frequency you will then consume twice as much power per second but it will only take half the time. The result is that there will be no difference in power consumption between the two strategies. If we look at this from the CMOS perspective, it will switch states equally many times to execute the instructions regardless of clock frequency. And since power is consumed when switching, power consumption will be the same.

But this is not the entire truth. For example, analog peripheral units will not respond to a decrease in clock frequency in the same way as digital logic. An application making use of analog peripheral units when active may be better off using a high CPU frequency while executing and then enter sleep mode. On the other hand, when an analog peripheral like a comparator or converter is used, it may be required that it is in use during a specific time period rather than for specific number of clock cycles, or it could require a specific CPU frequency to operate properly, complicating things further.

Some applications need to stay in active mode in between bursts even though they have little to do. In this case, clock throttling could be a good strategy. While the CPU frequency can be high during execution of critical functions, it can be reset to a level where program execution is slower but still acceptable when there is less work to do.

The power debugging tools available in IAR Embedded Workbench correlate power consumption data with trace data to enable testing and tuning for minimized power consumption. We have already mentioned the function profiler as a tool to use to find the functions that need to be optimized for speed. In IAR Embedded Workbench, the function profiler also displays power consumption statistics together with execution statistics, allowing a close examination of the correlation between execution profiling and power profiling. These tools can help you find the CPU frequency that best meets your requirements on performance and minimized power consumption.

A trip down memory lane to save power

A very costly operation in terms of power consumption in embedded systems is accessing memory. If we can optimize memory access, we will save power. There are two general principles for doing this. The first principle is to minimize the total number of memory accesses. The second is to do the ones we need to do from a memory as close to the CPU as possible.

Every system has a memory hierarchy and the further away from the core the memory is, the more expensive it is to access. This is regardless of whether the cost is measured as time or as energy. In a microcontroller, the physical memory hierarchy typically consists of registers, RAM and Flash. In more advanced processors, there is also a cache hierarchy, but for the application programmer it is usually transparent and we will not discuss it in this article.

Starting from the bottom of the memory hierarchy, the first thing we should do is to move items from Flash into RAM, so that every access of them consumes less power. Given the often small RAM sizes available on microcontrollers we seldom have the luxury of running the entire program from RAM. Instead, we run it from Flash. But there is a lot of power to save if we instruct the program to run at least the most frequently executed lines of code from RAM, instead of from Flash. This requires a decomposition of our code into functional subroutines. Using the function profiler, we should be able to detect which functions execute most frequently and spend most time executing, if this is not already known to us. These are the functions we should run from RAM, at least if they have a small enough memory footprint. If these functions are too big to fit in RAM, we need to consider the possibility of splitting large functions into several small ones.

We can easily see the results using the power profiling functionality in the function profiler. Alongside execution profiling, this window displays the power consumption for each function including sample average, max and min. If changing the location of a function from Flash to RAM results in a change in power consumption, we can instantly verify it by using the power profiling functionality.

power_profilingPower profiling in IAR Embedded Workbench

Once we have put the critical functions into RAM, we can go one step further by making sure our functions handle data stored in memory as efficiently as possible. There are many practices and techniques for this, but the first thing we need to ask ourselves is: "Can we restructure the code to make fewer data accesses?". Rewriting algorithms is very specific to the application, but there are also some generic low-level practices to consider, and we will take a look at a few examples.

When a variable (or constant) is needed by the program, it is read into a register. While registers are the fastest and least expensive memories to access, they are also very limited in number, so not all variables will necessarily fit into the registers we have available at any given time. However, if accesses to a specific variable could be grouped together, chances increase that it remains in a register between accesses, instead of being put on the stack while other data is occupying the registers.

It is also possible to use a #pragma directive to place a variable in a specified register and keep it there. This can be useful in some cases, but it will also limit the compiler’s ability to optimize the code since it will have fewer registers available.

Using global variables is often considered bad coding practice, but is still frequent and, in some circumstances, for good reasons. If we access a global variable more than once during a function call, it can be a good idea to make a local copy of it, since the copy is normally stored in a register, which makes subsequent accesses to the variable much less costly.

Another example with variables in registers concerns how they are handled when we make a function call. Normally, a function call means that all registers are saved to memory, and when the function returns, registers are restored again. By grouping function calls together, fewer save/restores of the caller context need to be made and accesses to main memory are avoided.

The effects of manipulating how variables are accessed and stored can be difficult to measure for a single read or write access and the effect on overall power consumption can be negligible, but when this is done systematically, the aggregated effect of all saved memory accesses will be well worth the effort.

Optimizing memory access is one of the most fruitful modifications we can do from a software perspective to reduce power consumption in an embedded system. Memory accesses in themselves consume a lot of energy, so reducing their number will save power. It also takes time to read and write data, so with fewer memory accesses we speed up the execution of a program and can spend more time in low-power mode. As with any optimization work, it becomes much easier if we immediately can see the result of our efforts instead of reaching in the dark and guessing the effects of changes we make. The power debugging technology in IAR Embedded Workbench measures the effects of any changes we make and gives us immediate feedback, without us having to take the system into the lab.

© IAR Systems 1995-2016 - All rights reserved.