The Allegro Wiki is migrating to github at https://github.com/liballeg/allegro_wiki/wiki

Optimization

From Allegro Wiki
Jump to: navigation, search
TODO: split this into sub-articles (edit)


From time to time, you may find that your program is not running as fast as you want it to, or using up more memory or other resources than you want it to. To solve this, you can rewrite parts of the program to achieve the same overall goal but quicker or using fewer resources. This process is known as optimization.

There are many layers to optimization. Some of them can reduce the ammount of work that needs to be done by many orders of magnitude. It is recommended to work your way down the list of optimization stratergies from the top (high level optimization) towards the bottom (low level optimization) - that is, attempt the high level optimizations before the low-level optimizations. But like all rules, there are exceptions. You may have thought of a low-level optimization that can be implemented rapidly, but have yet to consider the high-level optimizations. Also, some stratergies can be moved up or down depending on the type of project.

TODO: decide on the order of the stratergies, or mention when specific stratergies can be moved up or down (edit)


Some of the stratergies list some examples of using the stratergy.

TODO: This focuses on optimizing for speed. There should be more optimizing for memory stratergies (edit)



General

These stratergies are somewhat abstract and can be applied to several layers of optimization. It is recommended that the program has been debugged first (especially when it comes to the lower-level optimizations). This is because you may find that you optimized some buggy code, and the optimization cannot be applied to the bugfixed code. However if you've tracked down your bug to the implementation of a particular algorithm and you're considering replacing it with a different algorithm, you can automatically get rid of the bug by replacing the algorithm.


  • As optimization is about making the machine do less work, one way of thinking about it is by using the concept of lazyness. As well as the machine, it can also be applied to the programmer. Only optimize what needs to be optimized. Use a profiler to identify bottlenecks.
  • Consider high-level optimizations before low-level optimizations. You may want to spend ages trying to optimize a particular calculation, but ask yourself: Does this calculation need to be done in the first place (this is another example of where lazyness applies to optimization. In this case, it's about the machine being lazy)?
  • Think about what you are doing or trying to do rather than just looking for things to optimize.
  • Are you using the right tool for the right job? This concept could apply to a broad set of aspects of programming. It could be the right algorithm, the right API, or even the right programming language.
  • Look at what you are trying to do (algorithmically, mathematically, the way it is written, how it utilizes resources) and see if it can be done in a more efficient manner.
  • If writing low-level code, get to know your target platform. Find out what's likely to cause a bottlenecks on that particular platform. Note that optimizing for more than one platform where the bottlenecks are in different places involves doing extra work. Again, remember the difference between 'fast enough' and 'as fast as possible'.
  • With enough practice, the programmer learns to spot opportunities to make high-level optimizations, and lower-level optimizations are automatically taken into account by the programmer to the point that they don't have to think about them anymore.


Optimization strategies

  • Does it really need to be optimized? In other words, this is optimizing your schedule. Identify bottlenecks and concentrate on these (use the profiler to identify the bottlenecks). While any code that is called often such as vertex-manipulation code needs to be done as fast as possible, code for checking a keypress does not need to be optimized. Likewise, know the difference between "As fast as possible" and "Fast enough". If following the optimization strategies listed here in order, once you're satisfied with the speed and/or memory requirements, it's not necessary to optimize further.
  • Try and make sure the bottleneck is used as little as possible, and for that matter, does the bottleneck really need to be executed (Zen optimization)?
    • eg: Swapping two large blocks of data in an array: Consider using a linked list instead where the same effect can be achieved with four pointer-changes for a singly-linked-list.
    • eg: If doing collisions, instead of coming up with the fastest collision-detection algorithm, you should try and figure out if the collisions need to be done in the first place. If you divide your world into sectors and you work out which objects are in which sectors, if two objects are not in the same sector, it means a collision-check will always fail so it can be skipped. The simplest method is just to divide your world into an evenly spaced grid and use the modulus-operator (see further down for advice on choosing a moduland) to work out which sectors the object occupies. More complex methods include Quadtrees/Octrees and even BSP trees.
    • eg: With collisions, use a bounding-box or bounding-sphere (or some easier-to-check-for-intersection geometry) to determine if they are close enough to do a more detailed collision-check.
  • Think of different approach (thinking outside the box). Once you have a solution, think about how the problem and solution relate to eachother to gain a deeper understanding of the problem. You may then be more likely to think of an alternative approach - possibly by using a different method of abstracting the problem. The optimal approach to chose depends on the input-data size and should be experimented with until satisfaction. This is especially true if the approaches have a different rate of growth.
    • eg: Imagine objects spinning round the X axis in 3D space in an evenly-spaced circle on the Y-Z plane at the same speed. Which order are they drawn in? We could sort by Z position. However, if we look at the problem in reverse, we see that instead of finding out which position an object occupies, we find out which object should occupy a given position. We work out which object belongs at each position in the circle starting from the front-most and working my way to the back on both sides. We can very easily work out the draw-order this way.
    • eg. Polygon sorting vs. Z-buffering.
  • Consider solutions that don't give the correct answer but give a close-enough answer if it's quicker to compute.
    • eg. When working out the position of a moving object with forces applied to it, instead of using Newton's equations of motion, use Euler's integrator.
    • eg. When using collision bitmasks, you can cheat by not using a 1-1 mapping of sprite-pixels to collision-mask pixels (eg. 1 collision-mask pixel represents 2x2 sprite-pixels). Likewise in 3D, mesh-mesh collisions can be optimized by using collision-meshes (low-poly meshes that look similar to the high-poly models).
    • If calculating certain mathematical functions by hand (eg. trigonometric) using taylor-series (when using pixel shaders, this is often how they are done), find out how many terms you can get away with leaving out until the results deteriorate too much.
  • Think of same approach using different algorithm (appropriate complexity for input-size). Like with choosing the best approach, the optimal algorithm to chose depends on the input-data size and should be experimented with until satisfaction. This is especially true if the algorithms have a different rate of growth.
    • eg. Same problem as above with objects spinning: If we ignore solution above and just decide to go with sorting, if we know our sorting algorithms, we can pick one. QSort is good general-purpose one for randomly sorted data. However, if spinning round in a circle, lets assume for the time being that we're updating the position once per tick rather than using delta-time. If the objects are not moving so fast that they overtake the next object's position in the next tick, we can chose a sorting algorithm that works well when little or no change in the order of Z positions has taken place. Likewise, if we use delta-time, there's no upper-bound on the amount the objects can move between ticks, so the objects could have moved any distance (even more than a full revolution), so we'd have to chose a sorting algorithm that copes well with more or less random data.
    • eg. Think of an alternative to linear searching (eg. binary searching)
  • If the bottleneck makes use of an API, find out what the API is doing, and if you're not satisfied with it's performance, write the low-level code yourself. If your own implementation is more efficient and can be generalized enough to replace the existing function, consider proposing the changes to the API authors.


General platform-based optimizations

  • Bear in mind that on different platforms, the bottlenecks can sometimes be in different places. If there is no hardware rendering, the bottleneck is more likely to be in the rendering-code.
  • Try to use hardware if it's available (graphics acceleration, shaders, sound acceleration, physics-hardware, multiple processors, etc.).
  • On virtual machines, API calls to the platform-API can be a lot quicker than doing things yourself. This is because the platform-API may be implemented by the native machine rather than the virtual machine. Often, these optimizations can seem counter-intuitive to someone used to programing for physical machines.
    • eg. if the API supports a 2D tilemap, to get the tile to the left, GetTileAt(x-1,y) may actually be faster than *((&currenttile)-1).
  • Many bottlenecks are to be found in inner loops that are executed millions of times per second. When optimizing, these should be made to run as fast as possible. Also, if they fit inside the CPU cache (along with the data they are to access), it avoids cache-misses.
  • For bottlenecks that do not involve an inner-loop, consider optimizing for memory instead of speed.
  • Never make system calls.
  • Exceptions are slow.


Low level optimizations for non low-level languages

Do have faith in the compiler. This is really meant to be more of a workload optimization than anything. However, if you don't trust the compiler, see #Micro-optimizations. Also, some compilers are capable of doing the optimizations listed in this section.

TODO: Work out which optimizations belong under the Micro-optimizations header instead (edit)


  • Mathematical optimization (finding out if equations can be made simpler if the range of inputs is known (this is especially true for non-continuous functions)). In a similar vein, equations can be re-written to use fewer intermediate values, but on the other hand, using fewer values may be more likely to cause pipeline-stalls as we're more likely to rely on the result of recently-computed values.
    • eg: RGB<->HSV code can be simplified if we're certain the saturation is always going to be 1.
    • eg: The HSV->RGB code can be written in an easy-to-understand form, but by mathematically re-arranging it, it uses fewer values but is not so easy to figure out from looking at the code.
  • Once you've decided on the approach and algorithm, use code-based optimizing tricks
    • eg. Collision bitmasks instead of pixel-masks. You can check for collisions of as many pixels as there are bits in a machine-word jusing just a single 'AND' operator.
    • eg. Bob's useful code-snippets
  • Use iteration instead of recursion if possible. It reduces efficiency to keep pushing and popping return addresses, and you also risk overflowing the stack.
  • Pre-calculate as much as possible. For simple things, the compiler often does this for you, but for more complicated things, it's better to pre-calculate. This is especially true of things that remain constant throughout iteration or recursion.
    • Use interpolation. When iterating, it's quicker to add a pre-calculated constant than divide by a loop-counter.
  • With lots of branches (eg. switch() or a series of if()s), make sure conditions that are most likely to happen are tested for before unlikely conditions are checked. Sometimes, this is intuitive, but other times, you may have to count the number of times each condition is reached.
  • Use an appropriate iterator. When iterating through an array, increase a pointer instead of using an index into an array.
    TODO: explain why this is a bad idea on modern CPUs (edit)


  • Give the compiler a helping-hand.
    • Make use of keywords such as const, restrict, etc. (register is almost obsolete because compilers are good at register-juggling).
    • If a numeric integer-value is never going to be negative, use the unsigned keyword. The compiler knows for certain that the number is never going to be negative, and when writing code, does not have to include cases for handling negative numbers. Also, this has the advantage of being able to spot certain bugs at compile-time.
  • Get to know the target architecture. This is another way of giving the compiler a helping hand. Write code in such a way or choose values for arbitary values that the compiler is more likely to optimize.
    • eg. For unsigned integers, x/8 can be optimized to x>>3, but x/7 cannot be optimized this way.
    • eg. If the target-CPU supports SIMD instructions (instructions that work on several values at once), write code that's easier for the compiler to make use of SIMD instructions.
    • eg. If the CPU uses deep execution-pipelines, try and reduce the amount of branching (if's) in your code.
    • If iterating using an index, and the index starts at zero, count down instead of count up. This is because on many machines, there are specialised check-for-zero instructions that do not require loading a constant to check against. However, CPU-caches are more optimized for counting up than counting down (forwards iteration).


Micro-optimizations

TODO: Work out which optimizations belong under the Low level optimizations for non low-level languages header instead (edit)


TODO: Separate sub-sections according to different types of architecture (eg. machines where one approach is faster, and machines where another approach is faster) (edit)


Nowardays, compilers are good at optimizing, so they know many of the optimizing tricks, so attempting to out-optimize the compiler is futile unless you're a very good machine-code programmer (and even then, this will involve extra work). However, there's still a few tricks they don't know - especially when it comes to new hardware, or when optimization becomes more of an art than a science (that is, instead of using a methodical approach, you rely on gut-feeling). Optimizing for SIMD instructions can still cause problems for compilers and this is best done in machine code, but the resulting machine code is usually a very small portion of code that just calls the SIMD instruction on the data (although there will still be some higher level code to make sure the data is in the optimum layout to be SIMD'd). Even so, if you know machine code, you may want to check what your compiler is producing from time to time, but I'd only recommend this these days if you've got plenty of time on your development cycle, as if you discover something that's best written in machine code, the chances are that finding this out and writing the resulting code will take time.

Many compilers will do these for you, but if you are aware of these, you can make it easier for the compiler to optimize by giving it a helping hand (The trick is not to do the replacing yourself, but to recognise when there's an oppotunity for such replacement to occur).

Many of these only apply if you're writing in machine-code (thanks to the good standard of compilers) are machine-specific (but can be specific to many machines). At best, these types of optimization offer a linear speed-increase.

  • Micro-optimizations
    • Find out which is the faster of several similar statements. eg. The faster of n+n, n*2, n<<1 if it sees something similar in the code and even picks the appropriate choice for the sub-variant of CPU you wish to optimize for. However, if you're programming in machine-code, you have to make this choice yourself.
    • Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU).
    • Loop-unrolling.
      TODO: explain why this is a bad idea on modern CPUs (edit)
    • Use shifts instead of integer multiplies/divides by powers of two, and use bitwise AND for modulus by a power of two.
    • Make sure as much code/data as possible can fit into the cache. Some processors have seperate caches for data and code. However, if there are more processes running than available processors, the cache gets invalidated often. Also, bear in mind that sometimes, data can be padded to make access more efficient (some CPUs are better at accessing data aligned on a machine-word sized boundry), and that some CPUs are better at accessing data using the machine-native word-size (compilers usually do a good job of choosing the best balance of padding vs. fitting into the cache).
    • Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
    • If the CPU can execute parts of code in parallel, interleave the instructions in a way that the result is not immediately used by the next instruction. This prevents pipeline-stalls.
  • Machine code hacking. Extremely platform-specific. Requires in-depth knowledge of target architecture. Would even need to write differently for sub-classes of architecture. At best, performance optimization is linear. Only do this as a last resort! If a better algorithm or a bug is discovered, a lot of work goes to waste. Not worth it because all modern compilers are competent enough to handle this if they support the CPU in mind and know the subtleties of the CPU (preventing stalls, register management, etc). Also, can be a nightmare to maintain - especially if using self-modifying code. However, there are times when low-level optimization is an art rather than a science, and compilers may be scientific but are useless at art.
    • If your target architecture uses a non-symmetric set of machine code instructions (that is, different registers can be operated on in different ways), getting the most out of the processor is trickier. This means there may be several tricks the compiler may have missed.
    • Self-modifying code can be useful if both speed and compactness is desired, or to save on registers by hardcoding constants, but avoid it if the CPU uses separate code and data caches (unless the modification occurs only rarely).

External links