Function call is expensive!

Recently, I tried to solve a performance problem, and found function calls cost more than 10 floating multiplies!

Here is the sample code :

– empty function
local function test()
return 0
end

local x=77.1
local y=11.1

local t0=system.getTimer()
for i=1,100000 do
local v = y*x
end
local t1=system.getTimer()
for i=1,100000 do
local v = test()
end
local t2=system.getTimer()

print(string.format(“t1-t0 = %.2f, t2-t1 = %.2f”, t1-t0, t2-t1))

With iPhone4, We got : t1-t0 = 45.17, t2-t1 = 630.60

The cost of one function call is more than 13 float multiplies !

So, I have to reduce the function call in enterFrame() and avoid call functions in loops.

With that, I can’t write a more structured code with functions.

Or did I missed something ?

are you SURE that a function call overhead of 6.3 microseconds (630 millis / 100,000 iters) is where your performance problem is?

function call overhead should be a tiny tiny fraction of the total workload - unless you really ARE calling a function every time you need to do “y*x” 100K times per frame, in that case yes it might be time to refactor your code into fewer calls, instead of worrying about the function call overhead itself - which you can’t do anything about anyway (short of making everything local, which it already is in your example)

it’s like if you were worried that your shellsort was too slow, so you went to the extreme of implementing it as native code.

meanwhile, your buddy’s quicksort is still 10X faster in interpreted lua

fwiw, hth

We are developing a 2D shooting game. There are about 100+ enemies in the final stages.

If we want they to run smoothly, there should be 30 frames per second. One frame should have 33 ms to finish. With iPhone 4, the 100+ enemies need about 20ms to be redraw. So there are about 10~13 ms left for enterFrame() and other event handlers.

In the enterFrame(), the simple actions (calculation of locations and ranges, tracking, rotating, and other simple AI …) contain more than 20 function calls conservatively (including the nested ones). Then, 100 enemies will spend 0.0063 * 20* 100 = 12.6 ms for the function calls in one frame. There is no room left for the actual calculations.

We just wondered why the function call overhead is more than 10 float multiplies. Now we realize It is the limitation of this engine which needs the overhead to handle stack and variable scopes or … who knows.

Maybe, we should consider about other game strategies which need less sprites moving on the screen.

Thank you for replying.

Back in the olden days when we had to put images on the screen a pixel at a time there was a technique that could be used to really increase performance - unrolling the loops.

For a sprite 20x20 you’d normally do a loop 20 inside a loop 20, but that takes too much time. So you’d unroll the loop and end up with 400 lines of code, each one drawing a single pixel. Yes, it took more memory, but it was oodles (that’s a technical term we used back then) faster.

You can’t directly compare then to now, but maybe there are things you can do to make up your time difference? It might be nice to break down the operations into functions that make sense, but if you combine some of them can you get where you need to go?

Just something to think about, maybe.

 Jay

Function calls are “relatively” expensive in Lua because you’re also setting up a whole closure environment. (that is, it’s not quite a simple as 80386-asm era “push args, call, pop return”, while the 80387 took relatively “forever” to do a fmul)

So, yes, on the one hand it would be advisable to not OVER-use functions in critical loops (like if you’ve set up a complex class system with getters/setters for every single atomic value, perhaps further confounded by a metatable search up some complicated inheritance chain, then you’ll likely feel the pain - welcome to mobile development in an interpreted environment! :D)

On the other hand, it may still be a structural issue, or a case of “algorithm” choice (continuing the shellsort/quicksort analogy). Many games solve such issues by allowing their agents to be relatively autonomous in-between ai calls that are spaced farther apart. So, for instance, you call ai only once per 30 frames, for only 1/30th of your agents at a time in cyclic manner, expecting each to acquire enough info to “survive on their own” for the next 29 frames. (or whatever period/cycle ends up working for you) Then your per-agent-per-frame workload perhaps comes down to a manageable simple motion and basic collision detection, at which point doing 100 of them shouldn’t be any trouble.

Of course, that EXACT approach might not be applicable to your game, but there’s probably SOME approach along that line of thinking that could help.

fwiw, hth

@codingcake - Based on your own logic, the obvious solution is to change your equation into:

100 enemies will spend 0.0063 * 5* 100 = 2.65

This leaves almost all of the remaining enterFrame time for floating point multiplies (which is implied you need to do a lot of).

Both J.A.Whye and davebolinger have outlined strategies to try and help you move forward in changing your equation into what it must become (according to your description). I would think it would be easiest to first attempt J.A.Whye’s strategy and fold together many calls into single larger calls to reduce the overall numbers.

If that doesn’t get you far enough, you may need to go full bolinger - and more drastically rewrite your algorithm.

I think a new term has been coined.

“To go full Bollinger” means to rethink the solution to a problem, even if it means massive amounts of code changes.

Is that definition good for everyone? Tweaks? :slight_smile:

 Jay

ROFL :smiley:

BTW, I believe the OP was sincere, if perhaps a bit misguided, and didn’t intend to beat up on them.

[also, technical correction: i was thinking “bubblesort” when i wrote “shellsort” if the O magnitude of my analogy was to make sense. maybe a better analogy (that many here have likely seen) is solving a problem like “steer towards target” with a handful of expensive trig, when a far simpler vector approach might do. so massive change (just for the sake of being massive) isn’t necessarily what i was suggesting]

are you SURE that a function call overhead of 6.3 microseconds (630 millis / 100,000 iters) is where your performance problem is?

function call overhead should be a tiny tiny fraction of the total workload - unless you really ARE calling a function every time you need to do “y*x” 100K times per frame, in that case yes it might be time to refactor your code into fewer calls, instead of worrying about the function call overhead itself - which you can’t do anything about anyway (short of making everything local, which it already is in your example)

it’s like if you were worried that your shellsort was too slow, so you went to the extreme of implementing it as native code.

meanwhile, your buddy’s quicksort is still 10X faster in interpreted lua

fwiw, hth

We are developing a 2D shooting game. There are about 100+ enemies in the final stages.

If we want they to run smoothly, there should be 30 frames per second. One frame should have 33 ms to finish. With iPhone 4, the 100+ enemies need about 20ms to be redraw. So there are about 10~13 ms left for enterFrame() and other event handlers.

In the enterFrame(), the simple actions (calculation of locations and ranges, tracking, rotating, and other simple AI …) contain more than 20 function calls conservatively (including the nested ones). Then, 100 enemies will spend 0.0063 * 20* 100 = 12.6 ms for the function calls in one frame. There is no room left for the actual calculations.

We just wondered why the function call overhead is more than 10 float multiplies. Now we realize It is the limitation of this engine which needs the overhead to handle stack and variable scopes or … who knows.

Maybe, we should consider about other game strategies which need less sprites moving on the screen.

Thank you for replying.

Back in the olden days when we had to put images on the screen a pixel at a time there was a technique that could be used to really increase performance - unrolling the loops.

For a sprite 20x20 you’d normally do a loop 20 inside a loop 20, but that takes too much time. So you’d unroll the loop and end up with 400 lines of code, each one drawing a single pixel. Yes, it took more memory, but it was oodles (that’s a technical term we used back then) faster.

You can’t directly compare then to now, but maybe there are things you can do to make up your time difference? It might be nice to break down the operations into functions that make sense, but if you combine some of them can you get where you need to go?

Just something to think about, maybe.

 Jay

Function calls are “relatively” expensive in Lua because you’re also setting up a whole closure environment. (that is, it’s not quite a simple as 80386-asm era “push args, call, pop return”, while the 80387 took relatively “forever” to do a fmul)

So, yes, on the one hand it would be advisable to not OVER-use functions in critical loops (like if you’ve set up a complex class system with getters/setters for every single atomic value, perhaps further confounded by a metatable search up some complicated inheritance chain, then you’ll likely feel the pain - welcome to mobile development in an interpreted environment! :D)

On the other hand, it may still be a structural issue, or a case of “algorithm” choice (continuing the shellsort/quicksort analogy). Many games solve such issues by allowing their agents to be relatively autonomous in-between ai calls that are spaced farther apart. So, for instance, you call ai only once per 30 frames, for only 1/30th of your agents at a time in cyclic manner, expecting each to acquire enough info to “survive on their own” for the next 29 frames. (or whatever period/cycle ends up working for you) Then your per-agent-per-frame workload perhaps comes down to a manageable simple motion and basic collision detection, at which point doing 100 of them shouldn’t be any trouble.

Of course, that EXACT approach might not be applicable to your game, but there’s probably SOME approach along that line of thinking that could help.

fwiw, hth

@codingcake - Based on your own logic, the obvious solution is to change your equation into:

100 enemies will spend 0.0063 * 5* 100 = 2.65

This leaves almost all of the remaining enterFrame time for floating point multiplies (which is implied you need to do a lot of).

Both J.A.Whye and davebolinger have outlined strategies to try and help you move forward in changing your equation into what it must become (according to your description). I would think it would be easiest to first attempt J.A.Whye’s strategy and fold together many calls into single larger calls to reduce the overall numbers.

If that doesn’t get you far enough, you may need to go full bolinger - and more drastically rewrite your algorithm.

I think a new term has been coined.

“To go full Bollinger” means to rethink the solution to a problem, even if it means massive amounts of code changes.

Is that definition good for everyone? Tweaks? :slight_smile:

 Jay

ROFL :smiley:

BTW, I believe the OP was sincere, if perhaps a bit misguided, and didn’t intend to beat up on them.

[also, technical correction: i was thinking “bubblesort” when i wrote “shellsort” if the O magnitude of my analogy was to make sense. maybe a better analogy (that many here have likely seen) is solving a problem like “steer towards target” with a handful of expensive trig, when a far simpler vector approach might do. so massive change (just for the sake of being massive) isn’t necessarily what i was suggesting]