Custom Shader effect using too much gpu

damiano.massarelli · September 22, 2018, 10:00am

Hi everyone!

I’have recently developed a shader which only performs a matrix-vector multiplication and a bunch of additions and divisions, nothing too fancy. To my surprise, gpu usage rose to up 50% (I’m running the game on a samsung S7 and I’m monitoring gpu usage with this app https://play.google.com/store/apps/details?id=cz.chladek.profiler). I also tried to use one of the default shader effects (sepia to be precise) but the result is the same.
I suspect that this is caused by the fact that my scene is composed of a rather large number of tiles (in the order of 100/200). However, what really struck me is that there are some 3D games in the store which use way less gpu, how is that possible?

XeduR · September 22, 2018, 1:36pm

The fancy visual effects in general seem to be quite strenuous on Corona. In one game, we had a large black and white image of a sunburst effect. We then applied blendMode add to it, as well as rotated & scaled the image in a transition. This effect was enough by itself to cause the game to lag, but once we removed the blendMode and did the necessary editing in Photoshop, the game ran smoothly again.

It is also possible that you and I have done something wrong and that there are more optimal means of applying the effects.

I have to admit, that I find it interesting if matrix-vector multiplications are lagging on your device. I’m just adding some finishing touches to a shadow system that I am working on and it relies heavily on matrix multiplications and those are among the lightest calculations in the system.

damiano.massarelli · September 22, 2018, 2:25pm

Thank you for your reply!

I’m also very interested to understand why this little and gpu oriented operation is taking so many resounces. My fragment shader is really nothing more than a matrix-vertex multiplication but still when I run it on my device gpu usage is near 50%. On the other hand, when I run the game without the shader, gpu usage is close to 20%. I know that mobile devices’ gpus are much “weaker” than the ones on desktop PCs but I still find it hard to believe that a 2d game with such a simple shader is so heavy for a quite recent device.

Since I’m pretty new to shader programming I might have done something terribly wrong. I included the shader so that if you want you can give a look at it

P\_COLOR mat3 colors = mat3(0.0, 0.7, 0.0, 1.0, 0.7, 0.0, 0.0, 0.0, 1.0); P\_COLOR vec4 FragmentKernel( P\_UV vec2 texCoord ) { P\_COLOR vec4 texColor = texture2D( CoronaSampler0, texCoord ); P\_COLOR float avgColor = (texColor.r + texColor.g + texColor.b) \* 0.33; P\_COLOR vec4 bwColor = vec4(avgColor, avgColor, avgColor, texColor.a); P\_COLOR vec3 similarities = colors \* normalize(texColor.rgb); P\_COLOR vec4 outputColor = mix(bwColor, texColor, (similarities.r + similarities.g + similarities.b)/2.0); return CoronaColorScale( outputColor ); }

The goal of this shader is just to turn colors that are not “similar” to yellow, green or blue to gray scale. I know that dot product is not the best metric to tell whether two colors are similar but every other metric I came up with runs too slow to be considered a viable option.

Michael_Flad · September 22, 2018, 3:00pm

How big are your tiles combined when rendered? Just a single native fullscreen overdraw on your device is quite a lot of pixels and the shader runs on every single one.

Many 3D games render to smaller render targets and upscale later to the required sizes. Even many AAA games on consoles do this and try to find the sweet spot to reach their target framerate but the maximum resolution.

Another idea - maybe you get much worse batching when you use the shader and it’s actually not the fragment calculation that’s killing your performance?

Simple way to check this, limit your shader to a single tile, but render this as big as all your tiles combined would be?

damiano.massarelli · September 24, 2018, 1:35pm

First of all I wanted to thank you all for your help.

I tried what Michael suggested to see whether I was getting worse batching but even with a single tile using the shader, the result is the same (a bit better but not significantly better). What I noriced though is that if I remove my background images (which are five images used to convey parallax effect) performance are noticeably better. This is probably due to the fact that background images cover a large portion of the screen (and are 5 images). I also tried to use snapshots (https://docs.coronalabs.com/api/library/display/newSnapshot.html) but again, same result.

I guess now that there is not much I can do about it, but I’m still surprised that such a simple postprocessing effect is so resource demanding. The best course of action for me is probably to apply a preprocessing effect on the background changing the colors of the pixels using the cpu since the background does not have to be dynamically shaded. However, if you have any other idea I would be glad to delve into this problem.

Michael_Flad · September 27, 2018, 6:45am

If I understand you correctly, the background (parallax) images are not related to the effect? If that’s right then you have the classical situation where multiple parts are kind of equally demanding. You couuld either remove your backgrounds or the shaded tiles and both times you get a measurable performance boost.

Also, again, a simple effect might seem simple on a per pixel level, but at 60fps its done more than 220 million times a second (that would be just one fullscreen draw on your S7 at 60fps) it’s a lot of work to be done.

The performance boost with the parallax images is likely because you’re at the bandwidth limit of your gpu. With 5 images roughly fullscreen you’re operating already at 1 billion pixels a second just for the parallax effect - at 32bit that’s up to 8GB/sec of memory that has to be read and written.

Here’s another detail where 3d engines may actually be less demanding. 3d games use the z-buffer to determine wheter a pixel has to be drawn/shaded or not and for that reason, usually render whatever is visible front to back. GPUs are highly optimized to do this z-buffer rejection on their pixels and so, in 3d even the same amount of overdraw (like your 5 parallax images) is usually much less demanding on the GPU bandwidth than using the typical 2d way of handling this which is, drawing from back to front, for correct transparency - in 3d games that’s only done for geometry/materials with transluency.

If you don’t need the shader effect to run per frame, a preprocessed image is a great optimization. You could still do this with a shader and a snapshot as snapshots don’t have to be updated each frame.

Also you could try optimizations like, render your parallax effect into a snapshot half the width and height of your actual display and only render this at full resolution. This should reduce the used bandwith for the parallax images to slightly below 50% of whats consumed at the moment. Depending on your artstyle users may not even notice the difference.

XeduR · September 22, 2018, 1:36pm

The fancy visual effects in general seem to be quite strenuous on Corona. In one game, we had a large black and white image of a sunburst effect. We then applied blendMode add to it, as well as rotated & scaled the image in a transition. This effect was enough by itself to cause the game to lag, but once we removed the blendMode and did the necessary editing in Photoshop, the game ran smoothly again.

It is also possible that you and I have done something wrong and that there are more optimal means of applying the effects.

I have to admit, that I find it interesting if matrix-vector multiplications are lagging on your device. I’m just adding some finishing touches to a shadow system that I am working on and it relies heavily on matrix multiplications and those are among the lightest calculations in the system.

damiano.massarelli · September 22, 2018, 2:25pm

Thank you for your reply!

I’m also very interested to understand why this little and gpu oriented operation is taking so many resounces. My fragment shader is really nothing more than a matrix-vertex multiplication but still when I run it on my device gpu usage is near 50%. On the other hand, when I run the game without the shader, gpu usage is close to 20%. I know that mobile devices’ gpus are much “weaker” than the ones on desktop PCs but I still find it hard to believe that a 2d game with such a simple shader is so heavy for a quite recent device.

Since I’m pretty new to shader programming I might have done something terribly wrong. I included the shader so that if you want you can give a look at it

P\_COLOR mat3 colors = mat3(0.0, 0.7, 0.0, 1.0, 0.7, 0.0, 0.0, 0.0, 1.0); P\_COLOR vec4 FragmentKernel( P\_UV vec2 texCoord ) { P\_COLOR vec4 texColor = texture2D( CoronaSampler0, texCoord ); P\_COLOR float avgColor = (texColor.r + texColor.g + texColor.b) \* 0.33; P\_COLOR vec4 bwColor = vec4(avgColor, avgColor, avgColor, texColor.a); P\_COLOR vec3 similarities = colors \* normalize(texColor.rgb); P\_COLOR vec4 outputColor = mix(bwColor, texColor, (similarities.r + similarities.g + similarities.b)/2.0); return CoronaColorScale( outputColor ); }

The goal of this shader is just to turn colors that are not “similar” to yellow, green or blue to gray scale. I know that dot product is not the best metric to tell whether two colors are similar but every other metric I came up with runs too slow to be considered a viable option.

Michael_Flad · September 22, 2018, 3:00pm

How big are your tiles combined when rendered? Just a single native fullscreen overdraw on your device is quite a lot of pixels and the shader runs on every single one.

Many 3D games render to smaller render targets and upscale later to the required sizes. Even many AAA games on consoles do this and try to find the sweet spot to reach their target framerate but the maximum resolution.

Another idea - maybe you get much worse batching when you use the shader and it’s actually not the fragment calculation that’s killing your performance?

Simple way to check this, limit your shader to a single tile, but render this as big as all your tiles combined would be?

damiano.massarelli · September 24, 2018, 1:35pm

First of all I wanted to thank you all for your help.

I tried what Michael suggested to see whether I was getting worse batching but even with a single tile using the shader, the result is the same (a bit better but not significantly better). What I noriced though is that if I remove my background images (which are five images used to convey parallax effect) performance are noticeably better. This is probably due to the fact that background images cover a large portion of the screen (and are 5 images). I also tried to use snapshots (https://docs.coronalabs.com/api/library/display/newSnapshot.html) but again, same result.

I guess now that there is not much I can do about it, but I’m still surprised that such a simple postprocessing effect is so resource demanding. The best course of action for me is probably to apply a preprocessing effect on the background changing the colors of the pixels using the cpu since the background does not have to be dynamically shaded. However, if you have any other idea I would be glad to delve into this problem.

Michael_Flad · September 27, 2018, 6:45am

If I understand you correctly, the background (parallax) images are not related to the effect? If that’s right then you have the classical situation where multiple parts are kind of equally demanding. You couuld either remove your backgrounds or the shaded tiles and both times you get a measurable performance boost.

Also, again, a simple effect might seem simple on a per pixel level, but at 60fps its done more than 220 million times a second (that would be just one fullscreen draw on your S7 at 60fps) it’s a lot of work to be done.

The performance boost with the parallax images is likely because you’re at the bandwidth limit of your gpu. With 5 images roughly fullscreen you’re operating already at 1 billion pixels a second just for the parallax effect - at 32bit that’s up to 8GB/sec of memory that has to be read and written.

Here’s another detail where 3d engines may actually be less demanding. 3d games use the z-buffer to determine wheter a pixel has to be drawn/shaded or not and for that reason, usually render whatever is visible front to back. GPUs are highly optimized to do this z-buffer rejection on their pixels and so, in 3d even the same amount of overdraw (like your 5 parallax images) is usually much less demanding on the GPU bandwidth than using the typical 2d way of handling this which is, drawing from back to front, for correct transparency - in 3d games that’s only done for geometry/materials with transluency.

If you don’t need the shader effect to run per frame, a preprocessed image is a great optimization. You could still do this with a shader and a snapshot as snapshots don’t have to be updated each frame.

Also you could try optimizations like, render your parallax effect into a snapshot half the width and height of your actual display and only render this at full resolution. This should reduce the used bandwith for the parallax images to slightly below 50% of whats consumed at the moment. Depending on your artstyle users may not even notice the difference.