com.ansca.corona.CoronaService ANR affects 20k users

With that github repository you can download the .lua files and replace the built in widget library with your own versions, fixing any bugs/rare crashes you find. The readme’s in the repo detail how to do this.

There’s some more details here:

https://forums.coronalabs.com/topic/34183-widgets-20-open-source

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@everyone - any thoughts?
@rick_sherman, the latest rare-occuring widget.lua error is - Runtime Error: /Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget.lua:27: attempt to index field ‘?’ (a function value)

Looking at widget.lua in your git shows, “if group[i]._isWidget then” - which tells me that this object/group is nil, thereby causing the error. So I have two questions.

  1. are you saying that I don’t need Enterprise, that I can just copy and paste the widget folder into my project?
  2. what can be done to fix this? If it were me, I would change it to, “if group and type(group)==“table” and #group>=i and group[i] and group[i].x then” - such code would verify that a ) group was not nil; b ) that group is the proper type; a variable; c ) that the group array length is greater than or equal to i, d ) that the value in group[i] is not nil and e ) that the value in group[i] is a display object or group. I know this seems rediculous to some, but how else can you fix a bug like this with certainty? I don’t know of any other way.

Anyone, thoughts?

Not sure if that repo is 100% up to date.

Looks like a good belt and braces solution to me, these rare crashes were driving me crazy because they were so difficult to reproduce, but I’d see them come through on the logs. When you’ve got lots of users moving between scenes, creating and removing widgets within each scene a 1 in 1,000 shot adds up.

I’d replace any removeSelfs with display.remove as well.

Can someone please clarify which exact widgets are being discussed, or is it all the widgets? And does it only occur when using the built-in corona scene manager calls?

I only use the scrollview widget and not any type of scene manager calls. I have never been able to get the display.remove to properly clear/remove scrollviews, so I always code clearing/removing scrollviews manually, which may seem like overkill but at least I can then be sure it is completely removed if I jump to another section/function in my app.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@nick_sherman - thanks again for helping with this.

  1. Is the link you provided to the current version of widget.lua? And if not, can I get it?
  2. if its different from the version of widget.lua that you fixed with link above that one, could I compare the code of that used by Corona with your changes to update a new version, and then fix line 27 like above?

I’m still concerned that if we have to modify bugs that cause CRASHES in plug-in code, this should be where the Corona Team jumps in and says, “Ah, got it guys. Fixed in the next build.” right? Of course, I know they are hard at work finishing HTML 5. So, the trade-off decisions are difficult.

But line 27 identified above, requires a solution that Corona can’t fix, right? After all, comparing all of those values would require changes in the LUA interpreter, not just the Corona engine - so, it’s complicated.

Overall, I really think Corona needs to develop a “STRATEGY” and attack these issues. These CRASH issues occur in many different places for many different developers, and only rarely. Accordingly, Corona needs to be “hardened” IMHO. From the simple to the complex, across the board. I’m sure they themselves can create an entire list of issues of when the developer’s intent is clear, but either A ) LUA or B ) the Corona engine, or C ) a Corona plugin fail. And I’d be more than happy to participate in an online meeting, and likely many others impacted by the CRASH PROJECT, if Corona has interest.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@jacques - one of two crashing issues I have is in line 27 of widget.lua, likely pertaining to my use of scrollView. However, thse kinds of issues run throughout widget.lua and likely through other plugins, too - nick_sherman, above, had the same problem and chose to significantly harden (protect/strengthen) his version of widget.lua to solve this, and it worked for him.

The real difficulty here for Corona is that these bugs are quite rare. Only those of us pushing the engine hard are likely to see them, and even still, rarely. But as I wrote above, these issues are indicative of the fact that LUA, Corona engine and or Corona plugins are not like the killer-cant-crash code we want and desire. A strategy to harden all 3 are necessary to truly eliminate 99% of crashing bugs. Or course, the plugins should have the priority, IMHO.

I also found places in the individual widget files, like scrollView.lua or tableView.lua that benefitted from extra protection against nil references.

I’ll see if I can dig out the files, but I only ever used scrollView and tableView and it was a few years ago so I couldn’t tell you where the changes are!

Also, once you’ve gone down this road you will never get any future updates.

Indeed, but faced with crashes I had no control over happening to paying customers in that moment, I decided to just take ownership of the widgets code and if it crashed from then on, it was my problem.

Black boxes = bad!

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@SGS, you are referring to Corona Team updates to the “widget” plugin, right?

Your comment, “once you’ve gone down this road you will never get any future updates”

…is exactly why Corona needs to harden it for all of us.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@Rob, even if the team were to fix line 27 of widget.lua, that could result in another similar error, too - so what’s the chance that this can be done, and soon? Because we’re not talking about new feature development, simply hardening, might this be something one of the team can work on that is not dedicated to HTML 5, where most are likely focused?

Also, any insight as to staffing and platforms would be interesting, although I respect it could be confidential and you are unable to share.

Hey, guys!

There are a lot of valid concerns voiced in this discussion. Unfortunately there are bugs in widget.lua. For example, line 27 referred in previous post is following:

local cached\_displayNewGroup = display.newGroup

Error “widget.lua:27: attempt to index field ‘?’ (a function value)” would mean that there is defined function “display” in global namespace, before require "widget". Note, that functions declared in lua like:

function display() print "blah" end

would overwrite Corona’s display library in global scope. You can use local function display()...end, at least it wouldn’t bork global scope, only local. Still, best is not to use “display” as your function or variable name.

From what I see, on iOS, removing invalid object would result in no-op in device build. They’re just ignored. In Simulator, they will produce error message box.

Lastly: we have widget library in open source, so anyone can include it. Just place widget.lua into root of your project. Problem is that it is quite outdated. We will update widget repo soon.

Unfortunately, it is not as simple as just placing widget.lua in your root.  You need to download the entire widget library (https://github.com/coronalabs/framework-widget/tree/master/widgetLibrary) and there are quite a few internal references that need updating.  Also the source is 2 years old.

I would really appreciate a fix for this in current widgets library

/Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget\_tableview.lua:1266: attempt to index field 'parent' (a nil value) stack traceback: at (Unknown Function) (/Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget\_tableview.lua:1266) at (Unknown Function) ((tail call)) at ? (?:0) at (Unknown Function) (?:0)

And to add to this, part of the error handling you don’t like @Troy is Lua, not Corona. While we have an event that can help you trap some runtime errors in real time (see; https://docs.coronalabs.com/api/event/unhandledError/index.html), they are still errors.

Lua works in “chunks”. For instance all of main.lua is a chunk.  The guts of a for loop or function are a chunk. When a runtime error occurs, it aborts the chunk the error happened in and no further code will execute.  This may sound like a desirable outcome, but it’s not. Your code could be so borked from that function not completing properly, it gets into a worse state than crashing. It simply stops responding or corrupts data, all which can be interpreted by the app’s user far worse than a crash would.  Now in Google’s new “let’s rank you on how you perform” model, it probably wouldn’t do much as instead of getting a crash report, you’re going to get an ANR which honestly seems worse than a crash report.

Please note, if you do implement unhandledError events, make sure to read the docs. There are some gotchas about how the messages are reported and you may find it only partially useful.

Masking errors isn’t getting rid of them, it’s just hiding them which frankly as a developer I find more frighting.

Rob

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@Rob,

  1. The reason I’m not using the unhandledError events is because I need reports to go to Google, and as you stated above, there are gotchas about how the messages are reported. Fact is, we already know that line 27 of widget.lua is a problem for me and line 1266 in widget_tableview.lua is a problem for SGS above. Are these the only two line number error reports in widget that you’ve seen reported? Or do you have a long list of them collected in the past 2 years?

  2. I understand that some issues are Lua. Are you saying that the Corona Team is using a Lua interpreter and have never had an occassion to modify it when necessary?

If the Corona Team wasn’t busy at the moment, I have to believe these CRASH issues would have the highest of priority. if you agree, than this is not a matter of ‘should’ Corona Team fix it. It’s just a matter of priorities, am I right?

Do you agree that ‘hardening’ 1) Lua, 2) Corona engine; and 3) Corona plugins would be a good idea to preserve the integrity of the engine?

And lastly, I’ve noticed CRASH issues increasing with Android 7 and 8. Do you have any idea why, or are my observations circumstancial?

I’ll let @Rob answer from a Corona perspective but @troy where do you get “do not use unhandled error event” from?

I always use this to send the errors to my DB server.  AFAIK this has no bearing on what you see in Android vitals and ANR/crash logs.

Now saying this, it is responsible coding to check for valid data before trying to perform an action on it - things like “do you have a valid response to a network.request() before trying to process the data”, etc. 

I also use liberal use of pcalls if I suspect a code snippet is likely to fail.  It is not as slick as a try…catch…finally block which I so wish was available in lua.

Hey Troy. I don’t have a good list of widget bugs that crash and turn into reports you are seeing. So for #1, I don’t have a good answer for you.

And for #2, yes, we’ve made changes to our version of Lua when we’ve needed to address things, but what you’re asking would be a fundamental change in how Lua works. In other words, I seriously doubt we could or would  change how Lua error handle’s chunks since there is an expectation that Lua works that way.

The Corona team is busy at the moment, and it may not seem like but addressing these Android crashes is the highest priority of our engineering team right now. 

There seems to be two core problems:  1. OpenAL audio crashes.  2. Changes to Google Play core services libraries.  

In many cases, people are seeing jumps in ANR’s and Crashes even though they haven’t updated their apps in a long time. Corona can’t magically change your apps once they are on a device. So we have to look at what else may have changed. Many of these crashes as you note are happening on Android 7 and Android 8 which lends it to updates to Google Play. Many plugins touch Google Play’s base libraries. So we have to pretty much rebuild every plugin we are in control over. We have to get third parties to rebuild theirs now that we’ve updated our dependency library to use a more modern version of the Google Play libraries. This simply takes time and testing to make sure we don’t make the problem worse. It’s possible today, depending on what plugins you use, that you could push a new version and start seeing a decline in these reports. Have we gotten them all yet? No, but we are spending a great amount of engineering effort to address them as fast as we can.

The other crashes are around Android’s horrible support for the OpenAL library. Not only does that generate a bunch of crashes, it performs horribly as well. We have a new under-the-hood audio library in beta now for Android that is free of OpenAL and it’s call compatible with audio.* so when it’s released you should just need to rebuild and resubmit and wait on your community to update.  But it’s still got bugs we are working through and we don’t want to make it worse until we are comfortable that it’s ready to go.  

I can assure you we are working to get these addressed. As far as hardening Lua, I’m not sure what we can do there. We always look to harden Corona and we use plugins already to help maintain the integrity of the engine. But your device operating systems are a moving target and we always have to be agile to adapt to these changes. 

Rob

Thank you for the update Rob, really appreciate it!

I don’t speak for everyone but I guess one of our main concerns also is how fast can we deliver the fix to our players. Do you have any ETA on the fixes so that we can at least communicate this to our players?

In my case, my game is on Google Play’s Early Access and I’m in the process of coordinating my game launch with them but Google will not feature it unless I trim down the ANRs and crashes.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@Rob, thank you for the complete explaination. I understand your point about how hardening Lua could actually create a new behavior, making it worse. As for the widget.lua line 27, can you tell me if the new version in development will resolve this issue that nick_sherman and I both have experienced?

Here is a list of plugins that have been fixed: Revmob, AdColony, InMobi, Facebook Audience Network (Trying to determine if it’s the rev-share version, paid version or both), Unity Ads and Chartboost.

If you’re using these, it might be worth producing a release.

Rob