com.ansca.corona.CoronaService ANR affects 20k users

@troy: yes I understand you were also referring to to widget.lua which i agree might also be causing the ANRs, i was actually just reacting to the display:remove and calculating with nil errors you brought up. These ANRs are freaking us all out so bit stressed at the moment myself :slight_smile:

@jacques1 - I understand, no worries. And I’m sure that Rob understands, too.

The widget library is available on github isn’t it? Pretty sure I was seeing these issues a few years ago and put in the display.remove fix and nil checking in a lot of places myself.

https://github.com/coronalabs/framework-widget

I also wrapped a lot of calls like group:insert or checking display object positions in functions with built-in checking for certain properties, because I found maybe 1 in 1000 times a display object would exist (i.e. is not nil) but was in the process of being removed and was no longer a ‘proper’ display object. I found it odd because it was like a race condition but Corona is single-threaded…

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@rick_sherman, thanks so much for reaching out! You just described fixing the exact issue we are experiencing - be it rare, it still happens. So, when I create a build, why doesn’t Corona include your fixes? Or are there still more issues in widget.lua that we happen to be experiencing? If it helps, this problem seems more prevelent when we have more than one instance of widget.lua included.

With that github repository you can download the .lua files and replace the built in widget library with your own versions, fixing any bugs/rare crashes you find. The readme’s in the repo detail how to do this.

There’s some more details here:

https://forums.coronalabs.com/topic/34183-widgets-20-open-source

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@everyone - any thoughts?
@rick_sherman, the latest rare-occuring widget.lua error is - Runtime Error: /Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget.lua:27: attempt to index field ‘?’ (a function value)

Looking at widget.lua in your git shows, “if group[i]._isWidget then” - which tells me that this object/group is nil, thereby causing the error. So I have two questions.

  1. are you saying that I don’t need Enterprise, that I can just copy and paste the widget folder into my project?
  2. what can be done to fix this? If it were me, I would change it to, “if group and type(group)==“table” and #group>=i and group[i] and group[i].x then” - such code would verify that a ) group was not nil; b ) that group is the proper type; a variable; c ) that the group array length is greater than or equal to i, d ) that the value in group[i] is not nil and e ) that the value in group[i] is a display object or group. I know this seems rediculous to some, but how else can you fix a bug like this with certainty? I don’t know of any other way.

Anyone, thoughts?

Not sure if that repo is 100% up to date.

Looks like a good belt and braces solution to me, these rare crashes were driving me crazy because they were so difficult to reproduce, but I’d see them come through on the logs. When you’ve got lots of users moving between scenes, creating and removing widgets within each scene a 1 in 1,000 shot adds up.

I’d replace any removeSelfs with display.remove as well.

Can someone please clarify which exact widgets are being discussed, or is it all the widgets? And does it only occur when using the built-in corona scene manager calls?

I only use the scrollview widget and not any type of scene manager calls. I have never been able to get the display.remove to properly clear/remove scrollviews, so I always code clearing/removing scrollviews manually, which may seem like overkill but at least I can then be sure it is completely removed if I jump to another section/function in my app.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@nick_sherman - thanks again for helping with this.

  1. Is the link you provided to the current version of widget.lua? And if not, can I get it?
  2. if its different from the version of widget.lua that you fixed with link above that one, could I compare the code of that used by Corona with your changes to update a new version, and then fix line 27 like above?

I’m still concerned that if we have to modify bugs that cause CRASHES in plug-in code, this should be where the Corona Team jumps in and says, “Ah, got it guys. Fixed in the next build.” right? Of course, I know they are hard at work finishing HTML 5. So, the trade-off decisions are difficult.

But line 27 identified above, requires a solution that Corona can’t fix, right? After all, comparing all of those values would require changes in the LUA interpreter, not just the Corona engine - so, it’s complicated.

Overall, I really think Corona needs to develop a “STRATEGY” and attack these issues. These CRASH issues occur in many different places for many different developers, and only rarely. Accordingly, Corona needs to be “hardened” IMHO. From the simple to the complex, across the board. I’m sure they themselves can create an entire list of issues of when the developer’s intent is clear, but either A ) LUA or B ) the Corona engine, or C ) a Corona plugin fail. And I’d be more than happy to participate in an online meeting, and likely many others impacted by the CRASH PROJECT, if Corona has interest.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@jacques - one of two crashing issues I have is in line 27 of widget.lua, likely pertaining to my use of scrollView. However, thse kinds of issues run throughout widget.lua and likely through other plugins, too - nick_sherman, above, had the same problem and chose to significantly harden (protect/strengthen) his version of widget.lua to solve this, and it worked for him.

The real difficulty here for Corona is that these bugs are quite rare. Only those of us pushing the engine hard are likely to see them, and even still, rarely. But as I wrote above, these issues are indicative of the fact that LUA, Corona engine and or Corona plugins are not like the killer-cant-crash code we want and desire. A strategy to harden all 3 are necessary to truly eliminate 99% of crashing bugs. Or course, the plugins should have the priority, IMHO.

I also found places in the individual widget files, like scrollView.lua or tableView.lua that benefitted from extra protection against nil references.

I’ll see if I can dig out the files, but I only ever used scrollView and tableView and it was a few years ago so I couldn’t tell you where the changes are!

Also, once you’ve gone down this road you will never get any future updates.

Indeed, but faced with crashes I had no control over happening to paying customers in that moment, I decided to just take ownership of the widgets code and if it crashed from then on, it was my problem.

Black boxes = bad!

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@SGS, you are referring to Corona Team updates to the “widget” plugin, right?

Your comment, “once you’ve gone down this road you will never get any future updates”

…is exactly why Corona needs to harden it for all of us.

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@Rob, even if the team were to fix line 27 of widget.lua, that could result in another similar error, too - so what’s the chance that this can be done, and soon? Because we’re not talking about new feature development, simply hardening, might this be something one of the team can work on that is not dedicated to HTML 5, where most are likely focused?

Also, any insight as to staffing and platforms would be interesting, although I respect it could be confidential and you are unable to share.

Hey, guys!

There are a lot of valid concerns voiced in this discussion. Unfortunately there are bugs in widget.lua. For example, line 27 referred in previous post is following:

local cached\_displayNewGroup = display.newGroup

Error “widget.lua:27: attempt to index field ‘?’ (a function value)” would mean that there is defined function “display” in global namespace, before require "widget". Note, that functions declared in lua like:

function display() print "blah" end

would overwrite Corona’s display library in global scope. You can use local function display()...end, at least it wouldn’t bork global scope, only local. Still, best is not to use “display” as your function or variable name.

From what I see, on iOS, removing invalid object would result in no-op in device build. They’re just ignored. In Simulator, they will produce error message box.

Lastly: we have widget library in open source, so anyone can include it. Just place widget.lua into root of your project. Problem is that it is quite outdated. We will update widget repo soon.

Unfortunately, it is not as simple as just placing widget.lua in your root.  You need to download the entire widget library (https://github.com/coronalabs/framework-widget/tree/master/widgetLibrary) and there are quite a few internal references that need updating.  Also the source is 2 years old.

I would really appreciate a fix for this in current widgets library

/Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget\_tableview.lua:1266: attempt to index field 'parent' (a nil value) stack traceback: at (Unknown Function) (/Users/jenkins/slaveroot/workspace/Templates/label/android/subrepos/widget/widgetLibrary/widget\_tableview.lua:1266) at (Unknown Function) ((tail call)) at ? (?:0) at (Unknown Function) (?:0)

And to add to this, part of the error handling you don’t like @Troy is Lua, not Corona. While we have an event that can help you trap some runtime errors in real time (see; https://docs.coronalabs.com/api/event/unhandledError/index.html), they are still errors.

Lua works in “chunks”. For instance all of main.lua is a chunk.  The guts of a for loop or function are a chunk. When a runtime error occurs, it aborts the chunk the error happened in and no further code will execute.  This may sound like a desirable outcome, but it’s not. Your code could be so borked from that function not completing properly, it gets into a worse state than crashing. It simply stops responding or corrupts data, all which can be interpreted by the app’s user far worse than a crash would.  Now in Google’s new “let’s rank you on how you perform” model, it probably wouldn’t do much as instead of getting a crash report, you’re going to get an ANR which honestly seems worse than a crash report.

Please note, if you do implement unhandledError events, make sure to read the docs. There are some gotchas about how the messages are reported and you may find it only partially useful.

Masking errors isn’t getting rid of them, it’s just hiding them which frankly as a developer I find more frighting.

Rob

(Updated: This issue RE: runtime error caused by widget.lua has been moved to https://forums.coronalabs.com/topic/72782-runtime-error-caused-by-widgetlua/))

@Rob,

  1. The reason I’m not using the unhandledError events is because I need reports to go to Google, and as you stated above, there are gotchas about how the messages are reported. Fact is, we already know that line 27 of widget.lua is a problem for me and line 1266 in widget_tableview.lua is a problem for SGS above. Are these the only two line number error reports in widget that you’ve seen reported? Or do you have a long list of them collected in the past 2 years?

  2. I understand that some issues are Lua. Are you saying that the Corona Team is using a Lua interpreter and have never had an occassion to modify it when necessary?

If the Corona Team wasn’t busy at the moment, I have to believe these CRASH issues would have the highest of priority. if you agree, than this is not a matter of ‘should’ Corona Team fix it. It’s just a matter of priorities, am I right?

Do you agree that ‘hardening’ 1) Lua, 2) Corona engine; and 3) Corona plugins would be a good idea to preserve the integrity of the engine?

And lastly, I’ve noticed CRASH issues increasing with Android 7 and 8. Do you have any idea why, or are my observations circumstancial?

I’ll let @Rob answer from a Corona perspective but @troy where do you get “do not use unhandled error event” from?

I always use this to send the errors to my DB server.  AFAIK this has no bearing on what you see in Android vitals and ANR/crash logs.

Now saying this, it is responsible coding to check for valid data before trying to perform an action on it - things like “do you have a valid response to a network.request() before trying to process the data”, etc. 

I also use liberal use of pcalls if I suspect a code snippet is likely to fail.  It is not as slick as a try…catch…finally block which I so wish was available in lua.