Problem downloading large files via cellular network using network.download() (especially on Android)

Yeah, this definitely sounds like a deadlock issue, where 2 network threads are stuck in a Java synchronized block.  We’ll have to look into that issue, because it sounds like we haven’t fully resolved it.

But the good news is that there is a work-around by only doing 1 download at a time.

If you don’t mind, can you write up a bug report for this issue please?  You can do so by clicking the “Report a Bug” link at the top of this web page.  This way it’ll get into our queue and we can notify you when it is fixed.  Make sure to reference this forum thread since we have a lot of nice details posted here.  Thanks!

I’ve just submitted a bug report, case #23195

@ Josh - thank you so much for the quick response. Didn’t mean to ignore you, we have just been running some more tests to better understand what may be happening. And it’s still been rather confusing.

Yes - you are right, we are using setIdleTimer() to prevent the device from going to sleep.

On the tests we ran today, it seems like (at least on my test device) if the device were to go to sleep (push power button, push home button then wait for the device to go to sleep), I wait a while, then turn the device back up - what happens is that the callbacks actually do come back! It is almost as though the OS is holding those callbacks for my app until my apps wakes up again. So, at least based on this test, it is not because the device is going to sleep…

Is it possible the behavior varies based on the OS version and device? I run HTC Inspire 4G Android 2.2.1.

There is one spot in my house where the signal is weaker, and that’s where I seem to get most of my failures. I am away from the house today, but once I have a chance, I will run through the same tests there and let you know what I find.

@ Josh - I ran a bunch more tests from my house, where we get a weaker signal. And found that, indeed, sometimes we don’t get the “ended” callback from the API. I did make sure the device didn’t sleep (with setIdleTimer()). Also, I didn’t hit the home button, power button, or anything else that could have caused the device to become suspended. I confirmed through our logging that the app at no point went into suspend.

Here is an abbreviated (and somewhat masked) trace from our log of the responses we got from network.download().

Basically, it looks like we just all of a sudden stopped getting callbacks before the file download is complete (the total file size should be 1771520). Is it possible that, due to the bad reception, we get some dropped packets. But rather than getting an error, we are just not getting anything?

We are running Android 2.2.1 and the latest build 2013.1094. Please let me know if there is more information I can provide to help with this.
 

I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: began url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 0 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1024 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 3500 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 14261 bytesEstimated: -1 D/dalvikvm(10034): GC\_FOR\_MALLOC freed 4022 objects / 163384 bytes in 64ms \<... a lot more of the same .. \> I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1231274 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1248023 bytesEstimated: -1

 

Sorry about the late response.

Are you doing about 50 downloads at the same time?  If so, then I’m thinking we’re running into a thread pooling issue.  Threads are a finite resource and we can only spawn so many at once.  I’ve seen people run into this problem with our old network code too where they simply did too many network downloads at once.  For testing purposes, I would suggest that you try downloading up to 2 files at a time, and when one file download completes, download the next file in the chain.

The other possibility is that your Android device has a shaky network connection (my HTC device has that problem too), meaning the connection to the server was lost and communications was aborted.  If this is the case, then in order to make your app more fault tolerant, you may want to implement a “retry” on your end by attempting to download the same file again one more time, and if it fails that 2nd time, give up and let the user know.

(From my past personal experience as a Windows desktop app developer, retrying network requests/downloads is common practice to make apps more fault tolerant.  It’s by no means a hack or work-around.  Communication failures are pretty common in my experience.  Especially on wifi networks.)

@ Josh thanks for the quick response! To your questions, we have already implemented a “download queue” concept where we limit our downloads to 2 concurrent file downloads at once. We also already have a retry mechanism in place where if a particular file download fails, we try again.

To elaborate, we currently recognize a download error if we get a response where event.isError == true, but since we don’t get a response call-back at all, our code assumes that we are still waiting for a response - hence hangs rather than recognizing the failure and retry.

I agree that it appears my Android device has a shaky network connection (I have an HTC too). In that situation, what do you recommend as the best way to determine if communications was aborted (given that we don’t get back a response at all)? We have thought about using a timer that basically waits 15 seconds (or some other interval), and if we don’t get a progress response within that, then assume the communication is aborted. But our question is what interval to choose, and what if we are wrong? Say, we assume the communication had been aborted when it hasn’t, so we start a new download… wouldn’t that cause a problem?

Really appreciate your insight!

Oh I see.  You’re network Lua listener is sometimes not getting called at all, right?

If that’s happening, then that must be a bug on our end.

I do know that we are correctly calling the Lua listener with isError set to true if the connection suddenly breaks in the middle of a download.  For example, if you run away from your wifi network in the middle of the download.  So, I don’t think this is the cause.

Now, I did discover in our release build that there was a threading deadlock issue in our Android network code.  It can sometimes occur if 2 network requests/download/updates are happening at the same time, causing all network threads to hang.  What you are describing sounds like this particular issue.  We fixed this issue in daily build #1082.  Are you saying you still see this issue in build #1094?

@Josh -

  I am running #1094. But based on your description, I am not sure it’s the same deadlock issue. Because what we see is that, of the 2 concurrent downloads we are doing, only one of them will fail to get a response back, while the second one will get a response back…

  Also, I have tested that if I were to manually kill the network through settings, that we do get an isError=true response.

  This partcular issue appears to be related directly to network quality.

Thanks,

Andrew

Hmm… okay, well, I’ll raise this issue at the next team meeting here.  The hard part is figuring out how to reproduce this issue on our end.  If you can send us a test project that can reproduce this issue, then that will help expedite this.

I still want to believe that this is a deadlock issue.  Would you mind doing one more test please?  Try doing only 1 download at a time instead of 2.  Also, try removing “progress” support.  This would help us isolate this issue… and possibly find a work-around for this issue.  And if it still fails, then does it always fail after a certain number of downloads?

@Josh -

I have been running lots of testing based on your feedback. And this is what I am finding:

  • If I reduce the number of concurrent downloads to 1 file, regardless of if progress is turned on, download appears to succeed consistently. I’ve tried this 3 times so far and succeeded each time.

  • If I allow up to 2 concurrent download, then I get error about 50% of the time. This is regardless of if progress is turned on.

I am still doing more more tests to see if I can figure out some more behaviors on the way the errors are happening. I’ll share once I have more data.

In terms of sharing the code, I’ll have to see if there is a way to do so. It’s part of a much bigger and complicated system, so it’s hard to isolate the code…

@Josh - We did some more testing (testing takes a while since a full download on my poor network takes a long time). And here are our observations

If we do 2 concurrent downloads, we consistently run into issues. We tend to get more event.isError == true. We also get instances of corrupted downloads. Both of these are addressable. The real problem is that we sometimes will all of a sudden stop getting events for a particular download. We are not even getting the isError == true event to indicate that a timeout has happened. So, we don’t know if the download has finished or if we are still waiting for a download. So, our code hangs…

Is this the deadlock issue you are referring to? We are also thinking we may be able to mitigate it on our end by monitoring the “progress” events. And if we noticed that all of a sudden, we are not getting any more events for more than 60 seconds (the time out interval), then assume that a timeout has happened?

Do you think this is a good idea? Or is this something to be addressed on your end?

Interestingly, I noticed that if we suspend the app, all of the events appear to be queued. And when we resume the app, all of the events (including all the “progress” updates), suddenly are called back to my handler. At first we thought suspension may be causing the dropped events, it’s looking to not be the culprit…

Thanks for your help.

Yeah, this definitely sounds like a deadlock issue, where 2 network threads are stuck in a Java synchronized block.  We’ll have to look into that issue, because it sounds like we haven’t fully resolved it.

But the good news is that there is a work-around by only doing 1 download at a time.

If you don’t mind, can you write up a bug report for this issue please?  You can do so by clicking the “Report a Bug” link at the top of this web page.  This way it’ll get into our queue and we can notify you when it is fixed.  Make sure to reference this forum thread since we have a lot of nice details posted here.  Thanks!

I’ve just submitted a bug report, case #23195

I want to check in to see if there is any update on this issue? Thx!

Not yet.  We have other issues in the queue that we have to take care of first.  The bug you submitted is still on our list, but not scheduled in yet.

I want to check in to see if there is any update on this issue? Thx!

Not yet.  We have other issues in the queue that we have to take care of first.  The bug you submitted is still on our list, but not scheduled in yet.

Joshua -

   I want to check in to see this had been addressed? Also, is there a way we can check bug status in the bug tracker directly? I can’t seem to find such a link.

Happy New Year!

Andrew

Unfortunately no.  We haven’t had time to address this issue yet.  We’ve been more focused on perfecting our new Graphics 2.0 rendering system.

Joshua -

   I want to check in to see this had been addressed? Also, is there a way we can check bug status in the bug tracker directly? I can’t seem to find such a link.

Happy New Year!

Andrew