Problem downloading large files via cellular network using network.download() (especially on Android)

We have been running into this issue for a while, and now we have just duplicated it on the new API as well, I would love to hear if others are having the same issues and if the Corona Guys can shed some light? 

Basically, as part of our app, we have the feature of downloading upwards of 50 tar files of 1-3 MB each.  These files are hosed at google app engine, and fronted by a CDN. To minimize bogging down the device, we typically only download 2-3 files concurrently.  

What we have found is that on Wifi or on iOS, the downloads goes through no problem. But on Android, when downloading via cellular data network (we test using AT&T’s network) or some lower-end Wifi provider, we frequently run into download issues.

Many times, the networkRespones.isError just returns true.  Other times, the downloaded file fails checksum. We got so many customer complaints about this issue that it’s a real serious problem for us. A couple other info:

  • We do prevent the device from going to sleep to prevent the issues with “exit/suspend”. We also track progress in a file.

- While we do understand that the issue is a last-mile problem with a flaky network, we are still able to download apps from the Google Play Store consistently. So, what I don’t understand is if the Corona’s network API use a different stack than Google Play, or if Google Play has implemented other download “tricks”. 

 

Anyone else with similar issues? Any suggestions from the Corona Team? 

 

 

 

My first gut feel for this is maybe its the concurrent downloads and Android not playing nicely.  Is there a way you can test doing just one at a time?  I’ll ping the engineers and see if they have any thoughts.

Hi Rob -

   Thanks for the quick follow up. I actually haven’t tried that before. Let me try that real quick and let you know. I really appreciate any help you can lend.

Thx!

Andrew

Hi Rob -

    I just want to let you know that reducing the download to one file at a time doesn’t solve the problem. I am still getting the same kind of error. Basically, the event.isError = true, event.response = nil, event. event.url = requentURL (correctly).

    Things work fine on my Wifi (using cablemodem).

Thanks,

Andrew

@Rob - I am sure you are quite busy, but I just want to check in to see if there is any update on this issue? thx.

I’m thinking that this issue only happens to you when your Android device goes to sleep.  As in the display turns off and then device goes to lower power mode, thus suspending all background processes, in which case our download process stops receiving receiving responses from the server.

You said that you are preventing the device from going to sleep.  How exactly are you doing that?  Via our system.setIdleTimer() function?

http://docs.coronalabs.com/api/library/system/setIdleTimer.html

That function will prevent the device from going to sleep, but only if the app is in the foreground and displayed onscreen… as in, do not turn off the display with the power button.

@ Josh - thank you so much for the quick response. Didn’t mean to ignore you, we have just been running some more tests to better understand what may be happening. And it’s still been rather confusing.

Yes - you are right, we are using setIdleTimer() to prevent the device from going to sleep.

On the tests we ran today, it seems like (at least on my test device) if the device were to go to sleep (push power button, push home button then wait for the device to go to sleep), I wait a while, then turn the device back up - what happens is that the callbacks actually do come back! It is almost as though the OS is holding those callbacks for my app until my apps wakes up again. So, at least based on this test, it is not because the device is going to sleep…

Is it possible the behavior varies based on the OS version and device? I run HTC Inspire 4G Android 2.2.1.

There is one spot in my house where the signal is weaker, and that’s where I seem to get most of my failures. I am away from the house today, but once I have a chance, I will run through the same tests there and let you know what I find.

My first gut feel for this is maybe its the concurrent downloads and Android not playing nicely.  Is there a way you can test doing just one at a time?  I’ll ping the engineers and see if they have any thoughts.

Hi Rob -

   Thanks for the quick follow up. I actually haven’t tried that before. Let me try that real quick and let you know. I really appreciate any help you can lend.

Thx!

Andrew

Hi Rob -

    I just want to let you know that reducing the download to one file at a time doesn’t solve the problem. I am still getting the same kind of error. Basically, the event.isError = true, event.response = nil, event. event.url = requentURL (correctly).

    Things work fine on my Wifi (using cablemodem).

Thanks,

Andrew

@ Josh - I ran a bunch more tests from my house, where we get a weaker signal. And found that, indeed, sometimes we don’t get the “ended” callback from the API. I did make sure the device didn’t sleep (with setIdleTimer()). Also, I didn’t hit the home button, power button, or anything else that could have caused the device to become suspended. I confirmed through our logging that the app at no point went into suspend.

Here is an abbreviated (and somewhat masked) trace from our log of the responses we got from network.download().

Basically, it looks like we just all of a sudden stopped getting callbacks before the file download is complete (the total file size should be 1771520). Is it possible that, due to the bad reception, we get some dropped packets. But rather than getting an error, we are just not getting anything?

We are running Android 2.2.1 and the latest build 2013.1094. Please let me know if there is more information I can provide to help with this.
 

I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: began url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 0 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1024 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 3500 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 14261 bytesEstimated: -1 D/dalvikvm(10034): GC\_FOR\_MALLOC freed 4022 objects / 163384 bytes in 64ms \<... a lot more of the same .. \> I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1231274 bytesEstimated: -1 I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() isError: falsephase: progress url: http://foobar.com/trains\_citytrain.tar response: nil I/Corona ( 9440): DEBUG: serverUtil.performFileDownloadResponse() bytesTransferred: 1248023 bytesEstimated: -1

 

Sorry about the late response.

Are you doing about 50 downloads at the same time?  If so, then I’m thinking we’re running into a thread pooling issue.  Threads are a finite resource and we can only spawn so many at once.  I’ve seen people run into this problem with our old network code too where they simply did too many network downloads at once.  For testing purposes, I would suggest that you try downloading up to 2 files at a time, and when one file download completes, download the next file in the chain.

The other possibility is that your Android device has a shaky network connection (my HTC device has that problem too), meaning the connection to the server was lost and communications was aborted.  If this is the case, then in order to make your app more fault tolerant, you may want to implement a “retry” on your end by attempting to download the same file again one more time, and if it fails that 2nd time, give up and let the user know.

(From my past personal experience as a Windows desktop app developer, retrying network requests/downloads is common practice to make apps more fault tolerant.  It’s by no means a hack or work-around.  Communication failures are pretty common in my experience.  Especially on wifi networks.)

@ Josh thanks for the quick response! To your questions, we have already implemented a “download queue” concept where we limit our downloads to 2 concurrent file downloads at once. We also already have a retry mechanism in place where if a particular file download fails, we try again.

To elaborate, we currently recognize a download error if we get a response where event.isError == true, but since we don’t get a response call-back at all, our code assumes that we are still waiting for a response - hence hangs rather than recognizing the failure and retry.

I agree that it appears my Android device has a shaky network connection (I have an HTC too). In that situation, what do you recommend as the best way to determine if communications was aborted (given that we don’t get back a response at all)? We have thought about using a timer that basically waits 15 seconds (or some other interval), and if we don’t get a progress response within that, then assume the communication is aborted. But our question is what interval to choose, and what if we are wrong? Say, we assume the communication had been aborted when it hasn’t, so we start a new download… wouldn’t that cause a problem?

Really appreciate your insight!

Oh I see.  You’re network Lua listener is sometimes not getting called at all, right?

If that’s happening, then that must be a bug on our end.

I do know that we are correctly calling the Lua listener with isError set to true if the connection suddenly breaks in the middle of a download.  For example, if you run away from your wifi network in the middle of the download.  So, I don’t think this is the cause.

Now, I did discover in our release build that there was a threading deadlock issue in our Android network code.  It can sometimes occur if 2 network requests/download/updates are happening at the same time, causing all network threads to hang.  What you are describing sounds like this particular issue.  We fixed this issue in daily build #1082.  Are you saying you still see this issue in build #1094?

@Josh -

  I am running #1094. But based on your description, I am not sure it’s the same deadlock issue. Because what we see is that, of the 2 concurrent downloads we are doing, only one of them will fail to get a response back, while the second one will get a response back…

  Also, I have tested that if I were to manually kill the network through settings, that we do get an isError=true response.

  This partcular issue appears to be related directly to network quality.

Thanks,

Andrew

@Rob - I am sure you are quite busy, but I just want to check in to see if there is any update on this issue? thx.

I’m thinking that this issue only happens to you when your Android device goes to sleep.  As in the display turns off and then device goes to lower power mode, thus suspending all background processes, in which case our download process stops receiving receiving responses from the server.

You said that you are preventing the device from going to sleep.  How exactly are you doing that?  Via our system.setIdleTimer() function?

http://docs.coronalabs.com/api/library/system/setIdleTimer.html

That function will prevent the device from going to sleep, but only if the app is in the foreground and displayed onscreen… as in, do not turn off the display with the power button.

Hmm… okay, well, I’ll raise this issue at the next team meeting here.  The hard part is figuring out how to reproduce this issue on our end.  If you can send us a test project that can reproduce this issue, then that will help expedite this.

I still want to believe that this is a deadlock issue.  Would you mind doing one more test please?  Try doing only 1 download at a time instead of 2.  Also, try removing “progress” support.  This would help us isolate this issue… and possibly find a work-around for this issue.  And if it still fails, then does it always fail after a certain number of downloads?

@Josh -

I have been running lots of testing based on your feedback. And this is what I am finding:

  • If I reduce the number of concurrent downloads to 1 file, regardless of if progress is turned on, download appears to succeed consistently. I’ve tried this 3 times so far and succeeded each time.

  • If I allow up to 2 concurrent download, then I get error about 50% of the time. This is regardless of if progress is turned on.

I am still doing more more tests to see if I can figure out some more behaviors on the way the errors are happening. I’ll share once I have more data.

In terms of sharing the code, I’ll have to see if there is a way to do so. It’s part of a much bigger and complicated system, so it’s hard to isolate the code…

@Josh - We did some more testing (testing takes a while since a full download on my poor network takes a long time). And here are our observations

If we do 2 concurrent downloads, we consistently run into issues. We tend to get more event.isError == true. We also get instances of corrupted downloads. Both of these are addressable. The real problem is that we sometimes will all of a sudden stop getting events for a particular download. We are not even getting the isError == true event to indicate that a timeout has happened. So, we don’t know if the download has finished or if we are still waiting for a download. So, our code hangs…

Is this the deadlock issue you are referring to? We are also thinking we may be able to mitigate it on our end by monitoring the “progress” events. And if we noticed that all of a sudden, we are not getting any more events for more than 60 seconds (the time out interval), then assume that a timeout has happened?

Do you think this is a good idea? Or is this something to be addressed on your end?

Interestingly, I noticed that if we suspend the app, all of the events appear to be queued. And when we resume the app, all of the events (including all the “progress” updates), suddenly are called back to my handler. At first we thought suspension may be causing the dropped events, it’s looking to not be the culprit…

Thanks for your help.