Closed franciscodr closed 8 years ago
@raulraja @noelmarkham @diesalbla Javi, Fede and I have detected some issues related to the performance of the backend apps while trying to get info of packages from Google Play.
We have summarized these issues here and suggested some possible improvements that would fix these issues.
We would like to know your thoughts about this proposal. Does it make sense? Are we missing something about the initial decisions about Redis and the behaviour of NineCards Backend GP app?
Thanks!
Nice write up @franciscodr, and I think your suggestions are great.
I like changing the keys; hopefully that change should "just work" as the Redis serialization I wrote should be able to convert pretty much any type into JSON with no extra work.
It makes sense to me that you should only have one key for a particular package in Redis: we don't care where the package has come from, just that we have the metadata for that package - if it came from the API, great; if it came from the web, great too.
If possible, I would argue against having a blacklist for apps - the fewer manual interventions we have, the better. One idea (not sure if it's a good one) might be to have some kind of exponential backoff for failed packages, so that the more it fails, the longer we can wait before trying again? Then the ones that will never be there (camera, Samsung etc) will eventually have such a long time between checks that it's effectively a blacklist. Like I said, that's just a thought, might be a bit crazy.
Maybe a static blacklist isn't our better option... I think that we need a list of apps categorized that we are sure that they aren't on Google Play, then we'll obtain 2 things:
For example, Samsung Camera isn't on Google Play but it's important that this app appears between your Photography Apps
Regardless, we should have a dynamic blacklist and I think that the first Noel's proposal is good
In our tests, if the user has 150 apps installed on his phone, the user has 8 or 9 apps that they aren't on Google Play. When we search apps by Scrapping, every app takes ~3 seconds... In this case, the endpoint takes ~ 25 seconds more for categorizing
I have 15 apps that they aren't on Google Play and I get a timeout error when I want to categorize my apps :-(
When we search apps by Scrapping, every app takes ~3 seconds
But this is only for the first attempt, no? A second query would be quicker because the succeeded or failed result would be in Redis? So once we have a critical mass of apps in Redis (failed and succeeded) it would be a lot faster? If this is not the case, we should evaluate if Redis is adding value.
@noelmarkham yes, it's only for the first attempt, but if we avoid caching errors in order to fix, for example, that other users get errors for a temporary invalid responses of Google Play, we'll always go to the website for getting the information
I think to cache errors is good in the case that the user wants an app that it isn't on Google Play, but it's dangerous if the user has problems with his account or Google has problem in his page (for example, yesterday Google Drive wasn't working and if possible that Goggle has problems in the future with his non-official API)
Would these concerns be removed with exponential backoff? One off problem: check again soon, bigger issue: a long time between checks. Perhaps if we can also look at "user issues" vs "server issues" then that might help too?
That sounds good @noelmarkham. We are going to think about this idea and propose a new approach based on it.
Thanks for you help!
@javipacheco One error in particular from the unofficial API, the ones from Too many requests
, could be solved if we assigned the quotas to each user.
Currently we have two endpoints to ask for the Google Play info of a package:
After using both endpoints during the integration with the NineCards backend app, a problem has come up related to the second endpoint.
The number of packages included into the list is usually large (not less than 50) and the Google Play API is sometimes not able to resolve all of them (if there are a large number of requests in a short time it returns a 429 Too many requests
error). In this case, the process starts to do web scraping for each package that hasn't been resolved yet.
This process downloads the details web page for each package (the mean size of the page is about 300KB) and takes 2-3 seconds to finish the process. This causes that:
Redis cache will contain only one key per package and could store four type of values depending on the result of getting info from Google Play:
resolved
: Those packages for which the process of getting Google Play info has been successfully completed will have this tag.pending
: If while the execution of a multi-package request, the Google Play API returns an error (401 Unauthorized
or 429 Too many requests
), all the packages without info will be tag as pending. An internal process will try to get the info later.error
: If the request to get the Google Play information return a 404 Not Found
error, the package will be tagged as error.permanent
: All the packages with this tag are known applications that are not published in Google Play, but we can categorized them (for instance, the Samsung camera)The format of the key and value will be:
"com.package.name"
Value:
resolved
item{
"type": "resolved"
"content": {
"title": "Package title",
"free": true,
"icon": "http://lh3.googleusercontent.com/aYbdIM1abwyVSUZLDKoE0CDZGRhlkpsaPOg9tNnBktUQYsXflwknnOn2Ge1Yr7rImGk",
"starts": 4.5,
"downloads": "500,000,000 - 1,000,000,000",
"categories": ["SOCIAL"]
}
}
pending
item{
"type": "pending"
"content": {}
}
error
item{
"type": "error"
"content": {
"attemptsNumber": 10,
"lastAttempt": "2016-08-10T05:00:00.000Z"
}
}
permanent
item{
"type": "permanent"
"content": {
"categories": ["SOCIAL"]
}
}
The described workflow will be performed for each package of the list:
resolved
or permanent
resolved
will be created in Rediserror
or pending
item exists, the type of the item will be changed to resolved
and the content will be updated too401 Unauthorized
or 429 Too many requests
), the process will continue in step 3pending
or error
pending
exists, the process will do nothingerror
exists, the process will change the type of the item to pending
pending
will be created in Redispending
error
error
will be created in Rediserror
permanent
into Redis cache, we'll be able to categorize known applications that are not published in Google Play but it's quite used by people like Samsung camera.error
item will be created in Redis. Once the number of attempts reaches a specific limit, it could be changed to permanent
in order to avoid unnecessary requests.pending
. @noelmarkham @raulraja @diesalbla Any thoughts? Sorry for such a long comment...
Would moving the web scraping and some of the Google play API calls to the client help in some of those cases?
On Wed, Aug 10, 2016, 4:39 AM Francisco Diaz notifications@github.com wrote:
Context
Currently we have two endpoints to ask for the Google Play info of a package:
- The first one returns the info of an individual package
- The second one allows us to request the Google Play info of a list of packages
After using both endpoints during the integration with the NineCards backend app, a problem has come up related to the second endpoint.
The number of packages included into the list is usually large (not less than 50) and the Google Play API is sometimes not able to resolve all of them (if there are a large number of requests in a short time it returns a 429 Too many requests error). In this case, the process starts to do web scraping for each package that hasn't been resolved yet.
This process downloads the details web page for each package (the mean size of the page is about 300KB) and takes 2-3 seconds to finish the process. This causes that:
- The waiting time is longer, so the user experience isn't entirely good
- Sometimes a timeout error is thrown because the server isn't able to send back a response in a acceptable time.
Redis cache management
Redis cache will contain only one key per package and could store four type of values depending on the result of getting info from Google Play:
- resolved: Those packages for which the process of getting Google Play info has been successfully completed will have this tag.
- pending: If while the execution of a multi-package request, the Google Play API returns an error (401 Unauthorized or 429 Too many requests), all the packages without info will be tag as pending. An internal process will try to get the info later.
- error: If the request to get the Google Play information return a 404 Not Found error, the package will be tagged as error.
- permanent: All the packages with this tag are known applications that are not published in Google Play, but we can categorized them (for instance, the Samsung camera)
The format of the key and value will be:
Key: "com.package.name"
Value:
Example of a resolved item
{ "type": "resolved" "content": { "title": "Package title", "free": true, "icon": "http://lh3.googleusercontent.com/aYbdIM1abwyVSUZLDKoE0CDZGRhlkpsaPOg9tNnBktUQYsXflwknnOn2Ge1Yr7rImGk", "starts": 4.5, "downloads": "500,000,000 - 1,000,000,000", "categories": ["SOCIAL"] } }
-
Example of a pending item
{ "type": "pending" "content": {} }
-
Example of a error item
{ "type": "error" "content": { "attemptsNumber": 10, "lastAttempt": "2016-08-10T05:00:00.000Z" } }
-
Example of a permanent item
{ "type": "permanent" "content": { "categories": ["SOCIAL"] } }
Proposed solution
- The endpoint to get the info of a individual package doesn't change.
- The workflow for getting the Google Play info of a list of packages will change a bit:
[image: google-play-flowchart] https://cloud.githubusercontent.com/assets/1200151/17552242/15becaca-5eff-11e6-92b8-30535df3dbbd.png
The described workflow will be performed for each package of the list:
- Check if exists a key in Redis for the package and the type of the value is resolved or permanent
- If the key/value exists, the process will return the stored package info as resolved
- Otherwise the process will continue in step 2
- Try to get the package info by using Google Play API
- If the API returns a valid response:
- A new item of type resolved will be created in Redis
- If a previous error or pending item exists, the type of the item will be changed to resolved and the content will be updated too
- The process will return the package info as resolved
- If an error is thrown (like 401 Unauthorized or 429 Too many requests), the process will continue in step 3
- Check if the package exists in Google Play by requesting the server headers (faster and small size of the response)
- If the package exists, the process will continue in step 4
- Otherwise the process will continue in step 5
- Check if exists a key in Redis for the package and the type of the value is pending or error
- If a value of type pending exists, the process will do nothing
- If a value of type error exists, the process will change the type of the item to pending
- Otherwise a new item of type pending will be created in Redis
- In all the cases, the package will be return as pending
- Check if exists a key in Redis for the package and the type of the value is error
- If the key/value exists, the info of the value will be updated by increasing the number of attempts and setting the date of the last attempt to now
- Otherwise a new item of type error will be created in Redis
- In all the cases, the package will be return as error
Conclussions
- By adding item of type permanent into Redis cache, we'll be able to categorize known applications that are not published in Google Play but it's quite used by people like Samsung camera.
- If an app is no longer published in Google Play, an error item will be created in Redis. Once the number of attempts reaches a specific limit, it could be changed to permanent in order to avoid unnecessary requests.
- Web scraping will be only performed for individual requests. If Google API hasn't resolved all the packages in a multi-package request, the unresolved ones will be marked as pending.
- An internal process will be executed each certain time and try to get the Google Play info for those packages by web scraping later. It also will update the Redis cache with the new info.
- As the internal process will be executed each certain time, different requests for a same package within this period will be unified. So the number of requests for web scraping will decrease.
@noelmarkham https://github.com/noelmarkham @raulraja https://github.com/raulraja @diesalbla https://github.com/diesalbla Any thoughts? Sorry for such a long comment...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/47deg/nine-cards-v2/issues/597#issuecomment-238841709, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb4XBbBcyzwK43dZ5RLzXOJXlU0J3JBks5qebhmgaJpZM4JfkHr .
@franciscodr
To get more insight about what may cause the delays, I have done some experiments from my computer against the web scrapper.
First, I run several times the command time curl -s "https://play.google.com/store/apps/details?id=air.fisherprice.com.shapesAndColors&hl=en_GB" | wc -c
, where the -s
removes curl progress status, | wc -c
just gives the size in bytes of the answer, and time
gives the time it takes. The answer is about 280 kb long.
Next, I run the command time curl -s --head "https://play.google.com/store/apps/details?id=air.fisherprice.com.shapesAndColors&hl=en_GB" | wc -c
, where --head
is to use the HEAD
http method. This gives a smaller answer, of about only 1.5 kB.
I noticed that the times do not change too much despite the difference in size, which means that it is latency that is creating some of the problems.
Since latency is usually mitigated with parallelism, I run time curl -s "https://play.google.com/store/apps/details?id={air.fisherprice.com.shapesAndColors,com.google.android.youtube,com.mojang.minecraftpe,com.wallapop,com.spotify.music,com.shazam.android,com.google.earth,org.telegram.messenger}&hl=en_GB" | wc -c
, to download several pages at the same time. I found was that the average time per request was below that of each single package.
In conclusion, it may be worth looking out the parallelism, and the batching of HTTP requests through the web scrapper.
@franciscodr, another great write-up, thanks. One thing that concerns me with multiple states and writes to Redis is the possibility of race conditions.
@noelmarkham
Maybe could it be fixed by creating different keys (one for each possible status) and removing them when are not necessary instead of changing the value?
For instance:
If the package isn't resolved, a new Redis key is created:
{ "package": "com.package.name", "type": "pending" }
After that, the internal process would search the pending packages by asking for keys "*\"type\": \"pending\""
keys to Redis.
If the package has been resolved this time, a new Redis key is created:
{ "package": "com.package.name" }
And the pending key would be removed.
But I'm not an expert with Redis and I'm not sure if it makes sense
@raulraja I'm not sure if moving this business logic to the client is our better option:
429 Too many request
error if the user tries to resolved a large list of packages in a short while.hey @noelmarkham
what's your opinion about create a new key in Redis with the type
in the key? We think that we can resolve the race conditions problems. Thanks!
That sounds good, LGTM - sorry, I missed this the first time around.
Current status of NineCards Backend Google Play
We currently have three ways of getting the Google Play info about a package (ordered by priority):
1. Redis cache
For each package, we are storing up to two keys into Redis cache depending on the source where we are getting the info of the package:
The format of the key is:
"{\"serviceName\":\"serviceNameValue\",\"packageName\":\"com.dropbox.android\",\"locale\":{\"value\":\"en-US\"}}"
where the value of the serviceName field could be
apiClient_app
orwebScrape_app
On the other hand, the format of the value can be:
"{\"Right\":{\"b\":{\"packageName\":\"com.dropbox.android\",\"title\":\"Dropbox\",\"free\":true,\"icon\":\"http://lh6.ggpht.com/fR_IJDfD1becp10IEaG2ly07WO4WW0LdZGUaNSrscqpgr9PI53D3Cp0yd2dXOgyux8w\",\"stars\":4.4024128913879395,\"downloads\":\"500,000,000 - 1,000,000,000\",\"categories\":[\"PRODUCTIVITY\"]}}}"
"{\"Left\":{\"a\":{\"message\":\"com.dropbox.android\"}}}"
2. Unofficial Google Play API
If there are no stored keys in Redis cache for a specific package, the next step is to try to get the info by using the unofficial Google Play API.
429 Too many requests
error3. Web scraping from Google Play website
Finally, if none of the previous steps were successful, the last chance is to get the info from Google Play website by web scraping.
404 Not Found
error because the app is not published in Google Play market.Found issues
Apps that are not published in Google Play market
There are several apps like
Contacts
app or theAndroid Camera
that aren't published in Google Play market. We have also found apps made by manufactures like Samsung or Motorola that aren't published either.Each time that we try to get the Google Play info for one of these apps, all the steps are performed without getting a successful response. This means we are doing unnecessary requests and adding more latency.
Timeout while getting info for a large bunch of packages
Currently we are getting a timeout error when a large bunch of packages is sent (about 150 packages) and almost none of them are cached in Redis. The operation takes more than 60 seconds.
This could be caused by:
429 Too many requests
from Google Play API (too many request are sent to the API in a short time). From that moment, the rest of packages needs performing all the steps to get the info.Caching errors in Redis
For now, all the errors that are returned by Google Play API and web scraping are cached in Redis.
By doing that, we avoid to perform all the steps repeatedly for those apps that are not published in Google Play market, but this behaviour could be a bit misleading because if a request fails caused by an invalid Google auth token, all the next requests will fail too, even the app exists and the Google auth token is valid.
Package keys in Redis
We are not sure if we need two keys for each package in Redis. We want to cache the info about the applications but no matter what the source where we get such info.
In addition, the length of the keys are longer because we have to add info about the source and this could increase the space that requires Redis.
Possible improvements
Only one key per application
As we commented above, we don't need the source where we get the info about the app currently, so maybe we should store just a key for each app.
The stored info for that key could come from either Google Play API or Google Play scraping
Avoid to cache errors
Redis only will contain information about apps. In this way, the length of the values will be a bit shorter and we'll fix the issue about storing temporary errors (Invalid Google Auth Token, unavailable Google Play site...)
Create a black list for apps
This black list would contain those known apps that aren't published in Google Play Market. So, no calls would be done for those apps that are included into the such list.
This list could have a static part of applications that we know that always will fail, but also we could add new apps dynamically if we detect that requests for a specific application are failing constantly.
The apps included into this list could be categorized. In this way, although we don't have any info about the logo nor the number of downloads, at least we could classify them in the correct category.
Add a header to avoid web scraping
In some situations where a fast response is better than a more exhaustive response (for instance, the initial wizard process), we could add a header in the request to indicate that we don't want use web scraping if the unofficial Google Play API fails.
In this case, the response will contain just information for those apps that are already cached or the API returns a valid response. The rest of apps would be included into the
unresolved
section.