Improve NineCards Backend GP

franciscodr commented 8 years ago

Current status of NineCards Backend Google Play

We currently have three ways of getting the Google Play info about a package (ordered by priority):

1. Redis cache

For each package, we are storing up to two keys into Redis cache depending on the source where we are getting the info of the package:

Unofficial Google Play API
Web scraping from Google Play website

The format of the key is:

"{\"serviceName\":\"serviceNameValue\",\"packageName\":\"com.dropbox.android\",\"locale\":{\"value\":\"en-US\"}}"

where the value of the serviceName field could be apiClient_app or webScrape_app

On the other hand, the format of the value can be:

For valid responses: "{\"Right\":{\"b\":{\"packageName\":\"com.dropbox.android\",\"title\":\"Dropbox\",\"free\":true,\"icon\":\"http://lh6.ggpht.com/fR_IJDfD1becp10IEaG2ly07WO4WW0LdZGUaNSrscqpgr9PI53D3Cp0yd2dXOgyux8w\",\"stars\":4.4024128913879395,\"downloads\":\"500,000,000 - 1,000,000,000\",\"categories\":[\"PRODUCTIVITY\"]}}}"
If something went wrong (the package wasn't found, the API wasn't available and so on): "{\"Left\":{\"a\":{\"message\":\"com.dropbox.android\"}}}"
2. Unofficial Google Play API

If there are no stored keys in Redis cache for a specific package, the next step is to try to get the info by using the unofficial Google Play API.

The response returned by the API is always cached in Redis regardless if the call was successful or not.
We can get an unsuccessful response for several reasons:
- The API returned a 429 Too many requests error
- The package is not published in Google Play market
  3. Web scraping from Google Play website

Finally, if none of the previous steps were successful, the last chance is to get the info from Google Play website by web scraping.

In the same way than the previous step, the response is always cached in Redis
To get the information about the package, we have to download the whole HTML page (each page is about 300KB)
We can get a 404 Not Found error because the app is not published in Google Play market.
Found issues

Apps that are not published in Google Play market

There are several apps like Contacts app or the Android Camera that aren't published in Google Play market. We have also found apps made by manufactures like Samsung or Motorola that aren't published either.

Each time that we try to get the Google Play info for one of these apps, all the steps are performed without getting a successful response. This means we are doing unnecessary requests and adding more latency.

Timeout while getting info for a large bunch of packages

Currently we are getting a timeout error when a large bunch of packages is sent (about 150 packages) and almost none of them are cached in Redis. The operation takes more than 60 seconds.

This could be caused by:

Some of the packages aren't published and all the steps are performed
We get a 429 Too many requests from Google Play API (too many request are sent to the API in a short time). From that moment, the rest of packages needs performing all the steps to get the info.
Caching errors in Redis

For now, all the errors that are returned by Google Play API and web scraping are cached in Redis.

By doing that, we avoid to perform all the steps repeatedly for those apps that are not published in Google Play market, but this behaviour could be a bit misleading because if a request fails caused by an invalid Google auth token, all the next requests will fail too, even the app exists and the Google auth token is valid.

Package keys in Redis

We are not sure if we need two keys for each package in Redis. We want to cache the info about the applications but no matter what the source where we get such info.

In addition, the length of the keys are longer because we have to add info about the source and this could increase the space that requires Redis.

Possible improvements

Only one key per application

As we commented above, we don't need the source where we get the info about the app currently, so maybe we should store just a key for each app.

The stored info for that key could come from either Google Play API or Google Play scraping

Avoid to cache errors

Redis only will contain information about apps. In this way, the length of the values will be a bit shorter and we'll fix the issue about storing temporary errors (Invalid Google Auth Token, unavailable Google Play site...)

Create a black list for apps

This black list would contain those known apps that aren't published in Google Play Market. So, no calls would be done for those apps that are included into the such list.

This list could have a static part of applications that we know that always will fail, but also we could add new apps dynamically if we detect that requests for a specific application are failing constantly.

The apps included into this list could be categorized. In this way, although we don't have any info about the logo nor the number of downloads, at least we could classify them in the correct category.

Add a header to avoid web scraping

In some situations where a fast response is better than a more exhaustive response (for instance, the initial wizard process), we could add a header in the request to indicate that we don't want use web scraping if the unofficial Google Play API fails.

In this case, the response will contain just information for those apps that are already cached or the API returns a valid response. The rest of apps would be included into the unresolved section.

franciscodr commented 8 years ago

@raulraja @noelmarkham @diesalbla Javi, Fede and I have detected some issues related to the performance of the backend apps while trying to get info of packages from Google Play.

We have summarized these issues here and suggested some possible improvements that would fix these issues.

We would like to know your thoughts about this proposal. Does it make sense? Are we missing something about the initial decisions about Redis and the behaviour of NineCards Backend GP app?

Thanks!

noelmarkham commented 8 years ago

Nice write up @franciscodr, and I think your suggestions are great.

I like changing the keys; hopefully that change should "just work" as the Redis serialization I wrote should be able to convert pretty much any type into JSON with no extra work.

It makes sense to me that you should only have one key for a particular package in Redis: we don't care where the package has come from, just that we have the metadata for that package - if it came from the API, great; if it came from the web, great too.

If possible, I would argue against having a blacklist for apps - the fewer manual interventions we have, the better. One idea (not sure if it's a good one) might be to have some kind of exponential backoff for failed packages, so that the more it fails, the longer we can wait before trying again? Then the ones that will never be there (camera, Samsung etc) will eventually have such a long time between checks that it's effectively a blacklist. Like I said, that's just a thought, might be a bit crazy.

javipacheco commented 8 years ago

Maybe a static blacklist isn't our better option... I think that we need a list of apps categorized that we are sure that they aren't on Google Play, then we'll obtain 2 things:

We won't search these apps on Google Play
We will be able to categorize these apps

For example, Samsung Camera isn't on Google Play but it's important that this app appears between your Photography Apps

Regardless, we should have a dynamic blacklist and I think that the first Noel's proposal is good

In our tests, if the user has 150 apps installed on his phone, the user has 8 or 9 apps that they aren't on Google Play. When we search apps by Scrapping, every app takes ~3 seconds... In this case, the endpoint takes ~ 25 seconds more for categorizing

I have 15 apps that they aren't on Google Play and I get a timeout error when I want to categorize my apps :-(

noelmarkham commented 8 years ago

When we search apps by Scrapping, every app takes ~3 seconds

But this is only for the first attempt, no? A second query would be quicker because the succeeded or failed result would be in Redis? So once we have a critical mass of apps in Redis (failed and succeeded) it would be a lot faster? If this is not the case, we should evaluate if Redis is adding value.

javipacheco commented 8 years ago

@noelmarkham yes, it's only for the first attempt, but if we avoid caching errors in order to fix, for example, that other users get errors for a temporary invalid responses of Google Play, we'll always go to the website for getting the information

I think to cache errors is good in the case that the user wants an app that it isn't on Google Play, but it's dangerous if the user has problems with his account or Google has problem in his page (for example, yesterday Google Drive wasn't working and if possible that Goggle has problems in the future with his non-official API)

noelmarkham commented 8 years ago

Would these concerns be removed with exponential backoff? One off problem: check again soon, bigger issue: a long time between checks. Perhaps if we can also look at "user issues" vs "server issues" then that might help too?

franciscodr commented 8 years ago

That sounds good @noelmarkham. We are going to think about this idea and propose a new approach based on it.

Thanks for you help!

diesalbla commented 8 years ago

@javipacheco One error in particular from the unofficial API, the ones from Too many requests, could be solved if we assigned the quotas to each user.

franciscodr commented 8 years ago

Context

Currently we have two endpoints to ask for the Google Play info of a package:

The first one returns the info of an individual package
The second one allows us to request the Google Play info of a list of packages

After using both endpoints during the integration with the NineCards backend app, a problem has come up related to the second endpoint.

The number of packages included into the list is usually large (not less than 50) and the Google Play API is sometimes not able to resolve all of them (if there are a large number of requests in a short time it returns a 429 Too many requests error). In this case, the process starts to do web scraping for each package that hasn't been resolved yet.

This process downloads the details web page for each package (the mean size of the page is about 300KB) and takes 2-3 seconds to finish the process. This causes that:

The waiting time is longer, so the user experience isn't entirely good
Sometimes a timeout error is thrown because the server isn't able to send back a response in a acceptable time.

Redis cache management

Redis cache will contain only one key per package and could store four type of values depending on the result of getting info from Google Play:

resolved: Those packages for which the process of getting Google Play info has been successfully completed will have this tag.
pending: If while the execution of a multi-package request, the Google Play API returns an error (401 Unauthorized or 429 Too many requests), all the packages without info will be tag as pending. An internal process will try to get the info later.
error: If the request to get the Google Play information return a 404 Not Found error, the package will be tagged as error.
permanent: All the packages with this tag are known applications that are not published in Google Play, but we can categorized them (for instance, the Samsung camera)

The format of the key and value will be:

Key: "com.package.name"

Value:

Example of a resolved item

{
    "type": "resolved"
    "content": {
        "title": "Package title",
        "free": true,
        "icon": "http://lh3.googleusercontent.com/aYbdIM1abwyVSUZLDKoE0CDZGRhlkpsaPOg9tNnBktUQYsXflwknnOn2Ge1Yr7rImGk",
        "starts": 4.5,
        "downloads": "500,000,000 - 1,000,000,000",
        "categories": ["SOCIAL"]
    }
}

Example of a pending item

{
    "type": "pending"
    "content": {}
}

Example of a error item

{
    "type": "error"
    "content": {
        "attemptsNumber": 10,
        "lastAttempt": "2016-08-10T05:00:00.000Z"
    }
}

Example of a permanent item

{
    "type": "permanent"
    "content": {
        "categories": ["SOCIAL"]
    }
}

Proposed solution

The endpoint to get the info of a individual package doesn't change.
The workflow for getting the Google Play info of a list of packages will change a bit:

google-play-flowchart

The described workflow will be performed for each package of the list:

Check if exists a key in Redis for the package and the type of the value is resolved or permanent
- If the key/value exists, the process will return the stored package info as resolved
- Otherwise the process will continue in step 2
Try to get the package info by using Google Play API
- If the API returns a valid response:
  - A new item of type resolved will be created in Redis
  - If a previous error or pending item exists, the type of the item will be changed to resolved and the content will be updated too
  - The process will return the package info as resolved
  - If an error is thrown (like 401 Unauthorized or 429 Too many requests), the process will continue in step 3
Check if the package exists in Google Play by requesting the server headers (faster and small size of the response)
- If the package exists, the process will continue in step 4
- Otherwise the process will continue in step 5
Check if exists a key in Redis for the package and the type of the value is pending or error
- If a value of type pending exists, the process will do nothing
- If a value of type error exists, the process will change the type of the item to pending
- Otherwise a new item of type pending will be created in Redis
- In all the cases, the package will be return as pending
Check if exists a key in Redis for the package and the type of the value is error
- If the key/value exists, the info of the value will be updated by increasing the number of attempts and setting the date of the last attempt to now
- Otherwise a new item of type error will be created in Redis
- In all the cases, the package will be return as error

Conclussions

By adding item of type permanent into Redis cache, we'll be able to categorize known applications that are not published in Google Play but it's quite used by people like Samsung camera.
If an app is no longer published in Google Play, an error item will be created in Redis. Once the number of attempts reaches a specific limit, it could be changed to permanent in order to avoid unnecessary requests.
Web scraping will be only performed for individual requests. If Google API hasn't resolved all the packages in a multi-package request, the unresolved ones will be marked as pending.
An internal process will be executed each certain time and try to get the Google Play info for those packages by web scraping later. It also will update the Redis cache with the new info.
As the internal process will be executed each certain time, different requests for a same package within this period will be unified. So the number of requests for web scraping will decrease.

@noelmarkham @raulraja @diesalbla Any thoughts? Sorry for such a long comment...

raulraja commented 8 years ago

Would moving the web scraping and some of the Google play API calls to the client help in some of those cases?

On Wed, Aug 10, 2016, 4:39 AM Francisco Diaz notifications@github.com wrote:

Context

Currently we have two endpoints to ask for the Google Play info of a package:

The first one returns the info of an individual package

The second one allows us to request the Google Play info of a list of packages

After using both endpoints during the integration with the NineCards backend app, a problem has come up related to the second endpoint.

The number of packages included into the list is usually large (not less than 50) and the Google Play API is sometimes not able to resolve all of them (if there are a large number of requests in a short time it returns a 429 Too many requests error). In this case, the process starts to do web scraping for each package that hasn't been resolved yet.

This process downloads the details web page for each package (the mean size of the page is about 300KB) and takes 2-3 seconds to finish the process. This causes that:

The waiting time is longer, so the user experience isn't entirely good

Sometimes a timeout error is thrown because the server isn't able to send back a response in a acceptable time.

Redis cache management

Redis cache will contain only one key per package and could store four type of values depending on the result of getting info from Google Play:

resolved: Those packages for which the process of getting Google Play info has been successfully completed will have this tag.

pending: If while the execution of a multi-package request, the Google Play API returns an error (401 Unauthorized or 429 Too many requests), all the packages without info will be tag as pending. An internal process will try to get the info later.

error: If the request to get the Google Play information return a 404 Not Found error, the package will be tagged as error.

permanent: All the packages with this tag are known applications that are not published in Google Play, but we can categorized them (for instance, the Samsung camera)

The format of the key and value will be:

Key: "com.package.name"

Value:

Example of a resolved item

{ "type": "resolved" "content": { "title": "Package title", "free": true, "icon": "http://lh3.googleusercontent.com/aYbdIM1abwyVSUZLDKoE0CDZGRhlkpsaPOg9tNnBktUQYsXflwknnOn2Ge1Yr7rImGk", "starts": 4.5, "downloads": "500,000,000 - 1,000,000,000", "categories": ["SOCIAL"] } }

-

Example of a pending item

{ "type": "pending" "content": {} }

-

Example of a error item

{ "type": "error" "content": { "attemptsNumber": 10, "lastAttempt": "2016-08-10T05:00:00.000Z" } }

-

Example of a permanent item

{ "type": "permanent" "content": { "categories": ["SOCIAL"] } }

Proposed solution

The endpoint to get the info of a individual package doesn't change.

The workflow for getting the Google Play info of a list of packages will change a bit:

[image: google-play-flowchart] https://cloud.githubusercontent.com/assets/1200151/17552242/15becaca-5eff-11e6-92b8-30535df3dbbd.png

The described workflow will be performed for each package of the list:

Check if exists a key in Redis for the package and the type of the value is resolved or permanent

If the key/value exists, the process will return the stored package info as resolved

Otherwise the process will continue in step 2

Try to get the package info by using Google Play API

If the API returns a valid response:

A new item of type resolved will be created in Redis

If a previous error or pending item exists, the type of the item will be changed to resolved and the content will be updated too

The process will return the package info as resolved

If an error is thrown (like 401 Unauthorized or 429 Too many requests), the process will continue in step 3

Check if the package exists in Google Play by requesting the server headers (faster and small size of the response)

If the package exists, the process will continue in step 4

Otherwise the process will continue in step 5

Check if exists a key in Redis for the package and the type of the value is pending or error

If a value of type pending exists, the process will do nothing

If a value of type error exists, the process will change the type of the item to pending

Otherwise a new item of type pending will be created in Redis

In all the cases, the package will be return as pending

Check if exists a key in Redis for the package and the type of the value is error

If the key/value exists, the info of the value will be updated by increasing the number of attempts and setting the date of the last attempt to now

Otherwise a new item of type error will be created in Redis

In all the cases, the package will be return as error

Conclussions

By adding item of type permanent into Redis cache, we'll be able to categorize known applications that are not published in Google Play but it's quite used by people like Samsung camera.

If an app is no longer published in Google Play, an error item will be created in Redis. Once the number of attempts reaches a specific limit, it could be changed to permanent in order to avoid unnecessary requests.

Web scraping will be only performed for individual requests. If Google API hasn't resolved all the packages in a multi-package request, the unresolved ones will be marked as pending.

An internal process will be executed each certain time and try to get the Google Play info for those packages by web scraping later. It also will update the Redis cache with the new info.

As the internal process will be executed each certain time, different requests for a same package within this period will be unified. So the number of requests for web scraping will decrease.

@noelmarkham https://github.com/noelmarkham @raulraja https://github.com/raulraja @diesalbla https://github.com/diesalbla Any thoughts? Sorry for such a long comment...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/47deg/nine-cards-v2/issues/597#issuecomment-238841709, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb4XBbBcyzwK43dZ5RLzXOJXlU0J3JBks5qebhmgaJpZM4JfkHr .

diesalbla commented 8 years ago

@franciscodr

To get more insight about what may cause the delays, I have done some experiments from my computer against the web scrapper.

First, I run several times the command time curl -s "https://play.google.com/store/apps/details?id=air.fisherprice.com.shapesAndColors&hl=en_GB" | wc -c, where the -s removes curl progress status, | wc -c just gives the size in bytes of the answer, and time gives the time it takes. The answer is about 280 kb long. Next, I run the command time curl -s --head "https://play.google.com/store/apps/details?id=air.fisherprice.com.shapesAndColors&hl=en_GB" | wc -c, where --head is to use the HEAD http method. This gives a smaller answer, of about only 1.5 kB.

I noticed that the times do not change too much despite the difference in size, which means that it is latency that is creating some of the problems.

Since latency is usually mitigated with parallelism, I run time curl -s "https://play.google.com/store/apps/details?id={air.fisherprice.com.shapesAndColors,com.google.android.youtube,com.mojang.minecraftpe,com.wallapop,com.spotify.music,com.shazam.android,com.google.earth,org.telegram.messenger}&hl=en_GB" | wc -c, to download several pages at the same time. I found was that the average time per request was below that of each single package.

In conclusion, it may be worth looking out the parallelism, and the batching of HTTP requests through the web scrapper.

noelmarkham commented 8 years ago

@franciscodr, another great write-up, thanks. One thing that concerns me with multiple states and writes to Redis is the possibility of race conditions.

franciscodr commented 8 years ago

@noelmarkham

Maybe could it be fixed by creating different keys (one for each possible status) and removing them when are not necessary instead of changing the value?

For instance:

If the package isn't resolved, a new Redis key is created:

{ "package": "com.package.name", "type": "pending" }

After that, the internal process would search the pending packages by asking for keys "*\"type\": \"pending\"" keys to Redis.

If the package has been resolved this time, a new Redis key is created:

{ "package": "com.package.name" }

And the pending key would be removed.

But I'm not an expert with Redis and I'm not sure if it makes sense

franciscodr commented 8 years ago

@raulraja I'm not sure if moving this business logic to the client is our better option:

We wouldn't take advantage of caching the results. Different users could do web scraping for the same packages.
Given that the web scraping process is slow, the user experience could get worse. This is one reason why we propose to run an internal process each certain time. In addition, we would reduce the number of web scraping tasks by removing duplicated taks for the same package.
Google Play API could keep throwing a 429 Too many request error if the user tries to resolved a large list of packages in a short while.

javipacheco commented 8 years ago

hey @noelmarkham

what's your opinion about create a new key in Redis with the type in the key? We think that we can resolve the race conditions problems. Thanks!

noelmarkham commented 8 years ago

That sounds good, LGTM - sorry, I missed this the first time around.

xebia-functional / nine-cards-v2