[db] needsdiagnosis will grow indefinitely

karlcow commented 6 years ago

We might need to switch at a point to a DB. Or maybe we should do it right away.

That would allow to request things like, instead of all data in one shot. /api/needsdiagnosis-timeline?from=20180401&to=20180501

@magsout it probably requires a similar UI client side for the routes.

karlcow commented 6 years ago

This one needs to be addressed now. The file is growing and the request will become more and more painful. These data should be now stored into a DB.

I guess the work by @laghee would be very useful.

Shall we deploy first the DB with needsdiagnosis timeline. Then add the other data.

What do you think @laghee ?

That will help us make progress on https://github.com/mozilla/webcompat-team-okrs/issues/28

karlcow commented 6 years ago

Checking the addons on Heroku. The volume of all our issues stored as json as sent by GitHub is around 14Mo (verbose data) for ~20,000 issues

PostgreSql Heroku The free version of the addon is for 10,000 rows only the version above is 10,000,000 for 9$/month
Redis Cloud addon has a free plan for 30 MB of memory (Redis being a no-sql) which seems to offer more room for our data than PostgreSQL. Next stage is 10$/month with 100MB
mLab MongoDB Best offer probably. Free with 496MB storage and quite a log of options. To double check
Heroku Redis Free plan for 25MB.

karlcow commented 6 years ago

@miketaylr @laghee some ideas about this previous comment?

karlcow commented 6 years ago

looking at MongoDB API, I think it would be safe to go with this. My main reasons are:

multiple DBs

A single instance of MongoDB can support multiple independent databases.
Querying features
Install seems to be easy for local development (slightly harder on windows)
GitHub Data are mainly JSON, MongoDB is JSON Binary. PyMongo use dictionary syntax. It's all very familiar.

Optional there is a Flask PyMongo. Optional because this is just an abstraction layer with a couple of features.

karlcow commented 6 years ago

Ah another nice module for Mongo/Python is MongoEngine with a better API than PyMongo.

laghee commented 6 years ago

Oooh! This is so exciting! It's like I'm getting called up to the Show! 😁 (Sorry, @karlcow, I would have used a basketball metaphor, but my sports knowledge pretty much begins and ends with baseball... ⚾)

I discussed the row-limit with @miketaylr when I was setting up the configuration a couple months back. At the time, he gave me the go ahead, so I believe we're currently paying the $9/month for >10,000 rows PostgreSQL. (I don't think I can double check this right now because I no longer have access to the Heroku account.)

The back end work I've done up to now was all based on PostgreSQL, using Flask-sqlAlchemy, so switching to a non-relational db like Mongo would mean (I believe) basically throwing out everything that's been done so far. Which is certainly fine if we decide that's a markedly better plan. But given our always limited resources, it seems like we could get up and running faster if we stick to the framework we have currently.

If we wanted to go with another flavor of SQL, however, that wouldn't be hard -- just a couple of small changes. I think the only postgres-dependent point is the json type used for "details" in the events table.

laghee commented 6 years ago

As far as how to deploy (or when to deploy what), I was aiming for the database itself to be dashboard-agnostic. I had envisioned that we'd store a pared-down backup of the web-bugs repo as our db. Then by building some specific modules on the server side, we could have a query to grab the data needed for each dashboard.

karlcow commented 6 years ago

ok thanks for the clarification! I didn't know that.

karlcow commented 6 years ago

(I don't think I can double check this right now because I no longer have access to the Heroku account.)

@laghee I wonder how we can solve this.

laghee commented 6 years ago

@karlcow Ah, I didn't realize it was possible to ask for access to Heroku without a Mozilla email. I'll follow the same procedure as last time and see what happens!

EDIT: Aha -- I forgot -- looks like I'm just supposed to ask you, @karlcow 😊 (since I don't need Moz membership). I'll PM you my email...

miketaylr commented 6 years ago

Hmm. Tough call. MongoDB is pretty simple to use, and great if you're just storing JSON documents.

Is it the worst idea ever to experiment with... both? I do agree asking @laghee to re-write everything feels counter-productive. But if @karlcow is able to get Mongo up and running to solve this in a day or two, or five it might be worth it short-term. We could migrate the other dashboards to Kate's work when they're ready. But we're opting into some technical debt there.

WDYT, @karlcow?

(Also, yes, we're paying for the PostgresSQL addon in heroku. Nobody yelled at me yet...)

laghee commented 6 years ago

@miketaylr Trying both doesn't seem crazy 😁, but just to clarify -- we have two distinct storage concepts it sounds like?

1. Distill the necessary data for each particular dashboard first, and then store that formatted json (currently returned from the server endpoint) in the database to be retrieved when the specific dashboard is displayed (@karlcow).

AND

2. Back up a condensed version of all issues as they come in (with webhook updating) and use individual db queries to fetch the necessary data for each dashboard at display time (me).

FWIW, having written this down, it starts to sound pretty interesting to have both, especially since they're not fulfilling the same functions (assuming I've understood correctly). We could see which system is more efficient in which cases -- easier to work with, makes future scaling simpler, etc -- and we'd have more flexibility if we decided to change directions on either front.

Does it sound nuts/redundant to think of having some cron jobs (Heroku Scheduler or clock jobs with Python/APScheduler) that are crunching numbers from the Postgres bugs DB (saves on GitHub calls) and saving the results to MongoDB(s) (saves data manipulation at loading) for each dashboard??

karlcow commented 6 years ago

I guess we need to look at the shape of the data and what is the data source.

For needsdiagnosis for example. The data comes from an hourly request to https://api.github.com/repos/webcompat/web-bugs/milestones/3 and we just read "open_issues" parameters from there that we store with a timestamp in a JSON file.

@laghee Is there a version of the code which is accessible with regards to the way the DB is working currently and how it stays up to date with GitHub? Indeed I don't think we should venture in rewriting things. But it would be cool if we could start deploying the DB and server hooks in small individual pieces before even starting to integrate the dashboards.

At the same time, storing the full needsdiagnosis-timeline.json in a single record on a MongoDB seems counter effective too because it would mean to rewrite all data in the store. So writing individual records seem in any kind of stores (SQL or not) will still be the same work.

For the issues themselves I can see a benefit of storing them as a backup solution in a Mongo store (minus the synchro issue). How do we

I feel we are out-scoping this issue. :)

Let's open #63 for just deploying the DB part.

miketaylr commented 6 years ago

@laghee Is there a version of the code which is accessible with regards to the way the DB is working currently and how it stays up to date with GitHub?

@karlcow can you clarify what you're asking for here? Do you mean a pointer to where this happens in the repo? Thanks!

karlcow commented 6 years ago

oh… my… memory… 😱

https://github.com/webcompat/webcompat-metrics-server/blob/master/ochazuke/models.py

I didn't even remember I had reviewed it… 😭

karlcow commented 6 years ago

Let me close here and starts fresh with an issue about creating a table for needsdiagnosis data.

karlcow commented 6 years ago

I opened #67

webcompat / webcompat-metrics-server

[db] needsdiagnosis will grow indefinitely #4