Closed karlcow closed 6 years ago
This one needs to be addressed now. The file is growing and the request will become more and more painful. These data should be now stored into a DB.
I guess the work by @laghee would be very useful.
Shall we deploy first the DB with needsdiagnosis timeline. Then add the other data.
What do you think @laghee ?
That will help us make progress on https://github.com/mozilla/webcompat-team-okrs/issues/28
Checking the addons on Heroku. The volume of all our issues stored as json as sent by GitHub is around 14Mo (verbose data) for ~20,000 issues
@miketaylr @laghee some ideas about this previous comment?
looking at MongoDB API, I think it would be safe to go with this. My main reasons are:
A single instance of MongoDB can support multiple independent databases.
Optional there is a Flask PyMongo. Optional because this is just an abstraction layer with a couple of features.
Ah another nice module for Mongo/Python is MongoEngine with a better API than PyMongo.
Oooh! This is so exciting! It's like I'm getting called up to the Show! 😁 (Sorry, @karlcow, I would have used a basketball metaphor, but my sports knowledge pretty much begins and ends with baseball... ⚾)
I discussed the row-limit with @miketaylr when I was setting up the configuration a couple months back. At the time, he gave me the go ahead, so I believe we're currently paying the $9/month for >10,000 rows PostgreSQL. (I don't think I can double check this right now because I no longer have access to the Heroku account.)
The back end work I've done up to now was all based on PostgreSQL, using Flask-sqlAlchemy, so switching to a non-relational db like Mongo would mean (I believe) basically throwing out everything that's been done so far. Which is certainly fine if we decide that's a markedly better plan. But given our always limited resources, it seems like we could get up and running faster if we stick to the framework we have currently.
If we wanted to go with another flavor of SQL, however, that wouldn't be hard -- just a couple of small changes. I think the only postgres-dependent point is the json
type used for "details" in the events table.
As far as how to deploy (or when to deploy what), I was aiming for the database itself to be dashboard-agnostic. I had envisioned that we'd store a pared-down backup of the web-bugs repo as our db. Then by building some specific modules on the server side, we could have a query to grab the data needed for each dashboard.
ok thanks for the clarification! I didn't know that.
(I don't think I can double check this right now because I no longer have access to the Heroku account.)
@laghee I wonder how we can solve this.
@karlcow Ah, I didn't realize it was possible to ask for access to Heroku without a Mozilla email. I'll follow the same procedure as last time and see what happens!
EDIT: Aha -- I forgot -- looks like I'm just supposed to ask you, @karlcow 😊 (since I don't need Moz membership). I'll PM you my email...
Hmm. Tough call. MongoDB is pretty simple to use, and great if you're just storing JSON documents.
Is it the worst idea ever to experiment with... both? I do agree asking @laghee to re-write everything feels counter-productive. But if @karlcow is able to get Mongo up and running to solve this in a day or two, or five it might be worth it short-term. We could migrate the other dashboards to Kate's work when they're ready. But we're opting into some technical debt there.
WDYT, @karlcow?
(Also, yes, we're paying for the PostgresSQL addon in heroku. Nobody yelled at me yet...)
@miketaylr Trying both doesn't seem crazy 😁, but just to clarify -- we have two distinct storage concepts it sounds like?
1. Distill the necessary data for each particular dashboard first, and then store that formatted json
(currently returned from the server endpoint) in the database to be retrieved when the specific dashboard is displayed (@karlcow).
AND
2. Back up a condensed version of all issues as they come in (with webhook updating) and use individual db queries to fetch the necessary data for each dashboard at display time (me).
FWIW, having written this down, it starts to sound pretty interesting to have both, especially since they're not fulfilling the same functions (assuming I've understood correctly). We could see which system is more efficient in which cases -- easier to work with, makes future scaling simpler, etc -- and we'd have more flexibility if we decided to change directions on either front.
Does it sound nuts/redundant to think of having some cron jobs (Heroku Scheduler or clock jobs with Python/APScheduler) that are crunching numbers from the Postgres bugs DB (saves on GitHub calls) and saving the results to MongoDB(s) (saves data manipulation at loading) for each dashboard??
I guess we need to look at the shape of the data and what is the data source.
For needsdiagnosis for example. The data comes from an hourly request to https://api.github.com/repos/webcompat/web-bugs/milestones/3
and we just read "open_issues"
parameters from there that we store with a timestamp in a JSON file.
@laghee Is there a version of the code which is accessible with regards to the way the DB is working currently and how it stays up to date with GitHub? Indeed I don't think we should venture in rewriting things. But it would be cool if we could start deploying the DB and server hooks in small individual pieces before even starting to integrate the dashboards.
At the same time, storing the full needsdiagnosis-timeline.json in a single record on a MongoDB seems counter effective too because it would mean to rewrite all data in the store. So writing individual records seem in any kind of stores (SQL or not) will still be the same work.
For the issues themselves I can see a benefit of storing them as a backup solution in a Mongo store (minus the synchro issue). How do we
I feel we are out-scoping this issue. :)
Let's open #63 for just deploying the DB part.
@laghee Is there a version of the code which is accessible with regards to the way the DB is working currently and how it stays up to date with GitHub?
@karlcow can you clarify what you're asking for here? Do you mean a pointer to where this happens in the repo? Thanks!
oh… my… memory… 😱
I didn't even remember I had reviewed it… 😭
Let me close here and starts fresh with an issue about creating a table for needsdiagnosis data.
I opened #67
We might need to switch at a point to a DB. Or maybe we should do it right away.
That would allow to request things like, instead of all data in one shot.
/api/needsdiagnosis-timeline?from=20180401&to=20180501
@magsout it probably requires a similar UI client side for the routes.