webcompat / webcompat.com

Source code for webcompat.com
https://webcompat.com
359 stars 191 forks source link

[feature] Service for duplicate domain names detection on webcompat.com #2545

Open karlcow opened 6 years ago

karlcow commented 6 years ago

Options:

Context

We had

We should probably define this as "micro-service" aka a separate shell, that our site is requesting. This would help us to plug which ever solution we find adequate behind, be an independent DB, be a direct request to GitHub, etc. webcompat.com then could possibly use that service for searches and reporting.

miketaylr commented 6 years ago

Let's build it.

miketaylr commented 6 years ago

Also related to (and could probably power some of these requests):

https://github.com/webcompat/webcompat.com/issues/723 https://github.com/webcompat/webcompat.com/issues/132 https://github.com/webcompat/webcompat.com/issues/60

miketaylr commented 6 years ago

This is probably something @johngian will work on, which will also serve to power dashboards that show open webcompat issues for a given domain.

johngian commented 6 years ago

I created a new repo with some code here: https://github.com/johngian/webcompat-search For now what it does is:

johngian commented 6 years ago

@miketaylr While working on that issue a couple of questions came up:

miketaylr commented 6 years ago

Heya @johngian, sorry for the delay in responding!

Is there some sort of DB for all the issues or do we continue querying the GH API?

Not right now... we've talked about it at different times, but haven't really ever tackled it (mostly because we're lazybusy and don't want to solve all the problems of syncing)

Do we want to maintain some sort of instance/service to query issues instead? The way that i see it a service similar to webcompat-search backed by ES can be a good fit.

Yes, that sounds like a good plan.

Have you considered using ghtorrent to get Github related data for webcompat?

I've... never heard of it! It might be interesting for @karlcow and @laghee to look at, as they're building something similar for our https://github.com/webcompat/webcompat-metrics-server repo (not to suggest they should stop near then end :P).

It looks like it contains issues, so perhaps it could power such a server backed by elasticsearch.

miketaylr commented 6 years ago

@johngian is this hosted anywhere for us to start exploring it next quarter?

karlcow commented 5 years ago

GitHub has already a feature for this depending on GraphQl getRelatedIssues. Probably not yet a public API.

karlcow commented 5 years ago

@miketaylr Just a thought.

If the goal is to have a tool to help identify previous similar issues and not necessary the immediacy of the duplicates (longterm scale). We can

  1. Create a daily incremental backup of issues as JSON
  2. Feed this to a DB

Then a service exposing this DB for different purposes.

For the specific goal of finding what you need, the issue is still the same. You need to define before what is a "duplicate issue" when searching.