wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
25 stars 4 forks source link

Search by phrase in 'Browse Policy Documents' #439

Open dd207 opened 4 years ago

dd207 commented 4 years ago

In 'Browse Policy Documents' currently search works by retrieving words found anywhere in all policy documents available.

E.g. 'Wellcome Collection' searches as 'Wellcome' and 'Collection' retrieving all documents that contain those two separate words and therefore returning lots of non relevant results.

Users need to search by a phrase 'Wellcome Collection' and be able to find policy documents that have those words next to each other in policy documents.

What effort is required to implement this, given the current architecture? @jdu @SamDepardieu

jdu commented 4 years ago

We would need more sophisticated search operators akin to google search syntax, for instance allowing a user to input a phrase wrapped in quotes "Wellcome Collection" to force the engine to treat the words in quites as a complete phrase.

Effectively it's implementing a DSL (Domain Specific Language) on our search input and creating a simple interpreter in order to translate the query into a valid elastic query if you want to support the possibility of search for measles mumps to bring up topics on both concerns as well as complete phrases such as Wellcome Collection.

This has potential to get complicated fairly quickly so I think we should probably document what syntax/features we want to support overall in the search so we have a clearer idea of the overall syntax we want in the search so we can design the DSL/interpreter appropriately.

dd207 commented 4 years ago

We can't predict how all users are going to use the search in 'Browse Policy'.

A lot of research topics contain more than one word i.e. "sugar tax" from the usability session today.

So we at least need to have functionality for that.

Maybe it's worth us having a conversation about they types of syntax available and how that maps to search terms?

Also involving @aoifespenge for info.

jdu commented 4 years ago

It's less about predicting what their needs will be and more about deciding on some standard syntax rules to adopt for a search engine in terms of querying.

For instance out the gate we have two possible queries sugar tax and "sugar tax" one where the search treats the words as a series of distinct terms and one where the search treats the phrase as a distinct term. But there is much more functionality that can be exposed through this, for instance what about:

"sugar tax" or "soft drinks" or "sugar tax" and source: "WHO"

What we don't want to do -- if the path we want to go down is to support some fairly complex querying -- is to only think about it as and when a requirement for a new syntax, operator or other features rears its head. We want to think about the types of things we'll need now so that we can put down a baseline search system which supports being extended with additional functionality as we move along by understanding where we might have to go with it.

Some other search DSLs to look at:

DuckDuckGo - https://help.duckduckgo.com/duckduckgo-help-pages/results/syntax/ Google - https://www.lifewire.com/advanced-google-search-3482174 KQL - https://docs.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference

Really what I'm saying is that it would help immensely if we had a think and jotted down what we think we might need similar to the search systems above and others. That will help us to avoid ;arge refactors to the search system further down the line to support new things.