Weighted word values - Githubissues

woodj22 commented 7 years ago

A feature of the scraper could be to assess how strong the word is depending on what HTML tag it was found in. This could be sent to the db to help it weight how strongly it should index the word.

samuknet commented 7 years ago

Nice, so when we write a URL for a key, we could also add a weight metric? So for a given key we would write a tuple of (URL, weight)?

Simple examples I think are for h1 tags - these would have higher weight than say, or

tags.

How exactly do we quantity this weight? The algorithm we use is essentially our "ranking" algorithm that we should define.

woodj22 commented 7 years ago

Yeah so you would have the header tag weighing more than a word found in the copy right info in the footer tag of a HTML page.

Does HTML have an existing ranking system that is an official order magnitude weight. That would be something to go on. Otherwise we would essentially make our own. I suppose that could be something you use to train neural network.

samuknet commented 7 years ago

Well the h1, h2, h3, h4, h5, h6 tags are supposed to be used for headings But web developers are not bound to create their web pages following these rules. It is possible just to make a span with a page heading in look just like a h1 would using purely CSS or JavaScript. In fact, web developers are encouraged to use HTML tags properly so that search engines can better assess the content of the pages.

This makes it difficult to assess the importance or "semanatic" meaning of pages. But, to limit our scope, I think we should assume web pages are well written. Certainly the markup on http://bb.co.uk looks fairly decent.

So we should come up with some basic heuristics to determine ranking. A simple start would just be to use rank h1 - h6 accordingly. Then, anything other than these, will be lower ranking.

There are other tags we could look at too: header, section, blockquote, nav, but I think for a proof of concept we should initially only concern ourselves about h1 - h6.

What do you think?

woodj22 commented 7 years ago

I think to start with it should be h1 - 6 ranked to some basic heuristics, they are nice and already ranked in simple order already. Then, we can start adding in other details. Of course as soon as we add other tags it will get more complicated and interpreted by our opinion of where different tags are ranked rather than a universal standard.

I have just done a quick search for ranking HTML tags and theirs a lot of academic material on it. Bloody white papers but no actual implementation details, just graphs and numbers. It will not be that much more to implement this and for basics i can just read a ranking from a array while it scrapes each word. Maybe it would even be best just to scrape the tags that we have weighted then we won't have extra information that is un-ranked and i suppose null data.

I think this functionality of the scraper to be quite important and will add a lot of depth when searching the db. However, adding in heuristic algorithms will be new territory for me but something i am super keen on.

have you got an idea of the algorithm or way of 'weighting' the tags ?

samuknet commented 7 years ago

Maybe it would even be best just to scrape the tags that we have weighted then we won't have > extra information that is un-ranked and i suppose null data.

This is a really interesting idea - and I like it especially because it somewhat simplifies things. However it will drastically reduce the number of database writes we need to make, which sounds good, but does not make it as good for testing capacity: we would have to scrape more pages in the same amount of time to produce the same amount of writes.

have you got an idea of the algorithm or way of 'weighting' the tags ?

Here's a simple idea which could be a starting point:

First we define the metric weight, as measure of how important a word is with respect to the general intention of the article. weight ranges from -Infinity -> +Infinity where a higher value means the word is more important. This range allows for future changes to the weight metric - it gives us space in the future to to come up with heuristics that make words more important and heuristics which make words less important.

For now, let's use our basic h1-h6 heuristic. Here is how I propose it could work. For reference, I will use hn to refer to a generic h tag which could be one of h1, h2, h3, h4, h5, h6.

Let weight be the a function which takes a word and returns the weight. For a general word w, weight(w) can be calculated as follows: If TightestHTag(w) = h1 then weight(w) = h1Weight If TightestHTag(w) = h2 then weight = h2Weight etc,etc... ... If TightestHTag(w) = h6 then weight(w) = h6Weight If TighestHTag(w) = None then weight(w) = defaultWeight Where h1Weight..h6Weight are constants such that h1Weight > h2Weight > h3Weight > h4Weight > h5Weight > h6Weight > defaultWeight

For now, these constants are arbitrary as they are not at all combined together. As long as we choose them such that the above equality holds.

TighestHTag(w) is defined to be the the hn tag which is the nearest ancestor of w in the HTML document tree.

Below are some example HTML documents showing the behaviour of TighestHTag for w=tiger. I've tried to make different examples to shed light on some of the different cases.

No Tighest H Tag

tighestHTag(tiger) = None

<html>
tiger
</html>

Simple H Tag

tighestHTag(tiger) = hn

<html>
<hn>tiger</hn>
</html>

Nested H Tag

tighestHTag(tiger) = h3

<html>
<h1><h3>tiger</h3></h1>
</html>

Nested H Tag Out of Order

tighestHTag(tiger) = h1

<html>
<h2><h1>tiger</h1></h1>
</html>

H Tag Multiple Words

tighestHTag(tiger) = h1

<html>
<h1>Hello <h2>Tiger</h2></h1>
</html>

Nested H Tag with other tags

tighestHTag(tiger) = h2

<html>
<h2><p> Some text here <span>Tiger</span></h2>
</html>

What I have proposed above is more a specification than an implementation. In reality, I think this could be implemented just using a simple CSS query using the HTML parsing library.

What do you think of this weighting algorithm, as a starting point? Eventually, I imagine the weight function could also take into account the context of where the word appears in the page along with other surrounding tags, but for now this is a start.

Note that any word which is not in a hn tag has weight defaultWeight. We should also make some basic effort to filter out non-important words, like the is here some, etc. If these appear in a hn tag then it doesn't really make them anymore important. Anyway, this detail could be added later. For now it's fair that we expect the actual search queries be made only of single words, like tiger and are not full sentences which have non-important words in like the ones above.

woodj22 commented 7 years ago

Sweet I can definitely build the scraper with this in mind and i will make sure to still scape all words but have a default metric value added in.

woodj22 / KingsleyDB-Scraper

Weighted word values #2

tags.

How exactly do we quantity this weight? The algorithm we use is essentially our "ranking" algorithm that we should define.

No Tighest H Tag

Simple H Tag

Nested H Tag

Nested H Tag Out of Order

H Tag Multiple Words

Nested H Tag with other tags

woodj22 / KingsleyDB-Scraper

Weighted word values #2

tags. How exactly do we quantity this weight? The algorithm we use is essentially our "ranking" algorithm that we should define.

No Tighest H Tag

Simple H Tag

Nested H Tag

Nested H Tag Out of Order

H Tag Multiple Words

Nested H Tag with other tags

tags.

How exactly do we quantity this weight? The algorithm we use is essentially our "ranking" algorithm that we should define.