xdvom03 / klaus

Bayesian text classification of websites in a nested class system
Creative Commons Zero v1.0 Universal
2 stars 0 forks source link

Malicious links #76

Open xdvom03 opened 3 years ago

xdvom03 commented 3 years ago

Bot has followed links which exist only for confusing bots and are hidden from human users. Case study: https://ucarehi.com/telemedicine-services-clinic-hawaii-honolulu/ contains links hidden within \

\
leading to irrelevant sites. This is apparently a hacking problem and links are often obfuscated.

There isn't that much we can do about this (and nothing will save us if the spam links are visible and thus indistinguishable from regular links), but we could at least try not to follow hidden links.

Since Backbot classes every site it moves to, a class for spam links could let us backtrack if it seems that the link clicked was malicious. Just beware of possible collateral damage.

xdvom03 commented 3 years ago

Another form of maliciousness: http://www.kidscountonme.org/category/stock has a link (next to the footer, visible only in the source code) to http://seotemplates.net/blog/wordpress-theme/exray-wordpress-theme/. This leads to a wild-goose chase with links of "hotels you might like" across hundreds of hotels with no way out except by exhausting the hotel supply, then backtracking the whole way. I guess the whole point of the thing is SEO. Not sure what can be done about it. Classification won't help much - this can just as well be constructed between a ring of any random sites. There is something very off about the chase - the bot suddenly finds workable links dozens of domains in a row. This might be a trigger for some kind of SEO check, but details are tough.

xdvom03 commented 3 years ago

Focused bot will not follow these any further (since it now considers the whole queue), unless it either starts within the trap and has no other option, or finds the spam exactly what it is looking for. The unfocused crawler, not so good.