xdvom03 / klaus

Bayesian text classification of websites in a nested class system
Creative Commons Zero v1.0 Universal
2 stars 0 forks source link

Recognise URLs with nearly identical content #18

Closed xdvom03 closed 3 years ago

xdvom03 commented 4 years ago

Prevents accidental double classing because of http/s, a trailing slash, or moved content.

xdvom03 commented 3 years ago

Relatedly, create blacklists for domains to avoid connecting a domain to a class.

xdvom03 commented 3 years ago

Here is an example of boilerplate text for a Git book (e.g. https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository):

about branching and merging small and fast distributed data assurance staging area free and open source trademark documentation reference book videos external links downloads gui clients logos community this book is available in english full translation available in azrbaycan dili deutsch espaol franais nederlands slovenina tagalog partial translations available in etina polski translations started for indonesian italiano bahasa melayu portugus brasil portugus portugal svenska trke the source of this book is hosted on github patches suggestions and comments are welcome chapters 1 getting started 1 1 about version control 1 2 a short history of git 1 3 what is git 1 4 the command line 1 5 installing git 1 6 first time git setup 1 7 getting help 1 8 summary 2 git basics 2 1 getting a git repository 2 2 recording changes to the repository 2 3 viewing the commit history 2 4 undoing things 2 5 working with remotes 2 6 tagging 2 7 git aliases 2 8 summary 3 git branching 3 1 branches in a nutshell 3 2 basic branching and merging 3 3 branch management 3 4 branching workflows 3 5 remote branches 3 6 rebasing 3 7 summary 4 git on the server 4 1 the protocols 4 2 getting git on a server 4 3 generating your ssh public key 4 4 setting up the server 4 5 git daemon 4 6 smart http 4 7 gitweb 4 8 gitlab 4 9 third party hosted options 4 10 summary 5 distributed git 5 1 distributed workflows 5 2 contributing to a project 5 3 maintaining a project 5 4 summary 6 github 6 1 account setup and configuration 6 2 contributing to a project 6 3 maintaining a project 6 4 managing an organization 6 5 scripting github 6 6 summary 7 git tools 7 1 revision selection 7 2 interactive staging 7 3 stashing and cleaning 7 4 signing your work 7 5 searching 7 6 rewriting history 7 7 reset demystified 7 8 advanced merging 7 9 rerere 7 10 debugging with git 7 11 submodules 7 12 bundling 7 13 replace 7 14 credential storage 7 15 summary 8 customizing git 8 1 git configuration 8 2 git attributes 8 3 git hooks 8 4 an example git enforced policy 8 5 summary 9 git and other systems 9 1 git as a client 9 2 migrating to git 9 3 summary 10 git internals 10 1 plumbing and porcelain 10 2 git objects 10 3 git references 10 4 packfiles 10 5 the refspec 10 6 transfer protocols 10 7 maintenance and data recovery 10 8 environment variables 10 9 summary a1 appendix a git in other environments a1 1 graphical interfaces a1 2 git in visual studio a1 3 git in visual studio code a1 4 git in intellij pycharm webstorm phpstorm rubymine a1 5 git in sublime text a1 6 git in bash a1 7 git in zsh a1 8 git in powershell a1 9 summary a2 appendix b embedding git in your applications a2 1 command line git a2 2 libgit2 a2 3 jgit a2 4 go git a2 5 dulwich a3 appendix c git commands a3 1 setup and config a3 2 getting and creating projects a3 3 basic snapshotting a3 4 branching and merging a3 5 sharing and updating projects a3 6 inspection and comparison a3 7 debugging a3 8 patching a3 9 email a3 10 external systems a3 11 administration a3 12 plumbing commands 2nd edition

This is copied in the folder dozens of times. It should really only be included once.

Sometimes, boilerplate text may be all over a site, but in this case (and, possibly, many others), it is always in this long blob of text. Such a case should be filterable.

xdvom03 commented 3 years ago

A naive overlap test detects most of this. The problem is that we don't want to remove everything that overlaps - that would go against the point of the program (pattern-finding). So where is the limit for a repeated phrase?

xdvom03 commented 3 years ago

One obvious problem is that once we remove boilerplate, crawlers will not be ready for it. Many sites have privacy policy links in a footer, which is now mostly ignored, but might be a false clue towards legal if boilerplate is removed from training data.

xdvom03 commented 3 years ago

Basic boilerplate detection added. Might need some tweaking for non-universal boilerplate.