There are many popular repositories in Github, which contain Java tutorials and samples (for example: leeowenowen/rxjava-examples). They are rather popular, that's why they get into our dataset. Would be great to find a way to filter them out at the discover-repos.rb script. Maybe by some Github tags that they usually have.
Maybe we can use some ML/LLM techniques for such a filtering (relying on the description of repositories and the content of their README files)?
There are many popular repositories in Github, which contain Java tutorials and samples (for example: leeowenowen/rxjava-examples). They are rather popular, that's why they get into our dataset. Would be great to find a way to filter them out at the
discover-repos.rb
script. Maybe by some Github tags that they usually have.Maybe we can use some ML/LLM techniques for such a filtering (relying on the description of repositories and the content of their
README
files)?This can help, maybe: