yegor256 / cam

Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories
http://cam.yegor256.com
MIT License
23 stars 38 forks source link

filter out repositories with samples #227

Closed yegor256 closed 5 months ago

yegor256 commented 7 months ago

There are many popular repositories in Github, which contain Java tutorials and samples (for example: leeowenowen/rxjava-examples). They are rather popular, that's why they get into our dataset. Would be great to find a way to filter them out at the discover-repos.rb script. Maybe by some Github tags that they usually have.

Maybe we can use some ML/LLM techniques for such a filtering (relying on the description of repositories and the content of their README files)?

This can help, maybe:

h1alexbel commented 6 months ago

@yegor256 please assign me

yegor256 commented 6 months ago

@h1alexbel go ahead!