Somehow torvalds/linux is in Fronterra, next to JS projects, awesome-X lists, and frontend checklists.
Either kernel hackers unexpectedly love frontend, or more likely the people that write the code don't overlap much with the people that star Github projects!
I wonder if code embeddings might have been a better way to organize the projects, although probably infeasible given the amount of resources required to download and compute embeddings for each file.
People have been critiquing the collaborative filtering aspect of this work vs content analysis ("[why use stars instead of code similarity]") but there's something elegant about the simplicity of using less priors here.
A tf*idf matrix could be applied to the star-feature matrix too. Document = github repo. Term = name of user who starred it.
THUS, users who overstar are simply less important for computing similarities.
This would mitigate the phenomenon of massively popular github repos being clustered together because of folks who blithely star the most well known stuff.
Either kernel hackers unexpectedly love frontend, or more likely the people that write the code don't overlap much with the people that star Github projects!