Given a set of source code files collected from various open source projects, how well can unseen source code files from the same set of open source projects can be classified?
Possible real-world applications:
- Protecting intellectual property
- Data Loss Protection (DLP)
- Automatic categorization of source code repositories
www.kaggle.com/c/emc-data-science