a public dataset of 5,574 SMS (text) messages collected for mobile phone spam research, tagged as legitimate or spam.
Date: Jun 9, 2011
The SMS Spam Collection v.1 is a public set of SMS (text) labeled messages that have been collected for mobile phone spam research.
It has one dataset composed by 5,574 English, real and non-encoded messages, tagged as legitimate (ham) or spam.
The collection is free for all purposes, and it is publicly available at:
This corpus has been collected from free or free for research sources at the Internet including the Grumbletext Web site, the NUS SMS Corpus, Caroline Tag's PhD Thesis, and a smaller previous collection (SMS Spam Corpus v.0.1:
www.esp.uem.es/jmgomez/smsspamcorpus/, available for historic comparison).
A comprehensive study of this corpus can be found in the following paper, which offers a number of statistics, studies and baseline results for several machine learning methods:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (ACM DOCENG'11), Mountain View, CA, USA, 2011. (Accepted)