The smileys have been extracted with the following regex, on all Reddit comments until 2016: smiley_dict = { 'cheeky' : re.compile(r'.*[^a-zA-Z0-9]+(:P|;P)'), 'confused' : re.compile(r'.*[^a-zA-Z0-9]+(o_O|O_o|O_O)'), 'disapproval' : re.compile(ur'.+(ಠ_ಠ)',flags=re.UNICODE), 'disgust' : re.compile(r'.+(-_-)'), 'happy' : re.compile(r'.+(:\)|:-\)|:\)\))'), 'happy_asian' : re.compile(r'.+(\^\^)'), 'laughing' : re.compile(r'.*[^a-zA-Z0-9]+(XD|xD)'), 'sad' : re.compile(r'.+(:\(|:-\(|:\(\()'), 'surprised' : re.compile(r'.+(:o|:-o)'), 'shy' : re.compile(r'.+(:\$)'), 'wink' : re.compile(r'.+(;\)|;-\))'), 'romance' : re.compile(ur'.+(<3|♡)', flags=re.UNICODE), 'lenny' : re.compile(ur'.+(\( ͡° ͜ʖ ͡°\))',flags=re.UNICODE), } Full sentences: training_removed_smileys_full_sent: cheeky confused disapproval disgust happy laughing romance sad surprised wink Full sentences, but tokenized (all tokens separated with spaces): training_removed_smileys_tokenized: cheeky confused disapproval disgust happy laughing romance sad surprised wink Full sentences, tokenized, training_removed_smileys_tokenized_split: cheeky.dev confused.dev disapproval.dev disgust.dev happy.dev laughing.dev romance.dev sad.dev surprised.dev wink.dev cheeky.test confused.test disapproval.test disgust.test happy.test laughing.test romance.test sad.test surprised.test wink.test cheeky.train confused.train disapproval.train disgust.train happy.train laughing.train romance.train sad.train surprised.train wink.train Data source is https://files.pushshift.io/reddit/comments/ To generate new data, use the Python scripts in the scripts/ folder.