The smileys have been extracted with the following regex, on all Reddit comments until 2016:

    smiley_dict = {
                    'cheeky' : re.compile(r'.*[^a-zA-Z0-9]+(:P|;P)'),
                    'confused' : re.compile(r'.*[^a-zA-Z0-9]+(o_O|O_o|O_O)'),
                    'disapproval' : re.compile(ur'.+(ಠ_ಠ)',flags=re.UNICODE),
                    'disgust' : re.compile(r'.+(-_-)'),
                    'happy' : re.compile(r'.+(:\)|:-\)|:\)\))'),
                    'happy_asian' : re.compile(r'.+(\^\^)'),
                    'laughing' : re.compile(r'.*[^a-zA-Z0-9]+(XD|xD)'),
                    'sad' : re.compile(r'.+(:\(|:-\(|:\(\()'),
                    'surprised' : re.compile(r'.+(:o|:-o)'),
                    'shy' : re.compile(r'.+(:\$)'),
                    'wink' : re.compile(r'.+(;\)|;-\))'),
                    'romance' : re.compile(ur'.+(&lt;3|♡)', flags=re.UNICODE),
                    'lenny' : re.compile(ur'.+(\( ͡° ͜ʖ ͡°\))',flags=re.UNICODE),
                    }

Full sentences:

training_removed_smileys_full_sent:
cheeky	confused  disapproval  disgust	happy  laughing  romance  sad  surprised  wink

Full sentences, but tokenized (all tokens separated with spaces):

training_removed_smileys_tokenized:
cheeky	confused  disapproval  disgust	happy  laughing  romance  sad  surprised  wink

Full sentences, tokenized, 
training_removed_smileys_tokenized_split:

cheeky.dev    confused.dev    disapproval.dev	 disgust.dev	happy.dev    laughing.dev    romance.dev    sad.dev    surprised.dev	wink.dev
cheeky.test   confused.test   disapproval.test	 disgust.test	happy.test   laughing.test   romance.test   sad.test   surprised.test	wink.test
cheeky.train  confused.train  disapproval.train  disgust.train	happy.train  laughing.train  romance.train  sad.train  surprised.train	wink.train

Data source is https://files.pushshift.io/reddit/comments/
To generate new data, use the Python scripts in the scripts/ folder.