Last week we introduced you to the very basics of Flask framework and now we will show you how to create your own model for sentiment analysis. To build our dataset for training, we will use the well-known Twitter API. There are two prerequisites for the Twitter API in Python: tweepy package and API keys for Twitter.

  1. To install tweepy, follow this link: https://github.com/tweepy/tweepy
  2. To get your own API keys for Twitter, you have to create your own application at https://apps.twitter.com/app/new. After you create your application, go to the application management console, click on the tab Keys and Access Tokens and you will find all the required keys there. You will need the following four: Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret.
     

STREAMING OF TWEETS

Once you have the tweepy installed and have already received your API keys, you can connect to the Twitter API and stream the tweets directly to your PC with the following code:

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import os # to access enviromental variables

#Variables that contains the user credentials to access Twitter API 
# HINT: Never put values of your API keys directly into the code. In python you can access values of your enviromental variables with module os.
access_token = os.environ["twitter_access_token"]
access_token_secret = os.environ["twitter_access_token_secret"]
consumer_key = os.environ["twitter_consumer_key"]
consumer_secret = os.environ["twitter_consumer_secret"]

# We need to create the listener which will print the streamed tweets.
class StdOutListener(StreamListener):

# this function is called when data comes in
def on_data(self, data):
    # print the tweet
    print data
    return True

# this function is calledwhen error happens
def on_error(self, status):
    print status

# this is executed when script is launched
if __name__ == '__main__':

    #This handles authetification and the connects to Twitter API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)


    #In this line we can filter the stream of tweets to track only specific words like 'NBA'
    stream.filter(track=['nba'])


The code above will only print the streamed tweets, so now we have to redirect the print into a file to store them. For this, we will simply adjust the on_data function:
 

def on_data(self, data):
    # transform tweet to json
    data = json.loads(data)
    record = {}       
    # extract only important information
    record["text"] = data["text"]
    record["favorite_count"] = data["favorite_count"]
    record["retweet_count"] = data["retweet_count"]
    record["created_at"] = data["created_at"]
    record["language"] = data["lang"]
    #write to file
    with open("/Users/jurajkapasny/Data/Tweets/tweetstest.txt", 'a') as fp:
        json.dump(record, fp)


The last step before we run the script is the adjustment of the filter. We will collect tweets with positive and negative smileys to identify sentiment in our training dataset:

stream.filter(track=[':)',':('], languages=["en"])

We can now start our streaming application and you can leave it run for a while to save enough tweets.


PROCESSING TWEETS

Firstly, we need to load our tweets to Python. We will use pandas DataFrames and JSON: 

import pandas as pd
import json
data_path = "path-to-your-file"
full_file = open(data_path+'tweets.txt')
# ee read the full content of the file
# the result will be a huge string
x = full_file.read()
#we split the string on char "}" (represents end of the tweet) 
#and we add the same char at the end of the substring
parsed_file = [i + "}" for i in x.split("}")]

We have now parsed all the tweets into the list, where each element represents one tweet. Using the code below we are able to successfully load the file with multiple JSON objects into the pandas DataFrame.

#for each tweet, we parse the json and create dictionary
data = []
for tweet in parsed_file:
    try:
        data.append(json.loads(tweet))
    except ValueError:
        continue
# we transform list of dictionaries into the DataFrame
df = pd.DataFrame(data)
df.head()


REMOVING STOP WORDS

We will use the corpus of English stop words from the nltp package :

from nltk.corpus import stopwords
stop = stopwords.words('english')
# remove the stopwords
df['clean_text'] = df['text'].apply(lambda x: [item for item in x.split() if item.lower() not in stop])
# remove words RT (means retweet)
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if item.find("RT") == -1])
# removes words with @ (mentions)
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if item.find("@") == -1])
# removes :)
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if item.find(":)") == -1])
# removes :(
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if item.find(":(") == -1])
# removes words longer than 20 chars
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if len(item) < 20])
# removes words shorter than 3 chars
df["clean_text"] = df["clean_text"].apply(lambda x: [item for item in x if len(item) > 3])


Now we can create an indicator for positive and negative tweets. We will use it later to mark our features:
 

df["target"] = -1
df.loc[(df.text.str.find(":)") != -1) & (df.text.str.find(":(") == -1), "target"] = 1
df.loc[(df.text.str.find(":)") == -1) & (df.text.str.find(":(") != -1), "target"] = 0


Using the indicator of target variables we can create two subsets of our data, one for positive and one for negative tweets.
 

positive = df[df.target == 1][["text","clean_text","target"]]
negative = df[df.target == 0][["text","clean_text","target"]]


We will use these two samples separately during feature generation to mark if the feature indicates positive or negative sentiment:
 

# function that takes list of words as a argument and return dictionary of features.
def word_feats(words):
return dict([(word, True) for word in words])

negfeats = [(word_feats(negative.clean_text[i]), 'neg') for i in list(negative.index)]
posfeats = [(word_feats(positive.clean_text[i]), 'pos') for i in list(positive.index)] 

# we merge positive and negative features together.
trainfeat = negfeats + posfeats


MODELLING

We will use NaiveBayesClassifier from nltk package to build a classifier that can identify positive or negative sentiment:

from nltk.classify import NaiveBayesClassifier 
classifier = NaiveBayesClassifier.train(trainfeats)
classifier.show_most_informative_features()

We can check the most informative_features using the method show_most_informative_features() and the output will look like this:
 

Most Informative Features
resell = Trueneg : pos= 8099.7 : 1.0
constructive = Truepos : neg= 2526.6 : 1.0
bodyguards = Trueneg : pos=905.7 : 1.0
xbox = Trueneg : pos=733.1 : 1.0
steal = Trueneg : pos=695.1 : 1.0
tomorro = Truepos : neg=639.7 : 1.0
ps4 = Trueneg : pos=624.2 : 1.0
arrived! = Truepos : neg=578.4 : 1.0


Your values will be different because you will use different tweets. Now we can test whether the model works as it should. Let's create a positive and a negative example:
 

text_pos = "this is amazing"
text_neg = "my dog has died"

classifier.classify(word_feats(text_pos.split()))
classifier.classify(word_feats(text_neg.split()))


When we run the above commands, it will return 'pos' for the text_pos and 'neg' for the text_neg which is what we expect.
 

CONCLUSION

This week we showed you how to build your own model for sentiment analysis. Let us know if your results are what you expected. We will continue next week with a tutorial on the implementation of our classifier into the API, which we started building last week. In the meantime, you can check out online.basecamp.com for more Data Science education.

 

Comment