Category Archives: twitter

Collecting Twitter Data: Getting Started

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

The R code used in this post can be found on my git-hub.

After getting R, Python or whatever programming language you prefer, the next steps will require API keys from Twitter. This requires you have to have a Twitter account and to create an ‘application’ using the following steps.

Getting API Keys

  1. Sign into Twitter
  2. Go to and create a new application

    twitter register app

  3. Click on “Keys and Access Tokens” on the your application’s page

    twitter access keys

  4. Get and copy your Consumer Key, Consumer Secret Key, Access Token, and Secret Token

    twitter oauth screen

Those four complex strings of case-sensitive letters and numbers are your API keys. Keep them secret, because they are more powerful than your Twitter password. If you are wondering what the keys are for, they are really two pairs of keys consisting of secret and non-secret, and this is done for security purposes. The consumer key pair authorizes your program to use the Twitter API, and the access token essentially signs you in as your specific Twitter user account. This framework makes more sense in the context of third party Twitter developers like TweetDeck where the application is making API calls but it needs access to each user’s personal data to write tweets, access their timelines, etc.

Getting Started in R

If you don’t have a preference for a certain programming environment, I recommend that people with less programming experience start with R for tweet scraping since it is simpler to collect and parse the data without having to understand much programming. The Streaming API authentication I use in R is slightly more complicated than what I normally do with Python. If you feel comfortable with Python, I recommend using the tweepy package for Python. It’s more robust than R’s streamR but has a steeper learning curve.

First, like most R scripts, the libraries need to be installed and called. Hopefully you already installed if not the install.packages commands are commented out for reference.

The first part of the actually code for a Twitter scraper will use the API keys obtained from Twitter’s development website. You insert your personal API keys where the **KEY** is in the code. For this method of authentication in R it only uses the CONSUMER KEY and CONSUMER SECRET KEY and it gets your ACCESS TOKEN from a PIN number from using an web address you open in your browser.

After this is executed properly, R will give you output in your console that looks like the following:

twitter handshake

  1. Copy the https:// URL into a browser
  2. Log into Twitter if you haven’t already
  3. Authorize the application
  4. Then you’ll get the PIN number to copy into the R console and hit Enter

    twitter pin

Now that the authentication handshake was completed, the R program is able to use those credentials to make API calls. A basic call using the Streaming API is the filterStream() function in the streamR package. This will connected you to Twitter’s stream for a designated amount of time and/or for a certain number of tweets collected.

The track parameter tells Twitter want you want to ‘search’ for. It’s technically not really a search since you are filtering the Twitter stream and not searching…technically. Twitter’s dev site has a nice explanation of all the Streaming APIs parameters. For example, the track parameter is not case sensitive, it will treat hashtags and regular words the same, and it will find tweets with any of the words you specify, not just when all the words are present. The track parameter ‘apple, twitter’ will find tweets with ‘apple’, tweets with ‘twitter’, and tweets with both.

The filterStream() function will stay open as long as you tell it to in the timeout parameter [in seconds], so don’t set it too long if you want your data quickly. The data Twitter returns to you is a .json file, which is a JavaScript data file.

twitter json

The above is an excerpt from a tweet that’s been formatted to be easier to read. Here’s a larger annotated version of a tweet JSON file. Thes are useful in some contexts of programming, but for basic use in R, Tableau, and Excel it’s gibberish.

There are a few different ways to parse the data into something useful. The most basic [and easiest] is to use the parseTweets() function that is also in streamR.

This is a pretty simple function that takes the JSON file that filterStream() produced, reads it, and creates a wide data frame. The data frame can be pretty daunting, since there is so much metadata available.

twitter data frame

You might notice some of the ?-mark characters. These are text encoding errors. This is one of the limitations of using R to parse the tweets, because the streamR package doesn’t handle utf-8 characters well in its functions. This means that R can only read basic A-Z characters and can’t translate emoji, foreign languages, and some punctuation. I’d recommend using something like MongoDB to store tweets or create your own parser if you want be able to use these features of the text.

Quick Analysis

This tutorial focuses on how to collect Twitter data and not the intricacies of analyzing it, but here are a few simple examples of how you can use the tweet data frame.

The different columns within the data frame can be called separately. Calling the created_at field gives you the tweet’s time stamp, and the text field is the content of the tweet. Generally, there will be some correlation between the number of followers a person has [followers_count] and the number of accounts a person follows [friends_count]. When I ran my script I got a correlation of about 0.25. The scatter plot will be heavily impacted by the Justin Biebers of the world where they have millions of followers but follow only a few themselves.


This is quick-start tutorial to collecting Twitter data. There are plenty of resources to be found on Twitter’s developer site and all over the internet. While this tutorial is useful to learn the basics of how the OAuth process works and how Twitter returns data, I recommend using a tool like Python and MongoDB which can give you greater flexibility for analysis. Collecting tweets is the foundation of using Twitter’s API, but you can also get user objects, trends, or accomplish anything that you can in a Twitter client with the REST and Search APIs.


The R code used in this post can be found on my git-hub.

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

  1. Overview of Twitter API does
  2. Get R or Python
  3. Install Twitter packages
  4. Get Developer API Key from Twitter
  5. Write Code to Collect Tweets
  6. Parse the Raw Tweet Data [JSON files]
  7. Analyze the Tweet Data


Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

  1. Authenticate with OAuth
  2. Make API call
  3. Receive JSON file back
  4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

Download R:
R Studio: [optional]

Download Python:

Install Twitter Packages

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

[in the R environment]

[in the terminal, assuming you have pip installed]


Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

SOTU Title

2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume of tweets and the sentiment of reactions were the highest during the latter half of the speech when Obama made the remark “I should know; I won both of them” referring to the 2008 & 2012 elections he won.

2015 State of the Union Tweet Volume

Throughout the day before the speech, there weren’t many tweets and they tended to be neutral. These tweets typically contained links to news articles previewing the SOTU address or reminders about the speech. Both of these types of tweets are factual but bland when compared to the commentary and emotional reaction that occurred during the SOTU address itself. The huge spike in Twitter traffic didn’t happen until the President walked onto the House floor which was just before 9:10 PM. When the speech started, the sentiment/number of positive words per tweet increased to about 0.3 positive words/tweet suggesting that the SOTU address was well received. [at least to the people who bothered to tweet]

Around 7:45-8:00 PM the largest negative sentiment occurred during the day. I’ve looked back through the tweets from that time and couldn’t find anything definitive that happened to cause that. My conjecture would be that is when news coverage started [and strongly opinionated] people started watching the news and began to tweet.

The highest sentiment/number of positive words came during the 15-minute polling window where the President quipped about winning two elections. Unfortunately, that sound bite didn’t make a great hashtag, so it didn’t show up else where in my analysis. However, there are many news articles and discussion about that off-the-cuff remark, and it will probably be the most memorable moment from the SOTU address.


Once again [Emoji Popularity Link], the crying-my-eyes-out emoji proved to be the most used emoji in SOTU tweets, typically being used in tweets which aren’t serious and generally sarcastic. Not surprisingly, the clapping emoji was the second most popular emoji mimicking the copious ovations the SOTU address receives. Other notable popular emoji are the fire, US flag, the zzzz emoji and skull. The US flag reflects the patriotic themes of the entire night. The fire is generally reflecting praise for Obama’s speech. The skull and zzzz are commenting on spectators in the crowd.

2015 State of the Union Twitter Emoji Use

Two topic-specific emoji counts were interesting. For the most part in all of my tweet collections, the crying-my-eyes-out emoji is exponentially more popular than any other emoji. Understandably, the set of tweets that contained language associated with terrorism had more handclaps, flags, and angry emoji reflecting the serious nature of the subject.

2015 State of the Union Subject Emojis

Then tweets corresponding to the GOP response had a preponderance of pig-related emojis due to Joni Ernst’s campaign ad.


The following hashtag globe graphic is rather large. Please enlarge to see the most popular hashtags associated with the SOTU address. I removed the #SOTU hashtag, because it was use extensively and overshadowed the rest. For those wondering what #TCOT means, it stands for Top Conservatives on Twitter. The #P2 hashtag is its progressive counterpart. [Source]

2015 State of the Union Hashtag Globe


The White House staff won the retweeting war by being the two most retweeted [RT] accounts during the speech last night. This graph represents the total summed RTs over all the tweets they made. Since the White House and the Barack Obama account tweeted constantly during the speech, they accumulated the most retweets. Michael Clifford has the most retweeted single tweet stating he is just about met the President. If you are wondering who Michael Clifford is, you aren’t alone, because I had to look him up. He’s the 19-yo guitarist from 5 Seconds of Summer. The tweet is from August, however, people did retweet it during the day. [I was measuring the max retweet count on the tweets.] Rand Paul was the most retweeted non-President politician, and the Huffington Post had the most for a news outlet.

2015 State of the Union Popular Retweets

The Speech

Obama released his speech online before starting the State of the Union address. I used this for a quick word-count analysis, and it doesn’t contain the off-the-cuff remarks just the script, which he did stick to with few exceptions. The first graph uses the count of single words with ‘new’ being by far the most used word.

2015 State of the Union Address Word Frequency

This graph shows the most used two-word combinations [also known as bi-grams].

2015 State of the Union Address Bigram Frequency

Further Notes

I was hoping this would be the perfect opportunity to test out my sentiment analysis process, and the evaluation results were rather moderate achieving about 50% accuracy on three classes [negative, neutral, positive]. In this case 50% is an improvement over a 33% random guess, but not very encouraging overall. For the sentiment portion in the tweet volume graph, I used the bag-of-words approach that I have used many times before.

A more interesting and informative classifier might look try to classify the tweet into sarcastic/trolling, positive, and angry genres. I had problems classifying some tweets as positive and negative, because there were many news links, which are neutral, and sarcastic comments, which look positive but feel negative. For politics, classifying the political position might be more useful, since a liberal could be mocking Boehner one minute, then praising Obama the next. Having two tweets classified as liberal rather than a negative tweet and a positive tweet is much more informative when aggregating.

2015 Steelers-Ravens Playoffs Hashtag Use

2015 Steelers-Ravens Playoff Twitter Infographics

The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes:

There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the purple devils. Though for some reasons the ‘crying my eyes out’ emoji is by far the most popular in this collection of tweets. The yellow line represents how many unique tweets there were featuring that emoji. For example, 14 of the same of emoji in one tweet would count for 14 in the blue bar, while it would count for just 1 in the context of the yellow line.

2015 Steelers-Ravens Playoffs Emoji Use

Here’s the hashtag use. The #steelers exceeded the #ravens. This looks cool, but it doesn’t tell you much.

2015 Steelers-Ravens Playoffs Hashtag Use

Here’s a bar chart that’s a lot easier to read if you want the information.

2015 Steelers-Ravens Playoffs Hashtag Bar Chart

emoji header

The Most Popular Emoji Characters on Twitter

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.


Changes Meaning:

How I Did This

I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.


Emoji Use on Twitter

In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.

Emoji Selection on Twitter

Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.

Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.


In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.

Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.

Emoji, UTF-8, and Python

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there.

Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a different set of characters aside from the traditional alphanumeric and punctuation characters. These are essentially another alphabet, and this concept will be useful when using the emoji in Python. Emoji are NOT a font like Wingdings from Windows95, they are unique characters with no corresponding letter or symbol representation. If you have a document or webpage that has the Wingding font, you can simply change the font to a typical Latin font to see the normal characters the Wingding font represents.

Technical Background

Without getting into the technical encoding problems, emoji are defined in Unicode and UTF-8, which can represent just about a million characters. A lot of applications or software packages default to ASCII, which only encodes the typical 128 characters. Some Python IDEs, csv writing packages, or parsing software default to or translate to ASCII, so they don’t necessarily handle the emoji characters properly.

I wrote a Python script [or this Python ‘package’] that takes tweets that are stored in a MongoDB database (more on that later) and counts the number of different emoji in the tweet corpus. To make sure Python plays nice with the emojis, first I loaded in the data by making sure I had UTF-8 encoding specified otherwise you’ll get this encoding error:

I loaded an emoji key I made using all the emoji’s in Apple’s implementation by loading this code into a Panda’s data frame:

If Python loads you data in correctly with UTF-8 encoding, each emoji will be treated as separate unique character, so string function and regular expressions can be used to find the emoji’s in other strings such as Twitter text. In some IDEs emoji’s don’t display [Canopy] or don’t display well [PyCharm]. I remedied the invisible/messy emoji’s by running the script in Mac OS X’s terminal application, which displays emoji . Python can also produce an ASCII compliant string by using a unicode escape encoding:

The escape encoded string will display something like this:

All IDEs will display the ASCII string. You would need to decode it from the unicode escape to get it back into a unicode object. Ultimately I had a Pandas data frame containing unicode objects. To make sure the correct encoding was used on the output text file, I used the following code:

Emoji Counter Class

I made an emoji counter class in Python to simplify the process of counting and aggregating emoji counts. The code [socialmediaparse] is on my GitHub along with the necessary emoji data file, so it can load the key when the instance is created. Using the package, you can repeatedly call the add_emoji_count() method to change the internal count for each emoji. The results can be retrieved using the .dict, dict_total, and .baskets attributes of the instance. I wrote this because it organizes and simplifies the analysis for any social media or emoji application. Separate emoji dictionary counter objects can be created for different sets of tweets that someone would want to analyze.


MongoDB was used for this project because the data stores the JSON files very well, not needing a parser or a csv writer. It also has the advantage of natively storing strings in UTF-8. If I used R’s StreamR csv parser, there would be many encoding errors and virtually no emoji’s present in the data. There might be possible work arounds, but MongoDB was the easiest way I’ve found to work with Twitter JSON, UTF-8 encoded data.

AL Wild Card Game Twitter Map

2014 ALWCG Twitter Graphs

The Royals and A’s had quite the entertaining 12-inning game Tuesday night. These are a few graphs I made from Twitter data. Yellow is Oakland; blue is Kansas City. The proportions of tweets between teams might be off, but I would venture to guess the Royals had much more social media activity than the A’s. The map shows geotagged tweets from 5PM to 1AM EDT from yesterday. The middle of the country was solid blue, California was pretty yellow, and the East Coast was rather mixed.

AL Wild Card Game Twitter Map

The volume of tweets per minute is a pretty cool view of what happened during the game. It looks like the Royals really outpaced the A’s for volume, but I’d have to use some controls to determine that for sure. These are just for fun.

AL Wild Card Game Twitter Time Series

I used Twitter’s streaming API to collect tweets with keywords like “Royals”, “A’s”, “TakeTheCrown”, “GreenCollar”, etc. I could have missed a crucial element of discussion, and none of this takes into account sentiment just frequency of mention in a tweet.

Twitter Retweet Analysis

Twitter Retweet Decay

This uses the same data set I obtained from my NU Data Mining final project [summary].

Recently, @MLBcathedrals tweeted a photo I submitted to them:

I got a bunch of retweet/favorite notifications at first then I got fewer as the day went on. Now a month later, I’ll get a favorite or retweet [RT] notification every so often. The process of getting a retweet follows a Poisson process, where there is a discrete and somewhat small outcome that can be thought of as count data — you can count retweets per minute.

I used the tweets I just had lying around from my project and pulled out several collections of native RTs that had their first RT in the data set and a high volume of retweets. This was to ensure I had the first part of the tweet’s life and not just had captured it in the middle. Time is measured in seconds from the first retweet event. This simplifies things by giving each collection of a RTed tweets a time base relative to when RTing started.

The common time base enabled me to make this comparison chart of different RTed tweets:

Twitter Retweet Analysis

Not every RT pattern is the same. Some have many more RTs, some take a little while to get momentum, but generally they start off strong, then slowly die out taking the shape of a logarithmic function. The total number of RTs over time is interesting, but this problem works better if we look at the rate of RTing. The reason why the RT pattern flattens out is that there is steadily decreasing RTing rate over time. This makes intuitive sense if you have ever used Twitter, people react to things as they happen then rarely go back to it.

It turns out you can mathematically model the RT rate with a Poisson generalized linear model reasonably well. The following three graphs show the actual RT rate data points as red dots, the expected value regression as the black line, and a probability range as blue bands.

Twitter Retweets Per Minute

The model for this particular RTed tweet is described by the equation:

ln(E[Y|t])  = 2.980154 - 0.0017236*t

Y is the number of RT per minute. t is time. And E[Y|t] is read as the expected value of Y given t, or what is the most likely number of RTs per minute at a given time. The constant [2.980154] represents the rate at t=1, and the negative regression coefficient [-0.0017236] indicates that rate will decline. This regression line represents the expected value, which is essentially an average of possible outcomes. Using the Poisson distribution and the expected value, I constructed a probability distribution showing a band where 50% of all data points should be located, and another band that should encompass 90% of them.

The bands, in my opinion, are more important than the regression line, because we are dealing with count data. So having an expected value of say 2.5342 doesn’t mean much if you don’t know the probability for getting a value of 0, 1, 2, 3, 4, etc. For this reason the last graph in the series of three has only the actual data points in the probable area bands. For each minute, the data point has a 50% chance to be in the dark blue region, a 40% chance to be in the light blue region [90%-50%], and a 10% chance to be in the white.

This is all predicated on RTing being a random process with a fixed audience. This described most of the RTs I looked at fairly well, but there will always be other factors such as viral growth and time of day. Viral growth means it starts off slow then grows large. If this were to happen, it would not follow this pattern; it would look more like an S. For better or worse, most RTs come from accounts with a large number of followers, so they aren’t actually viral, they are propagations of already popular tweets.

This specific regression by itself won’t predict how many RTs a tweet might get before it’s tweeted, but it describes what happens after people have begun retweeting.

Average Sentiment Per Tweet -- Corporations

Twitter Sentiment Analysis

This is a summary of a final project I did from my Introduction to Data Mining class at NU. The goal of the project was to find a business need and execute a data mining process. The general process I used is outlined here and the sentiment lexicon is found here. The lexicon is from a paper: Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge.Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.

My experiences using social media, the business-centric focus of my grad classes, and my love for burritos inspired me to look into Twitter sentiment analysis. [I also needed to research something that wasn’t baseball.] Imagine every time you’ve misinterpreted a text message from your friend. Or every time an irate Twitter follower takes a sarcastic tweet seriously. That’s how hard it is for normal people to correctly interpret sentiment of written communication. So now picture trying to get a computer to do the same thing. Not easy. But at the very least we can find a way to categorize more tweets correctly than we misidentify.

On August 25th & 26th, I scraped tweets containing Twitter handles of some companies:Company Twitter Handles

I choose these based on my personal preferences or companies that I thought might have strong sentiment. The tweets were scraped using R and the package streamR. [These are pretty easy to use, if anyone wants to start doing any Twitter research, I’d start here.] The tweets are saved as JSON files, which are a mess, but human-readable. R parses the tweets into a data frame then that can go into a SQL database if so desired.

The way I determined a tweet’s sentiment was by matching words in the tweet to a predetermined sentiment lexicon. This process is outlined in this presentation, if you are curious about how to execute it. The algorithm is simple enough to write with one FOR loop. The toughest problem I had was dealing with the tweet’s data object. The scoring system was simple, each word from a tweet that matched to the lexicon list will count as a +1 [positive] or -1 [negative]. The words are added together to give a sentiment score. The sentiment score can be interpreted like algebra [0 is neutral, greater than 0 is positive, and less than 0 is negative].

I grouped the results by company:

Absolute Twitter Sentiment Corporations

The bar graph sums the sentiment scores of every tweet I captured. So for example, two negative words in one tweet will drop the company’s total score 2 points. No surprises here that Comcast is in last place. I’ve never heard anyone say anything nice about Comcast, and apparently Twitter is not much kinder. News about the Burger King-Tim Horton’s merger broke while I was capturing tweets, so that accounts for the great discontent about Burger King. I’m not quite sure what BK’s baseline should be since this is the only data I have on hand. Verizon has very positive scores. I am a little skeptical of this, because Verizon (and Apple) had a lot tweets that were ad-based. While it’s great to know who is advertising your product, it’s not the goal of this particular project. Ads and news stories gets retweeted and regurgitated a lot, but the tweets from real customers don’t.

One way to account for large volume of tweets is to look at average sentiment per tweet:

Average Sentiment Per Tweet -- Corporations

This graph takes the total sentiment score and divides it by the total number of tweets mentioning that company. This will give more weight to a company’s Twitter-customer base that’s strongly opinionated one way or another. To no one’s surprise, Comcast is dead last again. But to my delight, Chipotle ranks first! [I love burritos…and apparently a lot of Twitter users do too.] Chipotle is a good example of a company that does not have the volume that Comcast or Verizon has, but their users feel strongly about their product and tweet positively about it.

Luckily, there is a bunch of metadata available with the tweets including my favorite variable, a timestamp. First, here’s a time series baseline for August 25, 2014:

Total Tweets Per Hour -- 8/25/14

The volume of tweets increased through out the day, and peaked around lunch in EDT. Let’s look at Chipotle’s time series graph broken up by sentiment classification:

Chipotle Sentiment Per Hour 8/25/14

Neutral and positive tweets peaked during the 1PM EDT hour, with no corresponding spike in negative tweets. This is very encouraging for Chipotle since you would expect the negative tweets to follow the same pattern, and they don’t. They actually rise later in the day. Further research would have to be done to determine if this trend is real and what the source of it is. It could be a time zone delayed problem, general staffing/production issues in non-lunch hours, selection bias of people who go to a later lunch, or a random fluctuation that happened that day.

Comcast also has some interesting patterns:

Comcast Sentiment Per Hour -- 8/25/2014

There are two spikes in neutral tweet volume early in the day. I think these are results of mentions in a news related tweet that was retweeted a lot. The large spike at 2PM EDT is probably also caused by retweets as well. However, during the early afternoon, there are distinctive negative customer tweets accounting for the surge in negative sentiment after lunch. My conjecture for this surge would be an increase of people dealing with Comcast’s customer service. It would be interesting to see if call center data matched up.

This is just the most basic implementation of sentiment analysis. There are more advanced machine learning techniques that can weight words differently and looks at consecutive word groups (n-grams) in addition to individual words. The advantage of this can be seen easily. The phrase ‘not good’ is negative, but only looking at the constituent words it would be scored neutral. There is a lot of other processes I can try to get more accurate results, but unfortunately, not in this post.