An NLP Approach to Analyzing Twitter, Trump, and Profanity
Who swears more? Do Twitter users who mention Donald Trump swear more than those who mention Hillary Clinton? Let’s find out by taking a natural language processing approach (or, NLP for short) to analyzing tweets.
By Stephanie Kim, Algorithmia.
Who swears more? Do Twitter users who mention Donald Trump swear more than those who mention Hillary Clinton? Let’s find out by taking a natural language processing approach (or, NLP for short) to analyzing tweets.
This walkthrough will provide a basic introduction to help developers of all background and abilities get started with the NLP microservices available on Algorithmia. We’ll show you how to chain them together to perform light analysis on unstructured text. Unfamiliar with NLP? Our gentle introduction to NLP will help you get started.
We know that getting started with a new platform or developer tool is an investment in time and energy. Sometimes it can be hard to find the information you need in order to start exploring on your own. That’s why we’ve centralized all our information in the Algorithmia Developer Center and API Docs, where users will find helpful hints, code snippets, and getting started guides. These guides are designed to help developers integrate algorithms into applications and projects, learn how to host their trained machine learning models, or build their own algorithms for others to use via an API endpoint.
Now, let’s tackle a project using some algorithms to retrieve content, and analyze it using NLP. What better place to start than Twitter, and analyzing our favorite presidential candidates?
Twitter, Trump, and Profanity: An NLP Approach
First, let’s find the Twitter-related algorithms on Algorithmia. Go to the search bar on top of the navigation and type in “Twitter”:
You’ll get quite a few results, but find the one called Retrieve Tweets with Keyword, and check out the algorithm page where it will tell you such information as the algorithm’s description, pricing, and the permissions set for this algorithm:
The algorithm description provides information about the input and output data structures expected, as well as the details regarding any other requirements. For instance, Retrieve Tweets with Keyword requires your Twitter API authentication keys.
At the bottom section of every algorithm page we provide the code samples for your input, output, and how to call the algorithm in Python, Rust, Ruby, JavaScript, NodeJS, cURL, CLI, Java, or Scala. If you have questions about the details of using the Algorithmia API check out the API docs.
Alright, let’s get started!
Here’s the overall structure of our project:
+-- profanity_demo | +-- data | +-- Donald-Trump-OR-Trump.csv | +-- Hillary-Clinton-OR-Hillary.csv | +-- logs | +-- twitter_data_pull.log | +-- profanity_analysis.py | +-- twitter_pull_data.py
You’ll need a free Algorithmia account to complete this project. Sign up for free and receive an extra 10,000 credits. Overall, the project will consist of processing around 700 tweets or so with emoticons and other special characters stripped out. This means if a tweet only contained URL’s and emoticons then it won’t be analyzed. Once we pull our data from the Twitter API, we’ll clean it up with some regex, remove stop words, and then find our swear words.
Step One: Retrieve Tweets by Keyword
We’ll use the Retrieve Tweets by Keyword algorithm first in order to query tweets from theTwitter Search API:
Okay, let’s go over the obvious parts of the code snippet. This algorithm takes a nested dictionary called ‘input’ that contains the keys: ‘query’, ‘numTweets’ and ‘auth’ which is a dictionary itself. The key ‘query’ is set as a global variable called q_input and holds the system argument that is passed when executing the script. In our case it will hold a presidential nominee name. The key ‘numTweets’ is set to the number of tweets you want to extract and the dictionary ‘auth’ holds the Twitter authentication keys and tokens that you got from Twitter.
As you write the pull_tweets() function, pay attention to the line that sets the variable ‘client’ to ‘Algorithmia.client(algorithmia_api_key)’. This is where you pass in your API key that you were assigned when you signed up for an account with Algorithmia. If you don’t recall where to find that it is in the My Profile page in the Credentials section.
Next notice the variable ‘algo.’ This is where we pass in the path to the algorithm we’re using. Each algorithm’s documentation will give you the appropriate path in the code examples section at the bottom of the algorithm page.
And last, the list comprehension ‘tweet_list’ holds our data after looping through the result of the algorithm by passing in our input variable to algo.pipe(input).result.
Now, you simply write your data to a CSV file that is named after your query. Note: if your query is a space separated string, then the script will join the query with a dash.