Geo-tagged tweets collection using Twitter Streaming API and database
One research I’m working on is to use Twitter data to predict crime patterns. So, the first thing I need to do is to collect Twitter data. Specifically, since I’m interested in discovering the spatial patterns of crime, only geo-tagged tweets are collected. Based on the purpose of your own project, you might need to use Twitter official REST API if you want to search for specific sets of tweets, or use Twitter official Streaming API if you want to collect tweets in real time. The Streaming API is quite different from the REST API in that the REST API is used to pull data from Twitter but the streaming api pushes messages to a persistent session. In this blog post I’m going to discuss how to collect Twitter messages using Twitter Streaming API. In the next post, I’m going to talk about the use of Twitter REST API to collect tweets.
Tweepy is a Python package which enables users to more easily work with the official Twitter API. It’s sort of like a Python wrapper that bridges the communication between your own program and the Twitter API. Let’s go straight to the code snippets.
The first thing we need to do is to register the client application with Twitter. Log in to Twitter Apps with your Twitter account and create a new application. Once you are done you should have your consumer key, consumer secret, access token and access token secret. Now, we import the packages and define the keys and access tokens.
Next, we create a
MyStreamListener class. This class will later be used to create a
tweepy.Stream object and connect to the Twitter Streaming API. We define the
on_error() methods. The parent
tweepy.StreamListener class has alreary defined these methods. We overwrite the default ones to add our own intended logic.
on_connect() will be invoked once a successful response is received from the server. When the connection is established and raw data is received, the method
on_data() will be called. If condition ensures that only tweets associated with coordinates information are received. The received tweet object is in JSON format. So we use
json.loads() method to first decode JSON object to Python object. The collected tweet object has a long list of attributes. We are only interested in some of the attributes and we print them onto the terminal screen.
on_error() will be called when a non-200 status code is returned. HTTP Status codes are issued by a server in response to a browser’s request made to the server. A successful HTTP request will return a status code 200. A special case in using Twitter API is the issue of rate limit. Twitter limits the number of requests a user can make during a specific time window. The Twitter API will send a 420 status code if we’re being rate limited.
Normally, we don’t just want to print out the collected tweets on the terminal screen. We also want to store them for later analysis. Of course, you can choose to store all the collected tweets into a single file. But a more efficient and appropriate choice is to store them into a database.
Let’s first look at how to store collected tweets into
MySQL. We need a SQL connector to connect to a
MySQL database in Python. I use
MySQLdb package, but you are free to use the alternative. The first thing we need to do is to install
MySQL. Check this post I wrote before about how to install and set up MySQL on Mac. Then we need to install
Before we import
MySQLdb in our Python program, we should create a database and a table first. Add a database called
mysql> CREATE DATABASE twitter;
mysql> USE twitter;
Then, create a table called
twitter_stream_collect which we are going to use to store the data.
Now, we define a method to create a connection to the
Another option for the use of database is MongoDB. Unlike MySQL, MongoDB is a NoSQL database. It stores data in flexible, JSON-like documents. You don’t have to define a schema before the use of a database.
We define a method to create a database, connect to it, and store data into it:
Both of the
mongodb_store() methods are invoked inside
on_data(). Check my Git repository for full codes.
The final step is to authenticate with our keys and access tokens, instantiate the
MyStreamListener class, connect to the Twitter streaming API, and filter the collected tweets will locations filtering criteria.
Now, you can collect Twitter screaming data in real time and store them into database.