Harvesting Facebook Posts and Comments with Python: Part 1

In our previous examples we learned how to scrape data from public organization/company pages with no authentication. This limited the data we could retrieve from Facebook pages to only high level page information: likes, talking about counts etc.

In this series of tutorials we will build off of our previous example of scraping simple public data, and learn how to pull more rich post and comment data. In Part 1 of this tutorial we will learn how to extract post data and print it to our console.

Prerequisites

  • Completion of Simple Python Facebook Scraper Part 1 and 2
  • Python 2.7
  • MySQL 5.6

1. Become a Facebook App Developer

Before we begin to start coding you need to become Facebook application developer. Navigate to the link here, and follow the instructions to sign up your Facebook account as an App Developer.

Once you have become a Facebook developer, we will now create a new application. In the upper ribbon select Apps -> Create a New App. Now fill out the form. All you need to do is fill out is your app name, choose a category and press create app. I am going to call mine Simple Data Pull, but feel free to call it whatever you want.

Create Facebook App

Create Facebook App

Now that we have created our Facebook App, we will once again select the Apps drop down and select Simple Data Pull, or whatever you named it.

On this view we are interested in grabbing 2 pieces of information, the App ID and App Secret. We will use this info below for our authentication of our Graph API calls. Do not share with others your App ID and App Secret, treat this like your username and password.

App ID and App Secret

App ID and App Secret

2. Simple Authentication

In our Simple Python Facebook Scraper tutorial, we established our graph URL that we manipulated and opened to receive JSON object responses containing data about Likes, talking about etc. To collect JSON objects of posts we will have to change our URL that we were manipulating so that it is secure, and able to pull the Post data. So let’s first go to our graph_url variable, and change the assignment value to “https://graph.facebook.com/”. Though it may seem like a small change, it is a necessary requirement to pull post data using SSL.

Don’t worry, our original calls to the graph API we used to collect public information, will still work.

Now we will create a small function that we will use to create turn our graph_url into a url that will pull post data. Above our main function let’s create a function called create_post_url. Go ahead and copy the following code.

Let’s go through this function line by line. As you can see in the top of the function we are passing in variables graph_url, which we just changed, and the APP_ID and APP_SECRET which we received from Facebook when we created our app. The line post_args is the addition we need to our original graph call to pull the post data. In adding the APP_ID and APP_SECRET it allows Facebook to know we are ok to retrieve post information.

The next line simply combines the graph_url with the post_args from the previous line and creates the URL with the credentials needed to access the data. Finally, our function returns our post data URL.

3. Harvesting Post Data

Okay, now that we have created the function that will pass us our secure URL for post data, let’s actually put this to use and start collecting some posts. Before we have our script start pulling data, let’s view what we are about to collect in our web browser. Let’s make the URL, for the first iteration of the for loop. Simply paste your APP_SECRET and APP_ID in the URL below.

Navigate to the page and you should see the following in your browser:

Walmart Post Data

Walmart Post Data

Similar to when we looked at the JSON object in our browser in Simple Python Facebook Scraper Part 1, our results look like a mess of information… If you look closely at the data that is displayed in our browser, you will see that all this mess of information is actually nested under the “data” key in our JSON object. In fact, you can see that there are other nested values within such as “from” and “properties”.

We will get into more detail on how to extract some of these pieces of data in some subsequent tutorials. In Part 1 of Harvesting Facebook Posts and Comments with Python, we are going to start simple, and tackle some of the more advanced stuff in subsequent posts.

Okay, let’s copy the below snippet of code in our main function, below where we are printing our page_data to console.

Let’s go through this snippet of code line by line. Adopting a similar methodology as we did with Simple Python Facebook Scraper: Part 1, we open the URL, and receive the response. Then we use the read method to convert this response to a readable page.

Next, we load the readable page into a JSON object called json_postdata. The contents in this variable represents all of the key value pairs that we saw above when we opened the page in our browser.

To make our manipulation of this data a little more palatable, we assign the json_fbposts equal to the data within the “data” key. This is so all of the data we saw before is no longer nested within the “data” key. If you are eager to check if it works, go ahead and save and run the script. You should see a huge jumbled glob of data.

Before we Move On Let’s Simplify Our Code!

Since we have done this procedure of opening a URL, then converting the response multiple times already in our code, let’s create function that we can call so that we no longer have to type all of these lines out. Above our main function let’s create a function called render_to_json.

Now let’s replace the 4 lines of code below in our main function, with this simple function render_to_json. Our code should now look something like the below. I know, a small gain, but as we progress through this set of tutorials it will save us considerable lines of code… and our own sanity! If you want brownie points, incorporate our new function on the same for lines of code rendering the page level response to a JSON object.

Okay, now that we have our glob of data successfully printing to console, let’s pick out some data we are actually interested in. First, let’s comment out the print json_fbposts. Next copy the code, and paste it into your code, below where we were printing our blob of post data.

Now going through our block of code line by line, we are first looping through every post on the JSON page. Then for each post, we try to open up the object and read its value for “id” and “message” if it fails to print this information, it will simply print a very general error message.

We have included this try except, because there are several errors that you can receive while pulling data from the Facebook Graph API. If we hit an unforeseen error, we want our script to keep going.

5. What our Code Should Look Like Thus Far

At the close of Part 1 of Harvesting Facebook Posts and Comments with Python, our code should look something like the following:

Summary and Next Steps:

In this tutorial we built upon our Simple Python Facebook Scraper, and learned how to scrape public post data from Facebook. In our next post in this series, we will learn how to navigate deeper into the post data to collect data points like Likes and Shares, and make updates to our database for this new data.

3 thoughts on “Harvesting Facebook Posts and Comments with Python: Part 1

  1. Hi Scott. Great article! I try to talk myself round to write a Python-based social media/web mining blog for a while – you just simply did it. =) My two comments: 1) (a small typo) swapped APP_SECRET & APP_ID in this URL: “https://graph.facebook.com/walmart/posts/?key=value&access_token=APP_SECRET|APP_ID” should be “https://graph.facebook.com/walmart/posts/?key=value&access_token=APP_ID|APP_SECRET”; 2) (a kind of discussion-opening comment =)) have you considered MonoDB for your purposes? Best, Jakub

    • scott@simplebeautifuldata.com says:

      Hey Jakub, good catch! I have updated the line :). Also, yeah I think this would be a great use case for mongodb. I have tinkered around with it before, and it would probably be a better option than using relational db for this type of data. These first sets of posts were inspired by an ask to show in a very simple way how to scrape and store data from Facebook. Perhaps I can do some follow up after I am done with this series on incorporating our scraper with MongoDB. Thanks so much for reading!

Comments are closed.