Wednesday, December 14, 2016

Building Your First Web Scraper, Part 1

Building Your First Web Scraper, Part 1

Rubyland has two gems that have occupied the web scraping spotlight for the past few years: Nokogiri and Mechanize. We spend an article on each of these before we put them into action with a practical example.

Topics

  • Web Scraping?
  • Permission
  • The Problem
  • Nokogiri
  • Extraction?
  • Pages
  • API
  • Node Navigation

Web Scraping?

There are fancier terms around than web or screen scraping. Web harvesting and web data extraction pretty much tell you right away what’s going on. We can automate the extraction of data from web pages—and it’s not that complicated as well. 

In a way, these tools allow you to imitate and automate human web browsing. You write a program that only extracts the sort of data that is of interest to you. Targeting specific data is almost as easy as using CSS selectors.

A few years ago I subscribed to some online video course that had like a million short videos but no option to download them in bulk. I had to go through every link on my own and do the dreaded ‘save as’ myself. It was sort of human web scraping—something that we often need to do when we lack the knowledge to automate that kind of stuff. The course itself was alright, but I didn’t use their services anymore after that. It was just too tedious. 

Today, I wouldn’t care too much about such mind-melting UX. A scraper that would do the downloading for me would take me only a couple of minutes to throw together. No biggie!

Let me break it down real quick before we start. The whole thing can be condensed into a couple of steps. First we fetch a web page that has the desired data we need. Then we search through that page and identify the information we want to extract. 

The final step is to target these bits, slice them if necessary, and decide how and where you want to store them. Well-written HTML is often key to making this process easy and enjoyable. For more involved extractions, it can be a pain if you have to deal with poorly structured markup.

What about APIs? Very good question. If you have access to a service with an API, there is often little need to write your own scraper. This approach is mostly for websites that don’t offer that sort of convenience. Without an API, this is often the only way to automate the extraction of information from websites. 

You might ask, how does this scraping thing actually work? Without jumping into the deep end, the short answer is, by traversing tree data structures. Nokogiri builds these data structures from the documents you feed it and lets you target bits of interest for extraction. For example, CSS is a language written for tree traversal, for searching tree data structures, and we can make use of it for data extraction.

There are many approaches and solutions out there to play with. Rubyland has two gems that have occupied the spotlight for a number of years now. Many people still rely on Nokogiri and Mechanize for HTML scraping needs. Both have been tested and proven themselves to be easy to use while being highly capable. We will look at both of them. But before that, I’d like to take a moment to address the problem that we are going to solve at the end of this short introductory series.

Permission

Before you start scraping away, make sure you have the permission of the sites you are trying to access for data extraction. If the site has an API or RSS feed, for example, it might not only be easier to get that desired content, it might also be the legal option of choice. 

Not everybody will appreciate it if you do massive scraping on their sites—understandably so. Get yourself educated on that particular site you are interested in, and don’t get yourself in trouble. Chances are low that you will inflict serious damage, but risking trouble unknowingly is not the way to go.

The Problem

I needed to build a new podcast. The design was not where I wanted it to be, and I hated the way of publishing new posts. Damn WYSIWYGs! A little bit of context. About two years ago, I built the first version of my podcast. The idea was to play with Sinatra and build something super lightweight. I ran into a couple of unexpected issues since I tailor-made pretty much everything. 

Coming from Rails, it was definitely an educational journey that I appreciate, but I quickly regretted not having used a static site that I could have deployed through GitHub via GitHub pages. Deploying new episodes and maintaining them lacked the simplicity that I was looking for. For a while, I decided that I had bigger fish to fry and focused on producing new podcast material instead.

This past summer I started to get serious and worked on a Middleman site that is hosted via GitHub pages. For season two of the show, I wanted something fresh. A new, simplified design, Markdown for publishing new episodes, and no fist fights with Heroku—heaven! The thing was that I had 139 episodes lying around that needed to be imported and converted first in order to work with Middleman. 

For posts, Middleman uses .markdown files that have so called frontmatter for data—which replaces my database basically. Doing this transfer by hand is not an option for 139 episodes. That’s what computation is for. I needed to figure out a way to parse the HTML of my old website, scrape the relevant content, and transfer it to blog posts that I use for publishing new podcast episodes on Middleman. 

Therefore, over the next three articles, I’m going to introduce you to the tools commonly used in Rubyland for such tasks. In the end, we’ll go over my solution to show you something practical as well.

Nokogiri

Even if you are completely new to Ruby/Rails, chances are very good that you have already heard about this little gem. The name is dropped often and sticks with you easily. I'm not sure that many know that nokogiri is Japanese for “saw”. 

It's a fitting name once you understand what the tool does. The creator of this gem is the lovely Tenderlove, Aaron Patterson. Nokogiri converts XML and HTML documents into a data structure—a tree data structure, to be more precise. The tool is fast and offers a nice interface as well. Overall, it’s a very potent library that takes care of a multitude of your HTML scraping needs.

You can use Nokogiri not only for parsing HTML; XML is fair game as well. It gives you the options of both XML path language and CSS interfaces to traverse the documents you load. XML path Language, or XPath for short, is a query language. 

It allows us to select nodes from XML documents. CSS selectors are most likely more familiar to beginners. As with styles you write, CSS selectors make it fantastically easy to target specific sections of pages that are of interest for extraction. You just need to let Nokogiri know what you are after when you target a particular destination.

Pages

What we always need to start with is fetching the actual page we are interested in. We specify what kind of Nokogiri document we want to parse—XML or HTML for example:

some_scraper.rb

Nokogiri:XML and Nokogiri:HTML can take IO objects or String objects. What happens above is straightforward. This opens and fetches the designated page using open-uri and then loads its structure, its XML or HTML into a new Nokogiri document.  XML is not something beginners have to deal with very often. 

Therefore, I’d recommend that we focus on HTML parsing for now. Why open-uri? This module from the Ruby Standard Library lets us grab the site without much fuss. Because IO objects are fair game, we can make easy use of open-uri.

API

Let’s put this into practice with a mini example:

at_css

some_podcast_scraper.rb

What we did here represents all the steps that are usually involved with web scraping—just at a micro level. We decide which URL we need and which site we need to fetch, and we load them into a new Nokogiri document. Then we open that page and target a specific section.

Here I only wanted to know the title of the latest episode. Using the at_css method and a CSS selector for h2.post-title was all I needed to target the extraction point. With this method we will only scrape this singular element, though. This gives us the whole selector—which is most of the time not exactly what we need. Therefore we extract only the inner text portion of this node via the text method. For comparison, you can check the output for both the header and the text below.

Output

Although this example has very limited applications, it possesses all the ingredients, all the steps that you need to understand. I think it’s cool how simple this is. Because it might not be obvious from this example, I would like to point out how powerful this tool can be. Let’s see what else we can do with a Nokogiri script.

Attention!

If you are a beginner and not sure how to target the HTML needed for this, I recommend that you search online to find out how to inspect the contents of websites in your browser. Basically, all major browsers make this process really easy these days. 

On Chrome you just need to right-click on an element in the website and choose the inspect option. This will open a small window at the bottom of your browser which shows you something like an x-ray of the site’s DOM. It has many more options, and I would recommend spending some time on Google to educate yourself. This is time spent wisely!

css

The css method will give us not only a single element of choice but any element that matches the search criteria on the page. Pretty neat and straightforward!

some_scraper.rb

Output

The only little difference in this example is that I iterate on the raw headers first. I also extracted its inner text with the text method. Nokogiri automatically stops at the end of the page and does not attempt to follow the pagination anywhere automatically.

Let’s say we want to have a bit more information, say the date and the subtitle for each episode. We can simply expand on the example above. It is a good idea anyway to take this step by step. Get a little piece working and add in more complexity along the way.

some_scraper.rb

Output

At this point, we already have some data to play with. We can structure or butcher it any way we like. The above should simply show what we have in a readable fashion. Of course we can dig deeper into each of these by using regular expressions with the text method. 

We will look into this in a lot more in detail when we get to solving the actual podcast problem. It won’t be a class on regexp, but you will see some more of it in action—but no worries, not so much as to make your brain bleed.

Attributes

What could be handy at this stage is extracting the href for the individual episodes as well. It couldn’t be any simpler.

some_scraper.rb

The most important bits to pay attention to here are [:href] and podcast_url. If you tag on [:] you can simply extract an attribute from the targeted selector. I abstracted a little further, but you can see more clearly how it works below.

To get a complete and useful URL, I saved the root domain in a variable and constructed the full URL for each episode.

Let’s take a quick look at the output:

Output

Neat, isn’t it? You can do the same to extract the [:class] of a selector.

If that node has more than one class, you will get a list of all of them.

Node Navigation

  • parent
  • children
  • previous_sibling
  • next_sibling

We are used to dealing with tree structures in CSS or even jQuery. It would be a pain if Nokogiri didn't offer a handy API to move within such trees.

some_scraper.rb

Output

As you can see for yourself, this is some pretty powerful stuff—especially when you see what .parent was able to collect in one go. Instead of defining a bunch of nodes by hand, you could collect them wholesale.

You can even chain them for more involved traversals. You can take this as complicated as you like, of course, but I would caution you to keep things simple. It can quickly get a little unwieldy and hard to understand. Remember, "Keep it simple, stupid!"

some_scraper.rb

Output

Final Thoughts

Nokogiri is not a huge library, but it has a lot to offer. I recommend you play with what you have learned thus far and expand your knowledge through its documentation when you hit a wall. But don’t get yourself into trouble! 

This little intro should get you well on your way to understanding what you can do and how it works. I hope you will explore it a bit more on your own and have some fun with it. As you will find out on your own, it’s a rich tool that keeps on giving.


No comments:

Post a Comment