In a previous tutorial, I showed you how to use the Requests module to access webpages using Python. The tutorial covered a lot of topics like making GET/POST requests and downloading things like images or PDFs programmatically. The one thing missing from that tutorial was a guide on scraping webpages you accessed using Requests to extract the information that you need.
In this tutorial, you will learn about Beautiful Soup, which is a Python library to extract data from HTML files. The focus in this tutorial will be on learning the basics of the library, and more advanced topics will be covered in the next tutorial. Please note that this tutorial uses Beautiful Soup 4 for all the examples.
Installation
You can install Beautiful Soup 4 using pip
. The package name is beautifulsoup4
. It should work on both Python 2 and Python 3.
$ pip install beautifulsoup4
If you don’t have pip installed on your system, you can directly download the Beautiful Soup 4 source tarball and install it using setup.py
.
$ python setup.py install
BeautifulSoup is originally packaged as Python 2 code. When you install it for use with Python 3, it is automatically updated to Python 3 code. The code won’t be converted unless you install the package. Here are a few common errors that you might notice:
- The “No module named HTMLParser”
ImportError
occurs when you are running the Python 2 version of the code under Python 3. - The “No module named html.parser”
ImportError
occurs when you are running the Python 3 version of the code under Python 2.
Both the errors above can be corrected by uninstalling and reinstalling Beautiful Soup.
Installing a Parser
Before discussing the differences between different parsers that you can use with Beautiful Soup, let’s write the code to create a soup.
from bs4 import BeautifulSoup soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")
The BeautifulSoup
object can accept two arguments. The first argument is the actual markup, and the second argument is the parser that you want to use. The different parsers are: html.parser
, lxml, and html5lib. The lxml
parser has two versions, an HTML parser and an XML parser.
The html.parser
is a built-in parser, and it does not work so well in older versions of Python. You can install the other parsers using the following commands:
$ pip install lxml $ pip install html5lib
The lxml
parser is very fast and can be used to quickly parse given HTML. On the other hand, the html5lib
parser is very slow, but it is also extremely lenient. Here is an example of using each of these parsers:
soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser") print(soup) # <html><p>This is <b>invalid HTML</b></p></html> soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "lxml") print(soup) # <html><body><p>This is <b>invalid HTML</b></p></body></html> soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "xml") print(soup) # <?xml version="1.0" encoding="utf-8"?> # <html><p>This is <b>invalid HTML</b></p></html> soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html5lib") print(soup) # <html><head></head><body><p>This is <b>invalid HTML</b></p></body></html>
The differences outlined by the above example only matter when you are parsing invalid HTML. However, most of the HTML on the web is malformed, and knowing these differences will help you in debugging some parsing errors and deciding which parser you want to use in a project. Generally, the lxml
parser is a very good choice.
Objects in Beautiful Soup
Beautiful Soup parses the given HTML document into a tree of Python objects. There are four main Python objects that you need to know about: Tag
, NavigableString
, BeautifulSoup
, and Comment
.
The Tag
object refers to an actual XML or HTML tag in the document. You can access the name of a tag using tag.name
. You can also set a tag’s name to something else. The name change will be visible in the markup generated by Beautiful Soup.
You can access different attributes like the class and id of a tag using tag['class']
and tag['id']
respectively. You can also access the whole dictionary of attributes using tag.attrs
. You can also add, remove or modify a tag’s attributes. The attributes like an element’s class
which can take multiple values are stored as a list.
The text within a tag is stored as a NavigableString
in Beautiful Soup. It has a few useful methods like replace_with("string")
to replace the text within a tag. You can also convert a NavigableString
to unicode string using unicode()
.
Beautiful Soup also allows you to access the comments in a webpage. These comments are stored as a Comment
object, which is also basically a NavigableString
.
You have already learned about the BeautifulSoup
object in the previous section. It is used to represent the document as a whole. Since it is not an actual object, it does not have any name or attributes.
Getting the Title, Headings, and Links
You can extract the page title and other such data very easily using Beautiful Soup. Let’s scrape the Wikipedia page about Python. First, you will have to get the markup of the page using the following code based on the Requests module tutorial to access webpages.
import requests from bs4 import BeautifulSoup req = requests.get('http://ift.tt/RSMpTo') soup = BeautifulSoup(req.text, "lxml")
Now that you have created the soup, you can get the title of the webpage using the following code:
soup.title # <title>Python (programming language) - Wikipedia</title> soup.title.name # 'title' soup.title.string # 'Python (programming language) - Wikipedia'
You can also scrape the webpage for other information like the main heading or the first paragraph, their classes, or the id
attribute.
soup.h1 # <h1 class="firstHeading" id="firstHeading" lang="en">Python (programming language)</h1> soup.h1.string # 'Python (programming language)' soup.h1['class'] # ['firstHeading'] soup.h1['id'] # 'firstHeading' soup.h1.attrs # {'class': ['firstHeading'], 'id': 'firstHeading', 'lang': 'en'} soup.h1['class'] = 'firstHeading, mainHeading' soup.h1.string.replace_with("Python - Programming Language") del soup.h1['lang'] del soup.h1['id'] soup.h1 # <h1 class="firstHeading, mainHeading">Python - Programming Language</h1>
Similarly, you can iterate through all the links or subheading in a document using the following code:
for sub_heading in soup.find_all('h2'): print(sub_heading.text) # all the sub-headings like Contents, History[edit]...
Navigating the DOM
You can navigate through the DOM tree using regular tag names. Chaining those tag names can help you navigate the tree more deeply. For example, you can get the first link in the first paragraph of the given Wikipedia page by using soup.p.a
. All the links in the first paragraph can be accessed by using soup.p.find_all('a')
.
You can also access all the children of a tag as a list by using tag.contents
. To get the children at a specific index, you can use tag.contents[index]
. You can also iterate over a tag's children by using the .children
attribute.
Both .children
and .contents
are useful only when you want to access the direct or first-level descendants of a tag. To get all the descendants, you can use the .descendants
attribute.
print(soup.p.contents) # [<b>Python</b>, ' is a widely used ',.....the full list] print(soup.p.contents[10]) # <a href="/wiki/Readability" title="Readability">readability</a> for child in soup.p.children: print(child.name) # b # None # a # None # a # None # ... and so on.
You can also access the parent of an element using the .parent
attribute. Similarly, you can access all the ancestors of an element using the .parents
attribute. The parent of the top-level <html>
tag is the BeautifulSoup
Object itself, and its parent is None.
print(soup.p.parent.name) # div for parent in soup.p.parents: print(parent.name) # div # div # div # body # html # [document]
You can access the previous and next sibling of an element using the .previous_sibling
and .next_sibling
attributes.
For two elements to be siblings, they should have the same parent. This means that the first child of an element will not have a previous sibling. Similarly, the last child of the element will not have a next sibling. In actual webpages, the previous and next siblings of an element will most probably be a new line character.
You can also iterate over all the siblings of an element using .previous_siblings
and .next_siblings
.
soup.head.next_sibling # '\n' soup.p.a.next_sibling # ' for ' soup.p.a.previous_sibling # ' is a widely used ' print(soup.p.b.previous_sibling) # None
You can go to the element that comes immediately after the current element using the .next_element
attribute. To access the element that comes immediately before the current element, use the .previous_element
attribute.
Similarly, you can iterate over all the elements that come before and after the current element using .previous_elements
and .next_elements
respectively.
Final Thoughts
After completing this tutorial, you should now have a good understanding of the main differences between different HTML parsers. You should now also be able to navigate through a webpage and extract important data. This can be helpful when you want to analyze all the headings or links on a given website.
In the next part of the series, you will learn how to use the Beautiful Soup library to search and modify the DOM.
No comments:
Post a Comment