Web Development: Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification

Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification

In the last tutorial, you learned the basics of the Beautiful Soup library. Besides navigating the DOM tree, you can also search for elements with a given class or id. You can also modify the DOM tree using this library.

In this tutorial, you will learn about different methods that will help you with the search and modifications. We will be scraping the same Wikipedia page about Python from our last tutorial.

Filters for Searching the Tree

Beautiful Soup has a lot of methods for searching the DOM tree. These methods are very similar and take the same kinds of filters as arguments. Therefore, it makes sense to properly understand the different filters before reading about the methods. I will be using the same find_all() method to explain the difference between different filters.

The simplest filter that you can pass to any search method is a string. Beautiful Soup will then search through the document to find a tag that exactly matches the string.

for heading in soup.find_all('h2'):
    print(heading.text)
    
# Contents
# History[edit]
# Features and philosophy[edit]
# Syntax and semantics[edit]
# Libraries[edit]
# Development environments[edit]
# ... and so on.

You can also pass a regular expression object to the find_all() method. This time, Beautiful Soup will filter the tree by matching all the tags against a given regular expression.

import re

for heading in soup.find_all(re.compile("^h[1-6]")):
    print(heading.name + ' ' + heading.text.strip())
    
# h1 Python (programming language)
# h2 Contents
# h2 History[edit]
# h2 Features and philosophy[edit]
# h2 Syntax and semantics[edit]
# h3 Indentation[edit]
# h3 Statements and control flow[edit]
# ... an so on.

The code will look for all the tags that begin with "h" and are followed by a digit from 1 to 6. In other words, it will be looking for all the heading tags in the document.

Instead of using regex, you could achieve the same result by passing a list of all the tags that you want Beautiful Soup to match against the document.

for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())

You can also pass True as a parameter to the find_all() method. The code will then return all the tags in the document. The output below means that there are currently 4,339 tags in the Wikipedia page that we are parsing.

len(soup.find_all(True))
# 4339

If you are still not able to find what you are looking for with any of the above filters, you can define your own function that takes an element as its only argument. The function also needs to return True if there is a match and False otherwise. Depending on what you need, you can make the function as complicated as it needs to be to do the job. Here is a very simple example:

def big_lists(tag):
    return len(tag.contents) > 20 and tag.name == 'ul'
    
len(soup.find_all(big_lists))
# 13

The above function is going through the same Wikipedia Python page and looking for unordered lists that have more than 20 children.

Searching the DOM Tree Using Built-In Functions

One of the most popular methods for searching through the DOM is find_all(). It will go through all the tag's descendants and return a list of all the descendants that match your search criteria. This method has the following signature:

find_all(name, attrs, recursive, string, limit, **kwargs)

The name argument is the name of the tag that you want this function to search for while going through the tree. You are free to provide a string, a list, a regular expression, a function, or the value True as a name.

You can also filter the elements in the DOM tree on the basis of different attributes like id, href, etc. You can also get all the elements with a specific attribute regardless of its value using attribute=True. Searching for elements with a specific class is different from searching for regular attributes. Since class is a reserved keyword in Python, you will have to use the class_ keyword argument when looking for elements with a specific class.

import re

len(soup.find_all(id=True))
# 425

len(soup.find_all(class_=True))
# 1734

len(soup.find_all(class_="mw-headline"))
# 20

len(soup.find_all(href=True))
# 1410

len(soup.find_all(href=re.compile("python")))
# 102

You can see that the document has 1,734 tags with a class attribute and 425 tags with an id attribute. If you only need the first few of these results, you can pass a number to the method as the value of limit. Passing this value will instruct Beautiful Soup to stop looking for more elements once it has reached a certain number. Here is an example:

soup.find_all(class_="mw-headline", limit=4)

# <span class="mw-headline" id="History">History</span>
# <span class="mw-headline" id="Features_and_philosophy">Features and philosophy</span>
# <span class="mw-headline" id="Syntax_and_semantics">Syntax and semantics</span>
# <span class="mw-headline" id="Indentation">Indentation</span>

When you use the find_all() method, you are telling Beautiful Soup to go through all the descendants of a given tag to find what you are looking for. Sometimes, you want to look for an element only in the direct children on a tag. This can be achieved by passing recursive=False to the find_all() method.

len(soup.html.find_all("meta"))
# 6

len(soup.html.find_all("meta", recursive=False))
# 0

len(soup.head.find_all("meta", recursive=False))
# 6

If you are interested in finding only one result for a particular search query, you can use the find() method to find it instead of passing limit=1 to find_all(). The only difference between the results returned by these two methods is that find_all() returns a list with only one element and find() just returns the result.

soup.find_all("h2", limit=1)
# [<h2>Contents</h2>]

soup.find("h2")
# <h2>Contents</h2>

The find() and find_all() methods search through all the descendants of a given tag to search for an element. There are ten other very similar methods that you can use to iterate through the DOM tree in different directions.

find_parents(name, attrs, string, limit, **kwargs)
find_parent(name, attrs, string, **kwargs)

find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)

The find_parent() and find_parents() methods traverse up the DOM tree to find the given element. The find_next_sibling() and find_next_siblings() methods will iterate over all the siblings of the element that come after the current one. Similarly, the find_previous_sibling() and find_previous_siblings() methods will iterate over all the siblings of the element that come before the current one.

The find_next() and find_all_next() methods will iterate over all the tags and strings that come after the current element. Similarly, the find_previous() and find_all_previous() methods will iterate over all the tags and strings that come before the current element.

You can also search for elements using CSS selectors with the help of the select() method. Here are a few examples:

len(soup.select("p a"))
# 411

len(soup.select("p > a"))
# 291

soup.select("h2:nth-of-type(1)")
# [<h2>Contents</h2>]

len(soup.select("p > a:nth-of-type(2)"))
# 46

len(soup.select("p > a:nth-of-type(10)"))
# 6

len(soup.select("[class*=section]"))
# 80

len(soup.select("[class$=section]"))
# 20

Modifying the Tree

You can not only search through the DOM tree to find an element but also modify it. It is very easy to rename a tag and modify its attributes.

heading_tag = soup.select("h2:nth-of-type(2)")[0]

heading_tag.name = "h3"
print(heading_tag)
# <h3><span class="mw-headline" id="Features_and_philosophy">Feat...

heading_tag['class'] = 'headingChanged'
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline" id="Feat...

heading_tag['id'] = 'newHeadingId'
print(heading_tag)
# <h3 class="headingChanged" id="newHeadingId"><span class="mw....

del heading_tag['id']
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline"...

Continuing from our last example, you can replace a tag's contents with a given string using the .string attribute. If you don't want to replace the contents but add something extra at the end of the tag, you can use the append() method.

Similarly, if you want to insert something inside a tag at a specific location, you can use the insert() method. The first parameter for this method is the position or index at which you want to insert the content, and the second parameter is the content itself. You can remove all the content inside a tag using the clear() method. This will just leave you with the tag itself and its attributes.

heading_tag.string = "Features and Philosophy"
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy</h3>

heading_tag.append(" [Appended This Part].")
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy [Appended This Part].</h3>

print(heading_tag.contents)
# ['Features and Philosophy', ' [Appended This Part].']

heading_tag.insert(1, ' Inserted this part ')
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy Inserted this part  [Appended This Part].</h3>

heading_tag.clear()
print(heading_tag)
# <h3 class="headingChanged"></h3>

At the beginning of this section, you selected a level two heading from the document and changed it to a level three heading. Using the same selector again will now show you the next level two heading that came after the original. This makes sense because the original heading is no longer a level two heading.

The original heading can now be selected using h3:nth-of-type(2). If you completely want to remove an element or tag and all the content inside it from the tree, you can use the decompose() method.

soup.select("h3:nth-of-type(2)")[0]
# <h3 class="headingChanged"></h3>

soup.select("h3:nth-of-type(3)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

soup.select("h3:nth-of-type(2)")[0].decompose()
soup.select("h3:nth-of-type(2)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

Once you've decomposed or removed the original heading, the heading in the third spot takes its place.

If you want to remove a tag and its contents from the tree but don't want to completely destroy the tag, you can use the extract() method. This method will return the tag that it extracted. You will now have two different trees that you can parse. The root of the new tree will be the tag that you just extracted.

heading_tree = soup.select("h3:nth-of-type(2)")[0].extract()

len(heading_tree.contents)
# 2

You can also replace a tag inside the tree with something else of your choice using the replace_with() method. This method will return the tag or string that it replaced. It can be helpful if you want to put the replaced content somewhere else in the document.

soup.h1
# <h1 class="firstHeading">Python (programming language)</h1>

bold_tag = soup.new_tag("b")
bold_tag.string = "Python"

soup.h1.replace_with(bold_tag)

print(soup.h1)
# None
print(soup.b)
# <b>Python</b>

In the above code, the main heading of the document has been replaced with a b tag. The document no longer has an h1 tag, and that is why print(soup.h1) now prints None.

Final Thoughts

After reading the two tutorials in this series, you should now be able to parse different webpages and extract important data from the document. You should also be able to retrieve the original webpage, modify it to suit your own needs, and save the modified version locally.

If you have any questions regarding this tutorial, please let me know in the comments.

Web Development

Wednesday, April 5, 2017

Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification

Filters for Searching the Tree

Searching the DOM Tree Using Built-In Functions

Modifying the Tree

Final Thoughts

No comments:

Post a Comment