Web Development: Parsing HTML With PHP Using DiDOM

Parsing HTML With PHP Using DiDOM

Every now and then developers need to scrape webpages to get some information from a website. For example, let's say you are working on a personal project where you have to get geographical information about the capitals of different countries from Wikipedia. Entering this manually would take a lot of time. However, you can do it very quickly by scraping the Wikipedia page with the help of PHP. You will also be able to automatically parse the HTML to get specific information instead of going through the whole markup manually.

In this tutorial, we will learn about an easy to use and fast HTML parser called DiDOM. We will begin with the installation process and then learn how to extract information from different elements on a webpage using different kinds of selectors like tags, classes etc.

Installation and Usage

You can easily install DiDOM in your project directory by running the following command:

composer require imangazaliev/didom

Once you have run the above command, you will be able to load HTML from a string, a local file or a webpage. Here is an example:

require_once('vendor/autoload.php');

use DiDom\Document;

$document = new Document($washington_dc_html_string);

$document = new Document('washington_dc.html', true);

$url = 'https://en.wikipedia.org/wiki/Washington,_D.C.';
$document = new Document($url, true);

When you decide to parse HTML from a document, it could already be loaded and stored in a variable. In such cases, you can simply pass that variable to Document() and DiDOM will prepare the string for parsing.

If the HTML has to be loaded from a file or a URL, you can pass that as the first parameter to Document() and set the second parameter to true.

You can also create a new Document object by using new Document() without any parameters. In this case, you can call the method loadHtml() to load HTML from a string and loadHtmlFile() to load HTML from a file or webpage.

Finding HTML Elements

The first thing that you have to do before getting the HTML or text from an element is find the element itself. The easiest way to do that is to simply use the find() method and pass the CSS selector for your intended element as the first parameter.

You can also pass the XPath for an element as the first parameter of the find() method. However, this requires you to pass Query::TYPE_XPATH as the second parameter.

If you only want to use XPath values for finding an HTML element, you can simply use the xpath() method instead of passing Query::TYPE_XPATH as second parameter to find() every time.

If DiDOM can find elements which match the passed CSS selector or XPATH expression, it will return an array of instances of DiDom\Element. If no such elements are found, it will return an empty array.

Since these methods return an array, you can directly access the nth matching element by using find()[n-1].

An Example

In the following example, we will be getting the inner HTML from all the first and second level headings in the Wikipedia article about Washington, D.C..

require_once('vendor/autoload.php');

use DiDom\Document;

$document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true);

$main_heading = $document->find('h1.firstHeading')[0];
echo $main_heading->html();

$sub_headings = $document->find('h2');

foreach($sub_headings as $sub_heading) {
    if($sub_heading->text() !== 'See also') {
        echo $sub_heading->html();
    } else {
        break;
    }
}

We begin by creating a new Document object by passing the URL of the Wikipedia article about Washington, D.C.. After that, we get the main heading element using the find() method and store it inside a variable called $main_heading. We will now be able to call different methods on this element like text(), innerHtml() and html() etc.

For the main heading, we just call html() method which returns the HTML of whole heading element. Similarly, we can get the HTML inside a particular element by using the innerHtml() method. Sometimes, you will be more interested in the plain text content of an element instead of its HTML. In such cases, you can simply use the text() method an be done with it.

The level two headings divide our Wikipedia page in well defined sections. However, you might want to get rid of some of those subheadings like "See also", "Notes" etc.

One way to do so would be to loop through all the level two headings and check the value returned by the text() method. We break out of the loop if the returned heading text is "See also".

You could directly get to the 4th or 6th level two heading by using $document->find('h2')[3] and $document->find('h2')[5] respectively.

Traversing Up and Down the DOM

Once you have access to a particular element, the library allows you to traverse up and down the DOM tree to access other elements with ease.

You can go to the parent of an HTML element using the parent() method. Similarly, you can get to the next or previous sibling of an element using the nextSibling() and previousSibling() methods.

There are a lot of methods available to get access to the children of a DOM element as well. For instance, you can get to a particular child element using the child(n) method. Similarly, you can get access to the first or last child of a particular element using the firstChild() and lastChild() methods. You can loop over all the children of a particular DOM element using the children() method.

Once you get to a particular element, you will be able to access its HTML etc. using the html(), innerHtml() and text() methods.

In the following example, we start with level two heading elements and keep checking if the next sibling element contains some text. As soon as we find a sibling element with some text, we output it to the browser.

require_once('vendor/autoload.php');

use DiDom\Document;

$document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true);

$sub_headings = $document->find('h2');

for($i = 1; $i < count($sub_headings); $i++) {
    if($sub_headings[$i]->text() !== 'See also') {
        $next_sibling = $sub_headings[$i]->nextSibling();
        while(!$next_elem->html()) {
            $next_sibling = $next_sibling->nextSibling();
        }

        echo $next_elem->html()."<br>";
    } else {
        break;
    }
}

You can use a similar technique to loop through all the sibling elements and only output the text if it contains a particular string or if the sibling element is a paragraph tag etc. Once you know the basics, finding the right information is easy.

Manipulating Element Attributes

The ability to get or set the attribute value for different elements can prove very useful in certain situations. For example, we can get the value of src attribute for all the img tags in our Wikipedia article by using $image_elem->attr('src'). In a similar manner, you can get the value of href attributes for all the a tags in a document.

There are three way for getting the value of a given attribute for an HTML element. You can use the getAttribute('attrName') method and pass the name of attribute you are interested in as a parameter. You can also use the attr('attrName') method which works just like getAttribute(). Finally, the library also allows you to directly get the attribute value using $elem->attrName. This means that you can get the value of src attribute for an image element directly by using $imageElem->src.

require_once('vendor/autoload.php');

use DiDom\Document;

$document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true);

$images = $document->find('img');

foreach($images as $image) {
    echo $image->src."<br>";
}

Once you have access to the src attributes, you can write the code to automatically download all the image files. This way you will be able to save a lot of time.

You can also set the value of a given attribute using three different techniques. First, you can use the setAttribute('attrName', 'attrValue') method to set the attribute value. You can also use the attr('attrName', 'attrValue') method to set the attribute value. Finally, you can set the attribute value for a given element using $Elem->attrName = 'attrValue'.

Adding, Removing and Replacing Elements

You can also make changes to the loaded HTML document using different methods provided by the library. For example, you can add, replace or remove elements from the DOM tree using the appendChild(), replace() and remove() methods.

The library also allows you to create your own HTML elements in order to append them to the original HTML document. You can create a new Element object by using new Element('tagName', 'tagContent').

Keep in mind that you will get a Uncaught Error: Class 'Element' not found error if your program does not contain the line use DiDom\Element before instantiating the element object.

Once you have the element, you can either append it to other elements in the DOM using the appendChild() method or you can use the replace() method to use the newly instantiated element as a replacement for some old HTML element in the document. The following example should help in further clarifying this concept.

require_once('vendor/autoload.php');

use DiDom\Document;
use DiDom\Element;

$document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true);

// This will result in error.
echo $document->find('h2.test-heading')[0]->html()."\n";

$test_heading = new Element('h2', 'This is test heading.');
$test_heading->class = 'test-heading';

$document->find('h1')[0]->replace($test_heading);

echo $document->find('h2.test-heading')[0]->html()."\n";

Initially, there is no h2 element in our document with the class test-heading. Therefore, we will keep getting an error if we try to access such an element.

After verifying that there is no such element, we create a new h2 element and change the value of its class attribute to test-heading.

After that, we replace the first h1 element in the document with our newly created h2 element. Using the find() method on our document again to find the h2 heading with class test-heading will return an element now.

Final Thoughts

This tutorial covered the basics of PHP DiDOM HTML parser. We began with the installation and then learned how to load HTML from a string, file or URL. After that, we discussed how to find a particular element based on its CSS selector or XPath. We also learned how to get the siblings, parent or children of an element. The rest of the sections covered how we can manipulate the attributes of a particular element or add, remove and replace elements in an HTML document.

If there is anything that you would like me to clarify in the tutorial, feel free to let me know in the comments.

Web Development

Wednesday, July 4, 2018