Web Development: Data Extraction at Scale With Zenscrape

Data Extraction at Scale With Zenscrape

Web scraping is a great way of collecting data to take your business to the next level. It allows you to automate the process of extracting useful content from a variety of sources.

Unfortunately, automated web scraping is not always easy. Some websites might actively block you from extracting this data, while others are built using tools that basically render primitive web scrapers useless.

In this post, I'll show you how you can use Zenscrape to overcome all these problems and extract data at scale from any website you like, without worrying about getting blocked.

Advantages of Using Zenscrape

I'll begin the discussion by listing some of the amazing features of Zenscrape that help you get the job done and set it apart from other scraping tools.

JavaScript Rendering

Many websites now actively use JavaScript to serve content to visitors. This means that the content that a simple scraper sees when visiting a webpage could be different from the content that users see when they actually visit the website through a browser.

Zenscrape solves this problem by allowing you to use its APIs to render requests in a modern, headless Chrome browser. It supports all popular libraries and frameworks like Vue, Angular, and React, among others.

Extracting Data at Scale

Some projects will require you to scrape webpages at a large scale, and this situation presents its own set of challenges. There's a higher chance of you getting blocked by the website, and it will take a lot longer to get all the data you need by simply making one request at a time.

Zenscrape overcomes these issues by providing you with a huge IP pool and automatic proxy rotation to easily hide your scraping bot. It also gives you the option to make concurrent requests in order to quickly scrape a large set of data.

Scraping Content Using Zenscrape

We will now learn how to use the Zenscrape API to scrape content from different kinds of websites.

You can get started by creating an account on the website. Zenscrape offers a free plan, so you can simply sign up to follow this tutorial. It will give you access to an API key that you can use to make requests. You can read the detailed documentation to discover how to make requests using the API in a variety of languages and environments like PHP, Python, and Node.js.

The code snippets in the documentation will be prepopulated with your API keys once you have successfully registered and logged in.

You can also see other account-related information like usage statistics and your API key on the account dashboard page.

Extract Content From Wikipedia

Zenscrape lets you extract HTML from a webpage that you can then manipulate with a parser of your choice. We will use the PHP-based DiDOM parser for our example here, but you can also use some others mentioned in Zenscrape blog posts.

We will be scraping a Wikipedia page about a lighthouse for our example. Here is our PHP code to extract the HTML using Zenscrape's API.

<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);

$data = [
   "url" => "https://en.wikipedia.org/wiki/White_Shoal_Light,_Michigan",
];

curl_setopt($ch, CURLOPT_URL, "https://app.zenscrape.com/api/v1/get?" . http_build_query($data));
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    "apikey: YOUR_API_KEY"  
));

$html = curl_exec($ch);

curl_close($ch);

?>

The variable $html at this point contains the markup that Zenscrape extracted from the Wikipedia page. The first few lines of the markup look like this:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>White Shoal Light, Michigan - Wikipedia</title>
... and more ...

We can now pass this HTML to our DOM parser in order to extract information like the main heading, the first paragraph, or the first image from the Wikipedia article.

<?php

$document = new Document($html);

$heading = $document->find("h1")[0]->text();
$top_image = $document->find("td.infobox-image img")[0]->src;
$first_paragraph = $document->find("div#mw-content-text p")[0]->text();

echo '<h2>'.$heading.'</h2>';
echo '<img src="'.$top_image.'">';
echo '<p>'.$first_paragraph.'</p>';

?>

Here is the output that I got back, styled using some basic CSS.

Extract Localized Content From Websites

The homepage of popular website Reddit looks different depending on the country from which you are visiting. The website tries to fill it with content that is relevant and popular in your location.

In our example, we will be using Zenscrape to get some headlines from Reddit's homepage by setting the countries to the United States and the United Kingdom. However, Zenscrape allows you to choose a location from over 230 different countries for scraping content. The amazing part is that you can do all this by simply specifying two parameters in your API requests.

Here is the code that we use to get the HTML for Reddit's UK homepage using Zenscrape's API.

<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);

$data = [
   "url" => "https://www.reddit.com/",
   "premium" => "true",
   "location" => "uk"
];

curl_setopt($ch, CURLOPT_URL, "https://app.zenscrape.com/api/v1/get?" . http_build_query($data));
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    "apikey: YOUR_API_KEY"  
));

$html = curl_exec($ch);

curl_close($ch);

?>

As you can see, this isn't much different from the code we used in the previous section. This time, though, we pass two additional query parameters called premium and location. Setting premium to true allows you to use residential proxies. After that, you can use location to specify the country from which you want to visit the URL. I've set it to uk in this example.

The Zenscrape documentation about web scraping provides more details about other such parameters.

Similar to our previous example, the variable $html stores the extracted HTML that we got back. Now, we can parse and use this HTML in any way we like.

<?php 

$document = new Document($html);
$all_headings = $document->find("h3");

echo '<h2>Reddit Front Page (United Kingdom)</h2>';
echo '<ol>';
foreach($all_headings as $heading) {
    $heading_text = $heading->text();

    if(strlen($heading_text) > 15) {
        echo '<li><p>'.$heading->text().'</p></li>';
    }
}
echo '</ol>';

?>

I am using it to display a list of headlines for demonstration purposes.

Here's what I get when I use the Zenscrape API to scrape Reddit as a visitor from the United States.

Extract Content After JavaScript Rendering

One more problem that Zenscrape solves for you is the ease with which you can extract the HTML that will be rendered for visitors when they go to a website built with libraries and frameworks like Vue, React, or Angular.

I have created a simple CodePen demo to demonstrate this feature. Basic website scrapers will see different content on this page than actual website visitors will, because the content on the webpage is rendered using React.

You will get the following HTML inside the root element when using a simple cURL or file_get_contents() request.

<h1>Nothing to see here!</h1>

Zenscrape, on the other hand, gives you the option to render the request in a modern headless Chrome browser. This means that the HTML you get back using the Zenscrape API is the same HTML that users will see when they visit that webpage.

Here's the code that I used to extract the HTML that is finally shown to the users after running JavaScript.

<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);

$data = [
   "url" => "https://cdpn.io/Shokeen/debug/VwWjogr/VJMxxEJxmRYM",
   "render" => "true",
   "wait_for_css" => "div.joke",
];

curl_setopt($ch, CURLOPT_URL, "https://app.zenscrape.com/api/v1/get?" . http_build_query($data));
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    "apikey: YOUR_API_KEY"  
));

$html = curl_exec($ch);
curl_close($ch);

echo $html;

?>

As you can see, all you need to do is pass two parameters, render and wait_for_css. Setting render to true will tell Zenscrape that it needs to use a headless browser to fetch content because JavaScript is involved. You can set wait_for_css to the CSS selector of the element that you want.

The above code snippet allows you to extract HTML that can be parsed to get the following content.

Final Thoughts

Zenscrape solves a lot of web scraping problems for people who want to do data extraction at scale. What makes it great is the fact that it is easy to implement and doesn't require you to spend days or weeks learning about the API.

As you saw in the three examples above, Zenscrape takes care of everything from localization to JavaScript rendering for you by asking for just a few parameters. You only need to write a few lines of code, and everything will be up and running in no time. There's even a request builder that you can use to get the code necessary to make requests using Python, Node.js, PHP, etc.

You can use the Zenscrape API to do a lot of tasks like getting sales leads or tracking the pricing and availability of products on eCommerce platforms. Visit Zenscrape and read about it yourself. There's a free plan that comes with 1,000 credits per month. You can register for a free Zenscrape account in a couple of minutes and test out all the features yourself.

Web Development

Tuesday, September 7, 2021