Web Scraping with NodeJS: a comprehensive guide [part-1]

Web Scraping with NodeJS: a comprehensive guide [part-1]

This blog is about the Node js course on web scraping, it is divided into three-part series, where you will learn how to scrape any type of website and by using a real-world example. This blog will teach you strategies and practices that you won't find anywhere else. You'll be able to quickly grasp all of the ideas and move from the examples and also, you will be able to make your scraper by the end. This blog post is aimed to assist anyone interested in learning web scraping with NodeJS because the issue with any other blog tutorials is a little challenging and you don't always find everything you need in one place.

Requirements

The sole and only requirement you need is that you have to have a basic understanding of Javascript or are at least familiar with it, as this will be the only programming language we are going to utilize. I would also recommend reviewing the async and await syntax in es6 and higher, as we'll be using it a lot.

Tools

Also, if you're wondering what sorts of tools we will be using then, we will be utilizing vscode which is a free code editor that also supports NodeJS debugging without the need to install extra complicated plugins. We'll also utilize a variety of libraries, as well as different libraries, but we'll mostly use puppeteer, which was built and is maintained by the Google team, and also we will be using NighmareJS as well.

Tools and Project Setup.

Before we begin creating our scraper program, we must first set up our environments by downloading our editors and NodeJS, as well as complete some basic project setups before we can begin writing code. So first, go to the official VS code website,code.visualstudio.com, and download and install the VS code editor which is available free of charge. The key reason to use Visual Studio Code is that it is free and comes with a built-in debugger for Nodejs, making debugging a hundred or even thousand times easier.

VS code editor

After installing our VS code editor, we must ensure that we have Nodejs installed to run our Nodejs application on our machine. Nodejs can be downloaded from nodejs.org and is compatible with both macOS and Windows, with a simple installation process. All you need to do now is download the package and follow the simple installation instructions.

NodeJs Installation

Also, if you want to see what version of Nodejs you have installed, go to the terminal/command line and type node -v which will tell you what version of Nodejs is currently installed on your computer. If this command doesn't work, make sure you restart your computer after installing Nodejs.

Now that you've completed this, go to our VScode and open an empty folder in it, then open the terminal inside it and run the npm init -y command to initialize our project. So now you can see that a new "package.json" file has been produced within that folder, and we can just start installing and using our libraries right away and start making use of them. Also, inside that folder, create a new file called index.js, which will be our entry file, and now we're ready to write some code inside it. This is how your file structure should now look.

File structure

Simple IMDB movie scraper.

We'll be building a simple IMDB scraper that parses the data from the IMDB movie page. This is just one of many real-world examples we've included in this particular section of the blog to give you gist and an idea of what can be accomplished in a short amount of time, so don't worry about getting too many details in this example because we'll go into more depth in the upcoming examples.

So we'll use NodeJS to send a direct request to the IMDB website and expect a response that looks exactly like the image below. To begin, navigate to the IMDB website, right-click, and then select View page source.

View page source

Page source

As you can see, this is the exact HTML content with Javascript and CSS that we will scrape using our NodeJS scraper. Following that, we will use Cheerio JS, a Nodejs library that can handle HTML content and has a syntax that is nearly identical to jQuery. It would be preferable if you were already familiar with the jQuery syntax. To begin, we must import both the request-promise library, which is in charge of managing requests to the IMDB website and the Cheerio JS library, which will be used to parse the HTML contents.

Now, make sure you're in the index.js file that we previously created, and begin importing the actual libraries inside it.

const request = require("request-promise");
const cheerio = require("cheerio");

The next thing we're going to do is go to the IMDB website and copy the URL of whatever movie you want because we're going to send the request to that specific URL, so simply create a variable named URL and paste that specific copied URL link inside it.

href link

const URL = "https://www.imdb.com/title/tt0068646/?ref_=fn_al_tt_1";

Because we cannot write the asynchronous function in the index unless it is declared as async, we must now write the simple async function that will be accessed when the NodeJS scrapper is fired. Simply create a nameless asynchronous function that will be executed automatically. Before we write anything inside that async function, we must first install the request-response and cheerio libraries. To do so, go to the terminal and type the following command.

npm install cheerio request-promise

After installing the package, it should look something similar to this.

Package Installation

So, now that we have everything installed and ready to go, we can use the request library. To do so, create a variable called response and inside it simply wait for the request to be completed while also passing the URL as its parameters. To test the response, simply console log it; we should be able to see the raw response. So, to run it, go to the debugging tab and press the run button.

Debugging and running

As you can see, it worked; we got what we needed, which means the script was successful, and we can now begin passing our response to the cheerio library and using it to go through each of the HTML properties and find out exactly what we need.

First, let's get rid of the console log and implement the cheerio library.

let $ = cheerio.load(response);

We simply created a $ (dollar) variable responsible for the cheerio with the actual IMDB response. Now we can begin writing the scraping for the movie title. First, go to the movie that you want to scrap. So, right-click on the title and select Inspect Element.

Inspect element

Then we have the div element and inside it, there is an h1 as well as span element, which contains the title of the movie as well as the ratings of the movie. We can select the element by using the jQuery syntax-like selector as shown in the code below.

let title = $("section.ipc-page-section > div > div > h1").text();
let rating = $(
    "div.ipc-button__text > div > div:nth-child(2) > div > span"
  ).text().slice(0,6);
 console.log(`"${title}" movie has an IMDB rating of ${rating}`);

If you select the debug option again, you should see something similar to this.

Debug

So, now that you have enough information to get started with web scraping, let's delve into much more detail.

Why and when should you scrape a website?

So, before you begin creating a scrapper for your website, you should ask yourself

  • "What data do I need from that particular website ?",
  • "Do they have an API for that? ",
  • "Do they provide all the information that I need ?",
  • "Do they have any limitations that will stall your results ?",
  • "Do I have to pay to send the request to their server ?"

So, if you ever find yourself in a situation like this, where you believe you will not benefit from an official API due to the reasons stated above, or if there is a specific website that does not even have an API, you should consider creating a scrapper.

What we did previously is a perfect example, where we wrote a straightforward IMDB scrapper. Because IMDB does not have an official API that is accessible to the public, we relied on scraping the data. Of course, the scrapper that we wrote is very basic, but it demonstrated the possibility and power of scrapping with Nodejs. To give you a hot tip, we will explore the IMDB scrapper and write an even more complex scrapper later on.

Before we begin, we must understand when it is appropriate to scrape data from a website. Keep in mind that web scraping is not always an ethical solution, nor is it always a black hat solution; it falls somewhere in the middle. That is to say, web scraping is not illegal, but it can get you in trouble if you violate someone else's website or organizational policies. So, before you plan to scrape a website, you should look at the terms and services of that particular website and see if they have anything related to scraping the data from their website; if they do, it means they may not want you to do it, and if they don't, it means they don't care if you scrape their website or not. Also, before you scrape, you should ask for permission to scrape it. Also, before you start scraping other people's or companies' websites, you should respect their data by using official APIs whenever possible, not spamming their website with an excessive number of requests, and finally, if you want to monetize the scrapped data, always seek legal advice and make sure what you're doing with it is completely legal.

The most significant issue with scraping

crawler

The most difficult and inconvenient aspect of web scraping is the maintenance and stability of the scrapper. These are the issues that you may have to deal with when building a scrapper. Scrapers can be useful for a variety of things such as extracting data, parsing, and so on. Let's say you wrote a scrapper and it works fine until it doesn't and you encounter some random error, which is exactly the problem, so it can work for one day, one month, or even one year before failing. The main issue with this is that the website that you are currently scraping can constantly change, its structure can change, and their system can change, and also the URLs of the website As a result, you have no control over it, and your scrapper may fail at any time due to this issue. When writing or coding scrappers, the logic and workflow are based on the current website you are attempting to scrape and its structure, so if the website decides to change its entire structure, you may have to change the entire structure as well as the logic of the scrapper as well. Of course, if you still want to make it work, you may be wondering how to solve this type of problem. The short answer is that you cannot change this type of problem because you have no control over the website you are attempting to scrape; you must simply deal with the problem that arises. This is why you must learn how to develop a scrapper quickly and efficiently, as well as how to debug and fix problems. This type of problem can occur on both a small and large scale, so you must be prepared at all times.

Request Method with the assistance of the Request Library

request-promise

In this section, we'll go over the request-promise library, what you can do with it, and when it's best to use it. So, what exactly are we able to do with the request library? We're incorporating this library into our earlier project that we did. We use this library because the request library allows us to submit requests to the server in the simplest and fastest way possible. Before we begin, let's look at some examples. So, when you visit a website, a basic GET request to the server is sent first, followed by the initial content, the HTML response. So, with the request library, you can do the same thing, but instead of using the browser, you can write the action in Nodejs and it will do everything for you.

Let's take another example: when you want to login and enter your username and password into a specific website, a POST request is sent to the server, which sends the details of your entered account to the server for confirmation. This can also be done manually in Nodejs by simulating every or any request the browser makes to any website; all we have to do is provide the right parameters to it. In the case of the IMDB scraper, we used it as a GET request to obtain HTML and parse it.

Benefits and Drawbacks of Request Library

Since you control every parameter that you send to the server, it can be a little overwhelming at times. Let's use the previously described login process as an example. So, as previously described, the login process can consist of a single simple POST request to the server with the username and password depending on the website, followed by a single response with some cookies or a token in such case the request method is ideal, or the login system can consist of multiple requests on a simple login form on some websites can automatically send multiple requests for security reasons or because of how they were originally built on and In that case, you do not want to use the request library but of course, it is feasible, but it is very time-consuming and can be extremely frustrating, and many things can go wrong, such as missing simple parameters in the request headers and the server you are currently attempting to reach refuses to accept it. It all depends on the situation, but it is strongly discouraged to use this library if you have a large number of requests to send. Hence, if the website is more complex and automatically sends AJAX requests with different parameters and tokens, the best method would be to use the headless browser, which we will cover in detail in the following upcoming sections.

Therefore, only in simpler times, you should use the request library, but when the website has loads of security behind it and is dynamically rendered, you should probably use another method or even a headless browser method.

Scraping with a browser automation approach

scraping approach

In this section, we'll deep dive into browser automation and how it might be applied for developing a scraper. But first, let's define browser automation. Browser automation, in our case with the help of NodeJs, essentially means controlling the browser using code. Now that you know that certain browser engines support this, you can see that you can't just automate your regular browser; instead, you'll need a browser that allows you to manage it using code, and we'll look at a plethora of examples in the upcoming topics ahead.

Benefits and drawbacks of employing browser automation.

pros and cons

Before we get started, let's quickly go over the benefits and drawbacks of using Browser automation. For starters, it's much more beginner-friendly, and it's very easy to understand the action of the steps that you need to take because they're the same as when you browse the internet; all you have to do is write the specific code and scripts that your automated browser will follow. In most circumstances, implementing the scrapper with the automated browser is much cleaner, and you may wind up writing less code than you would with the request approach, but this, of course, depends on the page that needs to be scraped and what you need from it. The first disadvantage of this request approach is that you are essentially relying on the API availability of the individual browser you are automating. Others browsers have limited capabilities, and some aren't very stable, and some aren't even updated anymore, which is why you should be cautious and thoroughly study the browser before using it in your scrapper. So, before you decide whatever type of browser you want to automate, the browser's documentation will usually give detailed information.

When is it appropriate to use browser automation for a scraping project?

To begin, you must understand that there is no right or incorrect option. Any website may be done using requests, and the other way around. It all depends on how long it will take, how much code you will write, and how successful it will be. The browser automated scrapper will use more bandwidth and resources to load the page content from the website than the request method because the browser will load every CSS file, every javascript file, and every image that is on the website, whereas the request method will only get the HTML code for the website itself and will not load the external contents like files and libraries. So, if bandwidth and a few milliseconds of delay aren't important to you, browser automation is an excellent and perfect option for you. Browser automation makes things a lot easier while also saving you a lot of time.

Browser automation libraries

Before you begin, you must first decide which libraries to use. There are two excellent libraries available: Puppeteer and NightmareJS. There are many more libraries, although many of them are closed or abandoned. Puppeteer is built on the Chrome browser and is also known as a headless version of Chrome. It was created specifically for automation, testing, and testing chrome extensions, among other things, but in our case, we will be using this library for scraping. This library is developed and maintained by the Google Chrome team and is a fully functional and up-to-date headless browser. NightmareJS, on the other hand, is the electron browser's driver. It's a lot of fun to learn and even more fun to use, but it's not particularly suitable for complex scrappers. When compared to the puppeteer library, it has a lot of limitations. One of the library's biggest flaws is that it doesn't allow numerous tabs and links to open at once. As a result, libraries like this may break your scrapper or drive you to make compromises when you need them.

So, before you start scraping, let's go over a few things you might need to know. When you're running the scrapper and testing it, you can turn on the visual browser to see each action as it happens in real-time. This helps you understand and debug when you have a problem or when you're building a new scrapper. A competent headless browser will provide you with practically all APIs, allowing you to automate almost everything a user can do but by using the power of only coding and programming.

IMDB scraper using a request method

In this segment of the course, we'll delve a little deeper into the IMDB scraper that we constructed in the first session. We'll make it a little more complex as we go, and we'll learn new things along the way. With the request method, we'll learn how to spoof or fake user headers. So the main question is "why do we need to spoof them?" It's because we want it to appear that the scraper is a browser that's making the request. Request headers are extra parameters that the browser sends to the server automatically. They usually contain cookie information, such as whether you're logged in or not, and other types of browser information. So, let's get started with how to check these. First, open the browser and right-click to open the developer tools. Right now, we need to go to the network tab to see all of the requests that are happening in this tab.

Network Tab

We may see a number of requests and their types here. There's the document, as well as images, graphics, style sheets, javascript, and a whole lot more.

Network Tab

Let's take a look at the initial request that's being made, as you can see here. We can see the general information and the response headers, but we need to look at the request headers, which are a little farther down. Now we need to go to GitHub's request promise website and look at their documentation to see how we can include those in our request.

request-promise

Here => https://github.com/request/request-promise

What we need to look for is a way to add those extra parameters throughout the request, and if we look closely enough, we'll see the header options.

header

We'll copy the header and paste it into our VS code editor. Right now, we only have the URL as a parameter, and we need to alter it so it's an actual object, so delete the URL and build an object, but we'll still need to input the URI URL and then paste in the headers option.

const request = require("request-promise");
const cheerio = require("cheerio");
const URL = "https://www.imdb.com/title/tt0068646/?ref_=fn_al_tt_1";
(async () => {
  const response = await request({
    uri: URL,
    headers: {
      "User-Agent": "Request-Promise",
    },
  });
 let $ = cheerio.load(response);
  // console.log(response);
  let title = $("section.ipc-page-section > div > div > h1").text();
  let rating = $(
    "div.ipc-button__text > div > div:nth-child(2) > div > span"
  ).text().slice(0,6);
  console.log(`"${title}" movie has a IMDB rating of ${rating}`);
})();

Right now, we have precisely what we had before with the addition of a user parameter to the request with the value request promise. Of course, request promise isn't a user agent, but it's something we can simply adjust based on the documentation provided.

Let's go ahead and obtain the request headers now. Let's go back to the dev tools and look at the first request, making sure it's the one that's responsible for the IMDB page and not an image or a javascript file. Then, just like before, look at the request header and copy everything for now.

Network Tab

Copy everything and return it to the editor. What we have now are the request headers that are sent by the browser when we enter the IMDB page. What we need to do now is convert them all to javascript objects and pass them on instead of the previous useragent. Let's format them properly and replace them, indent them properly. Now we have control over all of the requests that are sent by the browser are being sent by us. Finally, we only need to worry about the cookie because we don't need it in this situation, so let's erase it, and we're done.

// index.js
const request = require("request-promise");
const cheerio = require("cheerio");
const URL = "https://www.imdb.com/title/tt0068646/?ref_=fn_al_tt_1";
(async () => {
  const response = await request({
    uri: URL,
    headers: {
      "accept":
       "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "en-IN,en-US;q=0.9,en;q=0.8",
      "cache-control": "no-cache",
      "pragma": "no-cache",
      "sec-ch-ua":
        '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
      "sec-ch-ua-mobile": "?1",
      "sec-ch-ua-platform": "Android",
      "sec-fetch-dest": "document",
      "sec-fetch-mode": "navigate",
      "sec-fetch-site": "same-origin",
      "sec-fetch-user": "?1",
      "sec-gpc": "1",
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Mobile Safari/537.36",
    },
  });
  let $ = cheerio.load(response);
  // console.log(response);
let title = $("section.ipc-page-section > div > div > h1").text();
let rating = $(
    "div.ipc-button__text > div > div:nth-child(2) > div > span"
  ).text().slice(0,6);

  console.log(`"${title}" movie has a IMDB rating of ${rating}`);
})();

Now we have the user agent, which is specific to the computer that you're using to code this. You can simply modify this on the user agent and check phoney ones on the internet and paste them right here; you don't need to paste in your actual browser info. Finally, let's put this to the test to see if it still works. Go to the debug tab and select debug play.

Debugger

Now, let's hope for the best and head to the debug console, where you can see that it does not work, as it does not print the movie's title or rating. So, we'll use what we learned before and set a debugger right at the console.log line to see what's going on. Let's run it again, and it stops right at the console.log line, and we can see what variables we have. We have the rating, which is an empty string, and the title, which is also an empty string, which means it didn't find the selectors we were looking for because the response changed, as you can see and it is completely nonsensical.

response

So, when we requested with only the URL, all of the other options were default, but now that we've added our own, everything is the default. We get this response because we forgot to add the gzip option to some of the default parameters for the request function.

Follow @aviyelHQ or sign-up on Aviyel for early access if you are a project maintainer, contributor, or just an Open Source enthusiast.

Join Aviyel's Discord => Aviyel's world

Twitter =>https://twitter.com/AviyelHq