TIL Web scraping with node

POSTED ON: Aug 14, 2022

TAGS: node webscraping

Web Scraping is fun exercise for any frontend developer. Some great use-cases are:

Crawl a social media platform and get a list of posts.
Visiting a bunch of store product pages and grabbing their prices.
Quickly getting the top news from a collection of news sites.

And you can do that with just some basic node scripts!

How it works

You fetch the data (webpage) from a website.
You traverse the DOM to grab what you're looking for and return it.

Pretty simple right?

Fetching the data

Fetching is straightforward. Just like fetching an API.

You can do it in a dozen different ways.
But for example:

async function fetch_endpoint() {
	const resp = await fetch('https://www.reddit.com/r/programming.json');

	console.log(await resp.json());
}


fetch_endpoint();

Traversing the dom

THe next bit is extracting the bit you want.

With regular expressions: (super hard btw)

const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>Username: (.+)<\/label>/)

console.log(result[1])
// John Doe

But preferably, you're using something like Cheerio (jquery-like syntax) or a bit more advanced is jsdom.

jsdom's implementation


const fs = require('fs');
const got = require('got');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const vgmUrl= 'https://www.vgmusic.com/music/console/nintendo/nes';

got(vgmUrl).then(response => {
  const dom = new JSDOM(response.body);
  console.log(dom.window.document.querySelector('title').textContent);
}).catch(err => {
  console.log(err);
});

via Web Scraping with JavaScript and NodeJS and Web Scraping and Parsing HTML in Node.js with jsdom

Related TILs

Tagged: node

TIL what is npm Script

Despite their high usage they are not particularly well optimized and add about 400ms of overhead. In this article we were able to bring that down to ~22ms.

Mar 31, 2023

TIL keywords in package.json

Today I learned what keywords are in a package.json file! It's a collection of keywords about a module. Keywords can help identify a package, related modules and software, and concepts.

Feb 07, 2023

TIL functional async/await

PLACEHOLDER

Feb 06, 2023

TIL Web scraping with node

How it works #

Fetching the data #

Traversing the dom #

Related TILs

How it works

Fetching the data

Traversing the dom