Today I Learned - Rocky Kev

TIL Web scraping with node

POSTED ON:

TAGS:

Web Scraping is fun exercise for any frontend developer. Some great use-cases are:

  1. Crawl a social media platform and get a list of posts.
  2. Visiting a bunch of store product pages and grabbing their prices.
  3. Quickly getting the top news from a collection of news sites.

And you can do that with just some basic node scripts!

How it works

  1. You fetch the data (webpage) from a website.
  2. You traverse the DOM to grab what you're looking for and return it.

Pretty simple right?

Fetching the data

Fetching is straightforward. Just like fetching an API.

You can do it in a dozen different ways.
But for example:

async function fetch_endpoint() {
const resp = await fetch('https://www.reddit.com/r/programming.json');

console.log(await resp.json());
}


fetch_endpoint();

Traversing the dom

THe next bit is extracting the bit you want.

With regular expressions: (super hard btw)

const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>Username: (.+)<\/label>/)

console.log(result[1])
// John Doe

But preferably, you're using something like Cheerio (jquery-like syntax) or a bit more advanced is jsdom.

jsdom's implementation


const fs = require('fs');
const got = require('got');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const vgmUrl= 'https://www.vgmusic.com/music/console/nintendo/nes';

got(vgmUrl).then(response => {
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelector('title').textContent);
}).catch(err => {
console.log(err);
});

via Web Scraping with JavaScript and NodeJS and Web Scraping and Parsing HTML in Node.js with jsdom


Related TILs

Tagged:

TIL what is npm Script

Despite their high usage they are not particularly well optimized and add about 400ms of overhead. In this article we were able to bring that down to ~22ms.

TIL keywords in package.json

Today I learned what keywords are in a package.json file! It's a collection of keywords about a module. Keywords can help identify a package, related modules and software, and concepts.

TIL functional async/await

PLACEHOLDER