TIL Web scraping with node
POSTED ON:
TAGS: node webscraping
Web Scraping is fun exercise for any frontend developer. Some great use-cases are:
- Crawl a social media platform and get a list of posts.
- Visiting a bunch of store product pages and grabbing their prices.
- Quickly getting the top news from a collection of news sites.
And you can do that with just some basic node scripts!
How it works #
- You
fetch
the data (webpage) from a website. - You traverse the DOM to grab what you're looking for and return it.
Pretty simple right?
Fetching the data #
Fetching is straightforward. Just like fetching an API.
You can do it in a dozen different ways.
But for example:
async function fetch_endpoint() {
const resp = await fetch('https://www.reddit.com/r/programming.json');
console.log(await resp.json());
}
fetch_endpoint();
Traversing the dom #
THe next bit is extracting the bit you want.
With regular expressions: (super hard btw)
const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>Username: (.+)<\/label>/)
console.log(result[1])
// John Doe
But preferably, you're using something like Cheerio (jquery-like syntax) or a bit more advanced is jsdom.
jsdom's implementation
const fs = require('fs');
const got = require('got');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const vgmUrl= 'https://www.vgmusic.com/music/console/nintendo/nes';
got(vgmUrl).then(response => {
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelector('title').textContent);
}).catch(err => {
console.log(err);
});
via Web Scraping with JavaScript and NodeJS and Web Scraping and Parsing HTML in Node.js with jsdom
Related TILs
Tagged: node