TIL filtering DOM elements when you web-scrape

POSTED ON: Aug 22, 2022

I love web-scraping.

One core technique you have to master is stripping HTML from a string.

The two solutions I like:

Solution 1: Regex

This method is kind of like a well-tuned blade. It chops pieces off based on expressions.


const someHTMLString = "<div><h1>A heading.</h1><p>Here we have some text</p></div>";

const myString = someHTMLString.replace(/<[^>]+>/g, '');
console.log(myString); //Will print A heading. Here we have some text

Solution 2: JS function

This method is a bit 'cleaner', as it wraps the element into a div, then target it using native methods.

I prefer this method.

const stripHtml = (html) => {
  // We create a new div element
  const tempDivElement = document.createElement('div');
  // And set the HTML
  tempDivElement.innerHTML = html;
  // And then get the text property of the element (cross-browser support)
  return tempDivElement.textContent || tempDivElement.innerText || '';
}


const myString = stripHtml(yourHTMLString);
console.log(myString); //Will print A heading. Here we have some text

Via 3 ways to split and remove HTML from a string

Related TILs

Tagged: web-scraping

TIL filtering DOM elements when you web-scrape

I love web-scraping. One core technique you have to master is stripping HTML from a string.

Aug 22, 2022

TIL a quick recipe involving Axios and Cheerio

A quick recipe to http request an api and use cheerio to web-scrape it

Jan 28, 2022