Today I Learned - Rocky Kev

TIL filtering DOM elements when you web-scrape

POSTED ON:

TAGS:

I love web-scraping.

One core technique you have to master is stripping HTML from a string.

The two solutions I like:

Solution 1: Regex

This method is kind of like a well-tuned blade. It chops pieces off based on expressions.


const someHTMLString = "<div><h1>A heading.</h1><p>Here we have some text</p></div>";

const myString = someHTMLString.replace(/<[^>]+>/g, '');
console.log(myString); //Will print A heading. Here we have some text

Solution 2: JS function

This method is a bit 'cleaner', as it wraps the element into a div, then target it using native methods.

I prefer this method.

const stripHtml = (html) => {
// We create a new div element
const tempDivElement = document.createElement('div');
// And set the HTML
tempDivElement.innerHTML = html;
// And then get the text property of the element (cross-browser support)
return tempDivElement.textContent || tempDivElement.innerText || '';
}


const myString = stripHtml(yourHTMLString);
console.log(myString); //Will print A heading. Here we have some text

Via 3 ways to split and remove HTML from a string


Related TILs

Tagged:

TIL filtering DOM elements when you web-scrape

I love web-scraping. One core technique you have to master is stripping HTML from a string.

TIL a quick recipe involving Axios and Cheerio

A quick recipe to http request an api and use cheerio to web-scrape it

TIL a quick recipe involving Axios and Cheerio

A quick recipe to http request an api and use cheerio to web-scrape it