TIL filtering DOM elements when you web-scrape
POSTED ON:
TAGS: web-scraping javascript
I love web-scraping.
One core technique you have to master is stripping HTML from a string.
The two solutions I like: #
Solution 1: Regex
This method is kind of like a well-tuned blade. It chops pieces off based on expressions.
const someHTMLString = "<div><h1>A heading.</h1><p>Here we have some text</p></div>";
const myString = someHTMLString.replace(/<[^>]+>/g, '');
console.log(myString); //Will print A heading. Here we have some text
Solution 2: JS function
This method is a bit 'cleaner', as it wraps the element into a div, then target it using native methods.
I prefer this method.
const stripHtml = (html) => {
// We create a new div element
const tempDivElement = document.createElement('div');
// And set the HTML
tempDivElement.innerHTML = html;
// And then get the text property of the element (cross-browser support)
return tempDivElement.textContent || tempDivElement.innerText || '';
}
const myString = stripHtml(yourHTMLString);
console.log(myString); //Will print A heading. Here we have some text
Via 3 ways to split and remove HTML from a string
Related TILs
Tagged: web-scraping