Stripping HTML is the process of removing every HTML element from HTML code and keeping only the text content inside instead.
For example, stripping HTML tags from the HTML code below will result in the text in the following block. In this case, the h1
and h2
tags are completely removed. Therefore, you get the raw text as a result.
<h1>HTML Stripper</h1>
<h2>You can easily strip all the HTML tags using HTML Stripper</h2>
HTML Stripper
You can easily strip all the HTML tags using HTML Stripper
You can completely strip HTML tags from HTML code programmatically using regular expression assuming the text in input HTML code is safely escaped; i.e. no <
and >
characters inside any HTML elements.
Here is an example of how to strip HTML tags to get only the text content from HTML code in JavaScript using the built-in replace
method. The similar regular expression can be used in other programming languages as well.
const html = '<h1>HTML Stripper</h1>';
// Replace everything matching an HTML element with an empty string as known as stripping it.
const text = html.replace(/<[^>]*>/g, '');
console.log(text); // HTML Stripper
Sometimes, stripped text can contain HTML entities which represent HTML special characters as known as reserved characters. An HTML entity begins with an ampersand &
and ends with a semicolon ;
. For example, ©
is the HTML entity of the copyright symbol ©
.
In order to get the original text, you'll have to decode HTML entities back to their corresponding characters. Fortunately, there exists a library for this purpose in most programming languages. In JavaScript, you can use the he (standing for HTML Entities) library to encode or decode HTML entities like so.
const he = require('he');
const text = 'The Euro (€) is the currency of the EU countries.';
// Decode the HTML entities in the text using the decode method from the he library.
const decodedText = he.decode(text);
console.log(decodedText); // The Euro (€) is the currency of the EU countries.