Why should I not use regex to parse HTML?

HTML is a context-free language, not a regular language. Regex cannot handle nested tags, attributes containing >, CDATA sections, comments, or malformed HTML. This makes regex-based HTML stripping unreliable and potentially a security risk (XSS).

What Regex Removes HTML Tags?

Q: What regex removes HTML tags?

The regex / ]*>/g matches and removes HTML tags. However, regex cannot properly parse HTML because HTML is not a regular language. For production use, use DOMParser in JavaScript or BeautifulSoup in Python.

Strip HTML tags: /<[^>]*>/g -- but use a real parser (DOMParser) for production. This regex matches anything between < and > and removes it. It works for simple cases but fails on nested tags, attributes containing >, comments, and malformed HTML.

Quick Regex Approach

JavaScript

const html = '<p>Hello <b>world</b></p>';
const text = html.replace(/<[^>]*>/g, '');
// "Hello world"

Python

import re
html = '<p>Hello <b>world</b></p>'
text = re.sub(r'<[^>]*>', '', html)
# 'Hello world'

Better: Use a Parser

JavaScript (DOMParser)

function stripHTML(html) {
  const doc = new DOMParser()
    .parseFromString(html, 'text/html');
  return doc.body.textContent || '';
}
stripHTML('<p>Hello <b>world</b></p>');
// "Hello world"

Python (BeautifulSoup)

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'html.parser')
soup.get_text()
# 'Hello'

Why Regex Fails on HTML

Attributes with >: <div data-x="a>b">
Comments: 
Script content: <script>if (a>b)</script>
Security: malformed HTML can bypass regex-based sanitization (XSS)

Try It Yourself

Test regex patterns with our Regex Tester or encode HTML with our HTML Encoder.

Built by Michael Lip. 100% client-side — no data leaves your browser.