View Single Post
Old 05-21-2006, 11:53 AM   #1 (permalink)
NirTivAal
Registered User
 
Join Date: Oct 2003
Posts: 11
NirTivAal is on a distinguished road
Using regular expressions to clean HTML in user input

Essentially what I'm trying to do is take input from a form and clean it up so that uses of < and > that aren't meant to be for tags are changed to the notation &lt; and &gt; so that there's no chance of them being misinterpreted as tags by a browser.

Code:
$input = htmlspecialchars($input);
$input = preg_replace('@&lt;([a-z0-9]+(\s[a-z]+(=".+")?)*/?|/[a-z0-9]+)&gt;@', '</1>', $input);
This is as far as I've gotten; basically I convert every < or > to the other notaion, then look for uses of them that look like tags and convert them back. The preg_replace looks for substrings that start with the modified <, followed by the name of the tag, then possibly attributes and the closing /, followed by > OR <, followed by /, then the name of the tag, then >
It works pretty well in most cases, except when the user does something like <sfajklhsfsjk> or </rhfkj>

I was thinking I could solve this by another preg_replace that undoes the first one where there is a tag that isn't properly closed/opened. The problem is I'm not sure how to implement it, since I don't know how to do a regular expression that looks for the lack of a pattern.

Is there a way to solve the problem, or just a simpler way of doing everything?
NirTivAal is offline   Reply With Quote