Essentially what I'm trying to do is take input from a form and clean it up so that uses of < and > that aren't meant to be for tags are changed to the notation < and > so that there's no chance of them being misinterpreted as tags by a browser.
Code:
$input = htmlspecialchars($input);
$input = preg_replace('@<([a-z0-9]+(\s[a-z]+(=".+")?)*/?|/[a-z0-9]+)>@', '</1>', $input);
This is as far as I've gotten; basically I convert every < or > to the other notaion, then look for uses of them that look like tags and convert them back. The preg_replace looks for substrings that start with the modified <, followed by the name of the tag, then possibly attributes and the closing /, followed by > OR <, followed by /, then the name of the tag, then >
It works pretty well in most cases, except when the user does something like <sfajklhsfsjk> or </rhfkj>
I was thinking I could solve this by another preg_replace that undoes the first one where there is a tag that isn't properly closed/opened. The problem is I'm not sure how to implement it, since I don't know how to do a regular expression that looks for the lack of a pattern.
Is there a way to solve the problem, or just a simpler way of doing everything?