As a CMS developer i've found out that there are people on this earth that try to do anything with your website EXCEPT for what you've build it for (reading the content as a normal human being would do).
There are several ways to catch those people and one of them is thru their UA (User Agent)
However there are bots that fake the UA and try to bypass your security which i will probably post in a later topic.
Anyway thru experience i've found out how the most "real" browsers behave and what their UA looks like.
So if you want to prevent any unknown UA use the following code which took me 5 days of checking and identifying all UA's i recieved on a website.
PHP Code:
function check_ua($agent, $os, $extra, &$data)
{
if (!empty($agent)) {
$data = array('agent' => $agent, 'os' => $os, 'ext' => $extra);
return true;
}
return false;
}
function identify_ua($agent)
{
$data = array();
$pattern = array(
# Gecko family
'#^Mozilla/5.0 \(([a-zA-Z0-9]+); U; (.*[^;])(; [a-zA-Z\-]{2,5})?; rv:[0-9\.]+.*?\) Gecko/[0-9]{8} .* (Firefox).*#e',
'#^Mozilla/5.0 \(([a-zA-Z0-9]+); U; (.*[^;])(; [a-zA-Z\-]{2,5})?; rv:[0-9\.]+.*?\) Gecko/[0-9]{8} ([a-zA-Z\-]+)/[0-9\.]+.*#e',
'#^Mozilla/5.0 \(([a-zA-Z0-9]+); U; (.*[^;])(; [a-zA-Z\-]{2,5})?; rv:[0-9\.]+.*?\) Gecko/[0-9]{8}$#e',
# Galeon alternate
'#^Mozilla/5.0 (Galeon)/[0-9\.]+ \(([a-zA-Z0-9]+); (.*[^;]); U\)#e',
# Konqueror
'#^Mozilla/5.0 \(compatible; (Konqueror)/[0-9\.\-rc]+; (i686 )?(Linux|FreeBSD).*#e',
# Lynx
'#^(Lynx)/2.[0-9\.]+(rel|dev)[0-9\.]+ libwww-FM/.*#e',
# Safari family
'#^Mozilla/5.0 \(Macintosh; U; PPC Mac OS X; [a-zA-Z\-]{2,5}\) AppleWebKit/.*? \(KHTML, like Gecko.*?\) ([a-zA-Z]+)/.*#e',
# w3m
'#^(w3m)/[0-9\.]+#e',
# Links
'#^(Links) \([0-9].[a-z0-9]+; (.*?);#e',
# Voyager
'#^Mozilla/4.0 \(compatible; (Voyager); (AmigaOS).*#e',
# Opera
'#^(Opera)/[67].[0-9]{1,2} \((.*?); U\)[\ ]{1,2}\[[a-zA-Z\-]{2,5}\]#e', # Opera 6-7
'#^Mozilla/[45].0 \(compatible; MSIE [56].0; (.*?)\) (Opera) [567].[0-9]{1,2} \[[a-zA-Z\-]{2,5}\]#e', # Opera 6-7 faking IE
'#^Mozilla/5.0 \((.*?); U\) (Opera) [67].[0-9]{1,2} \[[a-zA-Z\-]{2,5}\]#e', # Opera 6-7 faking Gecko
'#^(Opera)/8.[0-9]{1,2} \((.*?); U; [a-zA-Z\-]{2,5}\)#e', # Opera 8
'#^Mozilla/4.0 \(compatible; MSIE 6.0; (.*?); [a-zA-Z\-]{2,5}\) (Opera) 8.[0-9]{1,2}#e', # Opera 8 faking IE
'#^Mozilla/5.0 \((.*?); U; [a-zA-Z\-]{2,5}\) (Opera) 8.[0-9]{1,2}#e', # Opera 8 faking Gecko
# IE
'#^Mozilla/4.0 \(compatible; MSIE (4.0|5.0|5.5|6.0|7.0)[b1]?(; .*[^;])?; (Windows) [A-Z0-9\ \.]+[;)](.*)?#e',
'#^Mozilla/2.0 \(compatible; MSIE (3.0|4.0)[1]?(; .*[^;])?; (Windows) [A-Z0-9\ \.]+[;)](.*)?#e',
);
$replacement = array(
# Gecko family
'check_ua(\'\\4\', \'\\2\', \'\', $data)',
'check_ua(\'\\4\', \'\\2\', \'\', $data)',
'check_ua(\'Mozilla\', \'\\2\', \'\', $data)',
# Galeon
'check_ua(\'\\1\', \'\\3\', \'\', $data)',
# Konqueror
'check_ua(\'\\1\', \'\\3\', \'\', $data)',
# Lynx
'check_ua(\'\\1\', \'N/A\', \'\', $data)',
# Safari family
'check_ua(\'\\1\', \'Mac\', \'\', $data)',
# w3m
'check_ua(\'\\1\', \'N/A\', \'\', $data)',
# Links
'check_ua(\'\\1\', \'\\2\', \'\', $data)',
# Voyager
'check_ua(\'\\1\', \'\\2\', \'\', $data)',
# Opera
'check_ua(\'\\1\', \'\\2\', \'\', $data)',
'check_ua(\'\\2\', \'\\1\', \'\', $data)',
'check_ua(\'\\2\', \'\\1\', \'\', $data)',
'check_ua(\'\\1\', \'\\2\', \'\', $data)',
'check_ua(\'\\2\', \'\\1\', \'\', $data)',
'check_ua(\'\\2\', \'\\1\', \'\', $data)',
# IE
'check_ua(\'MSIE\', \'\\3\', \'\\4\', $data)',
'check_ua(\'MSIE\', \'\\3\', \'\\4\', $data)',
);
preg_replace($pattern, $replacement, $agent);
if (!isset($data['agent'])) return identify_bot($agent);
if ($data['agent'] == 'MSIE') {
# Detect bot that simulates MSIE
preg_match('#(Fetch API Request|Microsoft Scheduled Cache Content Download Service|Have a nice day\!|Your Own World|Mozilla/|Medusa)#is', $data['ext'], $regs);
if (!empty($regs[0])) {
$data['bot'] = $regs[0];
unset($data['agent']);
return $data;
}
preg_match_all('#(iRider|Crazy Browser|NetCaptor|Maxthon|Avant Browser)#s', $data['ext'], $regs);
if (!empty($regs[0])) {
$data['agent'] = str_replace(' Browser','',$regs[0][count($regs[0])-1]);
$data['ext'] = '';
}
}
preg_match('#(Win|Mac|Linux|FreeBSD|SunOS|IRIX|BeOS|OS/2|AIX|Amiga)#is', $data['os'], $regs);
$data['os'] = empty($regs[0]) ? 'Other' : $regs[0];
if ($data['os'] == 'Win') $data['os'] = 'Windows';
return $data;
}
$data = identify_ua($_SERVER['HTTP_USER_AGENT']);
if (!$data || empty($data['agent'])) {
die('We are sorry but unidentified User Agents are not allowed on this website');
}
Feel free to add comments if you have a UA of a "real" browser that fails this test and DON'T FORGET to mention if what plugin/software you used that modifies the UA.
NOTE: I am also working on a bot identification thru HTTP_REFERER and IP's/Network to prevent anything like referer spamming, harvesting and image grabbing.