阻擋robot spider

From Sevenjay's Wiki

Jump to: navigation, search

.htaccess

  • simple exmple
SetEnvIfNoCase user-agent "^Baiduspider" bad_bot 
SetEnvIfNoCase user-agent "^EmailCollector" bad_bot 
SetEnvIfNoCase user-agent "^EmailSiphon" bad_bot 
SetEnvIfNoCase user-agent "^EmailWolf" bad_bot 
SetEnvIfNoCase User-Agent "^baidu" bad_bot
SetEnvIfNoCase User-Agent "^sogou" bad_bot
SetEnvIfNoCase User-Agent "^soso" bad_bot
<Files *>
Order allow,deny
Allow from all
Deny from env=bad_bot
</Files>
  • other exmple
#
#=========== Set User Agents ==============#
SetEnvIfNoCase User-Agent "^Baiduspider" block_bot
#
#=========== Set A Requested URI ==========#
SetEnvIfNoCase Request_URI "^/w00tw00t" w00t
#
#=========== Set An IP Address ============#
SetEnvIf Remote_Addr "^38.100.41.107$" block_ip
#
#=========== Set A Range Of IP Addresses =====#
# Kratos-ua (Ripe Network) Blocks 193.104.22.0 - 193.104.22.255
SetEnvIf Remote_Addr "^193\.104\.22\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$" block_ip
#
#=========== Set A Referer ================#
SetEnvIfNoCase Referer "p30tip" block_referer
#
#=========== Block The Set Variables =========#
Order Allow,Deny
Allow from All
Deny from env=block_bot
Deny from env=w00t
Deny from env=block_ip
Deny from env=block_referer

robot.txt

1.如果你站點中的所有文件,都可以讓蜘蛛爬取、收錄的話,那么語法這樣寫:

User-agent: *
Disallow:

如果你網站中全部的文件都可以讓搜索引擎索引的話,你也可以不管這個文件。

2.完全禁止搜索引擎來訪的Robots.txt文件寫法:

User-agent: *
Disallow: /

要禁止掉某個搜索引擎來訪的Robots.txt文件寫法:

User-agent: Googlebot
Disallow: /

以下实现禁止所有来自百度的抓取:

User-agent: Baiduspider
Disallow: /

3.網站中某個文件夾不希望讓搜索引擎收錄的Robots.txt文件寫法:

User-agent: *
Disallow: /admin/
Disallow: /images/

4.禁止Google抓取網站中的圖片文件:

User-agent: Googlebot
Disallow: /*.gif$
Personal tools