首先,通过nginx获取到爬虫的访问日志:
$ grep 'Baiduspider' /opt/nginx/logs/access.log |head -n 10
日志1
123.125.71.34 - - [11/Nov/2019:09:28:38 +0800] "GET / HTTP/1.1" 301 184 "-" "Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
日志2
49.89.129.75 - - [12/Nov/2019:01:17:08 +0800] "HEAD / HTTP/1.1" 301 0 "http://orchome.com/" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html\x09"
通过nslookup
,查看该ip是不是真正来自百度
日志1中访问ip
$ nslookup 123.125.71.34
Server: 172.16.0.3
Address: 172.16.0.3#53
Non-authoritative answer:
34.71.125.123.in-addr.arpa name = baiduspider-123-125-71-34.crawl.baidu.com.
Authoritative answers can be found from:
125.123.in-addr.arpa nameserver = ns2.bta.net.cn.
125.123.in-addr.arpa nameserver = ns.bta.net.cn.
ns.bta.net.cn internet address = 202.96.0.133
ns2.bta.net.cn internet address = 202.106.196.28
日志2中访问ip
nslookup 49.89.129.75
Server: 172.16.0.3
Address: 172.16.0.3#53
** server can't find 75.129.89.49.in-addr.arpa.: NXDOMAIN
真是百度蜘蛛结果中会返回以*.baidu.com
或*.baidu.jp
的格式命名hostname,如不包含则为假百度蜘蛛。如下为真爬虫:
name = baiduspider-123-125-71-34.crawl.baidu.com.