Python抓取淘宝IP地址数据

https://ip.taobao.com上抓取IP地址库信息. 网上有很多这样的例子,但完整的代码不多, 这里分享下我写的版本.

因为淘宝有限制每秒10次的请求, 不知道是机器性能还是python性能亦或是我代码问题, 放开限制最多也就平均每秒10次访问.可能是由于我用的是urllib同步抓取,改成异步也很简单.具体运行截图如下:

screenshot1

screenshot2

三个线程:
worker线程抓取并解析数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def worker(ratelimit, jobs, results, progress):
    global cancel
    while not cancel:
        try:
            ratelimit.ratecontrol()
            ip = jobs.get(timeout=2) # Wait 2 seconds
            ok, result = fetch(ip)
            if not ok:
                logging.error("Fetch information failed, ip:{}".format(ip))
                progress.put("") # Notify the progress even it failed
            elif result is not None:
                results.put(" ".join(result))
            jobs.task_done()    # Notify one item
        except Queue.Empty:
            pass
        except:
            logging.exception("Unknown Error!")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def fetch(ip):
url = 'https://ip.taobao.com/service/getIpInfo.php?ip=' + ip
result = []
try:
response = urllib.urlopen(url).read()
jsondata = json.loads(response)
if jsondata[u'code'] == 0:
result.append(jsondata[u'data'][u'ip'].encode('utf-8'))
result.append(jsondata[u'data'][u'country'].encode('utf-8'))
result.append(jsondata[u'data'][u'country_id'].encode('utf-8'))
result.append(jsondata[u'data'][u'area'].encode('utf-8'))
result.append(jsondata[u'data'][u'area_id'].encode('utf-8'))
result.append(jsondata[u'data'][u'region'].encode('utf-8'))
result.append(jsondata[u'data'][u'region_id'].encode('utf-8'))
result.append(jsondata[u'data'][u'city'].encode('utf-8'))
result.append(jsondata[u'data'][u'city_id'].encode('utf-8'))
result.append(jsondata[u'data'][u'county'].encode('utf-8'))
result.append(jsondata[u'data'][u'county_id'].encode('utf-8'))
result.append(jsondata[u'data'][u'isp'].encode('utf-8'))
result.append(jsondata[u'data'][u'isp_id'].encode('utf-8'))
else:
return 0, result
except:
logging.exception("Url open failed:" + url)
return 0, result
return 1, result

process线程输出结果到output

1
2
3
4
5
6
7
8
9
10
11
def process(target, results, progress):
    global cancel
    while not cancel:
        try:
            line = results.get(timeout=5)
        except Queue.Empty:
            pass
        else:
            print >>target, line
            progress.put("")
            results.task_done()

progproc线程记录进度,我用了pip库里progressbar2这个包, 它默认输出到stderr,不影响默认的stdout结果输出.但是ProgressBar类不是线程安全的,所以就开了这个线程.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def progproc(progressbar, count, progress):
    """
    Since ProgressBar is not a thread-safe class, we use a Queue to do the counting job, like
    two other threads. Use this thread do the printing of progress bar. By the way, it will
    print to stderr, which does not conflict with the default result output(stdout).
    """
    idx = 1
    while True:
        try:
            progress.get(timeout=5)
        except Queue.Empty:
            pass
        else:
            progressbar.update(idx)
            idx += 1

用法很简单:

1
2
3
4
5
6
7
8
9
10
11
12
usage: main.py [-h][-o OUTPUT] iprange

positional arguments:
iprange                 The string of IP range, such as:
"192.168.1.0-192.168.1.255" : beginning-end
"192.168.1.0/24" : CIDR
"192.168.1.*" : wildcard

optional arguments:
-h, --help           show this help message and exit
-o OUTPUT, --output OUTPUT
The output destination of result, default is stdout

支持三种ip范围的表达式,足够抓取各类网段IP了,例如:C类网段: 192.168.*.1的参数也是接受的.

已经上传到Github上了:

taobaoip