2022年3月15日更新了爬虫代码和数据库

参见:https://www.imtrq.com/archives/2946

可以说是最简单的爬虫,没有任何反爬措施,也不需要解析网页。

查询网址:https://lianhanghao.com

这是2021年12月21日查询的数据,里面只有行号和开户行,一共167719条数据。

代码里只用了requests,存储在mongodb中。你也可以修改存储到mysql,或者直接存储在内存里,请随意。

请尽量不要用我的user-agent。

import pymongo
import requests
import pandas as pd

# 连接数据库
client = pymongo.MongoClient('mongodb://127.0.0.1:27017')
database = client['hanglianhao']
collection = database['data']

# 设置url、headers,请尽量不要用我这个user-agent
url = 'https://lianhanghao.com/api/bank/lhhTableData'
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

# 获取总页数
data = {"current":1,"size":"20","bank_id":"","province_id":"","city_id":"","keywords":""}
r = requests.post(url,headers=headers,data=data)
pages = r.json()['data']['total_num']//100+2

# 遍历每一页
for i in range(1,pages):
    data = {"current":i,"size":"100","bank_id":"","province_id":"","city_id":"","keywords":""}
    r = requests.post(url,headers=headers,data=data)
    l = r.json()['data']['data']
    collection.insert_many(l)
    if i%100==0: print('已完成{}页'.format(i))

# 导出到excel
df = pd.DataFrame(list(collection.find(projection=['bankname','hanghao'])))
df[['bankname','hanghao']].to_excel('bank.xlsx',index=None)