2022年3月15日更新了爬虫代码和数据库
可以说是最简单的爬虫,没有任何反爬措施,也不需要解析网页。
这是2021年12月21日查询的数据,里面只有行号和开户行,一共167719条数据。
代码里只用了requests,存储在mongodb中。你也可以修改存储到mysql,或者直接存储在内存里,请随意。
请尽量不要用我的user-agent。
import pymongo import requests import pandas as pd # 连接数据库 client = pymongo.MongoClient('mongodb://127.0.0.1:27017') database = client['hanglianhao'] collection = database['data'] # 设置url、headers,请尽量不要用我这个user-agent url = 'https://lianhanghao.com/api/bank/lhhTableData' headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'} # 获取总页数 data = {"current":1,"size":"20","bank_id":"","province_id":"","city_id":"","keywords":""} r = requests.post(url,headers=headers,data=data) pages = r.json()['data']['total_num']//100+2 # 遍历每一页 for i in range(1,pages): data = {"current":i,"size":"100","bank_id":"","province_id":"","city_id":"","keywords":""} r = requests.post(url,headers=headers,data=data) l = r.json()['data']['data'] collection.insert_many(l) if i%100==0: print('已完成{}页'.format(i)) # 导出到excel df = pd.DataFrame(list(collection.find(projection=['bankname','hanghao']))) df[['bankname','hanghao']].to_excel('bank.xlsx',index=None)
Comments | NOTHING