虽然我自己的水平还远远不够用来实战,但一直都那么枯燥地看视频实在是难以给我带来成就感,所以我便自己胡闹弄了个小爬虫,也没干什么别的事,就爬一下咱们商院的导师信息。
1 分析网页
商院导师的信息都在这里啦:https://nubs.nju.edu.cn/8878/listm.htm
不过这里要稍微绕一点路,因为这个页面不是静态的,所以不能像我以前学的那样用BeautifulSoup或者re模块解析,需要先用chrome看看这个查询请求是怎么实现的:
按下F12,在Network中可以看到教师列表是由这个generalQuery生成的,点击一下“Headers”,可以看见这次请求的主要信息:
1.1 General
- Request URL: https://nubs.nju.edu.cn/_wp3services/generalQuery?queryObj=teacherHome
- Request Method: POST
- Status Code: 200 OK
- Remote Address: 127.0.0.1:7890
- Referrer Policy: no-referrer-when-downgrade
1.2 header
- Accept: application/json, text/javascript, */*; q=0.01
- Accept-Encoding: gzip, deflate, br
- Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
- Connection: keep-alive
- Content-Length: 1888
- Content-Type: application/x-www-form-urlencoded; charset=UTF-8
- Cookie: www.nju.edu.cn=67313298; JSESSIONID=B1D647135CD01F9640BEA8F744B93D32
- Host: nubs.nju.edu.cn
- Origin: https://nubs.nju.edu.cn
- Referer: https://nubs.nju.edu.cn/8878/listm.htm
- Sec-Fetch-Mode: cors
- Sec-Fetch-Site: same-origin
- User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36
- X-Requested-With: XMLHttpRequest
1.3 data
- siteId: 295
- pageIndex: 1
- conditions: [{"orConditions":[{"field":"exField2","value":"经济学系","judge":"="},{"field":"exField2","value":"国际经济贸易系","judge":"="},{"field":"exField2","value":"金融与保险学系","judge":"="},{"field":"exField2","value":"产业经济学系","judge":"="},{"field":"exField2","value":"人口研究所","judge":"="},{"field":"exField2","value":"工商管理系","judge":"="},{"field":"exField2","value":"会计学系","judge":"="},{"field":"exField2","value":"营销与电子商务系","judge":"="},{"field":"exField2","value":"人力资源管理学系","judge":"="}]}]
- orders: [{"field":"letter","type":"asc"}]
- returnInfos: [{"field":"title","name":"title"},{"field":"exField1","name":"exField1"},{"field":"exField2","name":"exField2"},{"field":"career","name":"career"},{"field":"exField8","name":"exField8"},{"field":"wapUrl","name":"wapUrl"},{"field":"cnUrl","name":"cnUrl"}]
- articleType: 1
- level: 1
嗯,我认为是浏览器post了 https://nubs.nju.edu.cn/_wp3services/generalQuery?queryObj=teacherHome这个URL,查询data,然后有了一个json格式的返回,那么就可以开始编辑一个简单的小爬虫了。
2 Python编程
- import requests
- import pandas as pd
- url = 'https://nubs.nju.edu.cn/_wp3services/generalQuery?queryObj=teacherHome'
- data = {'siteId': 295,
- 'pageIndex': 1,
- 'articleType': 1,
- 'level': 1}
- r = requests.post(url,data=data)
- df = pd.DataFrame(r.json()['data']) #用返回的json文件中的“data"一栏生成DataFrame表格
- df.to_excel('/Users/sqwqwqw1/Desktop/未命名文件夹/teacher.xls') #导出到excel
至于请求头,可加可不加(至少对于我们学校这个网站来说)
效果如下
这小程序,即便是在我这个初学者的眼中,也算是比较简单的。
Comments | NOTHING