CPA考完了,终于有时间放松放松,学习一点自己喜欢的东西啦!
犹记得之前的Python网课才学到第二章,真得感谢Coursera没有删掉我的账号。
先拿之前学的小爬虫练练手,看自己是不是全部都忘光啦:
1 爬取我的博客的所有文章标题和链接并输出:
- import requests
- from bs4 import BeautifulSoup
- url = 'https://www.imtrq.com/page/'
- index = 1
- for i in range(0,17):
- r = requests.get(url+str(i))
- soup = BeautifulSoup(r.text,'lxml')
- archieves = soup.find_all('h1','entry-title')
- for items in archieves:
- print(index,items.string,items.contents[0].get('href'))
- index += 1
- i += 1
效果:
2 爬取我博客的文章缩略图保存到本地
- import requests
- from bs4 import BeautifulSoup
- url = 'https://www.imtrq.com/page/'
- path = '/Users/XXXXX/Documents/photo/'
- index = 1
- for page in range(1,17):
- response = requests.get(url+str(page))
- soup = BeautifulSoup(response.text,'lxml')
- fimg = soup.find_all('img','attachment-post-thumbnail size-post-thumbnail wp-post-image')
- for items in fimg:
- img = requests.get(items.get('src'))
- with open(path+str(index)+'.jpg','wb') as file:
- file.write(img.content)
- file.flush
- file.close
- index+=1
Comments | 1 条评论
博主 傲娇的小基基
没看BeautifulSoup官方文档的时候我还以为只能用正则提取链接,代码多了一倍哈哈哈