Python爬取房价信息的后续

昨天那个小脚本是我心血来潮写的，我总觉得那个很乱，所以又重写了一下，这次爬的是安居客的新房房源，不因为别的，只是因为链家的房源有点少（其实安居客更少，只不过把售罄的也列出来了）。

这次我把爬虫封装成了类，可以多次使用，爬取不同的城市了。

 1'''
 2author: Yaodo
 31 安居客房价爬取+生成图片
 42 安居客把售罄的房源也算作新房，所以我觉得价格并不公允
 53 我在爬到的结果中过滤了均价低于4000的房源（这样的房子是在太少了）
 64 按套计价的也因为第三条原因被过滤了
 7'''
 8
 9import requests,re
10import matplotlib.pyplot as plt
11from bs4 import BeautifulSoup
12import pandas as pd
13import seaborn as sns
14
15headers = {
16            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
17        }
18
19#一个“城市”类，初始参数为城市代码和页数
20class HousePriceCrawler(object):
21    """docstring for HousePriceCrawler"""
22    def __init__(self, city, page):
23        self.city = city
24        self.page = page
25        self.raw_url = 'https://' + city +'.fang.anjuke.com/loupan/all/p'
26
27    #获取每页的楼盘列表
28    def get_house_list(self, i):
29        url = self.raw_url + str(i) + '/'
30        r = requests.get(url, headers = headers)
31        soup = BeautifulSoup(r.text, 'lxml')
32        results = soup.find_all('div', "key-list imglazyload")
33        items = results[0].find_all('div', 'item-mod')
34        return items
35
36    #从每个楼盘中提取区县信息
37    def get_location(self, item):
38        district = re.compile('\[\xa0(.*?)\xa0')
39        location = re.findall(district, str(item.find('span', 'list-map').string))[0]
40        return location
41
42    #获取面积（有些楼盘给的是总面积，所以这一项意义不大）
43    def get_area(self, item):
44        areapattern = re.compile('[0-9\-\.㎡]')
45        area = ''.join(re.findall(areapattern, str(item.find('a', 'huxing').find_all('span')[-1].string)))
46        return area
47
48    #生成一条楼盘的信息（字典）
49    def get_one(self, item):
50        title = str(item.find('span', 'items-name').string)
51        location = self.get_location(item)
52        area = self.get_area(item)
53        price = int(str(item.find('p', 'price').span.string))
54        one = {
55            'title': title,
56            'location': location,
57            'area': area,
58            'price': price
59        }
60        return one
61
62    #循环抓取每一页每一条楼盘信息，生成列表
63    def Crawler(self):
64        self.li = []
65        for i in range(1, self.page+1):
66            items = self.get_house_list(i)
67            for item in items:
68                try:
69                    one = self.get_one(item)
70                    self.li.append(one)
71                except:
72                    continue
73            print('第{}页已完成…'.format(i))
74
75    #根据列表生成DataFrame
76    def get_frame(self):
77        self.df = pd.DataFrame(self.li)
78        self.df = self.df[self.df.price>4000]
79
80    #汇总均价，用seaborn作图
81    def draw_plot(self):
82        self.Crawler()
83        self.get_frame()
84        price = self.df.groupby('location').mean()
85        data = price.sort_values('price', ascending=False)
86        f, ax = plt.subplots(figsize=(14,8))
87        sns.barplot(x=data.index, y=data.price, data=data)
88        plt.show()
89
90#一个实例（淮安市）
91HA = HousePriceCrawler('ha', 7)
92HA.draw_plot()