为了不让自己完全忘掉爬虫知识,所以写个小程序练练手。其实这个程序比我写过的所有scrapy框架的程序都要简单许多,唯一的进步大概是用上了前几天学的Mysql数据库知识吧。
import requests, re from lxml import etree import pymysql headers = { "user-agent": "你的User-Agent" } connection = pymysql.connect( host='localhost', user='root', password='你的密码', db='animes', charset='utf8mb4', ) cursor = connection.cursor() for year in range(2010, 2020): for season in ['winter', 'fall', 'summer', 'spring']: url = 'https://myanimelist.net/anime/season/{}/{}'.format(year, season) r = requests.get(url, headers=headers) response = etree.HTML(r.content) seasonal_anime_list = response.xpath('//div[@class="seasonal-anime js-seasonal-anime"]') for anime in seasonal_anime_list: title = anime.xpath('.//a[@class="link-title"]/text()')[0].replace('\'', '\'\'') producer = anime.xpath('.//span[@class="producer"]//text()')[0].replace('\'', '\'\'') try: eps = re.findall('\d+', ''.join(anime.xpath('.//div[@class="eps"]//text()')))[0] eps = int(eps) except: eps = 0 try: score = re.sub('\s', '', anime.xpath('.//span[@title="Score"]//text()')[0]) score = float(score) except: score = 0 try: members = re.sub('\s|,', '', anime.xpath('.//span[@title="Members"]//text()')[0]) members = int(members) except: members = 0 source = anime.xpath('.//span[@class="source"]//text()')[0] genre = ', '.join(re.findall('\S+', ''.join(anime.xpath('.//span[@class="genre"]//text()')))).replace('\'', '\'\'') anime_json = { 'title': title, 'producer': producer, 'eps': eps, 'score': score, 'members': members, 'source': source, 'genre': genre, 'season': season, 'year': year } query_sql = "SELECT count(*) FROM `MyAnimeList` WHERE `title`='{}';".format(title) cursor.execute(query_sql) if_exist = cursor.fetchone()[0] if if_exist == 0: insert_sql = 'INSERT INTO `MyAnimeList`({}) VALUES({})'.format(', '.join(['`{}`'.format(i) for i in anime_json.keys()]), ', '.join(["'{}'".format(i) for i in anime_json.values()])) cursor.execute(insert_sql) connection.commit() connection.close()
在运行这个程序之前,你得先在Mysql里创建库和表:
DROP DATABASE IF EXISTS animes; CREATE DATABASE animes; USE animes; DROP TABLE IF EXISTS MyAnimeList; CREATE TABLE MyAnimeList( `id` INT AUTO_INCREMENT, `title` TEXT, `producer` TEXT, `eps` INT, `score` FLOAT, `members` INT, `source` TEXT, `genre` TEXT, `season` VARCHAR(6), `year` VARCHAR(4), PRIMARY KEY (`id`) )
其实我也说不清楚做这个的意义是什么,反正就当练练手好了。
对了,顺便看一下老美对动漫的评分吧。
Comments | NOTHING