为了不让自己完全忘掉爬虫知识,所以写个小程序练练手。其实这个程序比我写过的所有scrapy框架的程序都要简单许多,唯一的进步大概是用上了前几天学的Mysql数据库知识吧。

import requests, re
from lxml import etree
import pymysql

headers = { 
  "user-agent": "你的User-Agent"
} 

connection = pymysql.connect(
                    host='localhost',
                    user='root',
                    password='你的密码',
                    db='animes',
                    charset='utf8mb4',
                )
cursor = connection.cursor()

for year in range(2010, 2020):
    for season in ['winter', 'fall', 'summer', 'spring']:
        url = 'https://myanimelist.net/anime/season/{}/{}'.format(year, season) 
        r = requests.get(url, headers=headers)
        response = etree.HTML(r.content)
        seasonal_anime_list = response.xpath('//div[@class="seasonal-anime js-seasonal-anime"]')
        for anime in seasonal_anime_list:
            title = anime.xpath('.//a[@class="link-title"]/text()')[0].replace('\'', '\'\'')
            producer = anime.xpath('.//span[@class="producer"]//text()')[0].replace('\'', '\'\'')
            try:
                eps = re.findall('\d+', ''.join(anime.xpath('.//div[@class="eps"]//text()')))[0]
                eps = int(eps)
            except:
                eps = 0
            try:
                score = re.sub('\s', '', anime.xpath('.//span[@title="Score"]//text()')[0])
                score = float(score)
            except:
                score = 0
            try:
                members = re.sub('\s|,', '', anime.xpath('.//span[@title="Members"]//text()')[0])
                members = int(members)
            except:
                members = 0
            source = anime.xpath('.//span[@class="source"]//text()')[0]
            genre = ', '.join(re.findall('\S+', ''.join(anime.xpath('.//span[@class="genre"]//text()')))).replace('\'', '\'\'')
            anime_json = {
                'title': title,
                'producer': producer,
                'eps': eps,
                'score': score,
                'members': members,
                'source': source,
                'genre': genre,
                'season': season,
                'year': year
            }
            query_sql = "SELECT count(*) FROM `MyAnimeList` WHERE `title`='{}';".format(title)
            cursor.execute(query_sql)
            if_exist = cursor.fetchone()[0]
            if if_exist == 0:
                insert_sql = 'INSERT INTO `MyAnimeList`({}) VALUES({})'.format(', '.join(['`{}`'.format(i) for i in anime_json.keys()]), ', '.join(["'{}'".format(i) for i in anime_json.values()]))
                cursor.execute(insert_sql)
                connection.commit()
connection.close()

在运行这个程序之前,你得先在Mysql里创建库和表:

DROP DATABASE IF EXISTS animes;
CREATE DATABASE animes;
USE animes;
DROP TABLE IF EXISTS MyAnimeList;
CREATE TABLE MyAnimeList(
	`id` INT AUTO_INCREMENT,
	`title` TEXT,
	`producer` TEXT,
	`eps` INT,
	`score` FLOAT,
	`members` INT,
	`source` TEXT,
	`genre` TEXT,
	`season` VARCHAR(6),
	`year` VARCHAR(4),
	PRIMARY KEY (`id`)
)

其实我也说不清楚做这个的意义是什么,反正就当练练手好了。

对了,顺便看一下老美对动漫的评分吧。