Python笔记 - Scrapy爬虫

/ 技术文章 / 0 条评论 / 540浏览

Python笔记 - Scrapy爬虫

https://docs.scrapy.org/en/latest/intro/install.html

Scrapy安装

python3 -m pip install --upgrade Scrapy

在Windows上安装可能会出现以下问题:

    building 'twisted.test.raiser' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    ----------------------------------------
Command "C:\Users\Funfan\AppData\Local\Programs\Python\Python37\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Funfan\\AppData\\Local\\Temp\\pip-install-ilx4u1g0\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Funfan\AppData\Local\Temp\pip-record-lipvezrx\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Funfan\AppData\Local\Temp\pip-install-ilx4u1g0\Twisted\

需要手动安装需要的库 https://www.lfd.uci.edu/~gohlke/pythonlibs/

库的命名规则是这样的:{库名}-{库版本}-cp{Python版本}-cp{Python版本}m-win{系统环境}.whl,

比如我的机器是Windows64位,安装的Python 3.7.0,就需要下载Twisted‑18.7.0‑cp37‑cp37m‑win_amd64.whl

下载的时候注意根据自己的Python版本和Windows环境下载对应的库

下载完成后手动安装

python3 -m pip install D:\Download\Twisted‑18.7.0‑cp37‑cp37m‑win_amd64.whl
python3 -m pip install D:\Download\pywin32-224-cp37-cp37m-win_amd64.whl

简单例子

最简单的爬虫例子需要以下几步

  1. 创建一个Scrapy项目
  2. 定义提取的Item
  3. 编写爬取网站的 spider 并提取 Item
  4. 编写 Item Pipeline 来存储提取到的Item(即数据)

创建项目

scrapy startproject tb_spider

目录下将会创建一个jd_spder项目,当然一个项目可以有好几个爬虫

tb_spider/
    scrapy.cfg            # 部署配置

    tb_spider/             # 项目的Python模块, 一般从这里import代码
        __init__.py

        items.py          # items定义文件

        middlewares.py    # 项目爬虫、调度器、下载器中间件,有一些内置的钩子

        pipelines.py      # 管道钩子,输入输出

        settings.py       # 主要的运行时配置

        spiders/          # 真正的爬虫代码
            __init__.py

本来想爬某宝某东的,但奈何人家做了很多反爬虫,还有很多动态加载,对初学者来说极其不友好,退而求其次,我们去爬个电影网站http://www.minimp4.com/movie

定义Item

Item 是保存爬取到的数据的容器,提供了额外保护机制来避免拼写错误导致的未定义字段错误,其使用方法和python字典类似。

通过scrapy.Field()定义字段

# -*- coding: utf-8 -*-

import scrapy

class MiniMp4Item(scrapy.Item):
	name = scrapy.Field()
	director = scrapy.Field()
	editors = scrapy.Field()
	film_type = scrapy.Field()
	region = scrapy.Field()
	language = scrapy.Field()
	release_time = scrapy.Field()
	duaration = scrapy.Field()
	alias = scrapy.Field()
	douban_point = scrapy.Field()
	imdb_point = scrapy.Field()
	description = scrapy.Field()
	resources = scrapy.Field()
	comments = scrapy.Field()

编写Spider

根据模板创建爬虫 语法: scrapy genspider [-t template] <name> <domain> 可用模板

创建爬虫

cd tb_spider
scrapy genspider minimp4 www.minimp4.com

在spders目录下生成了minimp4.py,如下

# -*- coding: utf-8 -*-
import scrapy


class Minimp4Spider(scrapy.Spider):
    name = 'minimp4'
    allowed_domains = ['www.minimp4.com']
    start_urls = ['http://www.minimp4.com/']

    def parse(self, response):
        pass

根据实际情况,抓取对象

# -*- coding: utf-8 -*-
import scrapy
from tb_spider import items

class Minimp4Spider(scrapy.Spider):
    # Spider 唯一名称
    name = 'minimp4'
    # 域名限制
    allowed_domains = ['www.minimp4.com']
    # 爬取页面 从第一页到2306页 (列表推导式)
    start_urls = ['http://www.minimp4.com/movie/?page={0}'.format(page_num) for page_num in range(2)]

    def parse(self, response):
        # 找到详情页面URL, xpath()返回的是列表
        hrefs = response.xpath('//div[@class="meta"]/h1/a/@href').extract()
        for itemUrl in hrefs:
            # 点击进入详情页面, 生成器yield,
            # 进入页面后,调用回调函数,解析各个字段
            yield scrapy.Request(itemUrl, callback=self.parseContent)
    
    def parseContent(self, response):
        name = response.xpath('//div[@class="movie-meta"]/h1/text()').extract()
        director = response.xpath('//div[@class="movie-meta"]/p/span[text()="导演:"]/following-sibling::*/text()').extract()
        editors = response.xpath('//div[@class="movie-meta"]/p/span[text()="编剧:"]/following-sibling::*/text()').extract()
        actors = response.xpath('//div[@class="movie-meta"]/p/span[text()="主演:"]/following-sibling::*/text()').extract()
        film_type = response.xpath('//div[@class="movie-meta"]/p/span[text()="类型:"]/following-sibling::*/text()').extract()
        region = response.xpath('//div[@class="movie-meta"]/p/span[text()="制片地区:"]/following-sibling::*/text()').extract()
        language = response.xpath('//div[@class="movie-meta"]/p/span[text()="语言:"]/../text()').extract()
        release_time = response.xpath('//div[@class="movie-meta"]/p/span[text()="上映时间:"]/../text()').extract()
        duaration = response.xpath('//div[@class="movie-meta"]/p/span[text()="片长:"]/../text()').extract()
        alias = response.xpath('//div[@class="movie-meta"]/p/span[text()="又名:"]/../text()').extract()
        douban_point = response.xpath('//div[@class="movie-meta"]/p/a[starts-with(text(),"豆瓣")]/text()').extract()
        imdb_point = response.xpath('//div[@class="movie-meta"]/p/a[starts-with(text(),"IMDB")]/text()').extract()
        description = response.xpath('//div[@class="movie-introduce"]/p/text()').extract()
        resources = response.xpath('//div[@id="normalDown"]//child::a/@href').extract()
        comments = response.xpath('//div[contains(@class, "comment")]//child::div[@class="reply-content"]/text()').extract()

        miniItem = items.MiniMp4Item()
        miniItem['name'] = name
        miniItem['director'] = director
        miniItem['editors'] = editors
        miniItem['actors'] = actors
        miniItem['film_type'] = film_type
        miniItem['region'] = region
        miniItem['language'] = language
        miniItem['release_time'] = release_time
        miniItem['duaration'] = duaration
        miniItem['alias'] = alias
        miniItem['douban_point'] = douban_point
        miniItem['imdb_point'] = imdb_point
        miniItem['description'] = description
        miniItem['resources'] = resources
        miniItem['comments'] = comments
        # 生成数据结构
        yield miniItem

存储数据

先做最简单的文件存储,存成json文件

# -*- coding: utf-8 -*-
import json

class MiniMp4SpiderPipeline(object):
    
    def __init__(self, *args, **kwargs):
        # 初始化id集合,用于去重
        self.ids_seen = set()
        # 初始化时打开文件
        self.f = open('minimp4movie.json','ab')
        return super().__init__(*args, **kwargs)

    def process_item(self, item, spider):
        # 去重
        # if item['id'] in self.ids_seen:
        #     raise DropItem("Duplicate item found: %s" % item)
        # else:
        #     self.ids_seen.add(item['id'])
        #     return item
        
        # 将scrapy items 转换成字典dict,在转换成json字符串
        data = json.dumps(dict(item), ensure_ascii=False) + ',\n'
        self.f.write(data.encode('utf-8'))
        return item

    # Override close_spider, 结束爬虫时关闭文件
    # 覆写方法,千万不要改名字
    def close_spider(self, spider):
        self.f.close()

settings文件设置(主要设置内容)

为了启用一个Item Pipeline组件,你必须将它的类添加到 ITEM_PIPELINES 配置

# 设置请求头部,添加url
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
}

# 设置item——pipelines
ITEM_PIPELINES = {
    'tencent.pipelines.MiniMp4SpiderPipeline': 300,
}

分配给每个类的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。

运行

scrapy list # 列出项目中所有的爬虫
scrapy crawl minimp4 # 使用名为quotes的爬虫进行爬取