本书介绍如何结合Python进行网络爬虫程序的开发,从Python语言的基本特性入手,详细介绍了Python网络爬虫开发的各个方面,涉及HTTP、HTML、JavaScript、正则表达式、自然语言处理、数据科学等不同领域的内容。全书共10章,包括Python基础知识、网站分析、网页解析、Python文件读写、Python与数据库、AJAX技术、模拟登录、文本与数据分析、网站测试、Scrapy爬虫框架、爬虫性能等多个主题。本书可作为高等职业院校计算机类专业的专业课教材,也可供计算机相关从业人员选用参考。
耿兴隆,Autodesk中国认证考试中心首席专家,全面负责Autodesk中国官方认证考试大纲制定、题库建设、技术咨询和师资力量培训工作。其创作的很多教材成为国内具有引导性的旗帜作品,在国内相关专业方向图书创作领域具有举足轻重的地位。
目录
项目一 Python 基础认知 ····················································································.1
任务一 Python 概述 ·······································································································.1
一、Python 简介 ······································································································.1
二、安装Python ······································································································.2
三、安装PyCharm ···································································································.6
四、Python 语法规范 ·······························································································.11
任务二 Python 命令的组成 ·····························································································.13
一、基本符号 ·········································································································.14
二、常量与变量 ······································································································.16
三、数据类型 ·········································································································.19
四、功能符号 ·········································································································.24
任务三 程序结构 ·········································································································.26
一、表达式语句 ······································································································.26
二、顺序结构 ·········································································································.27
三、选择结构 ·········································································································.28
四、循环结构 ·········································································································.30
五、条件表达式 ······································································································.31
六、程序的流程控制 ································································································.32
项目实战 ·····················································································································.33
实战 输出百度网址 ································································································.33
项目二 网络爬虫基础认知 ················································································.35
任务一 网络爬虫概述 ···································································································.35
一、网络爬虫的基本原理 ··························································································.36
二、网络爬虫系统框架 ·····························································································.37
三、爬行策略 ·········································································································.37
四、网络爬虫的分类 ································································································.38
五、开源网络爬虫框架/项目 ······················································································.39
任务二 HTTP ·············································································································.41
一、HTTP 的工作原理 ·····························································································.41
二、Urllib 模块库 ···································································································.42
三、URL 定义 ·······································································································.43
四、URL 编码设置 ·································································································.47
任务三 网页请求过程 ···································································································.50
一、发送请求报文 ··································································································.51
二、返回响应 ········································································································.52
三、HTTP 消息 ······································································································.53
项目实战 ·····················································································································.54
实战一 搜索商品网址 ····························································································.54
实战二 搜索食品价格网址 ······················································································.56
项目三 Urllib 请求模块库的应用 ········································································.58
任务一 发送网页请求 ···································································································.58
一、基本HTTP 请求 ·······························································································.58
二、Request 网络请求 ·····························································································.66
三、设置请求头 ·····································································································.67
四、Handler 方法发送请求 ·······················································································.69
五、设置代理IP ····································································································.71
六、身份验证 ········································································································.73
任务二 网页下载 ·········································································································.77
一、网页结构 ········································································································.77
二、写入网页文件 ··································································································.77
三、网页文件下载 ··································································································.79
项目实战 ·····················································································································.82
实战一 下载Python 学习网址 ··················································································.82
实战二 下载公司网页HTML 文件 ············································································.85
项目四 安装Urllib3 请求模块库并发送请求 ··························································.87
任务一 安装Urllib3 请求模块库 ······················································································.87
一、安装Anaconda ·································································································.87
二、安装Urllib3 模块库 ···························································································.92
任务二 发送请求 ·········································································································.95
一、创建代理对象 ··································································································.96
二、请求方法 ········································································································.98
三、定义请求头 ·····································································································.99
四、设置代理IP ···································································································.101
五、自动重试 ·······································································································.102
六、重定向 ··········································································································.103
项目实战 ····················································································································.104
实战 发送请求访问淘宝 ························································································.104
项目五 Requests 请求模块库的应用 ·································································.106
任务一 网页请求 ·······································································································.106
一、标准的HTTP 请求 ···························································································.107
二、返回响应消息 ·································································································.109
三、JSON 格式数据 ·······························································································.114
任务二 发送请求方法 ·································································································.117
一、发送GET 请求方法 ·························································································.118
二、发送POST 请求方法 ························································································.120
三、其他请求方法 ·································································································.125
任务三 复杂网络请求 ·································································································.126
一、复杂请求头 ····································································································.126
二、上传文件 ·······································································································.129
三、Cookies 验证 ··································································································.131
四、会话保持 ·······································································································.131
任务四 异常处理 ·······································································································.133
一、try-except 语句 ································································································.133
二、Urllib 异常处理模块 ·························································································.134
三、Urllib3 异常处理模块 ·······················································································.135
四、request 异常处理模块 ·······················································································.135
项目实战 ···················································································································.138
实战 爬取豆瓣最受欢迎的影评网址 ·········································································.138
项目六 解析网页 ···························································································.141
任务一 正则表达式解析网页 ························································································.141
一、正则表达式模式 ······························································································.142
二、使用re 模块实现正则表达式 ··············································································.143
三、字符串查找 ····································································································.144
四、字符串替换 ····································································································.148
五、字符串分割 ····································································································.149
任务二 XPath 解析网页 ·······························································································.150
一、XPath 概述 ····································································································.150
二、XPath 网页解析 ······························································································.152
三、获取节点信息 ·································································································.154
四、节点关系 ·······································································································.160
五、查找节点信息 ·································································································.162
六、属性节点 ·······································································································.163
七、XPath 运算符 ·································································································.165
八、XML 节点轴 ··································································································.168
任务三 BeautifulSoup 解析网页 ······················································································.170
一、安装BeautifulSoup ···························································································.171
二、创建BeautifulSoup 对象 ····················································································.171
三、通过属性获取节点内容 ·····················································································.173
四、根据节点关系获取节点 ·····················································································.176
五、查找节点内容 ·································································································.178
六、通过CSS 选择器查找节点内容 ···········································································.182
项目实战 ····················································································································.183
实战一 获取查询网中河北省石家庄市的邮编区号 ·······················································.183
实战二 爬取销售热门图书名称 ···············································································.186
实战三 下载销售热门图书的图片 ············································································.188
项目七 Scrapy 网络爬虫框架 ···········································································.190
任务一 Scrapy 网络爬虫框架基础认知 ·············································································.190
一、Scrapy 网络爬虫框架基础 ··················································································.190
二、Scrapy 常用命令 ······························································································.192
三、创建Scrapy 项目 ·····························································································.193
任务二 使用模板创建Spider 文件 ··················································································.194
一、创建网络爬虫文件命令 ·····················································································.195
二、创建basic 模板文件 ·························································································.196
三、创建crawl 模板文件 ·························································································.197
四、创建csvfeed 模板文件 ······················································································.198
五、创建xmlfeed 模板文件 ······················································································.198
任务三 Scrapy 网络爬虫文件 ·························································································.199
一、Spider 类 ·······································································································.199
二、配置网络爬虫 ·································································································.201
三、启动网络爬虫 ·································································································.202
四、提取数据 ·······································································································.207
项目实战 ····················································································································.209
实战 提取景区名称 ······························································································.209