设为首页收藏本站

EPS数据狗论坛

 找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 800|回复: 0

[Python] Python网络数据采集

[复制链接]

71

主题

7738

金钱

8070

积分

高级用户

发表于 2018-7-16 17:19:03 | 显示全部楼层 |阅读模式

Python网络数据采集
译者序 .....................................................................................................................................................ix
前言 ..........................................................................................................................................................xi
第一部分 创建爬虫
第1 章 初见网络爬虫 .......................................................................................................................2
1.1 网络连接 .....................................................................................................................................2
1.2 BeautifulSoup 简介 .....................................................................................................................4
1.2.1 安装BeautifulSoup ........................................................................................................5
1.2.2 运行BeautifulSoup ........................................................................................................7
1.2.3 可靠的网络连接 ............................................................................................................8
第2 章 复杂HTML 解析 ...............................................................................................................11
2.1 不是一直都要用锤子 ...............................................................................................................11
2.2 再端一碗BeautifulSoup ...........................................................................................................12
2.2.1 BeautifulSoup 的find() 和findAll() ......................................................................13
2.2.2 其他BeautifulSoup 对象 .............................................................................................15
2.2.3 导航树 ..........................................................................................................................16
2.3 正则表达式 ...............................................................................................................................19
2.4 正则表达式和BeautifulSoup ...................................................................................................23
2.5 获取属性 ...................................................................................................................................24
2.6 Lambda 表达式 .........................................................................................................................24
2.7 超越BeautifulSoup ...................................................................................................................25
vi | 目录
第3 章 开始采集 ..............................................................................................................................26
3.1 遍历单个域名 ...........................................................................................................................26
3.2 采集整个网站 ...........................................................................................................................30
3.3 通过互联网采集 .......................................................................................................................34
3.4 用Scrapy 采集 ..........................................................................................................................38
第4 章 使用API ..............................................................................................................................42
4.1 API 概述 ...................................................................................................................................43
4.2 API 通用规则 ...........................................................................................................................43
4.2.1 方法 ..............................................................................................................................44
4.2.2 验证 ..............................................................................................................................44
4.3 服务器响应 ...............................................................................................................................45
4.4 Echo Nest ..................................................................................................................................46
4.5 Twitter API ................................................................................................................................48
4.5.1 开始 ..............................................................................................................................48
4.5.2 几个示例 ......................................................................................................................50
4.6 Google API ................................................................................................................................52
4.6.1 开始 ..............................................................................................................................52
4.6.2 几个示例 ......................................................................................................................53
4.7 解析JSON 数据 .......................................................................................................................55
4.8 回到主题 ...................................................................................................................................56
4.9 再说一点API ...........................................................................................................................60
第5 章 存储数据 ..............................................................................................................................61
5.1 媒体文件 ...................................................................................................................................61
5.2 把数据存储到CSV ..................................................................................................................64
5.3 MySQL ......................................................................................................................................65
5.3.1 安装MySQL ................................................................................................................66
5.3.2 基本命令 ......................................................................................................................68
5.3.3 与Python 整合 .............................................................................................................71
5.3.4 数据库技术与最佳实践 ..............................................................................................74
5.3.5 MySQL 里的“六度空间游戏” ..................................................................................75
5.4 Email .........................................................................................................................................77
第6 章 读取文档 ..............................................................................................................................80
6.1 文档编码 ...................................................................................................................................80
6.2 纯文本 .......................................................................................................................................81
6.3 CSV ...........................................................................................................................................85
6.4 PDF............................................................................................................................................87
6.5 微软Word 和.docx ..................................................................................................................88
第二部分 高级数据采集
第7 章 数据清洗 ..............................................................................................................................94
7.1 编写代码清洗数据 ...................................................................................................................94
7.2 数据存储后再清洗 ...................................................................................................................98
第8 章 自然语言处理 ...................................................................................................................103
8.1 概括数据 .................................................................................................................................104
8.2 马尔可夫模型 .........................................................................................................................106
8.3 自然语言工具包 .....................................................................................................................112
8.3.1 安装与设置 ................................................................................................................112
8.3.2 用NLTK 做统计分析 ................................................................................................113
8.3.3 用NLTK 做词性分析 ................................................................................................115
8.4 其他资源 .................................................................................................................................119
第9 章 穿越网页表单与登录窗口进行采集 ...........................................................................120
9.1 Python Requests 库 .................................................................................................................120
9.2 提交一个基本表单 .................................................................................................................121
9.3 单选按钮、复选框和其他输入 .............................................................................................123
9.4 提交文件和图像 .....................................................................................................................124
9.5 处理登录和cookie .................................................................................................................125
9.6 其他表单问题 .........................................................................................................................127
第10 章 采集JavaScript ............................................................................................................128
10.1 JavaScript 简介 .....................................................................................................................128
10.2 Ajax 和动态HTML ..............................................................................................................131
10.3 处理重定向 ...........................................................................................................................137
第11 章 图像识别与文字处理 ...................................................................................................139
11.1 OCR 库概述 ..........................................................................................................................140
11.1.1 Pillow .......................................................................................................................140
11.1.2 Tesseract ..................................................................................................................140
11.1.3 NumPy .....................................................................................................................141
11.2 处理格式规范的文字 ...........................................................................................................142
11.3 读取验证码与训练Tesseract ...............................................................................................146
11.4 获取验证码提交答案 ...........................................................................................................151
第12 章 避开采集陷阱 ................................................................................................................154
12.1 道德规范 ...............................................................................................................................154
12.2 让网络机器人看起来像人类用户 .......................................................................................155
viii | 目录
12.2.1 修改请求头 .............................................................................................................155
12.2.2 处理cookie .............................................................................................................157
12.2.3 时间就是一切 .........................................................................................................159
12.3 常见表单安全措施 ...............................................................................................................159
12.3.1 隐含输入字段值 .....................................................................................................159
12.3.2 避免蜜罐 .................................................................................................................160
12.4 问题检查表 ...........................................................................................................................162
第13 章 用爬虫测试网站 ............................................................................................................164
13.1 测试简介 ...............................................................................................................................164
13.2 Python 单元测试...................................................................................................................165
13.3 Selenium 单元测试 ...............................................................................................................168
13.4 Python 单元测试与Selenium 单元测试的选择 .................................................................172
第14 章 远程采集 .........................................................................................................................174
14.1 为什么要用远程服务器 .......................................................................................................174
14.1.1 避免IP 地址被封杀 ...............................................................................................174
14.1.2 移植性与扩展性 .....................................................................................................175
14.2 Tor 代理服务器 ....................................................................................................................176
14.3 远程主机 ...............................................................................................................................177
14.3.1 从网站主机运行 .....................................................................................................178
14.3.2 从云主机运行 .........................................................................................................178
14.4 其他资源 ...............................................................................................................................179
14.5 勇往直前 ...............................................................................................................................180
附录A Python 简介 ......................................................................................................................181
附录B 互联网简介 ........................................................................................................................184
附录C 网络数据采集的法律与道德约束 ................................................................................188
作者简介 ..............................................................................................................................................200
封面介绍 ..............................................................................................................................................200

Python网络数据采集.pdf

16.7 MB, 下载次数: 12

售价: 5 金钱  [记录]

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

客服中心
关闭
在线时间:
周一~周五
8:30-17:30
QQ群:
653541906
联系电话:
010-85786021-8017
在线咨询
客服中心

意见反馈|网站地图|手机版|小黑屋|EPS数据狗论坛 ( 京ICP备09019565号-3 )   

Powered by BFIT! X3.4

© 2008-2028 BFIT Inc.

快速回复 返回顶部 返回列表