Python笔记 - 常用库

本文提到的Python库

python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade requests
python3 -m pip install --upgrade pillow
python3 -m pip install --upgrade beautifulsoup4
python3 -m pip install --upgrade lxml

Pandas 10分钟入门

http://pandas.pydata.org/

下载（Windows需要 Run as Administrator）：

python3 -m pip install --upgrade pandas

通常数据处理需要引入一下模块：

import pandas as pd          # Pandas
import numpy as np          # 机器学习库
import matplotlib.pyplot as plt # 画图

核心数据结构

pandas最核心的就是Series和DataFrame两个数据结构。

名称	维度	说明
Series	1维	带有标签的同构类型数组，可以认为是数轴
DataFrame	2维	表格结构，带有标签，大小可变，且可以包含异构的数据列，大多作为图表

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 创建Series
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
print('=' * 10)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
==========

# 时间轴
dates = pd.date_range('20181001', periods=6)
print(dates)
print('=' * 10)

# 根据NumPy数组 创建DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
print('=' * 10)

DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
               '2018-10-05', '2018-10-06'],
              dtype='datetime64[ns]', freq='D')
==========
                   A         B         C         D
2018-10-01  1.074291  1.406615  0.590643  0.096200
2018-10-02  1.417597  1.884438  0.818170 -2.858133
2018-10-03 -1.466480  0.040343 -0.197412  1.386438
2018-10-04 -0.200471  1.146282 -0.324586  0.700680
2018-10-05  0.999744 -0.115692  0.127083  1.171267
2018-10-06 -1.693359  0.100852 -0.989874 -0.712852
==========

# 根据Python字典数据 创建DataFrame
df2 = pd.DataFrame({ 'A' : 1.,
                   'B' : pd.Timestamp('20181002'),
                   'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D' : np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),
                   'F' : 'foo' })
print(df2)
print('=' * 10)

     A          B    C  D      E    F
0  1.0 2018-10-02  1.0  3   test  foo
1  1.0 2018-10-02  1.0  3  train  foo
2  1.0 2018-10-02  1.0  3   test  foo
3  1.0 2018-10-02  1.0  3  train  foo
==========

DataFrame 有不同 dtypes.

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

基本操作

查看最Top数据

df.head(2)

	A	B	C	D
2018-10-01	1.074291	1.406615	0.590643	0.096200
2018-10-02	1.417597	1.884438	0.818170	-2.858133

查看最底部数据

df.tail(3)

	A	B	C	D
2018-10-04	-0.200471	1.146282	-0.324586	0.700680
2018-10-05	0.999744	-0.115692	0.127083	1.171267
2018-10-06	-1.693359	0.100852	-0.989874	-0.712852

print(df.index)
print(df.columns)
print(df.values)

DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
               '2018-10-05', '2018-10-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 1.07429103  1.4066146   0.59064325  0.09620024]
 [ 1.41759707  1.88443823  0.81816967 -2.85813293]
 [-1.46647964  0.04034268 -0.19741215  1.38643847]
 [-0.20047085  1.14628163 -0.32458641  0.70068001]
 [ 0.9997445  -0.1156922   0.12708306  1.17126667]
 [-1.6933592   0.10085231 -0.9898741  -0.71285152]]

内置统计信息

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.021887	0.743806	0.004004	-0.036067
std	1.357830	0.842536	0.657038	1.578833
min	-1.693359	-0.115692	-0.989874	-2.858133
25%	-1.149977	0.055470	-0.292793	-0.510589
50%	0.399637	0.623567	-0.035165	0.398440
75%	1.055654	1.341531	0.474753	1.053620
max	1.417597	1.884438	0.818170	1.386438

转置

df.T

	2018-10-01 00:00:00	2018-10-02 00:00:00	2018-10-03 00:00:00	2018-10-04 00:00:00	2018-10-05 00:00:00	2018-10-06 00:00:00
A	1.074291	1.417597	-1.466480	-0.200471	0.999744	-1.693359
B	1.406615	1.884438	0.040343	1.146282	-0.115692	0.100852
C	0.590643	0.818170	-0.197412	-0.324586	0.127083	-0.989874
D	0.096200	-2.858133	1.386438	0.700680	1.171267	-0.712852

排序

df.sort_index(axis=1, ascending=False) # 按列名倒序

	D	C	B	A
2018-10-01	0.096200	0.590643	1.406615	1.074291
2018-10-02	-2.858133	0.818170	1.884438	1.417597
2018-10-03	1.386438	-0.197412	0.040343	-1.466480
2018-10-04	0.700680	-0.324586	1.146282	-0.200471
2018-10-05	1.171267	0.127083	-0.115692	0.999744
2018-10-06	-0.712852	-0.989874	0.100852	-1.693359

df.sort_index(axis=0, ascending=False) # 按行名倒序

	A	B	C	D
2018-10-06	-1.693359	0.100852	-0.989874	-0.712852
2018-10-05	0.999744	-0.115692	0.127083	1.171267
2018-10-04	-0.200471	1.146282	-0.324586	0.700680
2018-10-03	-1.466480	0.040343	-0.197412	1.386438
2018-10-02	1.417597	1.884438	0.818170	-2.858133
2018-10-01	1.074291	1.406615	0.590643	0.096200

按列值排序

df.sort_values(by='B')

	A	B	C	D
2018-10-05	0.999744	-0.115692	0.127083	1.171267
2018-10-03	-1.466480	0.040343	-0.197412	1.386438
2018-10-06	-1.693359	0.100852	-0.989874	-0.712852
2018-10-04	-0.200471	1.146282	-0.324586	0.700680
2018-10-01	1.074291	1.406615	0.590643	0.096200
2018-10-02	1.417597	1.884438	0.818170	-2.858133

取数据

df['A'] #取列

2018-10-01    1.074291
2018-10-02    1.417597
2018-10-03   -1.466480
2018-10-04   -0.200471
2018-10-05    0.999744
2018-10-06   -1.693359
Freq: D, Name: A, dtype: float64

df[2:4] #取行

	A	B	C	D
2018-10-03	-1.466480	0.040343	-0.197412	1.386438
2018-10-04	-0.200471	1.146282	-0.324586	0.700680

筛选数据（可类比Excel）

df[df.A > 0] # 筛选A列， 选取A>0

	A	B	C	D
2018-10-01	1.074291	1.406615	0.590643	0.096200
2018-10-02	1.417597	1.884438	0.818170	-2.858133
2018-10-05	0.999744	-0.115692	0.127083	1.171267

df[df > 0] # 取值 value > 0

	A	B	C	D
2018-10-01	1.074291	1.406615	0.590643	0.096200
2018-10-02	1.417597	1.884438	0.818170	NaN
2018-10-03	NaN	0.040343	NaN	1.386438
2018-10-04	NaN	1.146282	NaN	0.700680
2018-10-05	0.999744	NaN	0.127083	1.171267
2018-10-06	NaN	0.100852	NaN	NaN

使用 isin() 筛选

df3 = df.copy()
df3['E'] = ['one', 'one','two','three','four','three']
df3

	A	B	C	D	E
2018-10-01	1.074291	1.406615	0.590643	0.096200	one
2018-10-02	1.417597	1.884438	0.818170	-2.858133	one
2018-10-03	-1.466480	0.040343	-0.197412	1.386438	two
2018-10-04	-0.200471	1.146282	-0.324586	0.700680	three
2018-10-05	0.999744	-0.115692	0.127083	1.171267	four
2018-10-06	-1.693359	0.100852	-0.989874	-0.712852	three

df3[df3['E'].isin(['two','four'])]

	A	B	C	D	E
2018-10-03	-1.466480	0.040343	-0.197412	1.386438	two
2018-10-05	0.999744	-0.115692	0.127083	1.171267	four

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x25a6b74eb70>

png

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x25a6da42e80>

png

保存与读取

df.to_csv('foo.csv')

pd.read_csv('foo.csv')

	Unnamed: 0	A	B	C	D
0	2018-10-01	-1.220849	0.347703	-1.294972	-0.127536
1	2018-10-02	0.965405	0.129895	-0.891966	-0.623168
2	2018-10-03	-0.716768	2.250310	1.579512	0.048968
3	2018-10-04	0.962764	-0.748119	0.382946	0.288096
4	2018-10-05	-0.798654	0.237288	-0.974988	0.055704
5	2018-10-06	-0.072705	0.690992	-0.407430	-0.715135

爬虫

UrlLib

https://docs.python.org/3/library/urllib.html

urllib中包括了四个模块，用来处理URL相关的操作

urllib.request可以用来发送request和获取request的结果
urllib.error包含了urllib.request产生的异常
urllib.parse用来解析和处理URL
urllib.robotparse用来解析页面的robots.txt文件

Requests

http://docs.python-requests.org/en/master/

对Python3更友好的的，相较于urllib能够支持更多的复杂功能

Keep-Alive & Connection Pooling
International Domains and URLs
Sessions with Cookie Persistence
Browser-style SSL Verification
Automatic Content Decoding
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
HTTP(S) Proxy Support
Multipart File Uploads
Streaming Downloads
Connection Timeouts
Chunked Requests
.netrc Support

安装

python3 -m pip install --upgrade requests

简单示例

import requests

# Get
r = requests.get('https://api.github.com/events')
# Get with Params
r = requests.get('https://httpbin.org/get', params={'key1': 'value1', 'key2': 'value2'})
# Put or POST
r = requests.put('https://httpbin.org/put', data = {'key':'value'})
#DEL
r = requests.delete('https://httpbin.org/delete')
#HEAD
r = requests.head('https://httpbin.org/get')
#Options
r = requests.options('https://httpbin.org/get')

Quick Start

获取响应

import requests

# Get
r = requests.get('https://api.github.com/events')
# print('Get response: {0:1}'.format(r.text))    # 文本
# print('Get response: {0:1}'.format(r.content)) # 二进制内容
# print('Get response in json: {0}'.format(r.json())) # json
print('Get response encoding: {0}'.format(r.encoding))
print('Get response status code: {0}'.format(r.status_code))
print('Get response headers: {0}'.format(r.headers))

Get response encoding: utf-8
Get response status code: 200
Get response headers: {'Server': 'GitHub.com', 'Date': 'Sat, 06 Oct 2018 07:57:56 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Status': '200 OK', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '55', 'X-RateLimit-Reset': '1538815630', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept', 'ETag': 'W/"e791f13d15e5fa81cdb229a2aeb70031"', 'Last-Modified': 'Sat, 06 Oct 2018 07:52:56 GMT', 'X-Poll-Interval': '60', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/events?page=2>; rel="next", <https://api.github.com/events?page=10>; rel="last"', 'Access-Control-Expose-Headers': 'ETag, Link, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'X-Runtime-rack': '0.364552', 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': '9371:15CB:2D17FCA:64372DA:5BB86B03'}

获取图片/二进制响应

from PIL import Image
from io import BytesIO

r = requests.get('https://www.baidu.com/img/bd_logo1.png')
i = Image.open(BytesIO(r.content))
i

png

获取原始响应（流）

r = requests.get('https://api.github.com/events', stream=True)
print(repr(r.raw))
print(r.raw.read(10))

# 将流写入文件
# with open(filename, 'wb') as fd:
#     for chunk in r.iter_content(chunk_size):
#         fd.write(chunk)

<urllib3.response.HTTPResponse object at 0x0000021DB45A01D0>
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

上传文件（Multipart）

强烈建议你用二进制模式，这是因为 Requests 可能会试图为你提供 Content-Length header，在它这样做的时候，这个值会被设为文件的字节数（bytes）。如果用文本模式(text mode)打开文件，就可能会发生错误。

url = 'http://httpbin.org/post'
files = {'file': open('foo.csv', 'rb')}
r = requests.post(url, files=files)
r.text

'{\n  "args": {}, \n  "data": "", \n  "files": {\n    "file": ",A,B,C,D\\r\\n2018-10-01,-1.220848947969879,0.34770279424784534,-1.2949716740282913,-0.12753569660447445\\r\\n2018-10-02,0.9654054950657867,0.1298946116737466,-0.8919662017673563,-0.6231678397510875\\r\\n2018-10-03,-0.7167684655756086,2.2503095310053123,1.579511611490373,0.04896807966119773\\r\\n2018-10-04,0.962763642192827,-0.7481190393285372,0.3829458578953143,0.2880958636907875\\r\\n2018-10-05,-0.7986543786669759,0.23728813860318618,-0.9749883474151525,0.055703777273328835\\r\\n2018-10-06,-0.07270533901475876,0.6909922491284924,-0.4074295590583697,-0.7151348951915282\\r\\n"\n  }, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Connection": "close", \n    "Content-Length": "697", \n    "Content-Type": "multipart/form-data; boundary=6bc67f9be73d22638faebce2a26c8116", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.19.1"\n  }, \n  "json": null, \n  "origin": "223.104.213.123", \n  "url": "http://httpbin.org/post"\n}\n'

Beautiful Soup

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装

python3 -m pip install --upgrade beautifulsoup4

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml

python3 -m pip install --upgrade lxml

简单示例

# coding:utf-8
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
import re
import json

skuid = '7437789'
resp = requests.get('https://item.jd.com/{0}.html'.format(skuid), allow_redirects=False, timeout=5)
print('Get JD merchandise returns status: {0}'.format(resp.status_code))

soup = BeautifulSoup(resp.text,'lxml')
merchandiseName = soup.find('div',attrs={'class':'sku-name'}).text.strip()
merchandiseImage = soup.find('img', attrs={'id':'spec-img'})['data-origin']

resp_price = requests.get('https://p.3.cn/prices/mgets?skuIds=J_{0}'.format(skuid), allow_redirects=False, timeout=5)
resp_json = json.loads(resp_price.text)

print(merchandiseName)
print(merchandiseImage)

for price_info in resp_json:
    print('Origin price: ￥{0}, Current Price: ￥{1}'.format(price_info['op'], price_info['p']))

r = requests.get('https:{0}'.format(merchandiseImage))
i = Image.open(BytesIO(r.content))
i

Get JD merchandise returns status: 200
欧橡驱鼠器电子猫老鼠干扰器 500㎡有效面积SD-003 企业定制款100个起订
//img14.360buyimg.com/n1/jfs/t19189/185/1994210951/283155/8e4d44ca/5ae1a511Necf9d582.jpg
Origin price: ￥248.00, Current Price: ￥248.00

png

本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名,转载请标明出处
最后编辑时间为: 2018/10/06 18:34