Python笔记 - 常用库

/ 技术文章 / 0 条评论 / 560浏览

Python笔记 - 常用库

本文提到的Python库

python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade requests
python3 -m pip install --upgrade pillow
python3 -m pip install --upgrade beautifulsoup4
python3 -m pip install --upgrade lxml

Pandas 10分钟入门

http://pandas.pydata.org/

下载(Windows需要 Run as Administrator):

python3 -m pip install --upgrade pandas

通常数据处理需要引入一下模块:

import pandas as pd          # Pandas
import numpy as np          # 机器学习库
import matplotlib.pyplot as plt # 画图

核心数据结构

pandas最核心的就是Series和DataFrame两个数据结构。

名称维度说明
Series1维带有标签的同构类型数组,可以认为是数轴
DataFrame2维表格结构,带有标签,大小可变,且可以包含异构的数据列,大多作为图表
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 创建Series
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
print('=' * 10)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
==========
# 时间轴
dates = pd.date_range('20181001', periods=6)
print(dates)
print('=' * 10)

# 根据NumPy数组 创建DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
print('=' * 10)
DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
               '2018-10-05', '2018-10-06'],
              dtype='datetime64[ns]', freq='D')
==========
                   A         B         C         D
2018-10-01  1.074291  1.406615  0.590643  0.096200
2018-10-02  1.417597  1.884438  0.818170 -2.858133
2018-10-03 -1.466480  0.040343 -0.197412  1.386438
2018-10-04 -0.200471  1.146282 -0.324586  0.700680
2018-10-05  0.999744 -0.115692  0.127083  1.171267
2018-10-06 -1.693359  0.100852 -0.989874 -0.712852
==========
# 根据Python字典数据 创建DataFrame
df2 = pd.DataFrame({ 'A' : 1.,
                   'B' : pd.Timestamp('20181002'),
                   'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D' : np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),
                   'F' : 'foo' })
print(df2)
print('=' * 10)
     A          B    C  D      E    F
0  1.0 2018-10-02  1.0  3   test  foo
1  1.0 2018-10-02  1.0  3  train  foo
2  1.0 2018-10-02  1.0  3   test  foo
3  1.0 2018-10-02  1.0  3  train  foo
==========

DataFrame 有不同 dtypes.

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

基本操作

查看最Top数据

df.head(2)
A B C D
2018-10-01 1.074291 1.406615 0.590643 0.096200
2018-10-02 1.417597 1.884438 0.818170 -2.858133

查看最底部数据

df.tail(3)
A B C D
2018-10-04 -0.200471 1.146282 -0.324586 0.700680
2018-10-05 0.999744 -0.115692 0.127083 1.171267
2018-10-06 -1.693359 0.100852 -0.989874 -0.712852
print(df.index)
print(df.columns)
print(df.values)
DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
               '2018-10-05', '2018-10-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 1.07429103  1.4066146   0.59064325  0.09620024]
 [ 1.41759707  1.88443823  0.81816967 -2.85813293]
 [-1.46647964  0.04034268 -0.19741215  1.38643847]
 [-0.20047085  1.14628163 -0.32458641  0.70068001]
 [ 0.9997445  -0.1156922   0.12708306  1.17126667]
 [-1.6933592   0.10085231 -0.9898741  -0.71285152]]

内置统计信息

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.021887 0.743806 0.004004 -0.036067
std 1.357830 0.842536 0.657038 1.578833
min -1.693359 -0.115692 -0.989874 -2.858133
25% -1.149977 0.055470 -0.292793 -0.510589
50% 0.399637 0.623567 -0.035165 0.398440
75% 1.055654 1.341531 0.474753 1.053620
max 1.417597 1.884438 0.818170 1.386438

转置

df.T
2018-10-01 00:00:00 2018-10-02 00:00:00 2018-10-03 00:00:00 2018-10-04 00:00:00 2018-10-05 00:00:00 2018-10-06 00:00:00
A 1.074291 1.417597 -1.466480 -0.200471 0.999744 -1.693359
B 1.406615 1.884438 0.040343 1.146282 -0.115692 0.100852
C 0.590643 0.818170 -0.197412 -0.324586 0.127083 -0.989874
D 0.096200 -2.858133 1.386438 0.700680 1.171267 -0.712852

排序

df.sort_index(axis=1, ascending=False) # 按列名倒序
D C B A
2018-10-01 0.096200 0.590643 1.406615 1.074291
2018-10-02 -2.858133 0.818170 1.884438 1.417597
2018-10-03 1.386438 -0.197412 0.040343 -1.466480
2018-10-04 0.700680 -0.324586 1.146282 -0.200471
2018-10-05 1.171267 0.127083 -0.115692 0.999744
2018-10-06 -0.712852 -0.989874 0.100852 -1.693359
df.sort_index(axis=0, ascending=False) # 按行名倒序
A B C D
2018-10-06 -1.693359 0.100852 -0.989874 -0.712852
2018-10-05 0.999744 -0.115692 0.127083 1.171267
2018-10-04 -0.200471 1.146282 -0.324586 0.700680
2018-10-03 -1.466480 0.040343 -0.197412 1.386438
2018-10-02 1.417597 1.884438 0.818170 -2.858133
2018-10-01 1.074291 1.406615 0.590643 0.096200

按列值排序

df.sort_values(by='B')
A B C D
2018-10-05 0.999744 -0.115692 0.127083 1.171267
2018-10-03 -1.466480 0.040343 -0.197412 1.386438
2018-10-06 -1.693359 0.100852 -0.989874 -0.712852
2018-10-04 -0.200471 1.146282 -0.324586 0.700680
2018-10-01 1.074291 1.406615 0.590643 0.096200
2018-10-02 1.417597 1.884438 0.818170 -2.858133

取数据

df['A'] #取列
2018-10-01    1.074291
2018-10-02    1.417597
2018-10-03   -1.466480
2018-10-04   -0.200471
2018-10-05    0.999744
2018-10-06   -1.693359
Freq: D, Name: A, dtype: float64
df[2:4] #取行
A B C D
2018-10-03 -1.466480 0.040343 -0.197412 1.386438
2018-10-04 -0.200471 1.146282 -0.324586 0.700680

筛选数据(可类比Excel)

df[df.A > 0] # 筛选A列, 选取A>0
A B C D
2018-10-01 1.074291 1.406615 0.590643 0.096200
2018-10-02 1.417597 1.884438 0.818170 -2.858133
2018-10-05 0.999744 -0.115692 0.127083 1.171267
df[df > 0] # 取值 value > 0
A B C D
2018-10-01 1.074291 1.406615 0.590643 0.096200
2018-10-02 1.417597 1.884438 0.818170 NaN
2018-10-03 NaN 0.040343 NaN 1.386438
2018-10-04 NaN 1.146282 NaN 0.700680
2018-10-05 0.999744 NaN 0.127083 1.171267
2018-10-06 NaN 0.100852 NaN NaN

使用 isin() 筛选

df3 = df.copy()
df3['E'] = ['one', 'one','two','three','four','three']
df3
A B C D E
2018-10-01 1.074291 1.406615 0.590643 0.096200 one
2018-10-02 1.417597 1.884438 0.818170 -2.858133 one
2018-10-03 -1.466480 0.040343 -0.197412 1.386438 two
2018-10-04 -0.200471 1.146282 -0.324586 0.700680 three
2018-10-05 0.999744 -0.115692 0.127083 1.171267 four
2018-10-06 -1.693359 0.100852 -0.989874 -0.712852 three
df3[df3['E'].isin(['two','four'])]
A B C D E
2018-10-03 -1.466480 0.040343 -0.197412 1.386438 two
2018-10-05 0.999744 -0.115692 0.127083 1.171267 four

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x25a6b74eb70>

png

df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x25a6da42e80>

png

保存与读取

df.to_csv('foo.csv')
pd.read_csv('foo.csv')
Unnamed: 0 A B C D
0 2018-10-01 -1.220849 0.347703 -1.294972 -0.127536
1 2018-10-02 0.965405 0.129895 -0.891966 -0.623168
2 2018-10-03 -0.716768 2.250310 1.579512 0.048968
3 2018-10-04 0.962764 -0.748119 0.382946 0.288096
4 2018-10-05 -0.798654 0.237288 -0.974988 0.055704
5 2018-10-06 -0.072705 0.690992 -0.407430 -0.715135

爬虫

UrlLib

https://docs.python.org/3/library/urllib.html

urllib中包括了四个模块,用来处理URL相关的操作

Requests

http://docs.python-requests.org/en/master/

对Python3更友好的的,相较于urllib能够支持更多的复杂功能

安装

python3 -m pip install --upgrade requests

简单示例

import requests

# Get
r = requests.get('https://api.github.com/events')
# Get with Params
r = requests.get('https://httpbin.org/get', params={'key1': 'value1', 'key2': 'value2'})
# Put or POST
r = requests.put('https://httpbin.org/put', data = {'key':'value'})
#DEL
r = requests.delete('https://httpbin.org/delete')
#HEAD
r = requests.head('https://httpbin.org/get')
#Options
r = requests.options('https://httpbin.org/get')

Quick Start

获取响应

import requests

# Get
r = requests.get('https://api.github.com/events')
# print('Get response: {0:1}'.format(r.text))    # 文本
# print('Get response: {0:1}'.format(r.content)) # 二进制内容
# print('Get response in json: {0}'.format(r.json())) # json
print('Get response encoding: {0}'.format(r.encoding))
print('Get response status code: {0}'.format(r.status_code))
print('Get response headers: {0}'.format(r.headers))

Get response encoding: utf-8
Get response status code: 200
Get response headers: {'Server': 'GitHub.com', 'Date': 'Sat, 06 Oct 2018 07:57:56 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Status': '200 OK', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '55', 'X-RateLimit-Reset': '1538815630', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept', 'ETag': 'W/"e791f13d15e5fa81cdb229a2aeb70031"', 'Last-Modified': 'Sat, 06 Oct 2018 07:52:56 GMT', 'X-Poll-Interval': '60', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/events?page=2>; rel="next", <https://api.github.com/events?page=10>; rel="last"', 'Access-Control-Expose-Headers': 'ETag, Link, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'X-Runtime-rack': '0.364552', 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': '9371:15CB:2D17FCA:64372DA:5BB86B03'}

获取图片/二进制响应

from PIL import Image
from io import BytesIO

r = requests.get('https://www.baidu.com/img/bd_logo1.png')
i = Image.open(BytesIO(r.content))
i

png

获取原始响应(流)

r = requests.get('https://api.github.com/events', stream=True)
print(repr(r.raw))
print(r.raw.read(10))

# 将流写入文件
# with open(filename, 'wb') as fd:
#     for chunk in r.iter_content(chunk_size):
#         fd.write(chunk)
<urllib3.response.HTTPResponse object at 0x0000021DB45A01D0>
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

上传文件(Multipart)

强烈建议你用二进制模式,这是因为 Requests 可能会试图为你提供 Content-Length header,在它这样做的时候,这个值会被设为文件的字节数(bytes)。如果用文本模式(text mode)打开文件,就可能会发生错误。

url = 'http://httpbin.org/post'
files = {'file': open('foo.csv', 'rb')}
r = requests.post(url, files=files)
r.text
'{\n  "args": {}, \n  "data": "", \n  "files": {\n    "file": ",A,B,C,D\\r\\n2018-10-01,-1.220848947969879,0.34770279424784534,-1.2949716740282913,-0.12753569660447445\\r\\n2018-10-02,0.9654054950657867,0.1298946116737466,-0.8919662017673563,-0.6231678397510875\\r\\n2018-10-03,-0.7167684655756086,2.2503095310053123,1.579511611490373,0.04896807966119773\\r\\n2018-10-04,0.962763642192827,-0.7481190393285372,0.3829458578953143,0.2880958636907875\\r\\n2018-10-05,-0.7986543786669759,0.23728813860318618,-0.9749883474151525,0.055703777273328835\\r\\n2018-10-06,-0.07270533901475876,0.6909922491284924,-0.4074295590583697,-0.7151348951915282\\r\\n"\n  }, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Connection": "close", \n    "Content-Length": "697", \n    "Content-Type": "multipart/form-data; boundary=6bc67f9be73d22638faebce2a26c8116", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.19.1"\n  }, \n  "json": null, \n  "origin": "223.104.213.123", \n  "url": "http://httpbin.org/post"\n}\n'

Beautiful Soup

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装

python3 -m pip install --upgrade beautifulsoup4

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml

python3 -m pip install --upgrade lxml

简单示例

# coding:utf-8
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
import re
import json

skuid = '7437789'
resp = requests.get('https://item.jd.com/{0}.html'.format(skuid), allow_redirects=False, timeout=5)
print('Get JD merchandise returns status: {0}'.format(resp.status_code))

soup = BeautifulSoup(resp.text,'lxml')
merchandiseName = soup.find('div',attrs={'class':'sku-name'}).text.strip()
merchandiseImage = soup.find('img', attrs={'id':'spec-img'})['data-origin']

resp_price = requests.get('https://p.3.cn/prices/mgets?skuIds=J_{0}'.format(skuid), allow_redirects=False, timeout=5)
resp_json = json.loads(resp_price.text)

print(merchandiseName)
print(merchandiseImage)

for price_info in resp_json:
    print('Origin price: ¥{0}, Current Price: ¥{1}'.format(price_info['op'], price_info['p']))

r = requests.get('https:{0}'.format(merchandiseImage))
i = Image.open(BytesIO(r.content))
i
Get JD merchandise returns status: 200
欧橡驱鼠器电子猫老鼠干扰器 500㎡有效面积SD-003 企业定制款100个起订
//img14.360buyimg.com/n1/jfs/t19189/185/1994210951/283155/8e4d44ca/5ae1a511Necf9d582.jpg
Origin price: ¥248.00, Current Price: ¥248.00

png