原文:pandas官方文档 10 minutes to pandas
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline
1 创建对象
1.1 Series
Series 是一个列向量,它仅包含一列数据及其索引,通过向Series中传递一个列表来创建:
s = pd.Series([1 , 3 , 5 , np.nan, 6 , 8 ]) print (s)
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
1.2 DataFrame
DataFrame 是一个矩阵,通过传递一个Numpy数组,以日期作为行索引,以’ABCD’作为列索引来创建:
dates = pd.date_range('20191214' , periods=6 ) df = pd.DataFrame(np.random.randn(6 , 4 ), index=dates, columns=list('ABCD' )) print (df)
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
通过传递一个字典来创建DataFrame,每个value都会被转化成一个 Series:
df2 = pd.DataFrame({'A' : 1. , 'B' : pd.Timestamp('20130102' ), 'C' : pd.Series(1 , index=list(range(4 )), dtype='float32' ), 'D' : np.array([3 ] * 4 , dtype='int32' ), 'E' : pd.Categorical(["test" , "train" , "test" , "train" ]), 'F' : 'foo' }) print (df2)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
所以我们可以认为,DataFrame 是由一列一列的Series 拼成的
2 查看数据
用head 和tail 查看开始和结尾几行的数据,用index 和columns 查看行索引和列索引,用vlaues 显示不带索引的内容:
print (df.head(), '\n' )print (df.tail(3 ), '\n' )print (df.index, '\n' )print (df.columns, '\n' )print (df.values, '\n' )
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
A B C D
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
DatetimeIndex(['2019-12-14', '2019-12-15', '2019-12-16', '2019-12-17',
'2019-12-18', '2019-12-19'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.61656893 0.05140077 1.12140918 0.30726721]
[ 0.48399521 -1.25127827 -1.77202211 1.2826497 ]
[ 0.36688539 -1.61791122 0.43830175 1.12413452]
[ 0.71737246 0.23322127 1.83916745 -1.53579412]
[ 1.75385779 0.26611012 -0.33393125 0.281345 ]
[ 1.04860622 0.98477115 -0.27910636 0.59228493]]
Numpy arrays整个数组只有一个数据类型,而DataFrame每一列一个数据类型。所以在将df2转化为numpy的时候,每列的数据类型统一为object:
print (df.to_numpy())df2.to_numpy()
[[ 0.61656893 0.05140077 1.12140918 0.30726721]
[ 0.48399521 -1.25127827 -1.77202211 1.2826497 ]
[ 0.36688539 -1.61791122 0.43830175 1.12413452]
[ 0.71737246 0.23322127 1.83916745 -1.53579412]
[ 1.75385779 0.26611012 -0.33393125 0.281345 ]
[ 1.04860622 0.98477115 -0.27910636 0.59228493]]
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
<kbdDescribe显示每一列数据的统计概要:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.831214 -0.222281 0.168970 0.341981
std 0.508670 0.998449 1.264150 1.008979
min 0.366885 -1.617911 -1.772022 -1.535794
25% 0.517139 -0.925609 -0.320225 0.287826
50% 0.666971 0.142311 0.079598 0.449776
75% 0.965798 0.257888 0.950632 0.991172
max 1.753858 0.984771 1.839167 1.282650
2019-12-14 2019-12-15 2019-12-16 2019-12-17 2019-12-18 2019-12-19
A 0.616569 0.483995 0.366885 0.717372 1.753858 1.048606
B 0.051401 -1.251278 -1.617911 0.233221 0.266110 0.984771
C 1.121409 -1.772022 0.438302 1.839167 -0.333931 -0.279106
D 0.307267 1.282650 1.124135 -1.535794 0.281345 0.592285
sort_index 按照轴排序,axis=1表示第二个维度,即按照列索引排序:
print (df.sort_index(axis=1 , ascending=False ))
D C B A
2019-12-14 0.307267 1.121409 0.051401 0.616569
2019-12-15 1.282650 -1.772022 -1.251278 0.483995
2019-12-16 1.124135 0.438302 -1.617911 0.366885
2019-12-17 -1.535794 1.839167 0.233221 0.717372
2019-12-18 0.281345 -0.333931 0.266110 1.753858
2019-12-19 0.592285 -0.279106 0.984771 1.048606
sort_values 按照值排序,by='B’表示对第二列排序:
print (df.sort_values(by='B' ))
A B C D
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
3 选择
3.1 获取行、列
与numpy类似,DataFrame也可以用[] 进行选择:
print (df['A' ], '\n' )print (df[0 :3 ], '\n' ) print (df['20191216' :'20191219' ], '\n' )
2019-12-14 0.616569
2019-12-15 0.483995
2019-12-16 0.366885
2019-12-17 0.717372
2019-12-18 1.753858
2019-12-19 1.048606
Freq: D, Name: A, dtype: float64
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
A B C D
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
print (df.loc[dates[0 ]], '\n' ) print (df.loc[:, ['A' , 'B' ]], '\n' )print (df.loc['20191216' :'20191218' , ['A' , 'B' ]], '\n' )print (df.at[dates[0 ], 'A' ], '\n' )
A 0.616569
B 0.051401
C 1.121409
D 0.307267
Name: 2019-12-14 00:00:00, dtype: float64
A B
2019-12-14 0.616569 0.051401
2019-12-15 0.483995 -1.251278
2019-12-16 0.366885 -1.617911
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
2019-12-19 1.048606 0.984771
A B
2019-12-16 0.366885 -1.617911
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
0.6165689271991402
print (df.iloc[3 ], '\n' )print (df.iloc[3 :5 , 0 :2 ], '\n' )print (df.iloc[[1 ,2 ,4 ], [0 ,2 ]], '\n' )print (df.iloc[1 :3 , :], '\n' )print (df.iat[1 ,1 ], '\n' )
A 0.717372
B 0.233221
C 1.839167
D -1.535794
Name: 2019-12-17 00:00:00, dtype: float64
A B
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
A C
2019-12-15 0.483995 -1.772022
2019-12-16 0.366885 0.438302
2019-12-18 1.753858 -0.333931
A B C D
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
-1.2512782747247186
print (df[df.A > 0 ], '\n' ) print (df[df > 0 ], '\n' )df2 = df.copy() df2['E' ] = ['one' , 'one' , 'two' , 'three' , 'four' , 'three' ] print (df2, '\n' )print (df2[df2['E' ].isin(['two' , 'four' ])])
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 NaN NaN 1.282650
2019-12-16 0.366885 NaN 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 NaN
2019-12-18 1.753858 0.266110 NaN 0.281345
2019-12-19 1.048606 0.984771 NaN 0.592285
A B C D E
2019-12-14 0.616569 0.051401 1.121409 0.307267 one
2019-12-15 0.483995 -1.251278 -1.772022 1.282650 one
2019-12-16 0.366885 -1.617911 0.438302 1.124135 two
2019-12-17 0.717372 0.233221 1.839167 -1.535794 three
2019-12-18 1.753858 0.266110 -0.333931 0.281345 four
2019-12-19 1.048606 0.984771 -0.279106 0.592285 three
A B C D E
2019-12-16 0.366885 -1.617911 0.438302 1.124135 two
2019-12-18 1.753858 0.266110 -0.333931 0.281345 four
3.2 设置行、列
s1 = pd.Series([1 , 2 , 3 , 4 , 5 , 6 ], index=pd.date_range('20191215' , periods=6 )) print (s1, '\n' )df['F' ] = s1 df.at[dates[0 ], 'A' ] = 0 df.iat[0 , 1 ] = 0 df.loc[:, 'D' ] = np.array([5 ] * len(df)) print (df)df2 = df.copy() df2[df2 > 0 ] = -df2 print (df2)
2019-12-15 1
2019-12-16 2
2019-12-17 3
2019-12-18 4
2019-12-19 5
2019-12-20 6
Freq: D, dtype: int64
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0
2019-12-18 1.753858 0.266110 -0.333931 5 4.0
2019-12-19 1.048606 0.984771 -0.279106 5 5.0
A B C D F
2019-12-14 0.000000 0.000000 -1.121409 -5 NaN
2019-12-15 -0.483995 -1.251278 -1.772022 -5 -1.0
2019-12-16 -0.366885 -1.617911 -0.438302 -5 -2.0
2019-12-17 -0.717372 -0.233221 -1.839167 -5 -3.0
2019-12-18 -1.753858 -0.266110 -0.333931 -5 -4.0
2019-12-19 -1.048606 -0.984771 -0.279106 -5 -5.0
4 缺失值
pandas用np.nan 表示缺失值,通常它不会被计算。
df1 = df.reindex(index=dates[0 :4 ], columns=list(df.columns) + ['E' ]) df1.loc[dates[0 ]:dates[1 ], 'E' ] = 1 print (df1, '\n' )print (df1.dropna(how='any' ), '\n' ) print (df1.fillna(value=5 ), '\n' ) print (pd.isna(df1))
A B C D F E
2019-12-14 0.000000 0.000000 1.121409 5 NaN 1.0
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0 NaN
2019-12-17 0.717372 0.233221 1.839167 5 3.0 NaN
A B C D F E
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
A B C D F E
2019-12-14 0.000000 0.000000 1.121409 5 5.0 1.0
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0 5.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0 5.0
A B C D F E
2019-12-14 False False False False True False
2019-12-15 False False False False False False
2019-12-16 False False False False False True
2019-12-17 False False False False False True
4 操作
4.1 统计
print (df.mean(), '\n' )print (df.mean(1 ),'\n' ) print (pd.Series([1 , 3 , 5 , np.nan, 6 , 8 ], index=dates).shift(2 ), '\n' )
A 0.728453
B -0.230848
C 0.168970
D 5.000000
F 3.000000
dtype: float64
2019-12-14 1.530352
2019-12-15 0.692139
2019-12-16 1.237455
2019-12-17 2.157952
2019-12-18 2.137207
2019-12-19 2.350854
Freq: D, dtype: float64
2019-12-14 NaN
2019-12-15 NaN
2019-12-16 1.0
2019-12-17 3.0
2019-12-18 5.0
2019-12-19 NaN
Freq: D, dtype: float64
4.2 Apply
print (df, '\n' )print (df.apply(np.cumsum), '\n' ) print (df.apply(lambda x: x.max()-x.min()))
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0
2019-12-18 1.753858 0.266110 -0.333931 5 4.0
2019-12-19 1.048606 0.984771 -0.279106 5 5.0
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -0.650613 10 1.0
2019-12-16 0.850881 -2.869189 -0.212311 15 3.0
2019-12-17 1.568253 -2.635968 1.626856 20 6.0
2019-12-18 3.322111 -2.369858 1.292925 25 10.0
2019-12-19 4.370717 -1.385087 1.013819 30 15.0
A 1.753858
B 2.602682
C 3.611190
D 0.000000
F 4.000000
dtype: float64
4.3 直方图
s = pd.Series(np.random.randint(0 , 7 , size=10 )) print (s, '\n' )print (s.value_counts())
0 4
1 3
2 3
3 6
4 4
5 5
6 2
7 2
8 3
9 0
dtype: int32
3 3
4 2
2 2
6 1
5 1
0 1
dtype: int64
4.4 字符串函数
s = pd.Series(['A' , 'B' , 'C' , 'Aaba' , 'Baca' , np.nan, 'CABA' , 'dog' , 'cat' ]) print (s.str.lower(), '\n' )
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
5 合并
5.1 Concat
df = pd.DataFrame(np.random.randn(10 , 4 )) print (df, '\n' )pieces = [df[:3 ], df[3 :7 ], df[7 :]] print (pd.concat(pieces))
0 1 2 3
0 1.434680 -0.776149 -1.218836 2.397118
1 -1.632868 -0.806080 0.634619 -0.760958
2 1.028005 -0.020240 0.696301 0.090692
3 -1.516807 0.952234 0.758215 -0.272409
4 0.540413 0.258468 -0.563238 -1.380825
5 0.021616 0.097505 -1.052843 0.272892
6 -0.957512 0.088345 -0.972309 -1.224522
7 0.211805 1.674883 -0.919839 -0.159892
8 1.651509 0.908129 -0.416815 -0.629014
9 -0.616352 0.160981 1.059860 -0.521867
0 1 2 3
0 1.434680 -0.776149 -1.218836 2.397118
1 -1.632868 -0.806080 0.634619 -0.760958
2 1.028005 -0.020240 0.696301 0.090692
3 -1.516807 0.952234 0.758215 -0.272409
4 0.540413 0.258468 -0.563238 -1.380825
5 0.021616 0.097505 -1.052843 0.272892
6 -0.957512 0.088345 -0.972309 -1.224522
7 0.211805 1.674883 -0.919839 -0.159892
8 1.651509 0.908129 -0.416815 -0.629014
9 -0.616352 0.160981 1.059860 -0.521867
5.2 Join
left = pd.DataFrame({'key' : ['foo' , 'foo' ], 'lval' : [1 , 2 ]}) right = pd.DataFrame({'key' : ['foo' , 'foo' ], 'rval' : [4 , 5 ]}) print (left, '\n' )print (right, '\n' )print (pd.merge(left, right, on='key' ))
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
5.3 Append
df = pd.DataFrame(np.random.randn(8 ,4 ), columns=['A' , 'B' , 'C' , 'D' ]) print (df, '\n' )s =df.iloc[3 ] print (df.append(s, ignore_index=True ))
A B C D
0 0.481732 1.270100 0.518214 0.028305
1 0.004151 0.602318 1.274942 -2.683102
2 0.130687 -0.402626 0.701901 -1.182579
3 -1.047072 0.656017 0.468616 -0.694842
4 -0.097726 1.669621 0.901135 1.337648
5 0.493724 -0.865273 -1.390465 1.143148
6 -0.691091 0.460376 0.540177 -0.205331
7 -0.634427 2.625610 0.067031 -2.041728
A B C D
0 0.481732 1.270100 0.518214 0.028305
1 0.004151 0.602318 1.274942 -2.683102
2 0.130687 -0.402626 0.701901 -1.182579
3 -1.047072 0.656017 0.468616 -0.694842
4 -0.097726 1.669621 0.901135 1.337648
5 0.493724 -0.865273 -1.390465 1.143148
6 -0.691091 0.460376 0.540177 -0.205331
7 -0.634427 2.625610 0.067031 -2.041728
8 -1.047072 0.656017 0.468616 -0.694842
6 Grouping
与SQL中的group by类似,包含以下步骤:
根据规则将数据分组
将函数独立的应用到每个分组
将结果合并
df = pd.DataFrame({'A' : ['foo' , 'bar' , 'foo' , 'bar' ,'foo' , 'bar' , 'foo' , 'foo' ], 'B' : ['one' , 'one' , 'two' , 'three' ,'two' , 'two' , 'one' , 'three' ], 'C' : np.random.randn(8 ), 'D' : np.random.randn(8 )}) print (df, '\n' )print (df.groupby('A' ).sum(), '\n' )print (df.groupby(['A' , 'B' ]).sum())
A B C D
0 foo one 0.182951 -0.346016
1 bar one 1.906573 -0.386902
2 foo two 0.102443 -1.087078
3 bar three 1.023196 0.467237
4 foo two -0.940578 -0.414947
5 bar two -1.075379 -0.844441
6 foo one 0.514689 0.038393
7 foo three -2.136912 -1.001738
C D
A
bar 1.854390 -0.764105
foo -2.277406 -2.811385
C D
A B
bar one 1.906573 -0.386902
three 1.023196 0.467237
two -1.075379 -0.844441
foo one 0.697640 -0.307622
three -2.136912 -1.001738
two -0.838134 -1.502025
7 Reshaping
7.1 Stack层叠
tuples = list(zip(*[['bar' , 'bar' , 'baz' , 'baz' , 'foo' , 'foo' , 'qux' , 'qux' ], ['one' , 'two' , 'one' , 'two' , 'one' , 'two' , 'one' , 'two' ]])) print (tuples)index = pd.MultiIndex.from_tuples(tuples, names=['first' , 'second' ]) df = pd.DataFrame(np.random.randn(8 , 2 ), index=index, columns=['A' , 'B' ]) df2 = df[:4 ] print (df2, '\n' )
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
A B
first second
bar one 0.220077 1.659987
two -1.504953 0.350697
baz one 1.631244 0.637820
two -0.083256 0.370784
stacked = df2.stack() print (stacked, '\n' )print(stacked.unstack(), '\n' ) print(stacked.unstack(0 ), '\n' ) print(stacked.unstack(1 ), '\n' )
first second
bar one A 0.220077
B 1.659987
two A -1.504953
B 0.350697
baz one A 1.631244
B 0.637820
two A -0.083256
B 0.370784
dtype: float64
A B
first second
bar one 0.220077 1.659987
two -1.504953 0.350697
baz one 1.631244 0.637820
two -0.083256 0.370784
first bar baz
second
one A 0.220077 1.631244
B 1.659987 0.637820
two A -1.504953 -0.083256
B 0.350697 0.370784
second one two
first
bar A 0.220077 -1.504953
B 1.659987 0.350697
baz A 1.631244 -0.083256
B 0.637820 0.370784
7.2 Pivot tables
pivot是把指定索引所对应的内容,作为新表的行(index)、列(columns)
df = pd.DataFrame({'A' : ['one' , 'one' , 'two' , 'three' ] * 3 , 'B' : ['A' , 'B' , 'C' ] * 4 , 'C' : ['foo' , 'foo' , 'foo' , 'bar' , 'bar' , 'bar' ] * 2 , 'D' : np.random.randn(12 ), 'E' : np.random.randn(12 )}) print (df, '\n' )print (pd.pivot_table(df, values='D' , index=['A' , 'B' ], columns=['C' ]), '\n' )
A B C D E
0 one A foo -0.647116 -0.383797
1 one B foo -1.009740 -0.149929
2 two C foo -0.000333 0.233550
3 three A bar 0.308752 -0.204858
4 one B bar 0.157966 -0.066707
5 one C bar 1.028976 -1.697499
6 two A foo -0.170791 -0.161074
7 three B foo 1.222346 -0.465335
8 one C foo -0.024742 -1.984697
9 one A bar -1.253050 0.675345
10 two B bar -0.124737 -0.169243
11 three C bar 0.501580 1.049995
C bar foo
A B
one A -1.253050 -0.647116
B 0.157966 -1.009740
C 1.028976 -0.024742
three A 0.308752 NaN
B NaN 1.222346
C 0.501580 NaN
two A NaN -0.170791
B -0.124737 NaN
C NaN -0.000333
8 Time series
rng = pd.date_range('1/1/2020' , periods=120 , freq='S' ) ts = pd.Series(np.random.randint(0 , 500 , len(rng)), index=rng) print (ts, '\n' )print (ts.resample('1Min' ).sum())
2020-01-01 00:00:00 349
2020-01-01 00:00:01 141
2020-01-01 00:00:02 467
2020-01-01 00:00:03 152
2020-01-01 00:00:04 491
...
2020-01-01 00:01:55 29
2020-01-01 00:01:56 383
2020-01-01 00:01:57 310
2020-01-01 00:01:58 204
2020-01-01 00:01:59 115
Freq: S, Length: 120, dtype: int32
2020-01-01 00:00:00 15490
2020-01-01 00:01:00 13744
Freq: T, dtype: int32
rng = pd.date_range('3/6/2020 00:00' , periods=5 , freq='D' ) ts = pd.Series(np.random.randn(len(rng)), rng) print (ts, '\n' )ts_utc = ts.tz_localize('UTC' ) print (ts_utc, '\n' )
2020-03-06 -0.264762
2020-03-07 1.358651
2020-03-08 -0.316810
2020-03-09 -0.134225
2020-03-10 -0.960168
Freq: D, dtype: float64
2020-03-06 00:00:00+00:00 -0.264762
2020-03-07 00:00:00+00:00 1.358651
2020-03-08 00:00:00+00:00 -0.316810
2020-03-09 00:00:00+00:00 -0.134225
2020-03-10 00:00:00+00:00 -0.960168
Freq: D, dtype: float64
print (ts_utc.tz_convert('US/Eastern' ), '\n' )
2020-03-05 19:00:00-05:00 -0.264762
2020-03-06 19:00:00-05:00 1.358651
2020-03-07 19:00:00-05:00 -0.316810
2020-03-08 20:00:00-04:00 -0.134225
2020-03-09 20:00:00-04:00 -0.960168
Freq: D, dtype: float64
rng = pd.date_range('1/1/2020' , periods=5 , freq='M' ) ts = pd.Series(np.random.randn(len(rng)), index=rng) print (ts, '\n' )ps = ts.to_period() print (ps, '\n' )print (ps.to_timestamp(),'\n' )
2020-01-31 -0.325968
2020-02-29 -0.989123
2020-03-31 -0.323522
2020-04-30 0.491792
2020-05-31 -0.161315
Freq: M, dtype: float64
2020-01 -0.325968
2020-02 -0.989123
2020-03 -0.323522
2020-04 0.491792
2020-05 -0.161315
Freq: M, dtype: float64
2020-01-01 -0.325968
2020-02-01 -0.989123
2020-03-01 -0.323522
2020-04-01 0.491792
2020-05-01 -0.161315
Freq: MS, dtype: float64
prng = pd.period_range('1990Q1' , '2000Q4' , freq='Q-NOV' ) ts = pd.Series(np.random.randn(len(prng)), prng) print (ts[:6 ], '\n' )ts.index = (prng.asfreq('M' , 'e' ) + 1 ).asfreq('H' , 's' ) + 9 print (ts.head(), '\n' )
1990Q1 -0.694847
1990Q2 1.123105
1990Q3 0.395293
1990Q4 -0.517952
1991Q1 0.689373
1991Q2 0.664599
Freq: Q-NOV, dtype: float64
1990-03-01 09:00 -0.694847
1990-06-01 09:00 1.123105
1990-09-01 09:00 0.395293
1990-12-01 09:00 -0.517952
1991-03-01 09:00 0.689373
Freq: H, dtype: float64
9 Categoricals
df = pd.DataFrame({"id" : [1 , 2 , 3 , 4 , 5 , 6 ], "raw_grade" : ['a' , 'b' , 'b' , 'a' , 'a' , 'e' ]}) df["grade" ] = df["raw_grade" ].astype("category" ) print (df["grade" ], '\n' )
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df["grade" ].cat.categories = ["very good" , "good" , "very bad" ] print (df["grade" ], '\n' )df["grade" ] = df["grade" ].cat.set_categories(["very bad" , "bad" , "medium" , "good" , "very good" ]) print (df["grade" ], '\n' )
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
print (df.sort_values(by="grade" ), '\n' )print (df.groupby("grade" ).size())
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
10 Plotting
ts = pd.Series(np.random.randn(1000 ),index=pd.date_range('1/1/2020' , periods=1000 )) ts = ts.cumsum() ts.plot()
df = pd.DataFrame(np.random.randn(1000 , 4 ), index=ts.index,columns=['A' , 'B' , 'C' , 'D' ]) df = df.cumsum() plt.figure() df.plot() plt.legend(loc='best' )
11 读取、写入数据
11.1 CSV
df.to_csv('foo.csv' ) print (pd.read_csv('foo.csv' ), '\n' )
Unnamed: 0 A B C D
0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462
1 2020-01-02 2.434732 0.361986 1.523058 -0.428913
2 2020-01-03 4.390606 0.680106 0.511064 -0.722390
3 2020-01-04 4.721387 0.118680 0.446910 -0.788997
4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760
.. ... ... ... ... ...
995 2022-09-22 9.467967 9.121625 57.829448 31.631881
996 2022-09-23 8.552319 10.117774 56.513708 30.713588
997 2022-09-24 10.064727 9.115677 56.434259 32.638695
998 2022-09-25 9.974457 7.912468 56.121923 32.281864
999 2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 5 columns]
11.2 HDF5
df.to_hdf('foo.h5' , 'df' ) print (pd.read_hdf('foo.h5' , 'df' ), '\n' )
A B C D
2020-01-01 0.674980 0.021207 -0.984289 -0.261462
2020-01-02 2.434732 0.361986 1.523058 -0.428913
2020-01-03 4.390606 0.680106 0.511064 -0.722390
2020-01-04 4.721387 0.118680 0.446910 -0.788997
2020-01-05 5.966886 -1.940727 1.120733 -0.169760
... ... ... ... ...
2022-09-22 9.467967 9.121625 57.829448 31.631881
2022-09-23 8.552319 10.117774 56.513708 30.713588
2022-09-24 10.064727 9.115677 56.434259 32.638695
2022-09-25 9.974457 7.912468 56.121923 32.281864
2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 4 columns]
11.3 Excel
df.to_excel('foo.xlsx' , sheet_name='Sheet1' ) print (pd.read_excel('foo.xlsx' , 'Sheet1' , index_col=None , na_values=['NA' ]), '\n' )
Unnamed: 0 A B C D
0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462
1 2020-01-02 2.434732 0.361986 1.523058 -0.428913
2 2020-01-03 4.390606 0.680106 0.511064 -0.722390
3 2020-01-04 4.721387 0.118680 0.446910 -0.788997
4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760
.. ... ... ... ... ...
995 2022-09-22 9.467967 9.121625 57.829448 31.631881
996 2022-09-23 8.552319 10.117774 56.513708 30.713588
997 2022-09-24 10.064727 9.115677 56.434259 32.638695
998 2022-09-25 9.974457 7.912468 56.121923 32.281864
999 2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 5 columns]
12 注意
if pd.Series([False , True , False ]): print("I was true" )
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-210-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
2 print("I was true")
D:\Applications\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.__class__.__name__
1556 )
1557 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().