十分钟入门Pandas

原文:pandas官方文档 10 minutes to pandas

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

1 创建对象

1.1 Series

  • Series是一个列向量,它仅包含一列数据及其索引,通过向Series中传递一个列表来创建:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print (s)
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

1.2 DataFrame

  • DataFrame是一个矩阵,通过传递一个Numpy数组,以日期作为行索引,以’ABCD’作为列索引来创建:
dates = pd.date_range('20191214', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print (df)
                   A         B         C         D
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-16  0.366885 -1.617911  0.438302  1.124135
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345
2019-12-19  1.048606  0.984771 -0.279106  0.592285
  • 通过传递一个字典来创建DataFrame,每个value都会被转化成一个 Series:
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
print (df2)
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
#DataFrame每一列的数据类型都不同:
print (df2.dtypes)
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
  • 所以我们可以认为,DataFrame是由一列一列的Series拼成的

2 查看数据

headtail查看开始和结尾几行的数据,用indexcolumns查看行索引和列索引,用vlaues显示不带索引的内容:

print (df.head(), '\n')
print (df.tail(3), '\n')
print (df.index, '\n')
print (df.columns, '\n')
print (df.values, '\n')
                   A         B         C         D
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-16  0.366885 -1.617911  0.438302  1.124135
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345 

                   A         B         C         D
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345
2019-12-19  1.048606  0.984771 -0.279106  0.592285 

DatetimeIndex(['2019-12-14', '2019-12-15', '2019-12-16', '2019-12-17',
               '2019-12-18', '2019-12-19'],
              dtype='datetime64[ns]', freq='D') 

Index(['A', 'B', 'C', 'D'], dtype='object') 

[[ 0.61656893  0.05140077  1.12140918  0.30726721]
 [ 0.48399521 -1.25127827 -1.77202211  1.2826497 ]
 [ 0.36688539 -1.61791122  0.43830175  1.12413452]
 [ 0.71737246  0.23322127  1.83916745 -1.53579412]
 [ 1.75385779  0.26611012 -0.33393125  0.281345  ]
 [ 1.04860622  0.98477115 -0.27910636  0.59228493]] 

  • Numpy arrays整个数组只有一个数据类型,而DataFrame每一列一个数据类型。所以在将df2转化为numpy的时候,每列的数据类型统一为object:
print (df.to_numpy())
df2.to_numpy()
[[ 0.61656893  0.05140077  1.12140918  0.30726721]
 [ 0.48399521 -1.25127827 -1.77202211  1.2826497 ]
 [ 0.36688539 -1.61791122  0.43830175  1.12413452]
 [ 0.71737246  0.23322127  1.83916745 -1.53579412]
 [ 1.75385779  0.26611012 -0.33393125  0.281345  ]
 [ 1.04860622  0.98477115 -0.27910636  0.59228493]]

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
  • <kbdDescribe显示每一列数据的统计概要:
print(df.describe())
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.831214 -0.222281  0.168970  0.341981
std    0.508670  0.998449  1.264150  1.008979
min    0.366885 -1.617911 -1.772022 -1.535794
25%    0.517139 -0.925609 -0.320225  0.287826
50%    0.666971  0.142311  0.079598  0.449776
75%    0.965798  0.257888  0.950632  0.991172
max    1.753858  0.984771  1.839167  1.282650
  • T表示矩阵的转置:
print (df.T)
   2019-12-14  2019-12-15  2019-12-16  2019-12-17  2019-12-18  2019-12-19
A    0.616569    0.483995    0.366885    0.717372    1.753858    1.048606
B    0.051401   -1.251278   -1.617911    0.233221    0.266110    0.984771
C    1.121409   -1.772022    0.438302    1.839167   -0.333931   -0.279106
D    0.307267    1.282650    1.124135   -1.535794    0.281345    0.592285
  • sort_index按照轴排序,axis=1表示第二个维度,即按照列索引排序:
print (df.sort_index(axis=1, ascending=False))
                   D         C         B         A
2019-12-14  0.307267  1.121409  0.051401  0.616569
2019-12-15  1.282650 -1.772022 -1.251278  0.483995
2019-12-16  1.124135  0.438302 -1.617911  0.366885
2019-12-17 -1.535794  1.839167  0.233221  0.717372
2019-12-18  0.281345 -0.333931  0.266110  1.753858
2019-12-19  0.592285 -0.279106  0.984771  1.048606
  • sort_values按照值排序,by='B’表示对第二列排序:
print (df.sort_values(by='B'))
                   A         B         C         D
2019-12-16  0.366885 -1.617911  0.438302  1.124135
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345
2019-12-19  1.048606  0.984771 -0.279106  0.592285

3 选择

3.1 获取行、列

  • 与numpy类似,DataFrame也可以用[]进行选择:
print (df['A'], '\n')
print (df[0:3], '\n') #行切片
print (df['20191216':'20191219'], '\n')
2019-12-14    0.616569
2019-12-15    0.483995
2019-12-16    0.366885
2019-12-17    0.717372
2019-12-18    1.753858
2019-12-19    1.048606
Freq: D, Name: A, dtype: float64 

                   A         B         C         D
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-16  0.366885 -1.617911  0.438302  1.124135 

                   A         B         C         D
2019-12-16  0.366885 -1.617911  0.438302  1.124135
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345
2019-12-19  1.048606  0.984771 -0.279106  0.592285 

  • 通过label进行选择
print (df.loc[dates[0]], '\n') 
print (df.loc[:, ['A', 'B']], '\n')
print (df.loc['20191216':'20191218', ['A', 'B']], '\n')
print (df.at[dates[0], 'A'], '\n') #选择单个数据时用at,速度更快
A    0.616569
B    0.051401
C    1.121409
D    0.307267
Name: 2019-12-14 00:00:00, dtype: float64 

                   A         B
2019-12-14  0.616569  0.051401
2019-12-15  0.483995 -1.251278
2019-12-16  0.366885 -1.617911
2019-12-17  0.717372  0.233221
2019-12-18  1.753858  0.266110
2019-12-19  1.048606  0.984771 

                   A         B
2019-12-16  0.366885 -1.617911
2019-12-17  0.717372  0.233221
2019-12-18  1.753858  0.266110 

0.6165689271991402 

  • 通过整数下标进行选择
print (df.iloc[3], '\n')
print (df.iloc[3:5, 0:2], '\n')
print (df.iloc[[1,2,4], [0,2]], '\n')
print (df.iloc[1:3, :], '\n')
print (df.iat[1,1], '\n')
A    0.717372
B    0.233221
C    1.839167
D   -1.535794
Name: 2019-12-17 00:00:00, dtype: float64 

                   A         B
2019-12-17  0.717372  0.233221
2019-12-18  1.753858  0.266110 

                   A         C
2019-12-15  0.483995 -1.772022
2019-12-16  0.366885  0.438302
2019-12-18  1.753858 -0.333931 

                   A         B         C         D
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-16  0.366885 -1.617911  0.438302  1.124135 

-1.2512782747247186 

  • 通过布尔值下标进行选择
print (df[df.A > 0], '\n')    #可以用df.A来表示
print (df[df > 0], '\n')
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print (df2, '\n')
print (df2[df2['E'].isin(['two', 'four'])]) #用isin来过滤
                   A         B         C         D
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-15  0.483995 -1.251278 -1.772022  1.282650
2019-12-16  0.366885 -1.617911  0.438302  1.124135
2019-12-17  0.717372  0.233221  1.839167 -1.535794
2019-12-18  1.753858  0.266110 -0.333931  0.281345
2019-12-19  1.048606  0.984771 -0.279106  0.592285 

                   A         B         C         D
2019-12-14  0.616569  0.051401  1.121409  0.307267
2019-12-15  0.483995       NaN       NaN  1.282650
2019-12-16  0.366885       NaN  0.438302  1.124135
2019-12-17  0.717372  0.233221  1.839167       NaN
2019-12-18  1.753858  0.266110       NaN  0.281345
2019-12-19  1.048606  0.984771       NaN  0.592285 

                   A         B         C         D      E
2019-12-14  0.616569  0.051401  1.121409  0.307267    one
2019-12-15  0.483995 -1.251278 -1.772022  1.282650    one
2019-12-16  0.366885 -1.617911  0.438302  1.124135    two
2019-12-17  0.717372  0.233221  1.839167 -1.535794  three
2019-12-18  1.753858  0.266110 -0.333931  0.281345   four
2019-12-19  1.048606  0.984771 -0.279106  0.592285  three 

                   A         B         C         D     E
2019-12-16  0.366885 -1.617911  0.438302  1.124135   two
2019-12-18  1.753858  0.266110 -0.333931  0.281345  four

3.2 设置行、列

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20191215', periods=6))
print (s1, '\n')

df['F'] = s1
df.at[dates[0], 'A'] = 0
df.iat[0, 1] = 0
df.loc[:, 'D'] = np.array([5] * len(df))
print (df)

df2 = df.copy()
df2[df2 > 0] = -df2
print (df2)
2019-12-15    1
2019-12-16    2
2019-12-17    3
2019-12-18    4
2019-12-19    5
2019-12-20    6
Freq: D, dtype: int64 

                   A         B         C  D    F
2019-12-14  0.000000  0.000000  1.121409  5  NaN
2019-12-15  0.483995 -1.251278 -1.772022  5  1.0
2019-12-16  0.366885 -1.617911  0.438302  5  2.0
2019-12-17  0.717372  0.233221  1.839167  5  3.0
2019-12-18  1.753858  0.266110 -0.333931  5  4.0
2019-12-19  1.048606  0.984771 -0.279106  5  5.0
                   A         B         C  D    F
2019-12-14  0.000000  0.000000 -1.121409 -5  NaN
2019-12-15 -0.483995 -1.251278 -1.772022 -5 -1.0
2019-12-16 -0.366885 -1.617911 -0.438302 -5 -2.0
2019-12-17 -0.717372 -0.233221 -1.839167 -5 -3.0
2019-12-18 -1.753858 -0.266110 -0.333931 -5 -4.0
2019-12-19 -1.048606 -0.984771 -0.279106 -5 -5.0

4 缺失值

  • pandas用np.nan表示缺失值,通常它不会被计算。
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
print (df1, '\n')
print (df1.dropna(how='any'), '\n') #删除所有存在缺失值的行
print (df1.fillna(value=5), '\n') #将所有缺失值用5填充
print (pd.isna(df1)) #获取所有位置是否是缺失值的boolean值
                   A         B         C  D    F    E
2019-12-14  0.000000  0.000000  1.121409  5  NaN  1.0
2019-12-15  0.483995 -1.251278 -1.772022  5  1.0  1.0
2019-12-16  0.366885 -1.617911  0.438302  5  2.0  NaN
2019-12-17  0.717372  0.233221  1.839167  5  3.0  NaN 

                   A         B         C  D    F    E
2019-12-15  0.483995 -1.251278 -1.772022  5  1.0  1.0 

                   A         B         C  D    F    E
2019-12-14  0.000000  0.000000  1.121409  5  5.0  1.0
2019-12-15  0.483995 -1.251278 -1.772022  5  1.0  1.0
2019-12-16  0.366885 -1.617911  0.438302  5  2.0  5.0
2019-12-17  0.717372  0.233221  1.839167  5  3.0  5.0 

                A      B      C      D      F      E
2019-12-14  False  False  False  False   True  False
2019-12-15  False  False  False  False  False  False
2019-12-16  False  False  False  False  False   True
2019-12-17  False  False  False  False  False   True

4 操作

4.1 统计

  • 通常,操作都会把 NaN 排除在外
print (df.mean(), '\n')
print (df.mean(1),'\n') #对列求均值
print (pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2), '\n') #将所有数据下移两位
A    0.728453
B   -0.230848
C    0.168970
D    5.000000
F    3.000000
dtype: float64 

2019-12-14    1.530352
2019-12-15    0.692139
2019-12-16    1.237455
2019-12-17    2.157952
2019-12-18    2.137207
2019-12-19    2.350854
Freq: D, dtype: float64 

2019-12-14    NaN
2019-12-15    NaN
2019-12-16    1.0
2019-12-17    3.0
2019-12-18    5.0
2019-12-19    NaN
Freq: D, dtype: float64 

4.2 Apply

  • 对数据apply一个函数
print (df, '\n')
print (df.apply(np.cumsum), '\n') #cumsum表示每行逐次累加
print (df.apply(lambda x: x.max()-x.min())) #每一列的最大值减去最小值
                   A         B         C  D    F
2019-12-14  0.000000  0.000000  1.121409  5  NaN
2019-12-15  0.483995 -1.251278 -1.772022  5  1.0
2019-12-16  0.366885 -1.617911  0.438302  5  2.0
2019-12-17  0.717372  0.233221  1.839167  5  3.0
2019-12-18  1.753858  0.266110 -0.333931  5  4.0
2019-12-19  1.048606  0.984771 -0.279106  5  5.0 

                   A         B         C   D     F
2019-12-14  0.000000  0.000000  1.121409   5   NaN
2019-12-15  0.483995 -1.251278 -0.650613  10   1.0
2019-12-16  0.850881 -2.869189 -0.212311  15   3.0
2019-12-17  1.568253 -2.635968  1.626856  20   6.0
2019-12-18  3.322111 -2.369858  1.292925  25  10.0
2019-12-19  4.370717 -1.385087  1.013819  30  15.0 

A    1.753858
B    2.602682
C    3.611190
D    0.000000
F    4.000000
dtype: float64

4.3 直方图

s = pd.Series(np.random.randint(0, 7, size=10))
print (s, '\n')
print (s.value_counts())
0    4
1    3
2    3
3    6
4    4
5    5
6    2
7    2
8    3
9    0
dtype: int32 

3    3
4    2
2    2
6    1
5    1
0    1
dtype: int64

4.4 字符串函数

  • Series在str属性中自带了很多字符串函数:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print (s.str.lower(), '\n')
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object 

5 合并

5.1 Concat

  • 简单地按行拼接
df = pd.DataFrame(np.random.randn(10, 4))
print (df, '\n')

#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

print (pd.concat(pieces))
          0         1         2         3
0  1.434680 -0.776149 -1.218836  2.397118
1 -1.632868 -0.806080  0.634619 -0.760958
2  1.028005 -0.020240  0.696301  0.090692
3 -1.516807  0.952234  0.758215 -0.272409
4  0.540413  0.258468 -0.563238 -1.380825
5  0.021616  0.097505 -1.052843  0.272892
6 -0.957512  0.088345 -0.972309 -1.224522
7  0.211805  1.674883 -0.919839 -0.159892
8  1.651509  0.908129 -0.416815 -0.629014
9 -0.616352  0.160981  1.059860 -0.521867 

          0         1         2         3
0  1.434680 -0.776149 -1.218836  2.397118
1 -1.632868 -0.806080  0.634619 -0.760958
2  1.028005 -0.020240  0.696301  0.090692
3 -1.516807  0.952234  0.758215 -0.272409
4  0.540413  0.258468 -0.563238 -1.380825
5  0.021616  0.097505 -1.052843  0.272892
6 -0.957512  0.088345 -0.972309 -1.224522
7  0.211805  1.674883 -0.919839 -0.159892
8  1.651509  0.908129 -0.416815 -0.629014
9 -0.616352  0.160981  1.059860 -0.521867

5.2 Join

  • SQL样式的合并方法
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print (left, '\n')
print (right, '\n')
print (pd.merge(left, right, on='key'))
   key  lval
0  foo     1
1  foo     2 

   key  rval
0  foo     4
1  foo     5 

   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

5.3 Append

  • 合并行为DataFrame
df = pd.DataFrame(np.random.randn(8,4), columns=['A', 'B', 'C', 'D'])
print (df, '\n')
s =df.iloc[3]
print (df.append(s, ignore_index=True))
          A         B         C         D
0  0.481732  1.270100  0.518214  0.028305
1  0.004151  0.602318  1.274942 -2.683102
2  0.130687 -0.402626  0.701901 -1.182579
3 -1.047072  0.656017  0.468616 -0.694842
4 -0.097726  1.669621  0.901135  1.337648
5  0.493724 -0.865273 -1.390465  1.143148
6 -0.691091  0.460376  0.540177 -0.205331
7 -0.634427  2.625610  0.067031 -2.041728 

          A         B         C         D
0  0.481732  1.270100  0.518214  0.028305
1  0.004151  0.602318  1.274942 -2.683102
2  0.130687 -0.402626  0.701901 -1.182579
3 -1.047072  0.656017  0.468616 -0.694842
4 -0.097726  1.669621  0.901135  1.337648
5  0.493724 -0.865273 -1.390465  1.143148
6 -0.691091  0.460376  0.540177 -0.205331
7 -0.634427  2.625610  0.067031 -2.041728
8 -1.047072  0.656017  0.468616 -0.694842

6 Grouping

与SQL中的group by类似,包含以下步骤:

  • 根据规则将数据分组
  • 将函数独立的应用到每个分组
  • 将结果合并
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
print (df, '\n')
print (df.groupby('A').sum(), '\n')
print (df.groupby(['A', 'B']).sum())
     A      B         C         D
0  foo    one  0.182951 -0.346016
1  bar    one  1.906573 -0.386902
2  foo    two  0.102443 -1.087078
3  bar  three  1.023196  0.467237
4  foo    two -0.940578 -0.414947
5  bar    two -1.075379 -0.844441
6  foo    one  0.514689  0.038393
7  foo  three -2.136912 -1.001738 

            C         D
A                      
bar  1.854390 -0.764105
foo -2.277406 -2.811385 

                  C         D
A   B                        
bar one    1.906573 -0.386902
    three  1.023196  0.467237
    two   -1.075379 -0.844441
foo one    0.697640 -0.307622
    three -2.136912 -1.001738
    two   -0.838134 -1.502025

7 Reshaping

7.1 Stack层叠

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
print (tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
print (df2, '\n')
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
                     A         B
first second                    
bar   one     0.220077  1.659987
      two    -1.504953  0.350697
baz   one     1.631244  0.637820
      two    -0.083256  0.370784 

stacked = df2.stack()
print (stacked, '\n')

print(stacked.unstack(), '\n')
print(stacked.unstack(0), '\n')
print(stacked.unstack(1), '\n')
first  second   
bar    one     A    0.220077
               B    1.659987
       two     A   -1.504953
               B    0.350697
baz    one     A    1.631244
               B    0.637820
       two     A   -0.083256
               B    0.370784
dtype: float64 

                     A         B
first second                    
bar   one     0.220077  1.659987
      two    -1.504953  0.350697
baz   one     1.631244  0.637820
      two    -0.083256  0.370784 

first          bar       baz
second                      
one    A  0.220077  1.631244
       B  1.659987  0.637820
two    A -1.504953 -0.083256
       B  0.350697  0.370784 

second        one       two
first                      
bar   A  0.220077 -1.504953
      B  1.659987  0.350697
baz   A  1.631244 -0.083256
      B  0.637820  0.370784 

7.2 Pivot tables

  • pivot是把指定索引所对应的内容,作为新表的行(index)、列(columns)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
print (df, '\n')

print (pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']), '\n')
        A  B    C         D         E
0     one  A  foo -0.647116 -0.383797
1     one  B  foo -1.009740 -0.149929
2     two  C  foo -0.000333  0.233550
3   three  A  bar  0.308752 -0.204858
4     one  B  bar  0.157966 -0.066707
5     one  C  bar  1.028976 -1.697499
6     two  A  foo -0.170791 -0.161074
7   three  B  foo  1.222346 -0.465335
8     one  C  foo -0.024742 -1.984697
9     one  A  bar -1.253050  0.675345
10    two  B  bar -0.124737 -0.169243
11  three  C  bar  0.501580  1.049995 

C             bar       foo
A     B                    
one   A -1.253050 -0.647116
      B  0.157966 -1.009740
      C  1.028976 -0.024742
three A  0.308752       NaN
      B       NaN  1.222346
      C  0.501580       NaN
two   A       NaN -0.170791
      B -0.124737       NaN
      C       NaN -0.000333 

8 Time series

rng = pd.date_range('1/1/2020', periods=120, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print (ts, '\n')
print (ts.resample('1Min').sum()) #resapmle重采样,每隔一分钟为index,对值进行求和
2020-01-01 00:00:00    349
2020-01-01 00:00:01    141
2020-01-01 00:00:02    467
2020-01-01 00:00:03    152
2020-01-01 00:00:04    491
                      ... 
2020-01-01 00:01:55     29
2020-01-01 00:01:56    383
2020-01-01 00:01:57    310
2020-01-01 00:01:58    204
2020-01-01 00:01:59    115
Freq: S, Length: 120, dtype: int32 

2020-01-01 00:00:00    15490
2020-01-01 00:01:00    13744
Freq: T, dtype: int32
# 时区表示
rng = pd.date_range('3/6/2020 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
print (ts, '\n')
ts_utc = ts.tz_localize('UTC')
print (ts_utc, '\n')
2020-03-06   -0.264762
2020-03-07    1.358651
2020-03-08   -0.316810
2020-03-09   -0.134225
2020-03-10   -0.960168
Freq: D, dtype: float64 

2020-03-06 00:00:00+00:00   -0.264762
2020-03-07 00:00:00+00:00    1.358651
2020-03-08 00:00:00+00:00   -0.316810
2020-03-09 00:00:00+00:00   -0.134225
2020-03-10 00:00:00+00:00   -0.960168
Freq: D, dtype: float64 

#转换为另一个时区
print (ts_utc.tz_convert('US/Eastern'), '\n')
2020-03-05 19:00:00-05:00   -0.264762
2020-03-06 19:00:00-05:00    1.358651
2020-03-07 19:00:00-05:00   -0.316810
2020-03-08 20:00:00-04:00   -0.134225
2020-03-09 20:00:00-04:00   -0.960168
Freq: D, dtype: float64 

#在时间跨度的表示之间进行转换
rng = pd.date_range('1/1/2020', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print (ts, '\n')
ps = ts.to_period()
print (ps, '\n')
print (ps.to_timestamp(),'\n')
2020-01-31   -0.325968
2020-02-29   -0.989123
2020-03-31   -0.323522
2020-04-30    0.491792
2020-05-31   -0.161315
Freq: M, dtype: float64 

2020-01   -0.325968
2020-02   -0.989123
2020-03   -0.323522
2020-04    0.491792
2020-05   -0.161315
Freq: M, dtype: float64 

2020-01-01   -0.325968
2020-02-01   -0.989123
2020-03-01   -0.323522
2020-04-01    0.491792
2020-05-01   -0.161315
Freq: MS, dtype: float64 

#在period和timestamp之间进行转换可以使用一些方便的算术函数:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
print (ts[:6], '\n')
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
print (ts.head(), '\n')
1990Q1   -0.694847
1990Q2    1.123105
1990Q3    0.395293
1990Q4   -0.517952
1991Q1    0.689373
1991Q2    0.664599
Freq: Q-NOV, dtype: float64 

1990-03-01 09:00   -0.694847
1990-06-01 09:00    1.123105
1990-09-01 09:00    0.395293
1990-12-01 09:00   -0.517952
1991-03-01 09:00    0.689373
Freq: H, dtype: float64 

9 Categoricals

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})

#将raw_grade转化为categorical的数据类型
df["grade"] = df["raw_grade"].astype("category")
print (df["grade"], '\n')
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e] 

df["grade"].cat.categories = ["very good", "good", "very bad"]
print (df["grade"], '\n')

#重新排序类别,同时添加缺少的类别;新类别按照旧类别的顺序匹配
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print (df["grade"], '\n')
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad] 

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good] 

#排序是按类别中的顺序进行的,而不是词汇顺序。
print (df.sort_values(by="grade"), '\n')

print (df.groupby("grade").size())
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good 

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

10 Plotting

ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
ts = ts.cumsum()
ts.plot()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lGmpFdQv-1576467757700)(output_66_1.png)]

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best') #best表示自动分配最佳位置

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h0f9J79v-1576467757701)(output_67_2.png)]

11 读取、写入数据

11.1 CSV

#写入
df.to_csv('foo.csv')

#读取
print (pd.read_csv('foo.csv'), '\n')
     Unnamed: 0          A          B          C          D
0    2020-01-01   0.674980   0.021207  -0.984289  -0.261462
1    2020-01-02   2.434732   0.361986   1.523058  -0.428913
2    2020-01-03   4.390606   0.680106   0.511064  -0.722390
3    2020-01-04   4.721387   0.118680   0.446910  -0.788997
4    2020-01-05   5.966886  -1.940727   1.120733  -0.169760
..          ...        ...        ...        ...        ...
995  2022-09-22   9.467967   9.121625  57.829448  31.631881
996  2022-09-23   8.552319  10.117774  56.513708  30.713588
997  2022-09-24  10.064727   9.115677  56.434259  32.638695
998  2022-09-25   9.974457   7.912468  56.121923  32.281864
999  2022-09-26   8.935891   8.412897  58.160465  32.012264

[1000 rows x 5 columns] 

11.2 HDF5

#写入
df.to_hdf('foo.h5', 'df')

#读取
print (pd.read_hdf('foo.h5', 'df'), '\n')
                    A          B          C          D
2020-01-01   0.674980   0.021207  -0.984289  -0.261462
2020-01-02   2.434732   0.361986   1.523058  -0.428913
2020-01-03   4.390606   0.680106   0.511064  -0.722390
2020-01-04   4.721387   0.118680   0.446910  -0.788997
2020-01-05   5.966886  -1.940727   1.120733  -0.169760
...               ...        ...        ...        ...
2022-09-22   9.467967   9.121625  57.829448  31.631881
2022-09-23   8.552319  10.117774  56.513708  30.713588
2022-09-24  10.064727   9.115677  56.434259  32.638695
2022-09-25   9.974457   7.912468  56.121923  32.281864
2022-09-26   8.935891   8.412897  58.160465  32.012264

[1000 rows x 4 columns] 

11.3 Excel

#写入
df.to_excel('foo.xlsx', sheet_name='Sheet1')

#读取
print (pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']), '\n')
    Unnamed: 0          A          B          C          D
0   2020-01-01   0.674980   0.021207  -0.984289  -0.261462
1   2020-01-02   2.434732   0.361986   1.523058  -0.428913
2   2020-01-03   4.390606   0.680106   0.511064  -0.722390
3   2020-01-04   4.721387   0.118680   0.446910  -0.788997
4   2020-01-05   5.966886  -1.940727   1.120733  -0.169760
..         ...        ...        ...        ...        ...
995 2022-09-22   9.467967   9.121625  57.829448  31.631881
996 2022-09-23   8.552319  10.117774  56.513708  30.713588
997 2022-09-24  10.064727   9.115677  56.434259  32.638695
998 2022-09-25   9.974457   7.912468  56.121923  32.281864
999 2022-09-26   8.935891   8.412897  58.160465  32.012264

[1000 rows x 5 columns] 

12 注意

  • 不能直接把返回值当作布尔值:
if pd.Series([False, True, False]):
print("I was true")
---------------------------------------------------------------------------
   
ValueError                                Traceback (most recent call last)

<ipython-input-210-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
      2     print("I was true")
   
D:\Applications\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1553             "The truth value of a {0} is ambiguous. "
   1554             "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555                 self.__class__.__name__
   1556             )
   1557         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
文章作者: Alston
文章链接: https://lizitong67.github.io/2020/02/21/%E5%8D%81%E5%88%86%E9%92%9F%E5%85%A5%E9%97%A8Pandas/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Alston's blog