方法 说明
count 计数
describe 给出各列的常用统计量
min,max 最大最小值
argmin,argmax 最大最小值的索引位置(整数)
idxmin,idxmax 最大最小值的索引值
quantile 计算样本分位数
sum,mean 对列求和,均值
mediam 中位数
mad 根据平均值计算平均绝对离差
var,std 方差,标准差
skew 偏度(三阶矩)
Kurt 峰度(四阶矩)
cumsum 累积和
Cummins,cummax 累计组大致和累计最小值
cumprod 累计积
diff 一阶差分
pct_change 计算百分数变化
  1. df[df.isnull()] #判断是够是Nan,None返回的是个true或false的Series对象
  2. df[df.notnull()]
  3. #dropna(): 过滤丢失数据
  4. #df3.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
  5. df.dropna() #将所有含有nan项的row删除
  6. df.dropna(axis=1,thresh=3) #将在列的方向上三个为NaN的项删除
  7. df.dropna(how='ALL') #将全部项都是nan的row删除
  8. df.dropna()与data[data.notnull()] #效果一致
  9. #fillna(): 填充丢失数据
  10. #前置填充 axis = 0 行
  11. #后置填充 axis = 1 列
  12. df3.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
  13. df.fillna({1:0, 2:0.5}) #对第一列nan值赋0,第二列赋值0.5
  14. df.fillna(method='ffill') #在列方向上以前一个值作为值赋给NaN
  1. print frame.drop(['a'])
  2. print frame.drop(['Ohio'], axis = 1)

drop函数默认删除行,列需要加axis = 1

采用drop方法,有下面三种等价的表达式:

  1. 1. DF= DF.drop('column_name', axis=1);
  2. 2. DF.drop('column_name',axis=1, inplace=True)
  3. 3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)

注意:凡是会对原数组作出修改并返回一个新数组的,往往都有一个 inplace可选参数。如果手动设定为True(默认为False),那么原数组直接就被替换。也就是说,采用inplace=True之后,原数组名(如2和3情况所示)对应的内存值直接改变;

而采用inplace=False之后,原数组名对应的内存值并不改变,需要将新的结果赋给一个新的数组或者覆盖原数组的内存位置(如1情况所示)。

  1. df['Name'] = df['Name'].astype(np.datetime64)

DataFrame.astype() 方法可对整个DataFrame或某一列进行数据格式转换,支持Python和NumPy的数据类型。

df.duplicated() 返回boolean列表,重复为True

df.drop_duplicates() 删除重复元素即值为True的列行

参数列表

  • subset : column label or sequence of labels, optional
    用来指定特定的列,默认所有列
  • keep : {‘first’, ‘last’, False}, default ‘first’
    删除重复项并保留第一次出现的项
  • inplace : boolean, default False
    是直接在原来数据上修改还是保留一个副本
  1. # 判断是否重复
  2. data.duplicated()`
  3. #移除重复数据
  4. data.drop_duplicated()
  5. #对指定列判断是否存在重复值,然后删除重复数据
  6. data.drop_duplicated(['key1'])
  7. df = DataFrame({'color':['white','white','red','red','white'],
  8. 'value':[2,1,3,3,2]})
  9. display(df,df.duplicated(),df.drop_duplicates())
  10. #输出:
  11. color value
  12. 0 white 2
  13. 1 white 1
  14. 2 red 3
  15. 3 red 3
  16. 4 white 2
  17. 0 False
  18. 1 False
  19. 2 False
  20. 3 True
  21. 4 True
  22. dtype: bool
  23. color value
  24. 0 white 2
  25. 1 white 1
  26. 2 red 3

1 replace() 替换元素 replace({索引键值对})

  1. df = DataFrame({'item':['ball','mug','pen'],
  2. 'color':['white','rosso','verde'],
  3. 'price':[5.56,4.20,1.30]})
  4. newcolors = {'rosso':'red','verde':'green'}
  5. display(df,df.replace(newcolors))
  6. #输出:
  7. color item price
  8. 0 white ball 5.56
  9. 1 rosso mug 4.20
  10. 2 verde pen 1.30
  11. color item price
  12. 0 white ball 5.56
  13. 1 red mug 4.20
  14. 2 green pen 1.30
  15. 2.replace还经常用来替换NaN元素
  16. df2 = DataFrame({'math':[100,139,np.nan],'English':[146,None,119]},index = ['张三','李四','Tom'])
  17. newvalues = {np.nan:100}
  18. display(df2,df2.replace(newvalues))
  19. #输出:
  20. English math
  21. 张三 146.0 100.0
  22. 李四 NaN 139.0
  23. Tom 119.0 NaN
  24. English math
  25. 张三 146.0 100.0
  26. 李四 100.0 139.0
  27. Tom 119.0 100.0

2 map()函数:新建一列

map(函数,可迭代对象) map(函数/{索引键值对})

map中返回的数据是一个具体值,不能迭代.

  1. df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']})
  2. price = {'red':5.56,'green':3.14,'chemistry':2.79}
  3. df3['price'] = df3['color'].map(price)
  4. display(df3)
  5. #输出:
  6. color project price
  7. 0 red math 5.56
  8. 1 green english 3.14
  9. 2 blue chemistry NaN
  10. df3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['张三','李四','倩倩'])
  11. display(df3)
  12. display(df3['zs'].map({129:'你好',130:'非常好',34:'不错'}))
  13. display(df3['zs'].map({129:120}))
  14. def mapscore(score):
  15. if score<90:
  16. return 'failed'
  17. elif score>120:
  18. return 'excellent'
  19. else:
  20. return 'pass'
  21. df3['status'] = ddd['zs'].map(mapscore)
  22. df3
  23. 输出:
  24. zs ls
  25. 张三 129 136
  26. 李四 130 98
  27. 倩倩 34 8
  28. 张三 你好
  29. 李四 非常好
  30. 倩倩 不错
  31. Name: zs, dtype: object
  32. 张三 120.0
  33. 李四 NaN
  34. 倩倩 NaN
  35. Name: zs, dtype: float64
  36. Out[96]:
  37. ls zs status
  38. 张三 136 129 excellent
  39. 李四 98 130 excellent
  40. 倩倩 8 34 failed

3 rename()函数:替换索引 rename({索引键值对})

  1. df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
  2. new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
  3. display(df4,df4.rename(new_index))
  4. #输出:
  5. color value
  6. 0 white 2
  7. 1 gray 0
  8. 2 purple 9
  9. 3 blue 2
  10. 4 green 0
  11. color value
  12. first white 2
  13. two gray 0
  14. three purple 9
  15. four blue 2
  16. five green 0

1 使用describe()函数查看每一列的描述性统计量

  1. df = DataFrame(np.random.randint(10,size = 10))
  2. display(df.describe())
  3. 0
  4. count 10.000000
  5. mean 5.900000
  6. std 2.685351
  7. min 1.000000
  8. 25% 6.000000
  9. 50% 7.000000
  10. 75% 7.750000
  11. max 8.000000

2 使用std()函数可以求得DataFrame对象每一列的标准差

  1. df.std()
  2. #输出:
  3. 0 3.306559
  4. dtype: float64

3 根据每一列的标准差,对DataFrame元素进行过滤。
借助any()函数,对每一列应用筛选条件,any过滤出所有符合条件的数据

  1. display(df[(df>df.std()*3).any(axis = 1)])
  2. df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True)
  3. display(df,df.shape)
  4. 输出:
  5. 0 1
  6. 2 7 9
  7. 6 8 8
  8. 9 8 1
  9. 0 1
  10. 0 5 0
  11. 1 3 3
  12. 3 3 5
  13. 4 2 4
  14. 5 7 6
  15. 7 1 6
  16. 8 7 7
  17. (7, 2)

使用take()函数排序
可以借助np.random.permutation()函数随机排序

  1. df5 = DataFrame(np.arange(25).reshape(5,5))
  2. new_order = np.random.permutation(5)
  3. display(new_order)
  4. display(df5,df5.take(new_order))
  5. #输出
  6. array([4, 2, 3, 1, 0])
  7. 0 1 2 3 4
  8. 0 0 1 2 3 4
  9. 1 5 6 7 8 9
  10. 2 10 11 12 13 14
  11. 3 15 16 17 18 19
  12. 4 20 21 22 23 24
  13. 0 1 2 3 4
  14. 4 20 21 22 23 24
  15. 2 10 11 12 13 14
  16. 3 15 16 17 18 19
  17. 1 5 6 7 8 9
  18. 0 0 1 2 3 4

groupby()函数

  1. import pandas as pd
  2. df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}])
  3. display(df)
  4. # 按col1分组并按col2求和
  5. display(df.groupby(by='col1').agg({'col2':sum}).reset_index())
  6. # 按col1分组并按col2求最值
  7. display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index())
  8. # 按col1 ,col3分组并按col2求和
  9. display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index())
  1. import matplotlib.pyplot as plt
  2. import pandas as pd
  3. import numpy as np
  4. from datetime import datetime
  5. '''
  6. 分组groupby
  7. '''
  8. df=pd.DataFrame({'key1':['a','a','b','b','a'],
  9. 'key2':['one','two','one','two','one'],
  10. 'data1':np.arange(5),
  11. 'data2':np.arange(5)})
  12. print(df)
  13. # key1 key2 data1 data2
  14. # 0 a one 0 0
  15. # 1 a two 1 1
  16. # 2 b one 2 2
  17. # 3 b two 3 3
  18. # 4 a one 4 4
  19. '''
  20. 根据分组进行计算
  21. '''
  22. #按key1分组,计算data1的平均值
  23. grouped=df['data1'].groupby(df['key1'])
  24. print(grouped.mean())
  25. # a 1.666667
  26. # b 2.500000
  27. #按key1和key2分组,计算data1的平均值
  28. groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean()
  29. print(groupedmean)
  30. # key1 key2
  31. # a one 2
  32. # two 1
  33. # b one 2
  34. # two 3
  35. #列变行
  36. print(groupedmean.unstack())
  37. # key2 one two
  38. # key1
  39. # a 2 1
  40. # b 2 3
  41. df['key1']#获取出来的数据series数据
  42. #groupby分组键可以是series还可以是数组
  43. states=np.array(['Oh','Ca','Ca','Oh','Oh'])
  44. years=np.array([2005,2005,2006,2005,2006])
  45. print(df['data1'].groupby([states,years]).mean())
  46. # Ca 2005 1.0
  47. # 2006 2.0
  48. # Oh 2005 1.5
  49. # 2006 4.0
  50. #直接将列名进行分组,非数据项不在其中,非数据项会自动排除分组
  51. print(df.groupby('key1').mean())
  52. # data1 data2
  53. # key1
  54. # a 1.666667 1.666667
  55. # b 2.500000 2.500000
  56. #将入key2分组
  57. print(df.groupby(['key1','key2']).mean())
  58. # data1 data2
  59. # key1 key2
  60. # a one 2 2
  61. # two 1 1
  62. # b one 2 2
  63. # two 3 3
  64. #size()方法,返回含有分组大小的Series,得到分组的数量
  65. print(df.groupby(['key1','key2']).size())
  66. # key1 key2
  67. # a one 2
  68. # two 1
  69. # b one 1
  70. # two 1
  71. '''
  72. 对分组信息进行迭代
  73. '''
  74. #将a,b进行分组
  75. for name,group in df.groupby('key1'):
  76. print(name)
  77. print(group)
  78. # a
  79. # key1 key2 data1 data2
  80. # 0 a one 0 0
  81. # 1 a two 1 1
  82. # 4 a one 4 4
  83. # b
  84. # key1 key2 data1 data2
  85. # 2 b one 2 2
  86. # 3 b two 3 3
  87. #根据多个建进行分组
  88. for (k1,k2),group in df.groupby(['key1','key2']):
  89. print(name)
  90. print(group)
  91. # key1 key2 data1 data2
  92. # 0 a one 0 0
  93. # 4 a one 4 4
  94. # b
  95. # key1 key2 data1 data2
  96. # 1 a two 1 1
  97. # b
  98. # key1 key2 data1 data2
  99. # 2 b one 2 2
  100. # b
  101. # key1 key2 data1 data2
  102. # 3 b two 3 3
  103. '''
  104. 选取一个或一组列,返回的Series的分组对象
  105. '''
  106. #对于groupBy对象,如果用一个或一组列名进行索引。就会聚合
  107. print(df.groupby(df['key1'])['data1'])#根据key1分组,生成data1的数据
  108. print(df.groupby(['key1'])[['data1','data2']].mean())#根据key1分组,生成data1data2的数据
  109. # data1 data2
  110. # key1
  111. # a 1.666667 1.666667
  112. # b 2.500000 2.500000
  113. print(df.groupby(['key1','key2'])['data1'].mean())
  114. # key1 key2
  115. # a one 2
  116. # two 1
  117. # b one 2
  118. # two 3
  119. '''
  120. 通过函数进行分组
  121. '''
  122. #加入你能根据人名长度进行分组的话,就直接传入len函数
  123. print(people.groupby(len,axis=1).sum())#杭州3是三个字母
  124. # 2 3
  125. # a 30.0 20.0
  126. # b 23.0 21.0
  127. # c 26.0 22.0
  128. # d 42.0 23.0
  129. # e 46.0 24.0
  130. #还可以和数组、字典、列表、Series混合使用
  131. key_list=['one','one','one','two','two']
  132. print(people.groupby([len,key_list],axis=1).min())
  133. # 2 3
  134. # one two two
  135. # a 0.0 15.0 20.0
  136. # b 1.0 16.0 21.0
  137. # c 2.0 17.0 22.0
  138. # d 3.0 18.0 23.0
  139. # e 4.0 19.0 24.0
  140. '''
  141. 根据索引级别分组
  142. '''
  143. columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor'])
  144. hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
  145. print(hier_df)
  146. # cty US JP
  147. # tenor 1 3 5 1 3
  148. # 0 -1.507729 2.112678 0.841736 -0.158109 -0.645219
  149. # 1 0.355262 0.765209 -0.287648 1.134998 -0.440188
  150. # 2 1.049813 0.763482 -0.362013 -0.428725 -0.355601
  151. # 3 -0.868420 -1.213398 -0.386798 0.137273 0.678293
  152. #根据级别分组
  153. print(hier_df.groupby(level='cty',axis=1).count())
  154. # cty JP US
  155. # 0 2 3
  156. # 1 2 3
  157. # 2 2 3
  158. # 3 2 3

1 可以使用pd.merge()函数包聚合操作的计算结果添加到df的每一行

  1. d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
  2. 'color':['white','white','red','green','white','white','red','green'],
  3. 'weight':np.random.randint(10,size = 8),
  4. 'price':np.random.randint(10,size = 8)}
  5. df = DataFrame(d1)
  6. sums = df.groupby('color').sum().add_prefix('total_')
  7. items = df.groupby('item')['price','weight'].sum()
  8. means = items['price']/items['weight']
  9. means = DataFrame(means,columns=['means_price'])
  10. df2 = pd.merge(df,sums,left_on = 'color',right_index = True)
  11. df3 = pd.merge(df2,means,left_on = 'item',right_index = True)
  12. display(df2,df3)
  13. #输出:
  14. color item price weight
  15. 0 white luobo 9 2
  16. 1 white baicai 5 9
  17. 2 red lajiao 5 8
  18. 3 green donggua 1 1
  19. 4 white luobo 7 4
  20. 5 white baicai 8 0
  21. 6 red lajiao 6 8
  22. 7 green donggua 4 3
  23. total_price total_weight
  24. color
  25. green 5 4
  26. red 11 16
  27. white 29 15
  28. pandas.core.frame.DataFrame
  29. pandas.core.frame.DataFrame
  30. Out[141]:
  31. color item price weight total_price total_weight
  32. 0 white luobo 9 2 29 15
  33. 1 white baicai 5 9 29 15
  34. 4 white luobo 7 4 29 15
  35. 5 white baicai 8 0 29 15
  36. 2 red lajiao 5 8 11 16
  37. 6 red lajiao 6 8 11 16
  38. 3 green donggua 1 1 5 4
  39. 7 green donggua 4 3 5 4

2 可以使用transform和apply实现相同功能

使用transform

  1. d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
  2. 'color':['white','white','red','green','white','white','red','green'],
  3. 'weight':np.random.randint(10,size = 8),
  4. 'price':np.random.randint(10,size = 8)}
  5. df = DataFrame(d1)
  6. sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_")
  7. sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_')
  8. sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_')
  9. display(sum,df,sum1,sums2,sums3)
  10. 输出:
  11. <function sum>
  12. color item price weight
  13. 0 white luobo 7 7
  14. 1 white baicai 7 7
  15. 2 red lajiao 2 7
  16. 3 green donggua 6 6
  17. 4 white luobo 1 2
  18. 5 white baicai 3 6
  19. 6 red lajiao 7 0
  20. 7 green donggua 0 2
  21. total_price total_weight
  22. color
  23. green 6 8
  24. red 9 7
  25. white 18 22
  26. total_price total_weight
  27. 0 18 22
  28. 1 18 22
  29. 2 9 7
  30. 3 6 8
  31. 4 18 22
  32. 5 18 22
  33. 6 9 7
  34. 7 6 8
  35. total_price total_weight
  36. 0 18 22
  37. 1 18 22
  38. 2 9 7
  39. 3 6 8
  40. 4 18 22
  41. 5 18 22
  42. 6 9 7
  43. 7 6 8

使用apply

  1. def sum_price(x):
  2. return x.sum()
  3. sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_')
  4. sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_')
  5. display(df,sums3,sums4)
  6. 输出:
  7. color item price weight
  8. 0 white luobo 4 4
  9. 1 white baicai 0 3
  10. 2 red lajiao 0 4
  11. 3 green donggua 7 5
  12. 4 white luobo 3 1
  13. 5 white baicai 3 3
  14. 6 red lajiao 0 6
  15. 7 green donggua 0 7
  16. total_price total_weight
  17. color
  18. green 7 12
  19. red 0 10
  20. white 10 11
  21. totals_price totals_weight
  22. color
  23. green 7 12
  24. red 0 10
  25. white 10 11

版权声明:本文为why957原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/why957/p/9303780.html