Python Pandas聚函数

1年前 (2024-04-27)
在《Python Pandas窗口函数》一节,我们重点介绍了窗口函数。我们知道,窗口函数可以与聚函数一起使用,聚函数指的是对一组数据求总和、值、最小值以及平均值的操作,本节重点讲解聚函数的应用。

应用聚函数

首先让我们创建一个 DataFrame 对象,然后对聚函数进行应用。

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])

print (df)

#窗口大小为3,min_periods 最小观测值为1

r = df.rolling(window=3,min_periods=1)

print(r)

输出结果:

                   A         B         C         D

2020-12-14  0.941621  1.205489  0.473771 -0.348169

2020-12-15 -0.276954  0.076387  0.104194  1.537357

2020-12-16  0.582515  0.481999 -0.652332 -1.893678

2020-12-17 -0.286432  0.923514  0.285255 -0.739378

2020-12-18  2.063422 -0.465873 -0.946809  1.590234

Rolling [window=3,min_periods=1,center=False,axis=0]

1) 对整体聚

您可以把一个聚函数传递给 DataFrame,示例如下:

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])

print (df)

#窗口大小为3,min_periods 最小观测值为1

r = df.rolling(window=3,min_periods=1)

#使用 aggregate()聚操作

print(r.aggregate(np.sum))

输出结果:

A B C D

2020-12-14 0.133713 0.746781 0.499385 0.589799

2020-12-15 -0.777572 0.531269 0.600577 -0.393623

2020-12-16 0.408115 -0.874079 0.584320 0.507580

2020-12-17 -1.033055 -1.185399 -0.546567 2.094643

2020-12-18 0.469394 -1.110549 -0.856245 0.260827

A B C D

2020-12-14 0.133713 0.746781 0.499385 0.589799

2020-12-15 -0.643859 1.278050 1.099962 0.196176

2020-12-16 -0.235744 0.403971 1.684281 0.703756

2020-12-17 -1.402513 -1.528209 0.638330 2.208601

2020-12-18 -0.155546 -3.170027 -0.818492 2.863051

2) 对任意某一列聚

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])

#窗口大小为3,min_periods 最小观测值为1

r = df.rolling(window=3,min_periods=1)

#对 A 列聚

print(r['A'].aggregate(np.sum))

输出结果:

2020-12-14 1.051501

2020-12-15 1.354574

2020-12-16 0.896335

2020-12-17 0.508470

2020-12-18 2.333732

Freq: D, Name: A, dtype: float64

3) 对多列数据聚

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])

#窗口大小为3,min_periods 最小观测值为1

r = df.rolling(window=3,min_periods=1)

#对 A/B 两列聚

print(r['A','B'].aggregate(np.sum))

输出结果:

A B

2020-12-14 0.639867 -0.229990

2020-12-15 0.352028 0.257918

2020-12-16 0.637845 2.643628

2020-12-17 0.432715 2.428604

2020-12-18 -1.575766 0.969600

4) 对单列应用多个函数

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])

#窗口大小为3,min_periods 最小观测值为1

r = df.rolling(window=3,min_periods=1)

#对 A/B 两列聚

print(r['A','B'].aggregate([np.sum,np.mean]))

输出结果:

sum mean

2020-12-14 -0.469643 -0.469643

2020-12-15 -0.626856 -0.313428

2020-12-16 -1.820226 -0.606742

2020-12-17 -2.007323 -0.669108

2020-12-18 -0.595736 -0.198579

5) 对不同列应用多个函数

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),

index = pd.date_range('12/11/2020', periods=5),

columns = ['A', 'B', 'C', 'D'])

r = df.rolling(window=3,min_periods=1)

print( r['A','B'].aggregate([np.sum,np.mean]))

输出结果:

A B

sum mean sum mean

2020-12-14 -1.428882 -1.428882 -0.417241 -0.417241

2020-12-15 -1.315151 -0.657576 -1.580616 -0.790308

2020-12-16 -2.093907 -0.697969 -2.260181 -0.753394

2020-12-17 -1.324490 -0.441497 -1.578467 -0.526156

2020-12-18 -2.400948 -0.800316 -0.452740 -0.150913

6) 对不同列应用不同函数

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(3, 4),

index = pd.date_range('12/14/2020', periods=3),

columns = ['A', 'B', 'C', 'D'])

r = df.rolling(window=3,min_periods=1)

print(r.aggregate({'A': np.sum,'B': np.mean}))

输出结果:

A B

2020-12-14 0.503535 -1.301423

2020-12-15 0.170056 -0.550289

2020-12-16 -0.086081 -0.140532