Python - Pandas

本文参考1

pandas导入

1
import pandas as pd

琐碎

  1. read_csv()默认有表头,会跳过第一行,从第二行开始读起。如果我们要读取的文件,直接就是数据,没有所谓的表头,就需 read_csv(header=None).

  2. when bad lines exist in file:

    1
    2
    ERROR : pandas.errors.ParserError: Error tokenizing data. C error: Expected 1024 fields in line 237, saw 1491
    data = pd.read_csv('file1.csv', error_bad_lines=False)

数据结构

Pandas处理以下三种数据结构:系列(Series),数据帧(DataFrame),面板(Panel)

系列是具有均匀数据的一维数组结构,尺寸大小不变但数据可变

数据帧(DataFrame)是一个具有异构数据的二维数组,尺寸大小可变且数据可变

面板是具有异构数据的三维数据结构,尺寸大小可变且数据可变

系列(Series)

系列(Series)是能够保存任何类型的数据(整数,字符串,浮点数,Python对象等)的一维标记数组,轴标签统称为索引。

1
2
3
4
5
6
#Pandas系列可以使用以下构造函数创建
pandas.Series( data, index, dtype, copy)
# data - 数据采取各种形式,如:ndarray,list,constants
# index - 索引值必须是唯一的和散列的,与数据的长度相同。 默认np.arange(n)如果没有索引被传递。
# dtype - dtype用于数据类型。如果没有,将推断数据类型
# copy - 复制数据,默认为false

创建系列

创建一个空的系列

1
2
3
4
import pandas as pd
s = pd.Series()
print(s)#Series([], dtype: float64)

ndarray创建一个系列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)#索引从0开始
#0 a
#1 b
#2 c
#3 d
#dtype: object
s = pd.Series(data,index=[100,101,102,103])
print(s)#自定义索引
#100 a
#101 b
#102 c
#103 d
#dtype: object

从字典创建一个系列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#字典(dict)可以作为输入传递,如果没有指定索引,则按排序顺序取得字典键以构造索引。 如果传递了索引,索引中与标签对应的数据中的值将被拉出。
import pandas as pd
import numpy as np
data = {'a':0,'b':1,'c':2}
s = pd.Series(data)
print(s)
# a 0
# b 1
# c 2
# dtype: int64
s = pd.Series(data,index=['b','def1','a','def2','c'])
print(s) #索引顺序保持不变,缺少的元素使用NaN(不是数字)填充。
# b 1.0
# def1 NaN
# a 0.0
# def2 NaN
# c 2.0
# dtype: float64

从标量创建一个系列

1
2
3
4
5
6
7
8
9
10
11
#如果数据是标量值,则必须提供索引。将重复该值以匹配索引的长度。
import pandas as pd
import numpy as np
s = pd.Series(5,index=[0,1,2,3])
print(s)
# 0 5
# 1 5
# 2 5
# 3 5
# dtype: int64

系列访问

使用位置index访问

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print(s[0]) #输出1
print(s[:3])#检索系列中的前三个元素
# a 1
# b 2
# c 3
print(s[-3:])#检索最后三个元素
# c 3
# d 4
# e 5

使用标签检索

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print(s['b']) #输出2
#使用索引标签值列表检索多个元素。
print(s[['a','c','e']])
# a 1
# c 3
# e 5

数据帧(DataFrame)

数据帧(DataFrame)是二维数据结构,即数据以行和列的表格方式排列。

1
2
3
4
5
6
7
#pandas中的DataFrame可以使用以下构造函数创建
pandas.DataFrame( data, index, columns, dtype, copy)
# data - 数据采取各种形式,如:ndarray,series,map,lists,dict,constant和另一个DataFrame。
# index - 对于行标签,要用于结果帧的索引是可选缺省值np.arrange(n),如果没有传递索引值。
# columns - 对于列标签,可选的默认语法是 - np.arange(n)
# dtype - 每列的数据类型。
# copy - 如果默认值为False,则此命令(或任何它)用于复制数据。

创建数据帧

1
2
import pandas as pd
df = pd.DataFrame()

通过列表创建

可以使用单个列表或列表列表创建数据帧(DataFrame)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
data = [['Alex',10],['Bob',12],['Dan',13]]
df = pd.DataFrame(data,columns=['Name','Age'])#dtype=float将参数age类型改为浮点
print(df)
# Name Age
# 0 Alex 10
# 1 Bob 12
# 2 Dan 13

通过ndarrays/Lists的字典创建

所有的ndarrays必须具有相同的长度。如果传递了索引(index),则索引的长度应等于数组的长度。

如果没有传递索引,则默认情况下,索引将为range(n),其中n为数组长度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
data = {'Name':['Tom','Jack','Steve','Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df) ## 观察值0,1,2,3。它们是分配给每个使用函数range(n)的默认索引。
# Age Name
# 0 28 Tom
# 1 34 Jack
# 2 29 Steve
# 3 42 Ricky
df = pd.DataFrame(data,index=['rank1','rank2','rank3','rank4'])
print(df) ## index参数为每行分配一个索引。
# Age Name
# rank1 28 Tom
# rank2 34 Jack
# rank3 29 Steve
# rank4 42 Ricky

通过字典创建

字典列表可作为输入数据传递以用来创建数据帧(DataFrame),字典键默认为列名。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]
df = pd.DataFrame(data)
print(df)
# a b c
# 0 1 2 NaN
# 1 5 10 20.0
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
# a b
# first 1 2
# second 5 10
print(df2)
# a b1
# first 1 NaN
# second 5 NaN

通过系列创建

字典的系列可以传递以形成一个DataFrame。 所得到的索引是通过的所有系列索引的并集。

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
# one two
# a 1.0 1
# b 2.0 2
# c 3.0 3
# d NaN 4

数据帧访问

列选择

1
2
3
4
5
6
7
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df ['one'])

列添加

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
# Adding a new column by passing as Series:
# one two three
# a 1.0 1 10.0
# b 2.0 2 20.0
# c 3.0 3 30.0
# d NaN 4 NaN
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)
# Adding a new column using the existing columns in DataFrame:
# one two three four
# a 1.0 1 10.0 11.0
# b 2.0 2 20.0 22.0
# c 3.0 3 30.0 33.0
# d NaN 4 NaN NaN

列删除

列可以删除或弹出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)
# Our dataframe is:
# one three two
# a 1.0 10.0 1
# b 2.0 20.0 2
# c 3.0 30.0 3
# d NaN NaN 4
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)
# Deleting the first column using DEL function:
# three two
# a 10.0 1
# b 20.0 2
# c 30.0 3
# d NaN 4
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)
# Deleting another column using POP function:
# three
# a 10.0
# b 20.0
# c 30.0
# d NaN

行选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#通过将行标签传递给loc()函数来选择行。
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.loc['b'])
# one 2.0
# two 2.0
# Name: b, dtype: float64
#可以通过将整数位置传递给iloc()函数来选择行
print(df.iloc[2])
# one 3.0
# two 3.0
# Name: b, dtype: float64

行切片

可以使用:运算符选择多行

1
2
3
4
5
6
7
8
9
10
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df[2:4])
# one two
# c 3.0 3
# d NaN 4

附加行

使用append()函数将新行添加到DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
# a b
# 0 1 2
# 1 3 4
# 0 5 6
# 1 7 8

删除行

使用索引标签从DataFrame中删除或删除行。 如果标签重复,则会删除多行。

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print(df)
# a b
# 1 3 4
# 1 7 8

面板(Panel)

Pandas基本功能

系列基本功能

Screen Shot 2018-06-08 at 11.56.57 PM

1
2
3
4
5
import pandas as pd
import numpy as np
#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))

axes

1
2
3
4
print("This axes are:")
print(s.axes)
# This axes are:
# [RangeIndex(start=0, stop=4, step=1)]

empty

返回布尔值,表示对象是否为空。返回True则表示对象为空。

1
2
3
4
print("Is the Object empty?")
print(s.empty)
# Is the Object empty?
# False

ndim

返回对象的维数。根据定义,一个系列是一个1D数据结构

1
2
3
4
print ("The dimensions of the object:")
print(s.ndim)
# The dimensions of the object:
# 1

size

返回系列的大小(长度)

1
2
3
4
print ("The size of the object:")
print (s.size)
# The size of the object:
# 2

values

1
2
3
4
print ("The actual data series is:")
print (s.values)
# The actual data series is:
# [-0.15974783 0.75598442 -0.05964617 -0.91200952]

head-tail

head()返回前n行(观察索引值)。要显示的元素的默认数量为5,但可以传递自定义这个数字值。

tail()返回最后n行(观察索引值)。 要显示的元素的默认数量为5,但可以传递自定义数字值。

1
2
3
4
5
6
7
8
9
10
11
12
13
print ("The original series is:")
print (s)
# The original series is:
# 0 0.209476
# 1 0.136436
# 2 -1.387964
# 3 -0.102560
print ("The first two rows of the data series:")
print (s.head(2))
# The first two rows of the data series:
# 0 0.209476
# 1 0.136436

DataFrame基本功能

Screen Shot 2018-06-09 at 12.15.51 AM

Pandas描述性统计

Screen Shot 2018-06-10 at 10.34.06 AM

  • 类似于:sum()cumsum()函数能与数字和字符(或)字符串数据元素一起工作,不会产生任何错误。字符聚合从来都比较少被使用,虽然这些函数不会引发任何异常。
  • 当DataFrame包含字符或字符串数据时,像abs()cumprod()这样的函数会抛出异常。

创建一个数据帧

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print(df)
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Minsu 4.60
6 23 Jack 3.80
7 34 Lee 3.78
8 40 David 2.98
9 30 Gasper 4.80
10 51 Betina 4.10
11 46 Andres 3.65

sum()

返回所请求轴的值的总和。 默认情况下,轴为索引(axis=0),即对列sum

1
2
3
4
5
print(df.sum())
# Age 382
# Name TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
# Rating 44.92
# dtype: object

axis=1,则对行sum

1
2
3
4
5
6
7
8
9
10
11
12
13
print(df.sum(1))
# 0 29.23
# 1 29.24
# 2 28.98
# 3 25.56
# 4 33.20
# 5 33.60
# 6 26.80
# 7 37.78
# 8 42.98
# 9 34.80
# 10 55.10
# 11 49.65

mean()

与sum()用法一样,默认是对列求均值;mean(1)则是对每一行求均值。

std()

返回数字列的Bressel标准偏差。

数据汇总

describe()函数是用来计算有关DataFrame列的统计信息的摘要。

1
2
3
4
5
6
7
8
9
10
print(df.describe())
# Age Rating
# count 12.000000 12.000000
# mean 31.833333 3.743333
# std 9.232682 0.661628
# min 23.000000 2.560000
# 25% 25.000000 3.230000
# 50% 29.500000 3.790000
# 75% 35.500000 4.132500
# max 51.000000 4.800000

include是用于传递关于什么列需要考虑用于总结的必要信息的参数。获取值列表; 默认情况下是”数字值”。

  • object - 汇总字符串列
  • number - 汇总数字列
  • all - 将所有列汇总在一起(不应将其作为列表值传递)
1
2
3
4
5
6
print(df.describe(include=['object']))
# Name
# count 12
# unique 12
# top James
# freq 1

Pandas函数

以下三种方法是要将自己或其他库的函数应用于Pandas对象,使用适当的方法取决于函数是否期望在整个DataFrame,行或列或元素上进行操作。

  • pipe() : 表明智函数应用
  • apply() : 行或列函数应用
  • applymap() : 元素函数应用

pipe表格函数

可以通过将函数和适当数量的参数作为管道参数来执行自定义操作。 因此,对整个DataFrame执行操作。例如,为DataFrame中的所有元素相加一个值2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print (df)
# col1 col2 col3
# 0 0.298525 -0.950629 0.164956
# 1 0.518124 0.952160 0.882564
# 2 1.215512 2.330336 -1.078768
# 3 -0.672469 0.139257 0.871575
# 4 -1.038358 1.132721 -0.705976
print(df.pipe(adder,2)) #对所有元素加2
# col1 col2 col3
# 0 2.298525 1.049371 2.164956
# 1 2.518124 2.952160 2.882564
# 2 3.215512 4.330336 0.921232
# 3 1.327531 2.139257 2.871575
# 4 0.961642 3.132721 1.294024

appy行或列智能函数应用

可以使用apply()方法沿DataFramePanel的轴应用任意函数,它与描述性统计方法一样,采用可选的轴参数。 默认情况下,操作按列执行,将每列列为数组。指定axis=1,则按照行执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print (df)
# col1 col2 col3
# 0 0.080134 -0.848024 -0.801573
# 1 -1.464748 -0.257665 -0.799735
# 2 0.260537 -0.944930 -0.119062
# 3 2.061183 -0.904605 0.099470
# 4 0.114762 -0.927484 0.561150
print(df.apply(np.mean)) #print(df.apply(np.mean,axis=1))
# col1 0.210374
# col2 -0.776542
# col3 -0.211950
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print (df)
# col1 col2 col3
# 0 0.869596 0.284294 1.302871
# 1 0.761678 1.430786 -0.708223
# 2 -0.020315 0.643858 1.274853
# 3 0.124935 0.165798 1.003215
# 4 -0.201225 1.139868 1.455647
print(df.apply(lambda x:x.max()-x.min()))#lambda x:定义了一个函数
# col1 1.070822
# col2 1.264987
# col3 2.163870
print (df) #apply不改变df的数据
# col1 col2 col3
# 0 0.869596 0.284294 1.302871
# 1 0.761678 1.430786 -0.708223
# 2 -0.020315 0.643858 1.274853
# 3 0.124935 0.165798 1.003215
# 4 -0.201225 1.139868 1.455647

applymap元素智能函数

并不是所有的函数都可以向量化(也不是返回另一个数组的NumPy数组,也不是任何值),在DataFrame上的方法applymap()和类似地在Series上的map()接受任何Python函数,并且返回单个值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# My custom function
print(df)
# col1 col2 col3
# 0 -0.645335 -0.014233 -0.228133
# 1 0.343590 1.726889 -0.303263
# 2 -0.179465 0.859529 0.736120
# 3 -0.459130 -0.487665 -0.078175
# 4 0.256705 1.299887 0.151205
print(df['col1'].map(lambda x:x*100))#因为是对列操作,所以map
# 0 -64.533532
# 1 34.359000
# 2 -17.946528
# 3 -45.913013
# 4 25.670518
print(df.applymap(lambda x:x*100))
# col1 col2 col3
# 0 -69.502940 -16.032075 -55.610105
# 1 28.380491 23.122472 102.260899
# 2 -71.143064 -12.581149 145.453437
# 3 145.381364 -128.505845 -58.202885
# 4 10.271497 -26.350430 56.805339

Pandas重建索引

重新索引会更改DataFrame的行标签和列标签。重新索引意味着符合数据以匹配特定轴上的一组给定的标签。

可以通过索引来实现多个操作 -

  • 重新排序现有数据以匹配一组新的标签。
  • 在没有标签数据的标签位置插入缺失值(NA)标记。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
print(df)
# A C D x y
# 0 2016-01-01 High 106.422798 0.0 0.670789
# 1 2016-01-02 Medium 90.168862 1.0 0.868026
# 2 2016-01-03 High 96.854027 2.0 0.974715
# 3 2016-01-04 High 98.622810 3.0 0.944312
# 4 2016-01-05 Medium 105.046802 4.0 0.922507
# 5 2016-01-06 Low 115.478319 5.0 0.091310
# 6 2016-01-07 Low 91.814775 6.0 0.397314
# 7 2016-01-08 High 95.030925 7.0 0.545544
# 8 2016-01-09 Medium 102.620742 8.0 0.214454
# 9 2016-01-10 Medium 102.370108 9.0 0.933056
# 10 2016-01-11 Medium 91.638636 10.0 0.464718
# 11 2016-01-12 Medium 83.425189 11.0 0.267500
# 12 2016-01-13 Medium 113.570416 12.0 0.899810
# 13 2016-01-14 Low 129.166525 13.0 0.818797
# 14 2016-01-15 Medium 105.074786 14.0 0.786602
# 15 2016-01-16 High 102.818441 15.0 0.451534
# 16 2016-01-17 Medium 95.567230 16.0 0.739450
# 17 2016-01-18 High 112.581924 17.0 0.550433
# 18 2016-01-19 High 114.526018 18.0 0.993903
# 19 2016-01-20 Low 111.224149 19.0 0.328954
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
print (df_reindexed)
# A C B
# 0 2016-01-01 High NaN
# 2 2016-01-03 High NaN
# 5 2016-01-06 Low NaN

重建索引与其他对象对齐

有时可能希望采取一个对象和重新索引,其轴被标记为与另一个对象相同。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
print(df1)
# col1 col2 col3
# 0 1.690225 -0.237625 0.682889
# 1 0.529209 0.531100 -1.973715
# 2 1.363789 -0.070481 1.149173
# 3 0.963786 0.378348 0.543242
# 4 -1.181315 0.453184 1.335767
# 5 -1.260020 -0.042963 0.210953
# 6 -0.218576 -1.195482 -1.600476
# 7 0.347295 0.677118 -2.225262
# 8 1.349135 0.672671 1.647481
# 9 2.080021 1.322867 1.502295
print(df2)
# col1 col2 col3
# 0 -0.500110 1.379281 -0.323381
# 1 -0.498182 0.365029 0.161133
# 2 -0.598007 -0.010500 -1.304982
# 3 0.701804 -0.097031 -0.770933
# 4 1.406598 -0.765083 0.557169
# 5 0.171932 0.518117 -0.263640
# 6 0.792754 0.424461 0.602631
df1 = df1.reindex_like(df2)
print (df1)
# col1 col2 col3
# 0 1.690225 -0.237625 0.682889
# 1 0.529209 0.531100 -1.973715
# 2 1.363789 -0.070481 1.149173
# 3 0.963786 0.378348 0.543242
# 4 -1.181315 0.453184 1.335767
# 5 -1.260020 -0.042963 0.210953
# 6 -0.218576 -1.195482 -1.600476

注意 - 在这里,df1数据帧(DataFrame)被更改并重新编号,如df2。 列名称应该匹配,否则将为整个列标签添加NAN

填充时重新加注

reindex()采用可选参数方法,它是一个填充方法,其值如下:

  • pad/ffill - 向前填充值
  • bfill/backfill - 向后填充值
  • nearest - 从最近的索引值填充
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print (df2.reindex_like(df1))
# col1 col2 col3
# 0 1.598405 1.451217 1.383048
# 1 -0.695038 -1.105388 -0.774673
# 2 NaN NaN NaN
# 3 NaN NaN NaN
# 4 NaN NaN NaN
# 5 NaN NaN NaN
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))
# Data Frame with Forward Fill:
# col1 col2 col3
# 0 1.598405 1.451217 1.383048
# 1 -0.695038 -1.105388 -0.774673
# 2 -0.695038 -1.105388 -0.774673
# 3 -0.695038 -1.105388 -0.774673
# 4 -0.695038 -1.105388 -0.774673
# 5 -0.695038 -1.105388 -0.774673

注 - 最后四行被填充了。

重建索引时的填充限制

限制参数在重建索引时提供对填充的额外控制。限制指定连续匹配的最大计数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print (df2.reindex_like(df1))
# col1 col2 col3
# 0 0.424747 -0.060944 -1.374431
# 1 0.897544 0.922728 -0.951170
# 2 NaN NaN NaN
# 3 NaN NaN NaN
# 4 NaN NaN NaN
# 5 NaN NaN NaN
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=1))
# Data Frame with Forward Fill limiting to 1:
# col1 col2 col3
# 0 0.424747 -0.060944 -1.374431
# 1 0.897544 0.922728 -0.951170
# 2 0.897544 0.922728 -0.951170
# 3 NaN NaN NaN
# 4 NaN NaN NaN
# 5 NaN NaN NaN

注意 - 只有第7行由前6行填充。 然后,其它行按原样保留。

重命名

rename()方法允许基于一些映射(字典或者系列)或任意函数来重新标记一个轴。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)
# col1 col2 col3
# 0 -0.162633 0.487238 0.376467
# 1 -0.370538 -1.132116 0.320140
# 2 -1.611416 1.934140 0.223959
# 3 0.220631 0.303568 -1.636330
# 4 1.412771 -0.428163 -1.034903
# 5 -0.804933 -1.000358 0.886100
print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))
# After renaming the rows and columns:
# c1 c2 col3
# apple -0.162633 0.487238 0.376467
# banana -0.370538 -1.132116 0.320140
# durian -1.611416 1.934140 0.223959
# 3 0.220631 0.303568 -1.636330
# 4 1.412771 -0.428163 -1.034903
# 5 -0.804933 -1.000358 0.886100

Pandas迭代

Pandas对象之间的基本迭代的行为取决于类型。当迭代一个系列时,它被视为数组式,基本迭代产生这些值。其他数据结构,如:DataFramePanel,遵循类似惯例迭代对象的键。

简而言之,基本迭代(对于i在对象中)产生 -

  • Series - 值
  • DataFrame - 列标签
  • Pannel - 项目标签

DataFrame迭代

要遍历数据帧(DataFrame)中的行,可以使用以下函数 -

  • iteritems() - 迭代(key,value)
  • iterrows() - 将行迭代为(索引,系列)对
  • itertuples() - 以namedtuples的形式迭代行

注意 - 不要尝试在迭代时修改任何对象。迭代是用于读取,迭代器返回原始对象(视图)的副本,因此更改将不会反映在原始对象上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
N=5
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
print(df)
# A C D x y
# 0 2016-01-01 Medium 68.074794 0.0 0.155689
# 1 2016-01-02 Low 102.783422 1.0 0.499028
# 2 2016-01-03 Low 99.767849 2.0 0.072982
# 3 2016-01-04 Low 121.020377 3.0 0.912389
# 4 2016-01-05 Medium 108.772650 4.0 0.624642
for col in df:
print (col)
# A
# C
# D
# x
# y

iteritems()

将每个列作为键,将值与值作为键和列值迭代为Series对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
print(df)
# col1 col2 col3
# 0 -0.087990 -0.049726 0.572913
# 1 0.859409 0.695247 -1.382812
# 2 0.275735 -0.528393 -0.814816
# 3 0.020638 -0.448942 -1.757958
for key,value in df.iteritems():
print (key,"---",value)
# col1 --- 0 -0.087990
# 1 0.859409
# 2 0.275735
# 3 0.020638
# Name: col1, dtype: float64
# col2 --- 0 -0.049726
# 1 0.695247
# 2 -0.528393
# 3 -0.448942
# Name: col2, dtype: float64
# col3 --- 0 0.572913
# 1 -1.382812
# 2 -0.814816
# 3 -1.757958
# Name: col3, dtype: float64

iterrows()

iterrows()返回迭代器,产生每个索引值以及包含每行数据的序列。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
print(df)
# col1 col2 col3
# 0 0.674005 0.042299 0.211592
# 1 1.080131 1.020547 -1.596309
# 2 -0.717016 -0.323693 0.697079
# 3 0.158146 0.936253 -0.103222
for row_index,row in df.iterrows():
print (row_index,"---",row)
# 0 --- col1 0.674005
# col2 0.042299
# col3 0.211592
# Name: 0, dtype: float64
# 1 --- col1 1.080131
# col2 1.020547
# col3 -1.596309
# Name: 1, dtype: float64
# 2 --- col1 -0.717016
# col2 -0.323693
# col3 0.697079
# Name: 2, dtype: float64
# 3 --- col1 0.158146
# col2 0.936253
# col3 -0.103222
# Name: 3, dtype: float64

itertuples()

itertuples()方法将为DataFrame中的每一行返回一个产生一个命名元组的迭代器。元组的第一个元素将是行的相应索引值,而剩余的值是行值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
print(df)
# col1 col2 col3
# 0 -3.525917 0.557613 -0.740162
# 1 -0.672476 -0.932059 0.638660
# 2 -0.138301 0.783534 -0.699881
# 3 -1.003695 -0.485473 -0.651522
for row in df.itertuples():
print (row)
# Pandas(Index=0, col1=-3.525917244517192, col2=0.5576131753319682, col3=-0.7401618735482571)
# Pandas(Index=1, col1=-0.6724761731790204, col2=-0.9320585818750085, col3=0.6386597998390104)
# Pandas(Index=2, col1=-0.13830132314512975, col2=0.7835337994714413, col3=-0.6998806104325761)
# Pandas(Index=3, col1=-1.003694634971743, col2=-0.4854728772237874, col3=-0.651522414077405)

Pandas排序

Pandas有两种排序方式,它们分别是 -

  • 按标签
  • 按实际值

以标签排序

使用sort_index()方法,通过传递axis参数和排序顺序,可以对DataFrame进行排序。 默认情况下,按照升序对行标签进行排序。

通过传递axis参数值为01,可以对列标签进行排序。 默认情况下,axis = 0,逐行排列。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)
# col2 col1
# 1 1.013623 0.889024
# 4 1.058798 -0.076676
# 6 -0.277161 0.665921
# 2 0.019196 0.636835
# 3 0.776214 -0.178358
# 5 0.112524 3.321190
# 9 -1.172484 -1.609542
# 8 0.081354 0.878184
# 0 -1.336118 -0.085982
# 7 1.398648 0.750015
sorted_df=unsorted_df.sort_index() # sorted_df = unsorted_df.sort_index(ascending=False)降序排序
print (sorted_df)
# col2 col1
# 0 -1.336118 -0.085982
# 1 1.013623 0.889024
# 2 0.019196 0.636835
# 3 0.776214 -0.178358
# 4 1.058798 -0.076676
# 5 0.112524 3.321190
# 6 -0.277161 0.665921
# 7 1.398648 0.750015
# 8 0.081354 0.878184
# 9 -1.172484 -1.609542

以值排序

像索引排序一样,sort_values()是按值排序的方法。它接受一个by参数,它将使用要与其排序值的DataFrame的列名称。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
print(unsorted_df)
# col1 col2
# 0 2 1
# 1 1 3
# 2 1 2
# 3 1 4
sorted_df = unsorted_df.sort_values(by='col1')
print (sorted_df) # col1值被排序,相应的col2值和行索引将随col1一起改变。
# col1 col2
# 1 1 3
# 2 1 2
# 3 1 4
# 0 2 1

通过by参数指定需要列值。

1
2
3
4
5
6
7
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print (sorted_df)
# col1 col2
# 2 1 2
# 1 1 3
# 3 1 4
# 0 2 1

先对col1列排序,使该列有序,同时col2也对应变化为3,2,4,1.接着对col2列排序,这时对前一列相同的值情况下再排序,即3,2,4排序为2,3,4.这样前一列还是有序的

Pandas索引和数据选择

Screen Shot 2018-06-11 at 3.17.18 PM

.loc()

.loc()具有多种访问方式,如 -

  • 单个标量标签
  • 标签列表
  • 切片对象
  • 一个布尔数组

loc需要两个单/列表/范围运算符,用","分隔。第一个表示行,第二个表示列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# a 1.411390 0.307114 -0.988168 -1.757327
# b 1.723277 -0.494447 1.372176 0.486670
# c 1.511763 0.132625 -1.242458 0.457793
# d 0.709051 0.361528 0.366901 -0.565920
# e -0.030279 1.160672 0.275871 -1.063416
# f -0.265881 -0.087183 0.313044 1.348987
# g 1.049137 -1.490819 1.085902 0.857283
# h -2.894211 0.086263 0.474841 -0.876164
#select all rows for a specific column
print (df.loc[:,'A'])
# a 1.411390
# b 1.723277
# c 1.511763
# d 0.709051
# e -0.030279
# f -0.265881
# g 1.049137
# h -2.894211
# Select all rows for multiple columns, say list[]
print (df.loc[:,['A','C']])
# A C
# a 1.411390 -0.988168
# b 1.723277 1.372176
# c 1.511763 -1.242458
# d 0.709051 0.366901
# e -0.030279 0.275871
# f -0.265881 0.313044
# g 1.049137 1.085902
# h -2.894211 0.474841
# Select few rows for multiple columns, say list[]
print (df.loc[['a','b','f','h'],['A','C']])
# A C
# a 1.411390 -0.988168
# b 1.723277 1.372176
# f -0.265881 0.313044
# h -2.894211 0.474841
# Select range of rows for all columns
print (df.loc['a':'c'])
# A B C D
# a 1.411390 0.307114 -0.988168 -1.757327
# b 1.723277 -0.494447 1.372176 0.486670
# c 1.511763 0.132625 -1.242458 0.457793
# for getting values with a boolean array
print (df.loc['a']>0) #选择行‘a’,并判断这一行的每一个值是否大于0
# A True
# B True
# C False
# D False

.iloc()

Pandas提供了各种方法,以获得纯整数索引。像python和numpy一样,第一个位置是基于0的索引。

各种访问方式如下 -

  • 整数
  • 整数列表
  • 系列值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 0 -1.097794 -0.343903 -0.545227 0.693677
# 1 0.748052 -0.800368 0.391323 0.079908
# 2 1.395456 -0.622070 0.431188 0.321310
# 3 -0.133916 0.723562 -2.708785 1.397269
# 4 0.998216 0.229914 1.551281 -0.279701
# 5 -0.747833 -0.557234 -0.309676 -0.222850
# 6 1.034332 0.240854 0.730528 -0.825282
# 7 -0.095764 -0.899946 -0.616187 2.121193
# select all rows for a specific column
print (df.iloc[:4])
# A B C D
# 0 -1.097794 -0.343903 -0.545227 0.693677
# 1 0.748052 -0.800368 0.391323 0.079908
# 2 1.395456 -0.622070 0.431188 0.321310
# 3 -0.133916 0.723562 -2.708785 1.397269
# Integer slicing
print (df.iloc[:4])
# A B C D
# 0 -1.097794 -0.343903 -0.545227 0.693677
# 1 0.748052 -0.800368 0.391323 0.079908
# 2 1.395456 -0.622070 0.431188 0.321310
# 3 -0.133916 0.723562 -2.708785 1.397269
print (df.iloc[1:5, 2:4])
# C D
# 1 0.391323 0.079908
# 2 0.431188 0.321310
# 3 -2.708785 1.397269
# 4 1.551281 -0.279701
# Slicing through list of values
print (df.iloc[[1, 3, 5], [1, 3]])
# B D
# 1 -0.800368 0.079908
# 3 0.723562 1.397269
# 5 -0.557234 -0.222850
print (df.iloc[1:3, :])
# A B C D
# 1 0.748052 -0.800368 0.391323 0.079908
# 2 1.395456 -0.622070 0.431188 0.321310
print (df.iloc[:,1:3])
# B C
# 0 -0.343903 -0.545227
# 1 -0.800368 0.391323
# 2 -0.622070 0.431188
# 3 0.723562 -2.708785
# 4 0.229914 1.551281
# 5 -0.557234 -0.309676
# 6 0.240854 0.730528
# 7 -0.899946 -0.616187

.ix()

除了基于纯标签和整数之外,Pandas还提供了一种使用.ix()运算符进行选择和子集化对象的混合方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 0 1.992586 -1.018359 -0.726185 -0.602579
# 1 0.112121 0.634097 -1.000867 -0.196700
# 2 -1.727799 0.033016 -0.250457 -0.009763
# 3 -0.253342 0.512558 -0.284954 -0.775973
print (df.ix[:2]) #前3行
# A B C D
# 0 1.992586 -1.018359 -0.726185 -0.602579
# 1 0.112121 0.634097 -1.000867 -0.196700
# 2 -1.727799 0.033016 -0.250457 -0.009763
print (df.ix[:,'A']) #'A'列
# 0 1.992586
# 1 0.112121
# 2 -1.727799
# 3 -0.253342

多轴索引

使用基本索引运算符[]

Screen Shot 2018-06-12 at 9.18.14 AM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 0 -0.118619 -0.180957 1.119985 0.786177
# 1 1.948019 2.303557 -1.179559 -0.068304
# 2 2.138096 1.280755 2.486576 -0.089437
# 3 -0.998829 0.371025 0.644868 0.166721
print(df['A'])
# 0 -0.118619
# 1 1.948019
# 2 2.138096
# 3 -0.998829
print (df[['A','B']])
# A B
# 0 -0.118619 -0.180957
# 1 1.948019 2.303557
# 2 2.138096 1.280755
# 3 -0.998829 0.371025
print (df[1:2])
# A B C D
# 1 1.948019 2.303557 -1.179559 -0.068304

可以使用属性运算符.来选择列。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 0 3.115677 -0.164456 0.404479 -0.612033
# 1 -0.820360 -1.395105 0.189704 -0.449740
# 2 1.354399 1.575434 -1.080091 -0.219215
# 3 -0.435956 -0.751293 0.828191 1.198294
print(df.A)
# 0 3.115677
# 1 -0.820360
# 2 1.354399
# 3 -0.435956

统计函数

pct_change()函数

系列,DatFrames和Panel都有pct_change()函数。此函数将每个元素与其前一个元素进行比较,并计算变化百分比。

认情况下,pct_change()对列进行操作; 如果想应用到行上,那么可使用axis = 1参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])# [x,y]:(y-x)/x
print (s.pct_change())
# 0 NaN
# 1 1.000000
# 2 0.500000
# 3 0.333333
# 4 0.250000
# 5 -0.200000
df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change())
# 0 1
# 0 NaN NaN
# 1 -1.475974 -0.719708
# 2 -0.533308 -2.573370
# 3 -0.119521 -2.599591
# 4 14.990426 -1.470710

cov()协方差

协方差适用于系列数据。Series对象有一个方法cov用来计算序列对象之间的协方差。NA将被自动排除。

1
2
3
4
5
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))# -0.6094391964528769

当应用于DataFrame时,协方差方法计算所有列之间的协方差(cov)值。

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].cov(frame['b']))
# -0.05532965605044696
print (frame.cov()) #协方差矩阵
# a b c d e
# a 0.605689 -0.055330 -0.449482 0.110269 -0.093332
# b -0.055330 0.910118 -0.330857 0.207272 -0.415138
# c -0.449482 -0.330857 0.881411 -0.347100 0.251697
# d 0.110269 0.207272 -0.347100 0.437272 0.021039
# e -0.093332 -0.415138 0.251697 0.021039 0.561143

注 - 观察第一个语句中ab列之间的cov结果值,与由DataFrame上的cov返回的值相同。

相关性函数

相关性显示了任何两个数值(系列)之间的线性关系。有多种方法来计算pearson(默认),spearmankendall之间的相关性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].corr(frame['b']))
# -0.10411289414403013
print (frame.corr())
# a b c d e
# a 1.000000 -0.104113 -0.136458 0.732508 0.372643
# b -0.104113 1.000000 0.098084 -0.074208 0.275227
# c -0.136458 0.098084 1.000000 0.280941 0.592759
# d 0.732508 -0.074208 0.280941 1.000000 0.572732
# e 0.372643 0.275227 0.592759 0.572732 1.000000

如果DataFrame中存在任何非数字列,则会自动排除。

数据排名

数据排名为元素数组中的每个元素生成排名,不是比大小哦。在关系的情况下,分配平均等级。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
print(s)
# a 0.198131
# b 0.544257
# c -0.253626
# d 0.163365
# e 0.105286
s['d'] = s['b'] # so there's a tie
print (s.rank()) #根据值得东西分配等级,因为有两个值相等,分别是(5级+4级)/2 = 4.5级
# a 3.0
# b 4.5
# c 1.0
# d 4.5
# e 2.0

Screen Shot 2018-06-12 at 9.40.29 AM

Pandas窗口函数

为了处理数字数据,Pandas提供了几个变体,如滚动,展开和指数移动窗口统计的权重。 其中包括总和,均值,中位数,方差,协方差,相关性等。

.rolling()

这个函数可以应用于Series数据。指定window=n参数并在其上应用适当的统计函数。

窗口大小指定几行数据进行运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[8,9,10,11],[12,13,14,15]],
index = pd.date_range('1/1/2020', periods=4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 2020-01-01 1 2 3 4
# 2020-01-02 4 5 6 7
# 2020-01-03 8 9 10 11
# 2020-01-04 12 13 14 15
print (df.rolling(window=2).mean())
# A B C D
# 2020-01-01 NaN NaN NaN NaN
# 2020-01-02 2.5 3.5 4.5 5.5
# 2020-01-03 6.0 7.0 8.0 9.0
# 2020-01-04 10.0 11.0 12.0 13.0

.expanding()

这个函数可以应用于Series数据。 指定min_periods = n参数并在其上应用适当的统计函数。

1
2
3
4
5
6
print (df.expanding(min_periods=2).mean())
# A B C D
# 2020-01-01 NaN NaN NaN NaN
# 2020-01-02 2.500000 3.500000 4.500000 5.500000 (行1和行2的平均值)
# 2020-01-03 4.333333 5.333333 6.333333 7.333333(行1、2、3的3行平均值)
# 2020-01-04 6.250000 7.250000 8.250000 9.250000(4行的平均值)

.ewm()

ewm()可应用于系列数据。指定comspanhalflife参数,并在其上应用适当的统计函数。它以指数形式分配权重。

1
2
3
4
5
6
print (df.ewm(com=0.5).mean())
# A B C D
# 2020-01-01 1.000000 2.000000 3.000000 4.000000
# 2020-01-02 3.250000 4.250000 5.250000 6.250000
# 2020-01-03 6.538462 7.538462 8.538462 9.538462
# 2020-01-04 10.225000 11.225000 12.225000 13.225000

Pandas聚合

当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合。

在整个数据框上应用聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[8,9,10,11],[12,13,14,15]],
index = pd.date_range('1/1/2020', periods=4), columns = ['A', 'B', 'C', 'D'])
print(df)
# A B C D
# 2020-01-01 1 2 3 4
# 2020-01-02 4 5 6 7
# 2020-01-03 8 9 10 11
# 2020-01-04 12 13 14 15
r = df.rolling(window=2,min_periods=1)
print (r) #Rolling [window=2,min_periods=1,center=False,axis=0]
print(r.aggregate(np.sum))
# A B C D
# 2020-01-01 1.0 2.0 3.0 4.0
# 2020-01-02 5.0 7.0 9.0 11.0
# 2020-01-03 12.0 14.0 16.0 18.0
# 2020-01-04 20.0 22.0 24.0 26.0

在数据框的单个列上应用聚合

1
2
3
4
5
print (r['A'].aggregate(np.sum))
# 2020-01-01 1.0
# 2020-01-02 5.0
# 2020-01-03 12.0
# 2020-01-04 20.0

在DataFrame的多列上应用聚合

1
2
3
4
5
6
print (r[['A','B']].aggregate(np.sum))
# A B
# 2020-01-01 1.0 2.0
# 2020-01-02 5.0 7.0
# 2020-01-03 12.0 14.0
# 2020-01-04 20.0 22.0

在DataFrame的单个列上应用多个函数

1
2
3
4
5
6
print (r['A'].aggregate([np.sum,np.mean]))
# sum mean
# 2020-01-01 1.0 1.0
# 2020-01-02 5.0 2.5
# 2020-01-03 12.0 6.0
# 2020-01-04 20.0 10.0

在DataFrame的多列上应用多个函数

1
2
3
4
5
6
7
print (r[['A','B']].aggregate([np.sum,np.mean]))
# A B
# sum mean sum mean
# 2020-01-01 1.0 1.0 2.0 2.0
# 2020-01-02 5.0 2.5 7.0 3.5
# 2020-01-03 12.0 6.0 14.0 7.0
# 2020-01-04 20.0 10.0 22.0 11.0

将不同的函数应用于DataFrame的不同列

1
2
3
4
5
6
print (r.aggregate({'A' : np.sum,'B' : np.mean}))
# A B
# 2020-01-01 1.0 2.0
# 2020-01-02 5.0 3.5
# 2020-01-03 12.0 7.0
# 2020-01-04 20.0 11.0

Pandas缺失数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
print(df)
# one two three
# a -1.188673 0.575815 -0.720743
# c -0.044184 -0.581790 -1.967263
# e 0.969510 0.183313 -0.311744
# f 0.358851 0.212901 0.849545
# h 0.302852 -1.235476 -0.113741
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
# one two three
# a -1.188673 0.575815 -0.720743
# b NaN NaN NaN
# c -0.044184 -0.581790 -1.967263
# d NaN NaN NaN
# e 0.969510 0.183313 -0.311744
# f 0.358851 0.212901 0.849545
# g NaN NaN NaN
# h 0.302852 -1.235476 -0.113741

使用重构索引(reindexing),创建了一个缺少值的DataFrame。 在输出中,NaN表示不是数字的值。

检查缺失值

为了更容易地检测缺失值(以及跨越不同的数组dtype),Pandas提供了isnull()notnull()函数,它们也是Series和DataFrame对象的方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
print(df)
# one two three
# a -1.188673 0.575815 -0.720743
# c -0.044184 -0.581790 -1.967263
# e 0.969510 0.183313 -0.311744
# f 0.358851 0.212901 0.849545
# h 0.302852 -1.235476 -0.113741
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
# one two three
# a -0.116164 -0.209310 -0.695126
# c -0.638017 0.303101 0.614645
# e 0.839940 -0.693997 0.213362
# f -0.405377 0.633845 0.196080
# h -0.448376 1.459465 0.051836
# one two three
# a -0.116164 -0.209310 -0.695126
# b NaN NaN NaN
# c -0.638017 0.303101 0.614645
# d NaN NaN NaN
# e 0.839940 -0.693997 0.213362
# f -0.405377 0.633845 0.196080
# g NaN NaN NaN
# h -0.448376 1.459465 0.051836
print (df['one'].isnull())
# a False
# b True
# c False
# d True
# e False
# f False
# g True
# h False
print (df['one'].notnull())
# a True
# b False
# c True
# d False
# e True
# f True
# g False
# h True

缺失数据的计算

  • 在数据计算时,NA将被视为0
  • 如果数据全部是NA,那么结果将是0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=['a', 'c',
'h'],columns=['one', 'two', 'three'])
print(df)
# one two three
# a 1 2 3
# c 4 5 6
# h 7 8 9
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
# one two three
# a 1.0 2.0 3.0
# b NaN NaN NaN
# c 4.0 5.0 6.0
# d NaN NaN NaN
# e NaN NaN NaN
# f NaN NaN NaN
# g NaN NaN NaN
# h 7.0 8.0 9.0
print (df['one'].sum()) #12
df = pd.DataFrame(index=[0,1,2],columns=['one','two'])
print (df['one'].sum()) #0

缺失数据清理/填充

Pandas提供了各种方法来清除缺失的值。fillna()函数可以通过几种方法用非空数据“填充”NA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
print(df)
# one two three
# a -0.158275 -1.280046 -0.670825
# c 0.826673 0.510478 -1.862733
# e 1.237025 -1.371099 0.104532
df = df.reindex(['a', 'b', 'c'])
print (df)
# one two three
# a -0.158275 -1.280046 -0.670825
# b NaN NaN NaN
# c 0.826673 0.510478 -1.862733
print (df.fillna(0))
# one two three
# a -0.158275 -1.280046 -0.670825
# b 0.000000 0.000000 0.000000
# c 0.826673 0.510478 -1.862733
方法 行为
pad/fill 填充的方法向前
bfill/backfill 填充的方法向后
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print (df)
# one two three
# a -0.634633 -1.146325 0.684612
# b NaN NaN NaN
# c 0.378554 0.018536 0.472808
print (df.fillna(method='pad'))
# one two three
# a -0.634633 -1.146325 0.684612
# b -0.634633 -1.146325 0.684612
# c 0.378554 0.018536 0.472808
print (df.fillna(method='backfill'))
# one two three
# a -0.634633 -1.146325 0.684612
# b 0.378554 0.018536 0.472808
# c 0.378554 0.018536 0.472808

放弃缺失值

如果只想排除缺少的值,则使用dropna函数和axis参数。 默认情况下,axis = 0,即在行上应用,这意味着如果行内的任何值是NA,那么整个行被排除。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
# one two three
# a -0.147621 -3.030745 -0.633145
# b NaN NaN NaN
# c -1.595044 -0.100556 1.246761
# d NaN NaN NaN
# e -0.990907 -0.416591 0.763210
# f 0.908294 1.061932 -0.889028
# g NaN NaN NaN
# h 1.558081 -0.399321 -0.272056
print (df.dropna())
# one two three
# a -0.147621 -3.030745 -0.633145
# c -1.595044 -0.100556 1.246761
# e -0.990907 -0.416591 0.763210
# f 0.908294 1.061932 -0.889028
# h 1.558081 -0.399321 -0.272056
print (df.dropna(axis=1))
# Empty DataFrame
# Columns: []
# Index: [a, b, c, d, e, f, g, h]

值替换

很多时候,必须用一些具体的值取代一个通用的值。可以通过应用替换方法来实现这一点。

用标量值替换NAfillna()函数的等效行为。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[1000,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print(df)
# one two
# 0 1000 1000
# 1 20 0
# 2 30 30
# 3 40 40
# 4 50 50
# 5 2000 60
print (df.replace({1000:10,2000:60}))
# one two
# 0 10 10
# 1 20 0
# 2 30 30
# 3 40 40
# 4 50 50
# 5 60 60

Pandas分组

Screen Shot 2018-06-12 at 3.45.43 PM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print (df)
# Points Rank Team Year
# 0 876 1 Riders 2014
# 1 789 2 Riders 2015
# 2 863 2 Devils 2014
# 3 673 3 Devils 2015
# 4 741 3 Kings 2014
# 5 812 4 kings 2015
# 6 756 1 Kings 2016
# 7 788 1 Kings 2017
# 8 694 2 Riders 2016
# 9 701 4 Royals 2014
# 10 804 1 Royals 2015
# 11 690 2 Riders 2017

数据拆分成组

Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -

  • obj.groupby(‘key’)
  • obj.groupby([‘key1’,’key2’])
  • obj.groupby(key,axis=1)

单列分组

1
2
3
4
5
6
7
8
9
print (df.groupby('Team')) #<pandas.core.groupby.DataFrameGroupBy object at 0x110c0d588>
print (df.groupby('Team').groups)# 查看分组
# {
# 'Devils': Int64Index([2, 3], dtype='int64'),
# 'Kings': Int64Index([4, 6, 7], dtype='int64'),
# 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
# 'Royals': Int64Index([9, 10], dtype='int64'),
# 'kings': Int64Index([5], dtype='int64')
# }

多列分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
print (df.groupby(['Team','Year']).groups)
# {
# ('Devils', 2014): Int64Index([2], dtype='int64'),
# ('Devils', 2015): Int64Index([3], dtype='int64'),
# ('Kings', 2014): Int64Index([4], dtype='int64'),
# ('Kings', 2016): Int64Index([6], dtype='int64'),
# ('Kings', 2017): Int64Index([7], dtype='int64'),
# ('Riders', 2014): Int64Index([0], dtype='int64'),
# ('Riders', 2015): Int64Index([1], dtype='int64'),
# ('Riders', 2016): Int64Index([8], dtype='int64'),
# ('Riders', 2017): Int64Index([11], dtype='int64'),
# ('Royals', 2014): Int64Index([9], dtype='int64'),
# ('Royals', 2015): Int64Index([10], dtype='int64'),
# ('kings', 2015): Int64Index([5], dtype='int64')
# }

迭代遍历分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
grouped = df.groupby('Year')
for name,group in grouped:
print (name)
print (group)
# 2014
# Points Rank Team Year
# 0 876 1 Riders 2014
# 2 863 2 Devils 2014
# 4 741 3 Kings 2014
# 9 701 4 Royals 2014
# 2015
# Points Rank Team Year
# 1 789 2 Riders 2015
# 3 673 3 Devils 2015
# 5 812 4 kings 2015
# 10 804 1 Royals 2015
# 2016
# Points Rank Team Year
# 6 756 1 Kings 2016
# 8 694 2 Riders 2016
# 2017
# Points Rank Team Year
# 7 788 1 Kings 2017
# 11 690 2 Riders 2017

选择一个分组

使用get_group()方法,可以选择一个组。

1
2
3
4
5
6
7
grouped = df.groupby('Year')
print (grouped.get_group(2014))
# Points Rank Team Year
# 0 876 1 Riders 2014
# 2 863 2 Devils 2014
# 4 741 3 Kings 2014
# 9 701 4 Royals 2014

聚合

聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。一个比较常用的是通过聚合或等效的agg方法聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df)
# Points Rank Team Year
# 0 876 1 Riders 2014
# 1 789 2 Riders 2015
# 2 863 2 Devils 2014
# 3 673 3 Devils 2015
# 4 741 3 Kings 2014
# 5 812 4 kings 2015
# 6 756 1 Kings 2016
# 7 788 1 Kings 2017
# 8 694 2 Riders 2016
# 9 701 4 Royals 2014
# 10 804 1 Royals 2015
# 11 690 2 Riders 2017
grouped = df.groupby('Year')
print (grouped['Points'].agg(np.mean))
# Year
# 2014 795.25
# 2015 769.50
# 2016 725.00
# 2017 739.00

另一种查看每个分组的大小的方法是应用size()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print (grouped.agg(np.size))
# Points Rank Year
# Team
# Devils 2 2 2
# Kings 3 3 3
# Riders 4 4 4
# Royals 2 2 2
# kings 1 1 1

一次应用多个聚合函数, 通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
agg = grouped['Points'].agg([np.sum, np.mean, np.std])
print (agg)
# sum mean std
# Team
# Devils 1536 768.000000 134.350288
# Kings 2285 761.666667 24.006943
# Riders 3049 762.250000 88.567771
# Royals 1505 752.500000 72.831998
# kings 812 812.000000 NaN

转换:分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df)
# Points Rank Team Year
# 0 876 1 Riders 2014
# 1 789 2 Riders 2015
# 2 863 2 Devils 2014
# 3 673 3 Devils 2015
# 4 741 3 Kings 2014
# 5 812 4 kings 2015
# 6 756 1 Kings 2016
# 7 788 1 Kings 2017
# 8 694 2 Riders 2016
# 9 701 4 Royals 2014
# 10 804 1 Royals 2015
# 11 690 2 Riders 2017
grouped = df.groupby('Team')
score = lambda x: x+10
print (grouped.transform(score))
# Points Rank Year
# 0 886 11 2024
# 1 799 12 2025
# 2 873 12 2024
# 3 683 13 2025
# 4 751 13 2024
# 5 822 14 2025
# 6 766 11 2026
# 7 798 11 2027
# 8 704 12 2026
# 9 711 14 2024
# 10 814 11 2025
# 11 700 12 2027

过滤:过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
print (filter) # 在上述过滤条件下,要求返回三次以上参加IPL的队伍。
# Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
4 741 3 Kings 2014
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
11 690 2 Riders 2017

Pandas合并/连接

Pandas具有功能全面的高性能内存中连接操作,与SQL等关系数据库非常相似。 Pandas提供了一个单独的merge()函数,作为DataFrame对象之间所有标准数据库连接操作的入口

1
2
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

在这里,有以下几个参数可以使用 -

  • left - 一个DataFrame对象。
  • right - 另一个DataFrame对象。
  • on - 列(名称)连接,必须在左和右DataFrame对象中存在(找到)。
  • left_on - 左侧DataFrame中的列用作键,可以是列名或长度等于DataFrame长度的数组。
  • right_on - 来自右的DataFrame的列作为键,可以是列名或长度等于DataFrame长度的数组。
  • left_index - 如果为True,则使用左侧DataFrame中的索引(行标签)作为其连接键。 在具有MultiIndex(分层)的DataFrame的情况下,级别的数量必须与来自右DataFrame的连接键的数量相匹配。
  • right_index - 与右DataFrame的left_index具有相同的用法。
  • how - 它是left, right, outer以及inner之中的一个,默认为内inner。 下面将介绍每种方法的用法。
  • sort - 按照字典顺序通过连接键对结果DataFrame进行排序。默认为True,设置为False时,在很多情况下大大提高性能。
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print (left)
print("========================================")
print (right)
1
2
3
4
5
6
7
8
9
10
11
12
13
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
========================================
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5

on参数

在一个键上合并两个数据帧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left,right,on='id')
print(rs)
# Name_x id subject_id_x Name_y subject_id_y
# 0 Alex 1 sub1 Billy sub2
# 1 Amy 2 sub2 Brian sub4
# 2 Allen 3 sub4 Bran sub3
# 3 Alice 4 sub6 Bryce sub6
# 4 Ayoung 5 sub5 Betty sub5

合并多个键上的两个数据框

1
2
3
4
5
rs = pd.merge(left,right,on=['id','subject_id'])
print(rs)
# Name_x id subject_id Name_y
# 0 Alice 4 sub6 Bryce
# 1 Ayoung 5 sub5 Betty

how参数

Screen Shot 2018-06-13 at 9.46.32 AM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
rs = pd.merge(left, right, on='subject_id', how='left')
print (rs)
# Name_x id_x subject_id Name_y id_y
# 0 Alex 1 sub1 NaN NaN
# 1 Amy 2 sub2 Billy 1.0
# 2 Allen 3 sub4 Brian 2.0
# 3 Alice 4 sub6 Bryce 4.0
# 4 Ayoung 5 sub5 Betty 5.0
rs = pd.merge(left, right, on='subject_id', how='right')
print (rs)
# Name_x id_x subject_id Name_y id_y
# 0 Amy 2.0 sub2 Billy 1
# 1 Allen 3.0 sub4 Brian 2
# 2 Alice 4.0 sub6 Bryce 4
# 3 Ayoung 5.0 sub5 Betty 5
# 4 NaN NaN sub3 Bran 3
rs = pd.merge(left, right, how='outer', on='subject_id')
print (rs)
# Name_x id_x subject_id Name_y id_y
# 0 Alex 1.0 sub1 NaN NaN
# 1 Amy 2.0 sub2 Billy 1.0
# 2 Allen 3.0 sub4 Brian 2.0
# 3 Alice 4.0 sub6 Bryce 4.0
# 4 Ayoung 5.0 sub5 Betty 5.0
# 5 NaN NaN sub3 Bran 3.0
rs = pd.merge(left, right, on='subject_id', how='inner')
print (rs)
# Name_x id_x subject_id Name_y id_y
# 0 Amy 2 sub2 Billy 1
# 1 Allen 3 sub4 Brian 2
# 2 Alice 4 sub6 Bryce 4
# 3 Ayoung 5 sub5 Betty 5

Pandas级联

Pandas提供了各种工具(功能),可以轻松地将SeriesDataFramePanel对象组合在一起。

1
2
pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)
  • objs - 这是Series,DataFrame或Panel对象的序列或映射。
  • axis - {0,1,...},默认为0,这是连接的轴。
  • join - {'inner', 'outer'},默认inner。如何处理其他轴上的索引。联合的外部和交叉的内部。
  • ignore_index − 布尔值,默认为False。如果指定为True,则不要使用连接轴上的索引值。结果轴将被标记为:0,...,n-1
  • join_axes - 这是Index对象的列表。用于其他(n-1)轴的特定索引,而不是执行内部/外部集逻辑。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(one)
# Marks_scored Name subject_id
# 1 98 Alex sub1
# 2 90 Amy sub2
# 3 87 Allen sub4
# 4 69 Alice sub6
# 5 78 Ayoung sub5
print(two)
# Marks_scored Name subject_id
# 1 89 Billy sub2
# 2 80 Brian sub4
# 3 79 Bran sub3
# 4 97 Bryce sub6
# 5 88 Betty sub5
rs = pd.concat([one,two])
print(rs)
# Marks_scored Name subject_id
# 1 98 Alex sub1
# 2 90 Amy sub2
# 3 87 Allen sub4
# 4 69 Alice sub6
# 5 78 Ayoung sub5
# 1 89 Billy sub2
# 2 80 Brian sub4
# 3 79 Bran sub3
# 4 97 Bryce sub6
# 5 88 Betty sub

结果的索引是重复的; 每个索引重复。如果想要生成的对象必须遵循自己的索引,请将ignore_index设置为True

1
2
3
4
5
6
7
8
9
10
11
12
13
rs = pd.concat([one,two],ignore_index=True)
print(rs)
# Marks_scored Name subject_id
# 0 98 Alex sub1
# 1 90 Amy sub2
# 2 87 Allen sub4
# 3 69 Alice sub6
# 4 78 Ayoung sub5
# 5 89 Billy sub2
# 6 80 Brian sub4
# 7 79 Bran sub3
# 8 97 Bryce sub6
# 9 88 Betty sub5

如果需要沿axis=1添加两个对象,则会添加新列。

1
2
3
4
5
6
7
8
9
rs = pd.concat([one,two],axis=1)
print(rs)
# Marks_scored Name subject_id Marks_scored Name subject_id
# 1 98 Alex sub1 89 Billy sub2
# 2 90 Amy sub2 80 Brian sub4
# 3 87 Allen sub4 79 Bran sub3
# 4 69 Alice sub6 97 Bryce sub6
# 5 78 Ayoung sub5 88 Betty sub5

连接的一个有用的快捷方式是在Series和DataFrame实例的append方法。这些方法实际上早于concat()方法。 它们沿axis=0连接,即索引 -

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
rs = one.append(two)
print(rs)

append()函数也可以带多个对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
rs = one.append([two,one,two])
print(rs)

时间序列

Pandas为时间序列数据的工作时间提供了一个强大的工具,尤其是在金融领域。在处理时间序列数据时,我们经常遇到以下情况 -

  • 生成时间序列
  • 将时间序列转换为不同的频率

获取当前时间

datetime.now()用于获取当前的日期和时间。

1
2
import pandas as pd
print pd.datetime.now()

创建时间戳

时间戳数据是时间序列数据的最基本类型,它将数值与时间点相关联。 对于Pandas对象来说,意味着使用时间点。

1
2
3
4
import pandas as pd
time = pd.Timestamp('2018-10-01')
print(time)
# 2018-10-01 00:00:00

创建时间范围

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
time = pd.date_range("12:00", "23:59", freq="30min").time
print(time)
# [datetime.time(12, 0) datetime.time(12, 30) datetime.time(13, 0)
# datetime.time(13, 30) datetime.time(14, 0) datetime.time(14, 30)
# datetime.time(15, 0) datetime.time(15, 30) datetime.time(16, 0)
# datetime.time(16, 30) datetime.time(17, 0) datetime.time(17, 30)
# datetime.time(18, 0) datetime.time(18, 30) datetime.time(19, 0)
# datetime.time(19, 30) datetime.time(20, 0) datetime.time(20, 30)
# datetime.time(21, 0) datetime.time(21, 30) datetime.time(22, 0)
# datetime.time(22, 30) datetime.time(23, 0) datetime.time(23, 30)]

改变时间频率

1
2
3
4
5
6
7
8
import pandas as pd
time = pd.date_range("12:00", "23:59", freq="H").time
print(time)
# [datetime.time(12, 0) datetime.time(13, 0) datetime.time(14, 0)
# datetime.time(15, 0) datetime.time(16, 0) datetime.time(17, 0)
# datetime.time(18, 0) datetime.time(19, 0) datetime.time(20, 0)
# datetime.time(21, 0) datetime.time(22, 0) datetime.time(23, 0)]

Pandas分类数据

分类对象创建

通过在pandas对象创建中将dtype指定为“category”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
s = pd.Series(["a","b","c","a"])
print(s)
# 0 a
# 1 b
# 2 c
# 3 a
# dtype: object
s = pd.Series(["a","b","c","a"], dtype="category")
print (s)
# 0 a
# 1 b
# 2 c
# 3 a
# dtype: category
# Categories (3, object): [a, b, c]

传递给系列对象的元素数量是四个,但类别只有三个。观察相同的输出类别。

使用标准Pandas分类构造函数,我们可以创建一个类别对象。

1
pandas.Categorical(values, categories, ordered)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
print (cat)
# [a, b, c, a, b, c]
# Categories (3, object): [a, b, c]
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
print (cat)
# [a, b, c, a, b, c, NaN]
# Categories (3, object): [c, b, a]
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
print (cat)
# [a, b, c, a, b, c, NaN]
# Categories (3, object): [c < b < a]
  • 第二个参数表示类别。因此,在类别中不存在的任何值将被视为NaN
  • 从逻辑上讲,排序(ordered)意味着,a大于bb大于c

Describe()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})
print(df)
# cat s
# 0 a a
# 1 c c
# 2 c c
# 3 NaN NaN
print (df.describe())
# cat s
# count 3 3
# unique 2 2
# top c c
# freq 2 2
print (df["cat"].describe())
# count 3
# unique 2
# top c
# freq 2
# Name: cat, dtype: object

Pandas可视化

线形图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18',
periods=10), columns=list('ABCD'))
print(df)
# A B C D
# 2018-12-18 0.141680 0.512593 -1.162793 0.780302
# 2018-12-19 -0.692091 -0.315033 -2.442913 1.066298
# 2018-12-20 1.025117 1.543544 1.111169 -0.143340
# 2018-12-21 0.156190 0.312793 -1.776588 -0.191348
# 2018-12-22 0.075410 0.628234 -0.611224 -0.191468
# 2018-12-23 -0.864251 -0.357473 -0.144735 -0.261345
# 2018-12-24 0.162799 -0.869081 -0.377572 0.333409
# 2018-12-25 -0.186183 -2.437047 0.441362 0.859709
# 2018-12-26 0.332621 -1.022226 0.170011 -0.197164
# 2018-12-27 -0.409928 1.305865 -0.435077 -0.816023
df.plot()
plt.show()

image-20180614155744022

如果索引由日期组成,则调用gct().autofmt_xdate()来格式化x轴,如上图所示。

我们可以使用xy关键字绘制一列与另一列。

条形图

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
plt.show()

image-20180614160506651

要生成一个堆积条形图,通过指定:pass stacked=True

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar(stacked=True)
plt.show()

image-20180614160626454

要获得水平条形图,使用barh()方法

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.barh(stacked=True)
plt.show()

image-20180614160738761

直方图

可以使用plot.hist()方法绘制直方图。我们可以指定bins的数量值。

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=20)
plt.show()

image-20180615094442572

要为每列绘制不同的直方图

1
2
df.hist(bins=20)
plt.show()

image-20180615094630408

箱线图

基本绘图:绘图

Series和DataFrame上的这个功能只是使用matplotlib库的plot()方法的简单包装实现。参考以下示例代码 -

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18',
periods=10), columns=list('ABCD'))
df.plot()
Python

执行上面示例代码,得到以下结果 -

img

如果索引由日期组成,则调用gct().autofmt_xdate()来格式化x轴,如上图所示。

我们可以使用xy关键字绘制一列与另一列。

绘图方法允许除默认线图之外的少数绘图样式。 这些方法可以作为plot()kind关键字参数提供。这些包括 -

  • barbarh为条形
  • hist为直方图
  • boxplot为盒型图
  • area为“面积”
  • scatter为散点图

条形图

现在通过创建一个条形图来看看条形图是什么。条形图可以通过以下方式来创建 -

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
Python

执行上面示例代码,得到以下结果 -

img

要生成一个堆积条形图,通过指定:pass stacked=True -

1
2
3
4
5
6
import pandas as pd
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar(stacked=True)
Python

执行上面示例代码,得到以下结果 -

img

要获得水平条形图,使用barh()方法 -

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.barh(stacked=True)
Python

执行上面示例代码,得到以下结果 -

img

直方图

可以使用plot.hist()方法绘制直方图。我们可以指定bins的数量值。

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=20)
Python

执行上面示例代码,得到以下结果 -

img

要为每列绘制不同的直方图,请使用以下代码 -

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.hist(bins=20)
Python

执行上面示例代码,得到以下结果 -

img

箱形图

Boxplot可以绘制调用Series.box.plot()DataFrame.box.plot()DataFrame.boxplot()来可视化每列中值的分布。

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()
plt.show()

image-20180615094905015

这里是一个箱形图,表示对[0,1)上的统一随机变量的10次观察的五次试验。

散点图

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')
plt.show()

image-20180615095026828

饼状图

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)
plt.show()

image-20180615095118032

IO工具

Pandas I/O API是一套像pd.read_csv()一样返回Pandas对象的顶级读取器函数。

读取文本文件(或平面文件)的两个主要功能是read_csv()read_table()。它们都使用相同的解析代码来智能地将表格数据转换为DataFrame对象

1
2
3
4
5
pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None)
########################################################
pandas.read_csv(filepath_or_buffer, sep='\t', delimiter=None, header='infer',
names=None, index_col=None, usecols=None)

Screen Shot 2018-06-15 at 9.54.01 AM

read_csv()

read.csv从csv文件中读取数据并创建一个DataFrame对象。

1
2
3
import pandas as pd
df=pd.read_csv("temp.csv")
print (df)

Screen Shot 2018-06-15 at 9.55.43 AM

自定义索引

可以指定csv文件中的一列来使用index_col定制索引。

1
2
3
4
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print (df)

Screen Shot 2018-06-15 at 9.57.17 AM

类型转换器

1
2
3
4
import pandas as pd
import numpy as np
df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print (df.dtypes)

Screen Shot 2018-06-15 at 9.58.33 AM

指定列名

使用names参数指定标题的名称。

1
2
3
4
5
import pandas as pd
import numpy as np
df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print (df)

Screen Shot 2018-06-15 at 10.00.18 AM

观察可以看到,标题名称附加了自定义名称,但文件中的标题还没有被消除。 现在,使用header参数来删除它。

如果标题不是第一行,则将行号传递给标题。这将跳过前面的行。

1
2
3
4
5
import pandas as pd
import numpy as np
df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
print (df)

Screen Shot 2018-06-15 at 10.01.15 AM

skiprows

skiprows跳过指定的行数

1
2
3
4
5
import pandas as pd
import numpy as np
df=pd.read_csv("temp.csv", skiprows=2)
print (df)

Screen Shot 2018-06-15 at 10.02.13 AM

Pandas稀疏数据

当任何匹配特定值的数据(NaN/缺失值,尽管可以选择任何值)被省略时,稀疏对象被“压缩”。 一个特殊的SparseIndex对象跟踪数据被“稀疏”的地方。 这将在一个例子中更有意义。 所有的标准Pandas数据结构都应用了to_sparse方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print (sts)
# 0 -0.554670
# 1 1.738188
# 2 NaN
# 3 NaN
# 4 NaN
# 5 NaN
# 6 NaN
# 7 NaN
# 8 -1.363170
# 9 -0.142478

为了内存效率的原因,所以需要稀疏对象的存在。

现在假设有一个大的NA DataFrame并执行下面的代码

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4))
df.ix[:9998] = np.nan
sdf = df.to_sparse()
print (sdf.density)
# 0.0001

通过调用to_dense可以将任何稀疏对象转换回标准密集形式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print(sts)
# 0 -2.222412
# 1 -0.074302
# 2 NaN
# 3 NaN
# 4 NaN
# 5 NaN
# 6 NaN
# 7 NaN
# 8 -0.289743
# 9 0.256266
# dtype: float64
# BlockIndex
# Block locations: array([0, 8], dtype=int32)
# Block lengths: array([2, 2], dtype=int32)
print (sts.to_dense())
# 0 -2.222412
# 1 -0.074302
# 2 NaN
# 3 NaN
# 4 NaN
# 5 NaN
# 6 NaN
# 7 NaN
# 8 -0.289743
# 9 0.256266
# dtype: float64

稀疏Dtypes

稀疏数据应该具有与其密集表示相同的dtype。 目前,支持float64int64booldtypes。 取决于原始的dtypefill_value默认值的更改 -

  • float64np.nan
  • int640
  • boolFalse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
s = pd.Series([1, np.nan, np.nan])
print (s)
print ("=============================")
s.to_sparse()
print (s)
# 0 1.0
# 1 NaN
# 2 NaN
# dtype: float64
# =============================
# 0 1.0
# 1 NaN
# 2 NaN
# dtype: float64

按位布尔

按位布尔运算符(如==!=)将返回一个布尔系列

1
2
3
4
5
6
7
8
9
10
import pandas as pd
s = pd.Series(range(5))
print (s==4) #不等于4的是false
# 0 False
# 1 False
# 2 False
# 3 False
# 4 True

isin()

这将返回一个布尔序列,显示系列中的每个元素是否完全包含在传递的值序列中

1
2
3
4
5
6
7
8
import pandas as pd
s = pd.Series(list('abc'))
s = s.isin(['a', 'c', 'e'])
print (s)
# 0 True
# 1 False
# 2 True