
在进行数据分析时, 为获取某个完整的panel数据往往需要将多个文件进行整合. 而当这些文件之间存在重叠的时候, 单纯的拼接难以得到好的结果. 本文将以pandas重复值处理的若干函数着手, 分析最好的文件重复值处理方法.
首先, pandas中处理重复值主要使用的是两个函数:
duplicated( subset: 'Hashable | Sequence[Hashable] | None' = None, keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first' ) -> 'Series' method of pandas.core.frame.DataFrame instance
Return boolean Series denoting duplicate rows. Considering certain columns is optional.
参数Series
Boolean series for each duplicated rows.
Index.duplicated : Equivalent method on index.
Series.duplicated : Equivalent method on Series.
Series.drop_duplicates : Remove duplicate values from Series.
DataFrame.drop_duplicates : Remove duplicate values from DataFrame.
考虑一个包含拉面评分的数据集:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
默认下, 对于每个重复值的set, 第一次出现的set会被设为False, 所有其他的会设为True.
df.duplicated() 0 False#重复值 此为第一次 标记为False 1 True#重复值 此为第二次 标记为True 2 False 3 False 4 False dtype: bool
通过设定keep为'last', 对于每个重复值的set, 最后出现的set会被设为False, 所有其他的会设为True.
df.duplicated(keep='last') 0 True#重复值 此为第一次 标记为True 1 False#重复值 此为第二次 标记为False 2 False 3 False 4 False dtype: bool
当keep设为False, 所有的重复值都会赋为True
df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
若要找到特定column(s), 使用subset参数.
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.duplicated(subset=['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
Help on method drop_duplicates in module pandas.core.frame:
drop_duplicates函数的解释
drop_duplicates( subset: 'Hashable | Sequence[Hashable] | None' = None, keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first', inplace: 'bool' = False, ignore_index: 'bool' = False ) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.
参数DataFrame or None
DataFrame with duplicates removed or None if ``inplace=True``.
DataFrame.value_counts: Count unique combinations of columns.
例子同样用上述df, 默认下, 该函数会移除基于所有列有重复的行
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
若要基于特定column(s)进行移除操作, 使用subset参数
df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
若要保留所有最后出现的重复行以进行移除操作, 使用keep参数
df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
应用于panel数据的处理
如下所示, df1为待更新的数据, df2为已更新的数据. 可见df1与df2在2022-05月的06, 09, 10日的数据是不相同的. 进而拼接后要求以更新的数据为准, 即df1在2022-05月的06, 09, 10日的数据要移除.
import pandas as pd
df1
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
3 2022/5/6 2.598594
4 2022/5/9 2.613143
5 2022/5/10 2.664768
df2
date val
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
拼接后, 设定subset为'date', keep为'last', 最后设置'date'为index以保持index的序列性
df3 = df1.append(df2)
df3
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
3 2022/5/6 2.598594 #
4 2022/5/9 2.613143 #
5 2022/5/10 2.664768 #
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
#上述df中标#的行为在以下应用drop_duplicates后会被移除的行.
df3 = df3.drop_duplicates(subset='date', keep='last')
df3
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
#设定'date'为index后, 数据会更加规整
df3.set_index('date')
val
date
2022/4/28 2.490852
2022/4/29 2.594890
2022/5/5 2.637793
2022/5/6 2.600081
2022/5/9 2.615622
2022/5/10 2.667454
2022/5/11 2.708073
2022/5/12 2.710368
2022/5/13 2.734073