多个df文件之间含有重叠项的处理

在进行数据分析时, 为获取某个完整的panel数据往往需要将多个文件进行整合. 而当这些文件之间存在重叠的时候, 单纯的拼接难以得到好的结果. 本文将以pandas重复值处理的若干函数着手, 分析最好的文件重复值处理方法.

首先, pandas中处理重复值主要使用的是两个函数:

duplicated()：判断是否有重复值
drop_duplicates() ：删除重复值

对这两个函数的解释 Help on method duplicated in module pandas.core.frame: duplicated函数的解释

duplicated(
subset: 'Hashable | Sequence[Hashable] | None' = None, 
keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first'
) -> 'Series' method of pandas.core.frame.DataFrame instance

Return boolean Series denoting duplicate rows. Considering certain columns is optional.

参数

subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep : {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to mark.
- - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
- - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
- - False : Mark all duplicates as ``True``.

Series
Boolean series for each duplicated rows.

参见

Index.duplicated : Equivalent method on index.
Series.duplicated : Equivalent method on Series.
Series.drop_duplicates : Remove duplicate values from Series.
DataFrame.drop_duplicates : Remove duplicate values from DataFrame.

例子

考虑一个包含拉面评分的数据集:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

默认下, 对于每个重复值的set, 第一次出现的set会被设为False, 所有其他的会设为True.

df.duplicated()
0    False#重复值 此为第一次 标记为False
1     True#重复值 此为第二次 标记为True
2    False
3    False
4    False
dtype: bool

通过设定keep为'last', 对于每个重复值的set, 最后出现的set会被设为False, 所有其他的会设为True.

df.duplicated(keep='last')
0     True#重复值 此为第一次 标记为True
1    False#重复值 此为第二次 标记为False
2    False
3    False
4    False
dtype: bool

当keep设为False, 所有的重复值都会赋为True

df.duplicated(keep=False)
0     True
1     True
2    False
3    False
4    False
dtype: bool

若要找到特定column(s), 使用subset参数.

df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

df.duplicated(subset=['brand'])
0    False
1     True
2    False
3     True
4     True
dtype: bool

Help on method drop_duplicates in module pandas.core.frame: drop_duplicates函数的解释

drop_duplicates(
subset: 'Hashable | Sequence[Hashable] | None' = None, 
keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first', 
inplace: 'bool' = False, 
ignore_index: 'bool' = False
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance

Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.

参数

subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep : {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to keep.
1. - ``first`` : Drop duplicates except for the first occurrence.
2. - ``last`` : Drop duplicates except for the last occurrence.
3. - False : Drop all duplicates.
inplace : bool, default False. Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False. If True, the resulting axis will be labeled 0, 1, ..., n - 1.

DataFrame or None
DataFrame with duplicates removed or None if ``inplace=True``.

参见

DataFrame.value_counts: Count unique combinations of columns.

例子

同样用上述df, 默认下, 该函数会移除基于所有列有重复的行

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

df.drop_duplicates()
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

若要基于特定column(s)进行移除操作, 使用subset参数

df.drop_duplicates(subset=['brand'])
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

若要保留所有最后出现的重复行以进行移除操作, 使用keep参数

df.drop_duplicates(subset=['brand', 'style'], keep='last')
     brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

应用于panel数据的处理

如下所示, df1为待更新的数据, df2为已更新的数据. 可见df1与df2在2022-05月的06, 09, 10日的数据是不相同的. 进而拼接后要求以更新的数据为准, 即df1在2022-05月的06, 09, 10日的数据要移除.

import pandas as pd

df1
        date       val
0  2022/4/28  2.490852
1  2022/4/29  2.594890
2   2022/5/5  2.637793
3   2022/5/6  2.598594
4   2022/5/9  2.613143
5  2022/5/10  2.664768

df2
        date       val
0   2022/5/6  2.600081
1   2022/5/9  2.615622
2  2022/5/10  2.667454
3  2022/5/11  2.708073
4  2022/5/12  2.710368
5  2022/5/13  2.734073

拼接后, 设定subset为'date', keep为'last', 最后设置'date'为index以保持index的序列性

df3 = df1.append(df2)
df3
        date       val
0  2022/4/28  2.490852
1  2022/4/29  2.594890
2   2022/5/5  2.637793
3   2022/5/6  2.598594 #
4   2022/5/9  2.613143 #
5  2022/5/10  2.664768 #
0   2022/5/6  2.600081
1   2022/5/9  2.615622
2  2022/5/10  2.667454
3  2022/5/11  2.708073
4  2022/5/12  2.710368
5  2022/5/13  2.734073

#上述df中标#的行为在以下应用drop_duplicates后会被移除的行.

df3 = df3.drop_duplicates(subset='date', keep='last')
df3
        date       val
0  2022/4/28  2.490852
1  2022/4/29  2.594890
2   2022/5/5  2.637793
0   2022/5/6  2.600081
1   2022/5/9  2.615622
2  2022/5/10  2.667454
3  2022/5/11  2.708073
4  2022/5/12  2.710368
5  2022/5/13  2.734073

#设定'date'为index后, 数据会更加规整
df3.set_index('date')
                val
date               
2022/4/28  2.490852
2022/4/29  2.594890
2022/5/5   2.637793
2022/5/6   2.600081
2022/5/9   2.615622
2022/5/10  2.667454
2022/5/11  2.708073
2022/5/12  2.710368
2022/5/13  2.734073

多个df文件之间含有重叠项的处理

Python相关栏目本月热门文章