Start from adding DatTim column (of type DateTime), taking source
data from Date and Time:
df['DatTim'] = pd.to_datetime(df.Date + ' ' + df.Time)
Then, assuming that ID
is an “ordinary” column (not the index),
you should call:
groupby
onDatTim
column with5 min
frequency.- To each group apply
drop_duplicates
, withsubset
including onlyID
column. - And finally drop
DatTim
from the index.
Expressing the above instruction in Python:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp.drop_duplicates(subset="ID"))\
.reset_index(level=0, drop=True)
If you print(df2)
, you will get:
ID Date Time DatTim
0 12 2012-1-1 00:01:00 2012-01-01 00:01:00
1 13 2012-1-1 00:01:30 2012-01-01 00:01:30
3 12 2012-1-1 00:05:10 2012-01-01 00:05:10
4 12 2012-1-1 00:10:00 2012-01-01 00:10:00
To “clean up”, you can drop DatTim
column:
df2.drop('DatTim', axis=1)
Edit
If ID
is the index, a slight change is required:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp[~grp.index.duplicated(keep='first')])\
.reset_index(level=0, drop=True)
And then the printed df2
is:
Date Time DatTim
ID
12 2012-1-1 00:01:00 2012-01-01 00:01:00
13 2012-1-1 00:01:30 2012-01-01 00:01:30
12 2012-1-1 00:05:10 2012-01-01 00:05:10
12 2012-1-1 00:10:00 2012-01-01 00:10:00
Of course, also in this case you can drop DatTim
column.
solved Removing duplicates every 5 minutes [closed]