[Solved] Removing duplicates every 5 minutes [closed]


Start from adding DatTim column (of type DateTime), taking source
data from Date and Time:

df['DatTim'] = pd.to_datetime(df.Date + ' ' + df.Time)

Then, assuming that ID is an “ordinary” column (not the index),
you should call:

  • groupby on DatTim column with 5 min frequency.
  • To each group apply drop_duplicates, with subset including only ID column.
  • And finally drop DatTim from the index.

Expressing the above instruction in Python:

df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
    .apply(lambda grp: grp.drop_duplicates(subset="ID"))\
    .reset_index(level=0, drop=True)

If you print(df2), you will get:

   ID      Date      Time              DatTim
0  12  2012-1-1  00:01:00 2012-01-01 00:01:00
1  13  2012-1-1  00:01:30 2012-01-01 00:01:30
3  12  2012-1-1  00:05:10 2012-01-01 00:05:10
4  12  2012-1-1  00:10:00 2012-01-01 00:10:00

To “clean up”, you can drop DatTim column:

df2.drop('DatTim', axis=1)

Edit

If ID is the index, a slight change is required:

df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
    .apply(lambda grp: grp[~grp.index.duplicated(keep='first')])\
    .reset_index(level=0, drop=True)

And then the printed df2 is:

        Date      Time              DatTim
ID                                        
12  2012-1-1  00:01:00 2012-01-01 00:01:00
13  2012-1-1  00:01:30 2012-01-01 00:01:30
12  2012-1-1  00:05:10 2012-01-01 00:05:10
12  2012-1-1  00:10:00 2012-01-01 00:10:00

Of course, also in this case you can drop DatTim column.

solved Removing duplicates every 5 minutes [closed]