You can sorted both DataFrames – columns c_x
and c_y
, for movies
is used DataFrame.pivot
, count non missing values by DataFrame.count
and append to df1
:
df2[['c_x','c_y']] = np.sort(df2[['c_x','c_y']], axis=1)
df2['g'] = df2.groupby(['c_x','c_y']).cumcount().add(1)
df2 = df2.pivot(index=['c_x','c_y'], columns="g", values="movie").add_prefix('movie')
df2['number'] = df2.count(axis=1)
print (df2)
g movie1 movie2 number
c_x c_y
bob dan c f 2
uni a f 2
kim kim a NaN 1
lee a b 2
And then:
df1[['c_x','c_y']] = np.sort(df1[['c_x','c_y']], axis=1)
df = df1.join(df2, on=['c_x','c_y'])
solved How to find and calculate the number of duplicated rows between two different dataframe? [closed]