[Solved] How to find the difference between 1st row and nth row of a dataframe based on a condition using Spark Windowing

Question

Shown here is a PySpark solution.

You can use conditional aggregation with max(when...)) to get the necessary difference of ranks with the first ‘PD’ row. After getting the difference, use a when... to null out rows with negative ranks as they all occur after the first ‘PD’ row.

# necessary imports 
w1 = Window.partitionBy(df.id).orderBy(df.svc_dt)
df = df.withColumn('rnum',row_number().over(w1))
w2 = Window.partitionBy(df.id)
res = df.withColumn('diff_pd_rank',max(when(df.clm_typ == 'PD',df.rnum)).over(w2) - rnum)
res = res.withColumn('days_to_next_pd_encounter',when(res.diff_pd_rank >= 0,res.diff_pd_rank))
res.show()

Accepted Answer

Shown here is a PySpark solution.

You can use conditional aggregation with max(when...)) to get the necessary difference of ranks with the first ‘PD’ row. After getting the difference, use a when... to null out rows with negative ranks as they all occur after the first ‘PD’ row.

# necessary imports 
w1 = Window.partitionBy(df.id).orderBy(df.svc_dt)
df = df.withColumn('rnum',row_number().over(w1))
w2 = Window.partitionBy(df.id)
res = df.withColumn('diff_pd_rank',max(when(df.clm_typ == 'PD',df.rnum)).over(w2) - rnum)
res = res.withColumn('days_to_next_pd_encounter',when(res.diff_pd_rank >= 0,res.diff_pd_rank))
res.show()