[Solved] Groupby fill missing values in dataframe based on average of previous values available and next value available

November 6, 2022 by Kirat

Perhaps this is helpful –

Load the test data

df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+
      * |class|score|
      * +-----+-----+
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |A    |35   |
      * |A    |null |
      * |A    |null |
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |B    |78   |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |56   |
      * |B    |null |
      * +-----+-----+
      *
      * root
      * |-- class: string (nullable = true)
      * |-- score: integer (nullable = true)
      */

Impute Null values from score columns(check new_score column)


    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
      .withColumn("next", first("score", ignoreNulls = true).over(w2))
      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +-----+-----+---------+
      * |class|score|new_score|
      * +-----+-----+---------+
      * |A    |null |46.0     |
      * |A    |46   |46.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |35   |35.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |46   |46.0     |
      * |A    |null |46.0     |
      * |A    |null |46.0     |
      * |B    |78   |78.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |56   |56.0     |
      * |B    |null |56.0     |
      * +-----+-----+---------+
      */

8

solved Groupby fill missing values in dataframe based on average of previous values available and next value available