Perhaps this is helpful –
Load the test data
df2.show(false)
df2.printSchema()
/**
* +-----+-----+
* |class|score|
* +-----+-----+
* |A |null |
* |A |46 |
* |A |null |
* |A |null |
* |A |35 |
* |A |null |
* |A |null |
* |A |null |
* |A |46 |
* |A |null |
* |A |null |
* |B |78 |
* |B |null |
* |B |null |
* |B |null |
* |B |null |
* |B |null |
* |B |56 |
* |B |null |
* +-----+-----+
*
* root
* |-- class: string (nullable = true)
* |-- score: integer (nullable = true)
*/
Impute Null values from score columns(check new_score column)
val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
.withColumn("next", first("score", ignoreNulls = true).over(w2))
.withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
.drop("next", "previous")
.show(false)
/**
* +-----+-----+---------+
* |class|score|new_score|
* +-----+-----+---------+
* |A |null |46.0 |
* |A |46 |46.0 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |35 |35.0 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |46 |46.0 |
* |A |null |46.0 |
* |A |null |46.0 |
* |B |78 |78.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |56 |56.0 |
* |B |null |56.0 |
* +-----+-----+---------+
*/
8
solved Groupby fill missing values in dataframe based on average of previous values available and next value available