[Solved] Spark 2.3: subtract dataframes but preserve duplicate values (Scala)


Turns out it’s easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.

Full code:

def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
    val except = df1.except(df2)

    val columns = df1.columns
    val colExpr: Column = df1(columns.head) <=> except(columns.head)
    val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
        colExpr && df1(p) <=> except(p)
    }

    val join = df1.join(except, joinExpression, "inner")

    join.select(df1("*"))
}

solved Spark 2.3: subtract dataframes but preserve duplicate values (Scala)