apache-spark-sql Archives

[Solved] Given the file path find the file extension using Scala?

January 9, 2023 by Kirat

You could achieve this as follows: import java.nio.file.Paths val path = “/home/gmc/exists.csv” val fileName = Paths.get(path).getFileName // Convert the path string to a Path object and get the “base name” from that path. val extension = fileName.toString.split(“\\.”).last // Split the “base name” on a . and take the last element – which is the extension. … Read more

[Solved] Get the highest price with smaller ID when two ID have the same highest price in Scala

November 1, 2022 by Kirat

Try this. scala> val df = Seq((4, 30),(2,50),(3,10),(5,30),(1,50),(6,25)).toDF(“id”,”price”) df: org.apache.spark.sql.DataFrame = [id: int, price: int] scala> df.show +—+—–+ | id|price| +—+—–+ | 4| 30| | 2| 50| | 3| 10| | 5| 30| | 1| 50| | 6| 25| +—+—–+ scala> df.sort(desc(“price”), asc(“id”)).show +—+—–+ | id|price| +—+—–+ | 1| 50| | 2| 50| | 4| … Read more

[Solved] Finding average value in spark scala gives blank result

October 19, 2022 by Kirat

I would suggest you to use sqlContext api and use the schema you have defined val df = sqlContext.read .format(“com.databricks.spark.csv”) .option(“delimiter”, “\\t”) .schema(schema) .load(“path to your text file”) the schema is val schema = StructType(Seq( StructField(“ID”, IntegerType, true), StructField(“col1”, DoubleType, true), StructField(“col2”, IntegerType, true), StructField(“col3”, DoubleType, true), StructField(“col4”, DoubleType, true), StructField(“col5”, DoubleType, true), StructField(“col6”, DoubleType, … Read more

[Solved] Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

September 27, 2022 by Kirat

Turns out it’s easier to do df1.except(df2) and then join the results with df1 to get all the duplicates. Full code: def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = { val except = df1.except(df2) val columns = df1.columns val colExpr: Column = df1(columns.head) <=> except(columns.head) val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) => colExpr && df1(p) … Read more

[Solved] Spark (Scala) execute dataframe within for loop

August 22, 2022 by Kirat

Your code is almost correct. Except two things : i is already used in your for loop so don’t use it in val i = If you want to use the value of i in a string, use String Interpolation So your code should look like : for (i <- List (‘a’,’b’)) { val df … Read more