apache-spark Archives - Page 2 of 3

[Solved] How to add columns using Scala

October 22, 2022 by Kirat

val grouped = df.groupBy($”id”).count val res = df.join(grouped,Seq(“id”)) .withColumnRenamed(“count”,”repeatedcount”) Group By will give count of each id’s. Join that with original dataframe to get count against each id. solved How to add columns using Scala

[Solved] Finding average value in spark scala gives blank result

October 19, 2022 by Kirat

I would suggest you to use sqlContext api and use the schema you have defined val df = sqlContext.read .format(“com.databricks.spark.csv”) .option(“delimiter”, “\\t”) .schema(schema) .load(“path to your text file”) the schema is val schema = StructType(Seq( StructField(“ID”, IntegerType, true), StructField(“col1”, DoubleType, true), StructField(“col2”, IntegerType, true), StructField(“col3”, DoubleType, true), StructField(“col4”, DoubleType, true), StructField(“col5”, DoubleType, true), StructField(“col6”, DoubleType, … Read more

[Solved] How to save bulk document in Cloudant using Java or Spark – Java?

October 13, 2022 by Kirat

Here is the code that can enable to upload bulk upload, CloudantClient client = ClientBuilder.account(“accounbt”) .username(“username”).password(“password”) .disableSSLAuthentication().build();*/ Database db = client.database(“databaseName”, true); List<JSONObject> arrayJson = new ArrayList<String>(); arrayJson.add(new JSONObject(“{data:hello}”)); arrayJson.add(new JSONObject(“{data:hello1}”)); arrayJson.add(new JSONObject(“{data:hello2}”)); db.bulk(arrayJson); 3 solved How to save bulk document in Cloudant using Java or Spark – Java?

[Solved] Infinite loop when replacing concrete value by parameter name

October 10, 2022 by Kirat

What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you’re passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized … Read more

[Solved] Creation of JavaRDD object has failed

October 6, 2022 by Kirat

You just have to instantiate JavaSparkContext. SparkConf conf = new SparkConf(); conf.setAppName(“YOUR APP”); //other config like conf.setMaster(“YOUR MASTER”); JavaSparkContext ctx = new JavaSparkContext(conf); //and then JavaRDD<Row> testRDD = ctx.parallelize(list); 1 solved Creation of JavaRDD object has failed

[Solved] Filter words from list python

October 6, 2022 by Kirat

This is because your listeexclure is just a string. If you want to search a string in another string, you can do following: Let’s assume you have a list like: lst = [‘a’, ‘ab’, ‘abc’, ‘bac’] filter(lambda k: ‘ab’ in k, lst) # Result will be [‘ab’, ‘abc’] So you can apply same method in … Read more

[Solved] How to find the difference between 1st row and nth row of a dataframe based on a condition using Spark Windowing

October 3, 2022 by Kirat

Shown here is a PySpark solution. You can use conditional aggregation with max(when…)) to get the necessary difference of ranks with the first ‘PD’ row. After getting the difference, use a when… to null out rows with negative ranks as they all occur after the first ‘PD’ row. # necessary imports w1 = Window.partitionBy(df.id).orderBy(df.svc_dt) df … Read more

[Solved] Spark – java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy

October 2, 2022 by Kirat

Spark – java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy solved Spark – java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy

[Solved] How to extract sub-elements from the column of DataFrame in Spark 2? [duplicate]

October 1, 2022 by Kirat

You basically have two steps here: First is exploding the arrays (using the explode functions) to get a row for each value in the array, then fixing each element. You do not have the schema here so the internal structure of each element in the array is not clear, however, I would assume it is … Read more

[Solved] How to get the specified output without combineByKey and aggregateByKey in spark RDD

September 29, 2022 by Kirat

Here is a standard approach. Point to note: you need to be working with an RDD. I think that is the bottleneck. Here you go: val keysWithValuesList = Array(“foo=A”, “foo=A”, “foo=A”, “foo=A”, “foo=B”, “bar=C”,”bar=D”, “bar=D”) val sample=keysWithValuesList.map(_.split(“=”)).map(p=>(p(0),(p(1)))) val sample2 = sc.parallelize(sample.map(x => (x._1, 1))) val sample3 = sample2.reduceByKey(_+_) sample3.collect() val sample4 = sc.parallelize(sample.map(x => (x._1, … Read more

[Solved] Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

September 27, 2022 by Kirat

Turns out it’s easier to do df1.except(df2) and then join the results with df1 to get all the duplicates. Full code: def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = { val except = df1.except(df2) val columns = df1.columns val colExpr: Column = df1(columns.head) <=> except(columns.head) val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) => colExpr && df1(p) … Read more

[Solved] Removing Characters from python Output

September 24, 2022 by Kirat

I think your only problem is that you have to reformat you result before saving it to the file, i.e. something like: result.map(lambda x:x[0]+’,’+str(x[1])).saveAsTextFile(“hdfs://localhost:9000/Test1”) 6 solved Removing Characters from python Output

[Solved] call method between two objects SCALA

September 14, 2022 by Kirat

The problem is that the value you are attempting to assign to list1 is a Set, but the list1 var is of type List. The compiler error explains this as best it can, but perhaps the error message is a bit obscure because the right hand side of the assignment could take many types (“polymorphic … Read more

[Solved] toDF is not working in spark scala ide , but works perfectly in spark-shell [duplicate]

September 5, 2022 by Kirat

To toDF(), you must enable implicit conversions: import spark.implicits._ In spark-shell, it is enabled by default and that’s why the code works there. :imports command can be used to see what imports are already present in your shell: scala> :imports 1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit) 2) import spark.implicits._ (1 types, 67 terms, … Read more

[Solved] How to create a dataframe from two others dataframe?

September 3, 2022 by Kirat

If this two column are same data type , you can just union a = predictons_lr.select(‘prediction’) b = predictions_nb.select(‘prediction’) new_df = a.union(b) 4 solved How to create a dataframe from two others dataframe?