[Solved] Working with Dates in Spark

Question

So by just creating a quick rdd in the format of the csv-file you describe

val list = sc.parallelize(List(("1","Timothy","04/02/2015","100","TV"), ("1","Timothy","04/03/2015","10","Book"), ("1","Timothy","04/03/2015","20","Book"), ("1","Timothy","04/05/2015","10","Book"),("2","Ursula","04/02/2015","100","TV")))

And then running

import java.time.LocalDate
import java.time.format.DateTimeFormatter

val startDate = LocalDate.of(2015,1,4)
val endDate = LocalDate.of(2015,4,5)

val result = list
    .filter{case(_,_,date,_,_) => {
         val localDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("MM/dd/yyyy"))
         localDate.isAfter(startDate) && localDate.isBefore(endDate)}}
    .map{case(id, _, _, amount, category) => ((id, category), (amount.toDouble, 1))} 
    .reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2)) 
    .map{case((id, category),(total, sales)) => (id, List((category, total, total/sales)))} 
    .reduceByKey(_ ++ _)

will give you

(1,List((Book,30.0,15.0), (TV,100.0,100.0)))
(2,List((TV,100.0,100.0)))

in the format of (SalesPersonId, [(ProductCategory,TotalSaleAmount, AvgSaleAmount)]. Is that what you are looking for?

Accepted Answer

So by just creating a quick rdd in the format of the csv-file you describe

val list = sc.parallelize(List(("1","Timothy","04/02/2015","100","TV"), ("1","Timothy","04/03/2015","10","Book"), ("1","Timothy","04/03/2015","20","Book"), ("1","Timothy","04/05/2015","10","Book"),("2","Ursula","04/02/2015","100","TV")))

And then running

import java.time.LocalDate
import java.time.format.DateTimeFormatter

val startDate = LocalDate.of(2015,1,4)
val endDate = LocalDate.of(2015,4,5)

val result = list
    .filter{case(_,_,date,_,_) => {
         val localDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("MM/dd/yyyy"))
         localDate.isAfter(startDate) && localDate.isBefore(endDate)}}
    .map{case(id, _, _, amount, category) => ((id, category), (amount.toDouble, 1))} 
    .reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2)) 
    .map{case((id, category),(total, sales)) => (id, List((category, total, total/sales)))} 
    .reduceByKey(_ ++ _)

will give you

(1,List((Book,30.0,15.0), (TV,100.0,100.0)))
(2,List((TV,100.0,100.0)))

in the format of (SalesPersonId, [(ProductCategory,TotalSaleAmount, AvgSaleAmount)]. Is that what you are looking for?