Spark Training Público

Spark Training

Nilesh Patel
Curso por Nilesh Patel, atualizado more than 1 year ago Colaboradores

Descrição

Spark Training

Informações do módulo

Sem etiquetas
Setup in windows :  Ref links :  https://stackoverflow.com/questions/37305001/winutils-spark-windows-installation https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html   set HADOOP_HOME=C:\spark\winutils set PATH=%HADOOP_HOME%\bin;%PATH% Run "spark-shell" in cmd window     Create .bat file :   @Echo off rem cmd /k "cd C:\spark\spark\bin" cmd /k "C:\spark\spark\bin\spark-shell"
Mostrar menos
Sem etiquetas
Word Count:   Example 1 (RDD) val txtfile=sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/wordcount.txt") val txtflatmap= txtfile.flatMap(x=>x.split(" ")) txtflatmap.first val txtmap=txtflatmap.map(x=>(x,1)) txtmap.first val output=txtmap.reduceByKey(_+_) output.foreach(println)   Single line : val output = txtfile.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_) output.foreach(println)     Example 2  (RDD) val txtfile=sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/wordcount.txt") val txtflatmap= txtfile.flatMap(x=>x.split(" ")) txtflatmap.first val txtmap=txtflatmap.map(x=>(x,1)) txtmap.first val output = txtmap.reduceByKey((x,txtmap)=>(x+txtmap)).foreach(println)   Example 3 (for Array[(string ,Int) ] type data  val str = "Hell saying hello is good, keep say hello"  val x =str.split(" ").map(x=> (x, 1))  val q = sc.parallelize(x)  q.reduceByKey(_+_).foreach(println)
Mostrar menos
Sem etiquetas
Sem etiquetas
Filter data //create person.txt Name, Age Vinayak, 35 Nilesh, 37 Raju, 30 Karthik, 28 Shreshta,1 Siddhish, 2     val personfile = sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/person.txt") val first = personfile.first val person = personfile.filter(x=>x!=first) val personmap = person.map(x=>{   val splitted = x.split(",")   val name = splitted(0).trim   val age = splitted(1).trim.toInt   (name,age) }) personmap.foreach(println) // ========= Method 1 : Scala val output = personmap.filter(x=>x._2>30) output.foreach(println) // ========= Method 2 : DataFrame //option 1 : val persondf=personmap.toDF persondf.show val output=persondf.filter($"_2">30) output.show //option 2 : val persondf=personmap.toDF("name","age") persondf.show val output=persondf.filter("age>30") output.show // ========= Method 3 : Spark sql //(didn’t worked in wondows) persondf.registerTempTable("person") val sqlcontext = org.apache.spark.sql.SQLContext // ========= Method 4 : sqlContext val persondf=personmap.toDF("name","age") persondf.registerTempTable("Person") val sqlcontext = persondf.sqlContext val output = sqlcontext.sql("select * from person where age>30").show //OR val persondf=personmap.toDF persondf.registerTempTable("Person2") val sqlcontext = persondf.sqlContext val output = sqlcontext.sql("select * from person2 where _2>30").show
Mostrar menos
Sem etiquetas
// JOINS =====================================   Using RDD //emp table   id,name,age 1,Vinayak, 35 2,Nilesh, 37 3,Raju, 30 4,Karthik, 28 5,Shreshta,1 6,Siddhish, 2   val emp = sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/emp.txt") val first = emp.first val empfile = emp.filter(x=>x!=first) val empmap=empfile.map(x=>{   val splitted=x.split(",")   val id=splitted(0).trim.toInt   val name=splitted(1).trim   val age=splitted(2).trim.toInt   (id,(name,age)) }) empmap.foreach(println) val output = empmap.filter(x=>x._2._2>30)   //payroll table id,dept,sal 1,dev, 3500 2,dev, 3700 3,qa, 3000 4,qa, 2800 5,prod,1000 6,prod, 2000   val payroll = sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/payroll.txt") val first = payroll.first val payrollfile = payroll.filter(x=>x!=first) val payrollmap=payrollfile.map(x=>{   val splitted=x.split(",")   val id=splitted(0).trim.toInt   val dept=splitted(1).trim   val sal=splitted(2).trim.toInt   (id,(dept,sal)) }) payrollmap.foreach(println) val output = payrollmap.filter(x=>x._2._2>2000) (1,(dev,3500)) (2,(dev,3700)) (4,(qa,2800)) (3,(qa,3000)) val joined=empmap.join(payrollmap) joined.foreach(println) (1,((Vinayak,35),(dev,3500))) (3,((Raju,30),(qa,3000))) (5,((Shreshta,1),(prod,1000))) (4,((Karthik,28),(qa,2800))) (6,((Siddhish,2),(prod,2000))) (2,((Nilesh,37),(dev,3700))) age > 30 and sal > 2000 val output=joined.filter(x=>x._2._1._2>30 && x._2._2._2>2000)   Using Dataframe    id,name,age 1,Vinayak, 35 2,Nilesh, 37 3,Raju, 30 4,Karthik, 28 5,Shreshta,1 6,Siddhish, 2   val emp = sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/emp.txt") val first = emp.first val empfile = emp.filter(x=>x!=first) val empmap=empfile.map(x=>{   val splitted=x.split(",")   val id=splitted(0).trim.toInt   val name=splitted(1).trim   val age=splitted(2).trim.toInt   (id,name,age) }) empmap.foreach(println) val empdf = empmap.toDF("id","name","age")   //payroll table id,dept,sal 1,dev, 3500 2,dev, 3700 3,qa, 3000 4,qa, 2800 5,prod,1000 6,prod, 2000 val payroll = sc.textFile("C:/Users/Redirection/pateln7/Downloads/SPARK/payroll.txt") val first = payroll.first val payrollfile = payroll.filter(x=>x!=first) val payrollmap=payrollfile.map(x=>{   val splitted=x.split(",")   val id=splitted(0).trim.toInt   val dept=splitted(1).trim   val sal=splitted(2).trim.toInt   (id,dept,sal) }) payrollmap.foreach(println) val payrolldf = payrollmap.toDF("id","dept","sal") scala> val dfjoined = empdf.join(payrolldf, empdf("id")===payrolldf("id")) dfjoined: org.apache.spark.sql.DataFrame = [id: int, name: string ... 4 more fields] scala> dfjoined.show +---+--------+---+---+----+----+ | id|    name|age| id|dept| sal| +---+--------+---+---+----+----+ |  1| Vinayak| 35|  1| dev|3500| |  6|Siddhish|  2|  6|prod|2000| |  3|    Raju| 30|  3|  qa|3000| |  5|Shreshta|  1|  5|prod|1000| |  4| Karthik| 28|  4|  qa|2800| |  2|  Nilesh| 37|  2| dev|3700| +---+--------+---+---+----+----+   Using sqlcontext    val dfjoined = empdf.join(payrolldf, empdf("id")===payrolldf("id"))  dfjoined.registerTempTable("joinedemp")  val sqlContext = dfjoined.sqlContext  val output = sqlContext.sql("select * from joinedemp where age>30 and sal> 3500").show
Mostrar menos
Sem etiquetas