xianny / titanic-scala

0 stars 0 forks source link

cousera dataframes and datasets outline #1

Open danlaudk opened 7 years ago

danlaudk commented 7 years ago

what might be useful http://stackoverflow.com/questions/41427191/dataframe-into-dense-vector-spark

It utilizes concepts from below coursera videos

Notes

instantiate sparksession.build(). etc like so https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html if using any of sql types, import org.apache.spark.sql.types._

Two ways to refer to columns in sql 1.

import spark.implicits._
df.filter($"age" > 18)
  1. df.filter(df("age") > 18)

Agg functions return a column with the functions name Df.groupBy($"somecol", $"anothercol").agg(count($"authorID"))

returns a new df with 2 groupby cols with a 3rd col count(authorID)

@xianny about 80-90min of video

To Watch

recommended to start https://www.coursera.org/learn/scala-spark-big-data/lecture/MORSy/cluster-topology-matters 8min intro:https://www.coursera.org/learn/scala-spark-big-data/lecture/NlNqx/spark-sql first 4.30

https://www.coursera.org/learn/scala-spark-big-data/lecture/bT1YR/shuffling-what-it-is-and-why-its-important 1st 9.5 minutes https://www.coursera.org/learn/scala-spark-big-data/lecture/Vkhm0/partitioning minutes 730-11 https://www.coursera.org/learn/scala-spark-big-data/lecture/LQT67/optimizing-with-partitioners all 11minutes https://www.coursera.org/learn/scala-spark-big-data/lecture/fwdAz/dataframes-2 optimizations minutes 20-30 https://www.coursera.org/learn/scala-spark-big-data/lecture/yrfPh/datasets probably all 40 minutes

an example of advanced manipulations, via each of sql, structs, datasets : http://stackoverflow.com/questions/33878370/spark-dataframe-select-the-first-row-of-each-group?rq=1

xianny commented 7 years ago

bah it's all paywalled! thanks for links though. I can look those specific things up.

On 1 April 2017 at 22:20, danlaudk notifications@github.com wrote:

what might be useful http://stackoverflow.com/ questions/41427191/dataframe-into-dense-vector-spark

It utilizes concepts from below coursera videos Notes

instantiate sparksession.build(). etc like so https://databricks.com/blog/ 2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html if using any of sql types, import org.apache.spark.sql.types._

Two ways to refer to columns in sql 1.

import spark.implicits._ df.filter($"age" > 18)

1.

df.filter(df("age") > 18)

Agg functions return a column with the functions name Df.groupBy($"somecol", $"anothercol").agg(count($"authorID"))

returns a new df with 2 groupby cols with a 3rd col count(authorID)

@xianny https://github.com/xianny To Watch

i recommend 8min https://www.coursera.org/learn/scala-spark-big-data/ lecture/MORSy/cluster-topology-matters

intro:https://www.coursera.org/learn/scala-spark-big- data/lecture/NlNqx/spark-sql first 4.30

optimizers: https://www.coursera.org/learn/scala-spark-big-data/ lecture/fwdAz/dataframes-2 from minute 20:00

datasets and optimizers: https://www.coursera.org/ learn/scala-spark-big-data/lecture/yrfPh/datasets but a prerequisite for watching datasets is some of the one hour of week 3 lectures about shuffling and partitioning (daniel will do on sunday) https://www.coursera.org/learn/scala-spark-big-data/home/week/3

an example of advanced manipulations, via each of sql, structs, datasets : http://stackoverflow.com/questions/33878370/spark- dataframe-select-the-first-row-of-each-group?rq=1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xianny/titanic-scala/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AIL2drgbRrbiWLu8WjyoLT7_jUEYfyurks5rrwXxgaJpZM4MwrOW .

xianny commented 7 years ago

ah, just saw your coursera login - thanks!!