본문 바로가기

Spark

[부록] scala-spark-big-data with coursera

Scala

 - Basics of Functional Programming

 - Parallelism


Spark

 - Not a machine learning or data science course!

 - distributed data parallelism in Spark

 - familiar funcitonal abstractions like functional lists over large clusters.

 - Context: analyzing large data sets.



Why Scala? Why Spark?

Normally - R / Python / MATLAB etc..

But!

If your data set ever gets too large to fit into memory?

there is also the massive shift in industry to data-oriented decision making too!

By using Scala

It's easier to scale your small  problem to the large with SPark, whose API is almost 1-to-1 with Scala's collections.

Spark is...

More expressive. more composable operations possible than in MapReduce.

Performant. surely running time... and for developer productivity! Interactive!

Also, Good for data science. 




그리고 여러분들이 왜 배워야 하는지는 아래 링크를 참조해라..??ㅋㅋㅋㅋㅋ

부제 : Coursera 의 흔한 Spark 광고... 

http://stackoverflow.com/insights/survey/2016#technology-top-paying-tech




Latency

Data-Parallel Programming

> Data paralleism in a distributed setting.
> Distributed collections abstraction from Apache Spark as an implementation of this paradigm.

Important concern...

> Partial Failure : crash failures of a subset of the machines involved in a distributed computation.
> Latency: certain operations have a much higher latency than other operations due to network communication.

> Latency cannot be masked completely! it will be an important aspect that also impacts the programming model



memory loading 보다 ssd 로딩이 빠름. 물론 이게 Spark 에서 중요한건 아님.. ㅎㅎ 스팍에서 latency 에 가장 중요한 요소는 Network I/O  라는걸 설명하는 그림임.


Spark vs Hadoop MR

MR shuffles its data and write intermediate data to disk.
Remember!
Reading / writing to disk : 100x slower than in-memory, but Spark do not!
Spark keep all data immutable and in-memory.



ㅁ Coursera 강좌 링크

https://www.coursera.org/learn/scala-spark-big-data

강좌에서 다루는 내용

 - Spark 을 이용한 data parallel paradigm
 - Sparks Programming model
 - Distributing computation
 - How to improve performance, how to avoid recomputation and shuflles in Spark.
 - Relational operations with DataFrames and Datasets.