Scala
- Basics of Functional Programming
- Parallelism
Spark
- Not a machine learning or data science course!
- distributed data parallelism in Spark
- familiar funcitonal abstractions like functional lists over large clusters.
- Context: analyzing large data sets.
Why Scala? Why Spark?
Normally - R / Python / MATLAB etc..
But!
If your data set ever gets too large to fit into memory?
there is also the massive shift in industry to data-oriented decision making too!
By using Scala
It's easier to scale your small problem to the large with SPark, whose API is almost 1-to-1 with Scala's collections.
Spark is...
More expressive. more composable operations possible than in MapReduce.
Performant. surely running time... and for developer productivity! Interactive!
Also, Good for data science.
그리고 여러분들이 왜 배워야 하는지는 아래 링크를 참조해라..??ㅋㅋㅋㅋㅋ
부제 : Coursera 의 흔한 Spark 광고...
http://stackoverflow.com/insights/survey/2016#technology-top-paying-tech
Latency
Data-Parallel Programming
Important concern...
> Latency cannot be masked completely! it will be an important aspect that also impacts the programming model
memory loading 보다 ssd 로딩이 빠름. 물론 이게 Spark 에서 중요한건 아님.. ㅎㅎ 스팍에서 latency 에 가장 중요한 요소는 Network I/O 라는걸 설명하는 그림임.
Spark vs Hadoop MR
ㅁ Coursera 강좌 링크
https://www.coursera.org/learn/scala-spark-big-data
강좌에서 다루는 내용
'Spark' 카테고리의 다른 글
[부록] ALS in Spark 튜닝하기 (0) | 2018.03.19 |
---|---|
[부록] Performance Tuning of an Apache Kafka/Spark Streaming System (0) | 2017.01.24 |
[부록] Jupyter ( IPython ) 에서 pyspark 사용하기 (0) | 2017.01.23 |
3. Spark SQL, DataFrames and Datasets (0) | 2017.01.22 |
개발환경 셋팅하기 with pycharm (0) | 2016.06.14 |