Week 1: Introduction-Course Introduction
Introduction to Apache Spark for Machine Learning on BigData
()
What is Big Data?
()
Course Syllabus
Setup of the grading and exercise environment
Week 1: Introduction-Understanding how Apache Spark works
Data storage solutions
()
Parallel data processing strategies of Apache Spark
()
Exercise 1 - working with RDD
Functional programming basics
()
Exercise 2 - functional programming basics with RDDs
Resilient Distributed Dataset and DataFrames - ApacheSparkSQL
()
Exercise 3 - working with DataFrames
Programming Lanuage Options for Apache Spark (optional)
Week 2: Scaling Math for Statistics on Apache Spark-Experience parallel programming on Apache Spark
Averages
()
Standard deviation
()
Skewness
()
Kurtosis
()
Covariance, Covariance matrices, correlation
()
Exercise 1 - statistics and transfomrations using DataFrames
Week 2: Scaling Math for Statistics on Apache Spark-Data Visualization of Big Data
Plotting with ApacheSpark and python's matplotlib
()
Exercise on Plotting
Dimensionality reduction
()
PCA
()
Exercise on PCA
Week 3: Introduction to Apache SparkML-Introduction to Apache SparkML
How ML Pipelines work
()
Introduction to SparkML
()
Extract - Transform - Load
()
Exercise 1: Modifying a Apache SparkML Feature Engineering Pipeline
Week 3: Introduction to Apache SparkML-Unsupervised Learning with Apache SparkML
Introduction to Clustering: k-Means
()
Using K-Means in Apache SparkML
()
Exercise 2 - Working with Clustering and Apache SparkML
Week 4: Supervised and Unsupervised learning with SparkML-Supervised Learning with Apache SparkML
Linear Regression
()
LinearRegression with Apache SparkML
()
Logistic Regression
()
LogisticRegression with Apache SparkML
()
Exercise 1 - Improving Classification performance
Week 4: Supervised and Unsupervised learning with SparkML-Course Project
Course Project