Introduction
Why is data ingestion important?
()
What you should know
()
Using the exercise files
()
Using the Coderpad quizzes
()
1. Data Ingestion Overview
Overview of data scientists work
()
Where does data come from?
()
Different types of data
()
The data pipeline (ETL)
()
Final destination (data lake)
()
2. Reading Files
Working in CSV
()
Working in XML
()
Working in Parquet, Avro, and ORC
()
Unstructured text
()
JSON
()
Solution: CSV to JSON
()
3. Calling APIs
Working with JSON
()
Making HTTP calls
()
Processing event-based data
()
Solution: Location from IP
()
4. Web Scraping
Try to find an API
()
Working with Beautiful Soup
()
Working with Scrapy
()
Working with Selenium
()
Other considerations
()
Solution: Get stock information from HTML
()
5. Schema
What are schemas?
()
Working with ontologies
()
What should be in schema
()
Schema changes
()
Schema validations
()
6. Working with Databases
Types of databases
()
Hosted and cost of ops
()
Working with relational databases
()
Working with key or value databases
()
Working with document databases
()
Working with graph databases
()
Solution: ETL
()
7. Troubleshooting Data
Data is never 100% okay
()
Causes of errors
()
Filling missing values
()
Finding outliers (manual)
()
Finding outliers (ML)
()
Solution: Clean rides dataset
()
8. Data KPIs and Process
Design your data
()
KPIs
()
What to monitor?
()
Ex_Files_Data_Ingestion_Python.zip
(2.3 MB)
Glossary_DataIngestionPython.zip
(1.0 MB)