Introduction
Why is data ingestion important?
()
What you should know
()
Using the exercise files
()
Using the Coderpad quizzes
()
1. Data Ingestion Overview
Overview of data scientists work
()
Where does data come from?
()
Different types of data
()
The data pipeline (ETL)
()
Final destination (data lake)
()
2. Reading Files
JSON
()
Solution: CSV to JSON
()
Working in CSV
()
Working in XML
()
Working in Parquet, Avro, and ORC
()
Unstructured text
()
3. Calling APIs
Solution: Location from IP
()
Working with JSON
()
Making HTTP calls
()
Processing event-based data
()
4. Web Scraping
Solution: Get stock information from HTML
()
Try to find an API
()
Working with Beautiful Soup
()
Working with Scrapy
()
Working with Selenium
()
Other considerations
()
5. Schema
Schema validations
()
What are schemas?
()
Working with ontologies
()
What should be in schema
()
Schema changes
()
6. Working with Databases
Working with graph databases
()
Solution: ETL
()
Types of databases
()
Hosted and cost of ops
()
Working with relational databases
()
Working with key or value databases
()
Working with document databases
()
7. Troubleshooting Data
Solution: Clean rides dataset
()
Data is never 100% okay
()
Causes of errors
()
Filling missing values
()
Finding outliers (manual)
()
Finding outliers (ML)
()
8. Data KPIs and Process
Design your data
()
KPIs
()
What to monitor?
()
Ex_Files_Data_Ingestion_Python.zip
(2.3 MB)
Glossary_DataIngestionPython.zip
(1.0 MB)