Chapter 1. Python for Data Science
• In-Class Discussion
• Python Data Science-Centric Libraries
• NumPy
• NumPy Arrays
• Select NumPy Operations
• SciPy
• pandas
• Creating a pandas DataFrame
• Fetching and Sorting Data
• Scikit-learn
• Matplotlib
• Seaborn
• Python Dev Tools and REPLs
• IPython
• Jupyter
• Jupyter Operation Modes
• Jupyter Common Commands
• Anaconda
Chapter 2. Defining Data Science
• What is Data Science?
• Data Science, Machine Learning, AI?
• The Data-Related Roles
• The Data Science Ecosystem
• Tools of the Trade
• Who is a Data Scientist?
• Data Scientists at Work
• Examples of Data Science Projects
• An Example of a Data Product
• Applied Data Science at Google
• Data Science Gotchas
Chapter 3. Data Processing Phases
• Typical Data Processing Pipeline
• Data Discovery Phase
• Data Harvesting Phase
• Data Priming Phase
• Exploratory Data Analysis
• Model Planning Phase
• Model Building Phase
• Communicating the Results
• Production Roll-out
• Data Logistics and Data Governance
• Data Processing Workflow Engines
• Apache Airflow
• Data Lineage and Provenance
• Apache NiFi
Chapter 4. Descriptive Statistics Computing Features in Python
• Descriptive Statistics
• Non-uniformity of a Probability Distribution
• Using NumPy for Calculating Descriptive Statistics Measures
• Finding Min and Max in NumPy
• Using pandas for Calculating Descriptive Statistics Measures
• Correlation
• Regression and Correlation
• Covariance
• Getting Pairwise Correlation and Covariance Measures
• Finding Min and Max in pandas DataFrame
Chapter 5. Repairing and Normalizing Data
• Repairing and Normalizing Data
• Dealing with the Missing Data
• Sample Data Set
• Getting Info on Null Data
• Dropping a Column
• Interpolating Missing Data in pandas
• Replacing the Missing Values with the Mean Value
• Scaling (Normalizing) the Data
• Data Preprocessing with scikit-learn
• Scaling with the scale() Function
• The MinMaxScaler Object
Chapter 6. Data Visualization in Python
• Data Visualization
• Data Visualization in Python
• Matplotlib
• Getting Started with matplotlib
• The matplotlib.pyplot.plot() Function
• The matplotlib.pyplot.bar() Function
• The matplotlib.pyplot.pie () Function
• Subplots
• Using the matplotlib.gridspec.GridSpec Object
• The matplotlib.pyplot.subplot() Function
• Figures
• Saving Figures to a File
• Seaborn
• Getting Started with seaborn
• Histograms and KDE
• Plotting Bivariate Distributions
• Scatter plots in seaborn
• Pair plots in seaborn
• Heatmaps
• Ggplot
Chapter 7. Data Science and ML Algorithms in scikit-learn
• In-Class Discussion
• Types of Machine Learning
• Terminology: Features and Observations
• Representing Observations
• Terminology: Labels
• Terminology: Continuous and Categorical Features
• Continuous Features
• Categorical Features
• Common Distance Metrics
• The Euclidean Distance
• What is a Model
• Supervised vs Unsupervised Machine Learning
• Supervised Machine Learning Algorithms
• Unsupervised Machine Learning Algorithms
• Choosing the Right Algorithm
• The scikit-learn Package
• scikit-learn Estimators, Models, and Predictors
• Model Evaluation
• The Error Rate
• Confusion Matrix
• The Binary Classification Confusion Matrix
• Multi-class Classification Confusion Matrix Example
• ROC Curve
• The AUC Metric
• Feature Engineering
• Scaling of the Features
• Feature Blending (Creating Synthetic Features)
• The 'One-Hot' Encoding Scheme
• Example of 'One-Hot' Encoding Scheme
• Bias-Variance (Underfitting vs Overfitting) Trade-off
• The Modeling Error Factors
• One Way to Visualize Bias and Variance
• Underfitting vs Overfitting Visualization
• Balancing Off the Bias-Variance Ratio
• Regularization in scikit-learn
• Regularization, Take Two
• Dimensionality Reduction
• PCA and isomap
• The Advantages of Dimensionality Reduction
• The LIBSVM format
• Life-cycles of Machine Learning Development
• Data Splitting into Training and Test Datasets
• ML Model Tuning Visually
• Data Splitting in scikit-learn
• Cross-Validation Technique
• Classification (Supervised ML) Examples
• Classifying with k-Nearest Neighbors
• k-Nearest Neighbors Algorithm
• Regression Analysis
• Regression vs Correlation
• Regression vs Classification
• Simple Linear Regression Model
• Linear Regression Illustration
• Least-Squares Method (LSM)
• Gradient Descent Optimization
• Multiple Regression Analysis
• Evaluating Regression Model Accuracy
• The R2 Model Score
• The MSE Model Score
• Logistic Regression (Logit)
• Interpreting Logistic Regression Results
• Decision Trees
• Decision Tree Terminology
• Properties of Decision Trees
• Decision Tree Classification in the Context of Information Theory
• The Simplified Decision Tree Algorithm
• Using Decision Trees
• Random Forests
• Support Vector Machines (SVMs)
• Naive Bayes Classifier (SL)
• Naive Bayesian Probabilistic Model in a Nutshell
• Bayes Formula
• Classification of Documents with Naive Bayes
• Unsupervised Learning Type: Clustering
• k-Means Clustering (UL)
• k-Means Clustering in a Nutshell
• k-Means Characteristics
• Global vs Local Minimum Explained
• XGBoost
• Gradient Boosting
• A Better Algorithm or More Data?
Lab Exercises
Lab 1 – Using Jupyter Notebook
Lab 2 - Understanding Python
Lab 3 - Understanding NumPy
Lab 4 - Understandng pandas
Lab 5 – Repairing and Normalizing Data
Lab 6 - Data Visualization in Python
Lab 7 - Data Visualization in Python Project
Lab 8 - Data Splitting
Lab 9- k-Nearest Neighbors Algorithm
Lab 10 - The Random Forest Algorithm
Lab 11 – Th k-Means Algorithm
Lab 12 – Building Regression Models with XGBoost Library