Search our courses
Training

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. In this course, you’ll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The course begins with an introduction to data manipulation in Python using pandas. You’ll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you’ll be able to analyze data that is distributed on several computers by using Dask. As you progress, you’ll study how to aggregate data for plots when the entire dataset cannot be accommodated into memory. You’ll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The course also covers Spark and its interaction with other tools.

By the end of this course, you’ll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

LEARNING OUTCOMES

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on the disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Big Data Analysis with Python

Course Code

GTDBDAP

Duration

2 Days

Course Fee

POA

Accreditation

N/A

Target Audience

  • Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help in understanding various concepts explained in this course.

Expand all

Course Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. In this course, you’ll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The course begins with an introduction to data manipulation in Python using pandas. You’ll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you’ll be able to analyze data that is distributed on several computers by using Dask. As you progress, you’ll study how to aggregate data for plots when the entire dataset cannot be accommodated into memory. You’ll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The course also covers Spark and its interaction with other tools.

By the end of this course, you’ll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

LEARNING OUTCOMES

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on the disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals
Course Outline

Lesson 1: The Python Data Science Stack

  • Python Libraries and Packages
  • Using Pandas
  • Data Type Conversion
  • Aggregation and Grouping
  • Exporting Data from Pandas
  • Visualization with Pandas

Lesson 2: Statistical Visualizations

  • Types of Graphs and When to Use Them
  • Components of a Graph
  • Which Tool Should Be Used?
  • Types of Graphs
  • Pandas DataFrames and Grouped Data
  • Changing Plot Design: Modifying Graph Components
  • Exporting Graphs

Lesson 3: Working with Big Data Frameworks

  • Hadoop
  • Spark
  • Writing Parquet Files
  • Handling Unstructured Data

Lesson 4: Diving Deeper with Spark

  • Getting Started with Spark DataFrames
  • Writing Output from Spark DataFrames
  • Exploring Spark DataFrames
  • Data Manipulation with Spark DataFrames
  • Graphs in Spark

Lesson 5: Handling Missing Values and Correlation Analysis

  • Setting up the Jupyter Notebook
  • Missing Values
  • Handling Missing Values in Spark DataFrames
  • Correlation

Lesson 6: Exploratory Data Analysis

  • Defining a Business Problem
  • Translating a Business Problem into Measurable Metrics and Exploratory Data Analysis (EDA)
  • Structured Approach to the Data Science Project Life Cycle

Lesson 7: Reproducibility in Big Data Analysis

  • Reproducibility with Jupyter Notebooks
  • Gathering Data in a Reproducible Way
  • Code Practices and Standards
  • Avoiding Repetition

Lesson 8: Creating a Full Analysis Report

  • Reading Data in Spark from Different Data Sources
  • SQL Operations on a Spark DataFrame
  • Generating Statistical Measurements
Learning Path
Ways to Attend
  • Attend a public course, if there is one available. Please check our schedule, or register your interest in joining a course in your area.
  • Private onsite Team training also available, please contact us to discuss. We can customise this course to suit your business requirements.

Private Team Training is available for this course

We deliver this course either on or off-site in various regions around the world, and can customise your delivery to suit your exact business needs. Talk to us about how we can fine-tune a course to suit your team's current skillset and ultimate learning objectives.

Private Team Training | Contact us

Technical ICT learning & mentoring services

Private Team Training

Our instructors are specialist consultants with vast real world experience and expertise allowing them to design and deliver client-focused courses for your organisation.

Learn more about our Private Team Training

What Our Clients Say

"Absolutely fantastic training. Thoroughly enjoyed it thanks to our highly enthusiastic tutor.  It wouldn't be an understatement to say that it was the best professional training that I have ever received."

 

Customised Linux with Networking

Live Online -  February 2022

 


“It was very positive. This course was 4 days but covered a semester worth of work if it was done in college. The labs were relevant and delegates were provided the lab/coursebook for further study and practice after the course finished. GuruTeam's course was excellent and provides a deeper understanding of the architecture and how it all works. The hands-on aspect was very helpful as it helped solidify the concepts as I went along."

 

Kubernetes Administration Certification - GTLFK

Live Online September 2024

 

 

 

“The Instructor was very knowledgeable, laid back and very approachable during the course. The environment setup was second to none.  Very easy to jump in and follow along with minimal pre-req setup."

Kubernetes Administration Certification - GTLFK

Onsite May 2024

 

“Very engaging and practical course so hope to be able to put the learning into practice.”

 

Being Agile in Business - GTBAB

Live Online September 2021

 

“Great instructor, who encouraged active participation. The breakout groups and exercises kept the group engaged and the content relevant to our own products”.

 

Site Reliability Engineering Foundation - GTDSRE

Live Online January 2022

 

 

 

"Intelligence is the ability to avoid doing work, yet
getting the work done"

Linus Torvalds, creator of Linux and GIT

Technical ICT learning & mentoring services

About GuruTeam

GuruTeam is a high-level ICT Learning, Mentoring and Consultancy services company. We specialise in delivering instructor-led on and off-site training in Blockchain, Linux, Cloud, Big Data, DevOps, Kubernetes, Agile, Software & Web Development technologies. View our Testimonials

Download our eBrochure
Our Accreditation Partners
  •  
  •  
  •  

 

Upcoming Courses

Kubernetes Administration


 2nd - 5th December 2024
10th - 13th December 2024
16th - 19th December 2024


Live Online
GMT +01:00  09:30 - 17:00 hrs



This Kubernetes Administration Certification training course is suitable for anyone who wants to learn the skills necessary to build and administer a Kubernetes cluster.

 

 

 

Learn More

RUST PROGRAMMING



10th - 13th December 2024
16th - 19th December 2024


 

Live Online
GMT +01:00  09:30 - 17:00 hrs

This course will help you understand what Rust applications look like, how to write Rust applications properly, and how to get the most out of the language and its libraries.
 

 

 

Learn More

INTRO TO PYTHON 3 

ADVANCED PYTHON 3

Dates to be added

 

Live Online
GMT +01:00  09:30 - 17:00 hrs

  
Python is a powerful and popular object-oriented programming/
 scripting language with many high quality libraries.
 



 

 

Learn More

 GO LANG TRAINING

10th - 13th December 2024
16th - 19th December 2024

Live Online
GMT +01:00  09:30 - 17:00 hrs
 

This Go language programming training course will help you understand how Go works, and immediately be more productive. If you are building a team using Go, this will be a great opportunity to get your team on the same page and speaking the same language. Innovative lab exercises and code samples are provided to reinforce skills and quickly master the topics.

Learn More

Newsletter

Stay up to date, receive updates on scheduled dates, new courses, offers, and events.

Subscribe to our Newsletter