Skip to content

Latest commit

 

History

History
 
 

Notebooks

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Learning PySpark by examples

By: Dimitar Trajanov

Learning PySpark by examples can be an effective way to understand how PySpark works in practice and to gain hands-on experience with the tool. Here are a few reasons why learning PySpark by examples can be beneficial:

  1. Demonstrates real-world applications: Examples show how PySpark is used to solve real-world data processing challenges. This can help you understand how to apply PySpark to your own data analysis problems.
  2. Provides practical experience: Working through examples can help you gain practical experience with PySpark. By experimenting with PySpark code, you can better understand how to use PySpark functions, manipulate data, and build data pipelines.
  3. Improves problem-solving skills: Examples can also help you develop your problem-solving skills. As you work through examples, you will encounter errors and challenges that you will need to solve. This can help you develop your critical thinking skills and learn how to debug PySpark code.
  4. Learning PySpark by examples can also be a fun and engaging way to learn. Examples can help make learning more interactive and hands-on, which can be more enjoyable than reading through documentation or tutorials.

In this repository, you can find examples for PySpark that range from the basics all the way up to advanced machine learning algorithms, including those using MLLib. These examples provide a comprehensive overview of PySpark's capabilities, from simple data processing tasks to more complex use cases involving advanced machine learning techniques. Whether you are just starting out with PySpark or are already experienced with the tool, you can find examples that will help you build your skills and develop your understanding of how to use PySpark to solve real-world data problems.

PySpark official documentation

To help users get started with PySpark, the official documentation provides a comprehensive guide that covers all aspects of the framework, from installation to advanced data processing and machine learning algorithms.

The official PySpark documentation is maintained by the developers themselves, ensuring that it is always up-to-date and accurate with the latest changes in the platform.

The following guides are highly recommended for anyone learning or working with PySpark:

  • PySpark Getting Started: This is the Getting Started guide that summarizes the basic steps required to setup and gets started with PySpark.
  • RDD Programming Guide: This guide provides an overview of Spark basics, including RDDs (the core but older API), accumulators, and broadcast variables.
  • Spark SQL, Datasets, and DataFrames: This guide is focused on processing structured data with relational queries using a newer API than RDDs.
  • MLlib: This guide provides detailed information on how to apply machine learning algorithms in PySpark.
  • Spark Python API. This is the PySpark API documentation, which provides detailed information on the PySpark API, including its modules, classes, and methods.

PySpark by examples

RDD Basics

RDD Advanced

PySpark DataFrames

ML models implementation

Spark MLlib examples

SparkNLP Examples

Python ML-related examples