Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.
- Hadoop_Setting_up_a_Single_Node_Cluster.ipynb Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
- Hadoop_single_node_cluster_setup_Python.ipynb Set up a single-node Hadoop cluster on Google Colab using Python
- Hadoop_minicluster.ipynb Deploy a test Hadoop Cluster with a single command and no need for configuration.
- Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
- Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example.
- Run_Spark_on_Google_Colab.ipynb Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
- Spark_Standalone_Architecture_on_Google_Colab.ipynb Explore the Spark architecture through the immersive experience of deploying a standalone setup.
- MapReduce_Primer_HelloWorld.ipynb A MapReduce Primer with “Hello, World!”
- MapReduce_Primer_HelloWorld_bash.ipynb A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
- mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
- mrjob_wordcount.ipynb A simple MapReduce job with mrjob
- Hadoop_spilling.ipynb Hadoop spilling explained
- PySpark_On_Google_Colab.ipynb Explore the inner workings of PySpark on Google Colab
- PySpark_miscellanea.ipynb Tips, tricks, and insights related to PySpark.
- demoSparkSQLPython.ipynb Pyspark basic demo
- ngrams_with_pyspark.ipynb Basic example of n-grams extraction with PySpark
- generate_data_with_Faker.ipynb Data Generation and Aggregation with Python's Faker Library and PySpark
- Encoding+dataframe+columns.ipynb DataFrame Column Encoding with PySpark and Parquet Format
- Apache_Sedona_with_PySpark.ipynb Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab
- GutenbergBooks.ipynb Explore and download books from the Gutenberg books collection.
- TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
- Unicode.ipynb Exploring Unicode categories
- polynomial_regression.ipynb Worked out example of polynomial regression with numpy and matplotlib
- downloadSpark.ipynb How to download and verify the Spark distribution
- docker_for_beginners.md Docker for beginners: an introduction to the world of containers
- Terraform for beginners.md Getting started with Terraform
- Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management
- online_resources.md Online resources for learning Big Data
Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt (see also: Google Colab vs. GitHub Ubuntu Runner ).
Current status:
The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.