Skip to content

Latest commit

 

History

History
700 lines (537 loc) · 60.4 KB

README.md

File metadata and controls

700 lines (537 loc) · 60.4 KB

Data.Engineers.Lunch

Resources from weekly Zoom lunches revolving around data engineering and data engineering-related topics. Hosted by Anant Corporation.

Join Data Engineer's Lunch Weekly at 12 PM EST Every Monday

Watch Data Engineer's Lunches Live and Subscribe to Our YouTube Channel to Keep Up to Date

If you would like to be a guest speaker, you can reach us at solutions@anant.us. If you would like to sponsor Data Engineer's Lunch, please reach us at the email listed.

Check out the Data Engineer's Lunch playlist on Youtube

Table of Contents

Number Jump To Topic YouTube SlideShare
1 Data Engineering Roadmap YouTube SlideShare
2 Common ETL Frameworks YouTube SlideShare
3 Scripting Shell Automation for Data Engineering YouTube SlideShare
4 Airflow for Data Engineering YouTube SlideShare
5 What is a Data Lake YouTube SlideShare
6 Common Data Formats Used In Data Engineering YouTube SlideShare
7 SQL Databases YouTube SlideShare
8 SQL Databases Part 2 YouTube SlideShare
9 Open Source & Cloud Data Catalog YouTube SlideShare
10 NoSQL Databases: Part 1 YouTube SlideShare
11 Apache Spark Companion Technologies MLFlow YouTube SlideShare
12 Introduction to sed for Data Engineering YouTube SlideShare
13 Introduction to Airflow YouTube SlideShare
14 NoSQL Databases: Part 2 CAP Theorem YouTube SlideShare
15 Introduction to Jenkins YouTube SlideShare
16 Introduction to awk for Data Engineering YouTube SlideShare
17 NoSQL Databases: Part 3 Data Store Types YouTube SlideShare
18 Luigi for Scheduling YouTube SlideShare
19 Introduction to jq for Data Engineering YouTube SlideShare
20 DataOps vs. DevOps YouTube SlideShare
21 Python ETL Tools YouTube SlideShare
22 Prometheus YouTube SlideShare
23 Thanos/Cortex YouTube SlideShare
24 Pandas for Data Engineering YouTube SlideShare
25 Airflow and Spark YouTube SlideShare
26 Akka Actors for Data Processing YouTube SlideShare
27 Data Processing with Containers: Docker & Kubernetes Tools for Data Engineering YouTube SlideShare
28 Petl for Data Engineering YouTube SlideShare
29 Introduction to Apache Nifi YouTube SlideShare
30 Databand YouTube SlideShare
31 Migrating from PostgreSQL to Cassandra YouTube SlideShare
32 Converting JSON to CSV YouTube SlideShare
33 Using Spark, Cassandra, and Elasticsearch for Data Processing YouTube SlideShare
34 DBeaver YouTube SlideShare
35 Introduction to Snowflake YouTube SlideShare
36 Amundsen/DSE + Airflow YouTube SlideShare
37 Pipedream: Serverless Integration and Compute Platform YouTube SlideShare
39 Dapr Cloud YouTube SlideShare
40 Streaming Real Time vs Batch for ETL YouTube SlideShare
41 PygramETL YouTube SlideShare
42 Introduction to Databricks YouTube SlideShare
43 Bodo.ai - Karthik Narayanan YouTube
44 Prefect YouTube SlideShare
45 Apache Livy YouTube SlideShare
46 Node.js and API calls YouTube SlideShare
47 Airflow on Kubernetes YouTube SlideShare
48 Veezoo - João Pedro Monteiro YouTube
49 Meltano for Data Engineering YouTube SlideShare
50 Airbyte for Data Engineering YouTube SlideShare
51 Comparison of Managed Airflow Options YouTube
52 JupyterHub/JupyterLab on Kubernetes YouTube SlideShare
53 2021 in Review YouTube
54 dbt and Spark YouTube SlideShare
55 Get Started in Data Engineering YouTube SlideShare
56 Spring Cloud Data Flow with Cassandra YouTube SlideShare
57 StreamSets for Data Engineering YouTube SlideShare
58 InfinyOn YouTube
59 Spark Tasks and Distribution YouTube SlideShare
60 Series - Developing Enterprise Consciousness YouTube SlideShare
61 Kubevirt YouTube SlideShare
63 Building a Cryptocurrency Data Catalogue YouTube SlideShare
64 Processing Real-time Crypto Transactions YouTube
65 JanusGraph on Jupyter - Using Notebooks with Graph YouTube
66 Airflow and Presto YouTube SlideShare
67 Machine Learning - Feature Selection YouTube SlideShare
68 DevOps Fundamentals YouTube SlideShare
69 Great Expectations for Data Engineering SlideShare
70 Apache Iceberg YouTube SlideShare
71 Tools for Cloud Data Engineering YouTube
72 Introduction to Apache Pinot YouTube
74 Table Format Comparison YouTube
75 Real-time change data capture processing and ingest into OLTP and OLAP databases YouTube
76 Airflow and Google Dataproc YouTube
77 Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Databases YouTube
78 Visualize Data from Cassandra in Superset YouTube
79 The Second 90% of Data Engineering Projects YouTube
80 Apache Spark Resource Managers YouTube
81 Reverse ETL Tools for Modern Data Platforms YouTube
82 Automating Apache Cassandra Operations with Apache Airflow YouTube
83 Strategies for Migration to Apache Iceberg - Alex Merced Dremio YouTube
84 Interesting and Exciting Things from AWS re:Invent 2022 YouTube
85 Designing a Modern Data Stack YouTube
86 Building Real-Time Applications at Scale: A Case Study in Cyclist Crash Detection YouTube
87 ChatGPT for Data Engineering YouTube
89 Machine Learning Orchestration with Airflow YouTube
90 Migrating SQL Data with Arcion YouTube
91 Deploying Google-managed Instance Groups with Terraform YouTube
92 GCP Managed Instance Groups with Terraform Pt. 2 YouTube
93 LLM / AI Engineering for Software & Data Engineers YouTube
94 Upgrading Postgres for On-Prem IoT YouTube
95 Python Parallel Processing Frameworks YouTube

  • We cover the data engineering roadmap and the general path, which includes various technologies for programming, scripting/automation, databases, data processing, scheduling, clouds, and infrastructure. We also discuss different guides and resources.

  • We discuss common ETL frameworks and different tools and frameworks for different languages including Python, Java, Scala, .NET, and Node.

  • We discuss a multitude of tools you can use to do scripting and shell automation for data engineering along with different shells, cron, and various command-line tools with resources and examples.

  • Guest speaker Will Angel covers the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.

  • We discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes.

  • We discuss common data formats used in data engineering including text/file and binary formats.

  • We discuss relational concepts including the history of RDBMS, the general need for SQL databases, rules of design, and normalization. We also discuss popular SQL databases, and their advantages and disadvantages.

  • We continue our discussion of relational concepts, popular SQL databases, and advantages and disadvantages. We also discuss Cloud Databases and database tools compatible with SQL databases.


  • We discuss NoSQL datastores, specifically, different types of key-value stores.

  • We cover MLFlow, a tool by Databricks for managing and cataloging machine learning workflows.

  • We will introduce sed, a stream editor, for data engineering. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

  • We will cover some resources for getting started with Airflow, a python based scheduling tool with the ability to connect to a number of different data management tools. We had an overview recently from Will Angel in Data Engineers Lunch #4. This session will help beginners learn to use Airflow.

  • We cover the fundamental difference between relational vs most non-relation databases with ACID vs Base.

  • We will cover the use of Jenkins as a scheduling tool, have a general overview of Jenkins capabilities, and a comparison of how it stacks up against Airflow as a scheduling tool.

  • We will introduce and demonstrate awk, a program that you can use to select particular records in a file and perform operations upon them.

  • We discussed the four different types of data stores that underlie NoSQL databases.

  • We discussed Luigi as a scheduling platforms alongside our previous discussions of Jenkins and Airflow. Luigi is a Python package that helps you build complex pipelines of batch jobs.

  • We introduce jq and how we can use it for data engineering. jq is a command-line tool like sed for JSON data and can be used to slice, filter, map, and transform structured data.

  • We discuss the definitions and differences between DataOps (Data Operations) and DevOps (Dev Operations).

  • We discuss, compare, and contrast a number of ETL tools for Python.

  • Guest speaker Will Angel covers the topic of using Prometheus for data engineering. Prometheus is a monitoring system & time series database.


  • We continue our discussion of Python ETL tools with a more in-depth look at Pandas.


  • We discuss how to use Akka Actors for concurrent data processing operations.


  • We continue our discussion of Python ETL tools with a more in-depth look at Petl.

  • We introduce Apache Nifi and discuss how we can use it for data engineering.

  • In Data Engineer’s Lunch #30 we discuss the differences between the open-source and paid versions of Databand and have Databand CEO Josh Benamram walk us through a demo of the paid version.

  • In Data Engineer's Lunch #31, we will discuss the process and reasons for migrating your database from SQL(PostgreSQL) to NoSQL(Cassandra)

  • In Data Engineer's Lunch #32, we will discuss different ways to convert json files into csv files.

  • In Data Engineer's Lunch #33, we will discuss how you can use Spark and Spark jobs to load data from a csv file, and save + load the data into Cassandra and Elasticsearch.

  • In Data Engineer's Lunch #34: DBeaver, we will be discussing what DBeaver is and how it can be used in data engineering.

  • In Data Engineer's Lunch #35: Introduction to Snowflake, we will introduce Snowflake and discuss how it can be used for Data Engineering.

  • In Data Engineer's Lunch #36, we will discuss data discovery with Amundsen.

  • In Data Engineer's Lunch #37, we will discuss Pipedream, a serverless integration and compute platform that is free for individual developers to use.

  • In Data Engineer's Lunch #39: Dapr Cloud we will discuss how to use Dapr to make a cloud Application

  • In Data Engineer's Lunch #40: Streaming Real Time vs Batch for ETL, we will be discussing use cases for using real time stream processing or processing in batches.

  • In Data Engineer's Lunch #41, we will discuss pygrametl as part of our discussion of python ETL tools.

  • In Data Engineer's Lunch #42, we will introduce Databricks and how it can be used for data engineering.

  • In Data Engineer's Lunch #43, Karthik Narayanan Principal Solutions Architect and Bodo.ai will be demonstrating what Bodo.ai is and its capabilities.

  • In Data Engineer's Lunch #44, we will discuss Prefect and how it compares to Airflow when scheduling tasks.

  • In Data Engineer's Lunch #45, we will discuss the use of Apache Livy, which creates a REST API for interacting with Spark.

  • In Data Engineer's Lunch #46, we discuss the architecture of Node.js and use it to initiate and harvest some data from an API call.

  • In Data Engineer's Lunch #47, we will use Kubernetes to deploy airflow

  • In Data Engineer's Lunch #48, João Pedro Monteiro (JP), co-founder and CTO of Veezoo, will be introducing Veezoo and showing how natural language interfaces are the key to enabling data democratization at companies.

  • In Data Engineer's Lunch #49, we will be introducing Meltano and how it can be used for ELT in data engineering.

  • In Data Engineer's Lunch #50, we will introduce Airbyte and discuss how it can be used for data engineering

  • In Data Engineer's Lunch #51: Comparison of Managed Airflow Options, guest speaker Andres Namm will be comparing AWS Airflow, GCP Airflow, Astronomer vs. self-managed Airflow.

  • In Data Engineer's Lunch #52 we will deploy JupyterHub/JupyterLab on Kubernetes

  • In Data Engineer's Lunch #53, we discussed some of our most popular webinars from 2021 and received feedback from the audience about what they would like to see in 2022.

  • In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.

  • In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.

  • In Data Engineer's Lunch #55 we will be going over how to integrate Spring Cloud Data Flow with Cassandra.

  • In Data Engineer's Lunch #57, we will discuss StreamSets and how it can be used for data engineering.

  • In Data Engineer’s Lunch #58, Sehyo Chang, founder and CTO of InfinyOn, will give an introduction to Fluvio OSS and the InfinyOn Cloud data streaming platform.

  • In Data Engineer's Lunch #59, we will discuss the way that Spark splits up and distributes work between nodes. We will look at some example code and view in the Spark UI, how it was distributed between nodes.

  • In Data Engineer's Lunch #60, CEO of Anant, Rahul Singh, will discuss modern data processing / pipeline approaches. Want to learn about modern data engineering patterns & practices for global data platforms? High-level overview of different types, frameworks, and workflows in data processing and pipeline design.

  • In Data Engineer's Lunch #61, Stefan Nikolovski will discuss Kubevirt.

  • In Data Engineer’s Lunch #63, Travis Collins, founder of the open source project DataPM, will present DataPM, how to get access to cryptocurrency, and blockchain data. This is part 1 of a series with Decodable on processing real-time crypto transactions fed by DataPM.

  • In Data Engineer’s Lunch #64, Eric Sammer, CEO of Decodable, will discuss their cloud-based streaming SQL engine and how to mine insights from data in real-time. This is part 2 of a series with DataPM on processing real-time crypto transactions fed by DataPM.

  • In Data Engineer's Lunch #65, Ryan Quey will discuss the Graph Notebook tool put out by the AWS team on JanusGraph.

  • In Data Engineer's Lunch #66, Arpan Patel will discuss how to connect Airflow and Presto

  • In Data Engineer's Lunch #67, Obioma Anomnachi will discuss the process of feature selection as part of a Machine Learning process. Feature selection describes the process of picking particular, relevant data features out of a wider data set, to be used to perform model training.

  • In Data Engineer’s Lunch #68, Will Angel, Technical Product Manager at Caribou Financial, will provide an introduction to DevOps practices and tooling including testing, deployment automation, logging, monitoring, and DevOps principles. Additionally, we will discuss some of the ways that DevOps for data engineering is different from conventional application development.

[Data Engineer's Lunch #69: Great Expectations for Data Engineering]

  • In Data Engineer's Lunch #69, Arpan Patel will discuss Great Expectations and how it can be used for data engineering. This will be part one of a series on Great Expectations and will primarily focus on introducing Great Expectations. Future talks will feature tools like Spark and Airflow in conjunction with Great Expectations!

  • In Data Engineer's Lunch #70, watch Alex Merced, Developer Advocate at Dremio, for this webinar to learn the architectural details of why the Hive table format falls short and why the Iceberg table format resolves them, as well as the benefits that stem from Iceberg’s approach.

  • In Data Engineer’s Lunch #71, CEO of Anant, Rahul Singh, will discuss tools for cloud data engineering!

  • In Data Engineer’s Lunch #72, CEO of Anant, Rahul Singh, will give an overview of the up-and-coming Apache Pinot project that spun out of LinkedIn and is now being supported by Startree as an enterprise offering. This is the first in a series of talks and workshops on why Pinot is important to the future of real-time data

  • In Data Engineer's Lunch #74, Alex Merced, Developer Advocate for Dremio, will discuss the three major data lake table formats – Apache Iceberg, Apache Hudi, and Delta Lake – covering how they work, their features, and their limitations so you can make an informed decision when architecting your data lakehouse.

  • In Data Engineer's Lunch #75, Eric Sammer, CEO of Decodable, will discuss real-time change data capture, processing, and ingest into OLTP and OLAP databases!

  • In Data Engineer's Lunch #76, Arpan Patel will cover how to connect Airflow and Dataproc with a demo using an Airflow DAG to create a Dataproc cluster, submit an Apache Spark job to Dataproc, and destroy the Dataproc cluster upon completion.

  • This talk covers why ODBC & JDBC don’t cut it in today’s data world and the problems solved by Arrow, Arrow Flight, and Arrow Flight SQL. Alex will go through how each of these building blocks works as well as an overview of universal ODBC & JDBC drivers built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

  • In this lunch, Ryan will walk through how to visualize data from Cassandra in Superset (by means of Presto). Along the way, he shares some observations about his experience and potential use cases that may be interesting to you.

  • You build an ELT pipeline to get data from some source, load it into your data lake, and transform it into a usefully modeled dataset for analysts and business users to consume; another data engineering job well done. Except you now have a new set of data artifacts, access patterns, documentation (hopefully), and security permissions to manage. This talk will provide an overview of Data Governance, which is the art of anticipating, preventing, and mitigating all the risks, costs, and headaches that come with every new data source throughout the data lifecycle.

  • In Data Engineer's Lunch #80, Obioma Anomnachi will compare and contrast the different resource managers available for Apache Spark. We will cover local, standalone, YARN, and Kubernetes resource managers and discuss how each one allows the user different levels of control over how resources given to Spark are distributed to Spark applications.

  • During this lunch, we’ll review some of the open-source reverse ETL tools to uncover how to send data back to SaaS systems.

  • During this lunch, we’ll discuss going beyond cron jobs to manage ETL, Data Hygiene, and Data Import/Export.

  • In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following: How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg, Pros and Cons of an In-place or Shadow Migration, Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie

  • In this lunch, Nicholas will deliver a breakdown and description of AWS re:Invent 2022 and some of the cool announcements and learning that occurred there.

  • What are the design considerations that go into architecting a modern data warehouse? This lunch will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.

  • As the demand for real-time data processing continues to grow, so too do the challenges associated with building production-ready applications that can handle large volumes of data and handle it quickly. In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes. Using telemetry data collected from a fitness app, we’ll demonstrate how we used a combination of Apache Kafka and Python-based microservices running on Kubernetes to build a pipeline for processing and analyzing this data in real-time. We'll also discuss how we used machine learning techniques to build a model for detecting collisions and how we implemented notifications to alert family members of a crash. Our ultimate goal is to help you navigate the challenges that come with building data-intensive, real-time applications that use ML models. By showcasing a real-world example, we aim to provide practical solutions and insights that you can apply to your own projects.

  • Learn how to use ChatGPT for basic data engineering tasks such as data modeling, ETL, data cleanup, code conversion, and data science.

  • In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.

  • If you're looking to migrate SQL data for your organization, you won't want to miss this informative talk on Arcion. Designed to simplify the data migration process, Arcion offers a seamless solution that enables you to move your SQL data quickly and efficiently. During this talk, you'll discover the many benefits of using Arcion for data migration, including its intuitive interface and powerful automation features. You'll also learn how to leverage Arcion's robust capabilities to streamline your SQL data migration process, regardless of the size or complexity of your data sets.

  • In this lunch, we'll show you how to deploy a Managed Instance Group using Terraform. We'll explain the different methods and demonstrate configurations, lessons learned, and a simple example for deploying a Managed Instance Group with Terraform in GCP.

  • In the second in a series on Google Managed Instance Groups, Anant Architect Nicholas Brackley details methods for implementing auto-healing and updating the image used by the group.

  • During this lunch, we'll cover what you need to know to get started in LLM / GPT engineering as a Software and/or Data Engineer. This lunch covers the fundamentals of LLM, some patterns, and how to get started.

  • Join Will Angel for a talk about his journey upgrading an on-prem IoT system's Postgres database and some of the techniques used to improve local database performance for analytical systems.

  • In Data Engineer's Lunch 94, Obioma Anomnachi will be sharing his expertise on the topic of parallel computing for Python programmers. During the event, Obioma will delve into the various pathways available for Python developers who wish to execute their code in parallel. You will learn about the benefits of parallel processing, how it can improve the performance of your code, and the different tools and frameworks that can be used to achieve this.

  • In this lunch, we will introduce the concepts of Real Time Analytics, why it is important, the evolution of Analytics, and how companies such as LinkedIn, Stripe, Uber, and more are using Real Time analytics to grow their audience and improve usability by using Apache Pinot. What is Apache Pinot? Followed by Demo and Q&A.

  • In this lunch we will discuss the use of the Hadoop Distributed File System for data engineering applications.

  • A comprehensive exploration of the intricacies of Data Lake Table Formats and their impact on business analytics. Data lake table formats are a critical component of modern data analytics. By the end of this presentation, you will better understand data lake table formats and how they can be used to improve business analytics.