Skip to content

Commit

Permalink
DLDG
Browse files Browse the repository at this point in the history
  • Loading branch information
rtyler committed Nov 16, 2024
1 parent 95dd7fc commit 4d10d7d
Show file tree
Hide file tree
Showing 2 changed files with 173 additions and 0 deletions.
172 changes: 172 additions & 0 deletions _posts/2024-11-15-deltalake-the-definitive-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
---
layout: post
title: "From the beginning, delta-rs to Delta Lake: The Definitive Guide"
tags:
- databricks
- deltalake
- buoyantdata
---

<a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942" target="_blank"><img src="/images/post-images/2024-deltalake/book-cover.jpg" align="right" width="200"/></a>

Nothing quite feels like "I made it!" like being _published_. Which is why I am
thrilled to share that [Delta Lake: The Definitive
Guide](https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942)
is available for purchase, and I kind of helped! I wanted to share a little bit
about how my contributions (Chapter 6!) came about, because my entrance into
the [Delta Lake](https://delta.io) ecosystem was about as unplanned as my
authorship of part of this wonderful book.


The [delta-rs](https://github.com/delta-io/delta-rs) project started in 2020 and I wish that I could say it is because
I am a brilliant visionary. The project largely started because I have had a
bias against JVM-based technology stacks and I had stepped into a role at
[Scribd](https://tech.scribd.com) where we were migrating to AWS, Databricks,
and a new architecture _anyways_ so why not challenge the orthodoxy? My
colleague [QP Hou](https://about.houqp.me/) and I were loving Rust and liked
Delta Lake from a design standpoint, but did not love [Apache
Spark](https://spark.spache.org) for some of the things we needed to do.

I would consider the official start of the project to be April 11th, 2020 when
I sent our Databricks colleagues the following:

----

Greetings! As I mentioned in our weekly sync up this week, we have an interest
in partnering with Databricks to develop and open source native client
interface for Delta Lake.

For framing this conversation and scope of the native interface, I categorize
our compute workloads into three groups:

1. **Big offline data processing**, requiring a cluster of compute resources where Spark makes a big dent.
1. **Lightweight/small offline data processing**, workloads needing "fractional
compute" resources, basically less than a single machine. (Ruby/Python type
tasks which move data around, or perform small-scale data accesses make up
the majority of these in our current infrastructure, we've discussed using
the Databricks Light runtime for these in the past, since the cost to
deploy/run these small tasks on Databricks clusters doesn't make sense).
1. **Boundary data-processing**, where the task might involve a little bit of
production "online" data and a little bit of warehouse "offline" data to
complete its work. In our environment we have Ruby scripts whose sole job is
to sync pre-computed (by Spark) offline data into online data stores for the
production Rails application, etc, to access and serve.

I don't want to burn down our current investment in Ruby for many of the 2nd
and 3rd workloads, not to mention retraining a number of developers in-house to
learn how to effectively use Scala or pySpark.

My proposal is that we partner with Databricks and jointly develop an open
source client interface for Delta Lake. One where we would have at least one
developer from Databricks working with at least one developer from Scribd on a
jointly scoped effort to deliver a library capable of _initially_ addressing
our '2' and '3' use-cases.

[..]

Further, I propose that we jointly develop a client interface in Rust, which
will allow us to easy extend that within the Databricks community to support
Golang, Python, Ruby, and Node clients.

The key benefits I imagine for us all:

* Much broader market share for Delta Lake as a technology. Not only would
companies like Scribd benefit, and continue to invest in Delta Lake, but
other companies would have an easier on-ramp into the Databricks ecosystem.
Basically, if you start using Delta Lake before you use Spark, you will (I
guarantee) reach a point where these lightweight workloads become heavyweight
workloads requiring the full power and glory of the Databricks runtime :D

* It's a fantastic developer advocacy story that hits a number of key bullet
marketing points: open source, partner collaboration, Rust (so hot right now) :)

* Scribd is able to "immediately" take advantage of Delta Lake benefits without
burning up all our existing codebase and investment in Ruby tasks and
tooling. Thereby allowing for an easier onramp into Delta Lake and the
Databricks platform as a whole.

The scope of the effort I think would be largely around properly dealing with
the transaction log, since the Apache Arrow project has already created a
pretty decent [parquet crate](https://crates.io/crates/parquet) in Rust. That
said, there may be some writer improvements we'd want/need to push upstream to
Apache Arrow to make this successful.

----


On second thought, almost all of this has come true! What a brilliant sage! (plz clap)

Like many advancements, there's a right time, a right place, and a right group
of people. Unfortunately Databricks didn't join the party until a later on but
were a strong supporter of our initial work, providing guidance and helping to
make [Delta Lake](https://delta.io) an ever-more thriving open source
community. The right people were all converging on the direction that made
this possible with [Neville](https://github.com/nevi-me) helped make
[arrow-rs](https://github.com/apache/arrow-rs) a much better [Apache
Parquet](https://parquet.apache.org) writer. QP wrote the first version of the
protocol parser and created the first Python bindings for the library.
[Christian Williams](https://github.com/xianwill) built out
[kafka-delta-ingest](https://github.com/delta-io/kafka-delta-ingest) with
[mykhailo Osypov](https://github.com/mosyp) and helped prove that: **Rust is
way more efficient for data ingestion workloads.**. As time went on Will Jones,
Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from
a small Scribd-motivated open source project into a thriving Rust and Python
project.

Scribd had wild success with the data ingestion being in Rust, and the data
processing/query being in Spark. THe community grew, Databricks grew, and at
some point some folks started working on a book.

As a long-time maintainer of delta-rs and talking head in the Delta and
Databricks ecosystem I was asked to be a technical reviewer of the book after
Prashanth, Scott, Tristen, and Denny had already gotten more than halfway
through the chapters.

I provided as much feedback as I could on their chapters. I reviewed the
outline and noticed "Chapter 8: TBD".

What's supposed to be Chapter 8? "_We're not sure yet._"

My friend [Kohsuke](https://kohsuke.org) once marveled at how I was able to
acquire things for the [Jenkins](https://jenkins.io) by the simple act of
asking for them. There's some skill involved in finding mutually beneficial
opportunities, but being uninhibited by the possibility somebody would say "no"
helps a lot.

"So this outline looks good, but when are you going to talk about Rust and
Python? There are dozens of us! Dozens!"

[Denny](https://dennyglee.com/) needed another chapter and I asked if I could
write about building native data applications in Rust and Python.

----


[Scribd](https://tech.scribd.com) is a fun company to work at. Books,
audiobooks, podcasts, articles. We have a deep appreciation for the written
word, telling stories, and learning. All of which I value highly. Before this
experience however I had never seen the _other_ side of books. The creation,
the meetings, the rewrites, the edits, the reviews, going to press. It is
incredibly interesting and the team at O'Reilly are talented and professional.

Going through copy-editing I was fielding review comments on the consistency of
tense, the subject of sentences, discussions about what is a proper noun and
how to consistently apply terms through _hundreds of pages_ of content.


Over the years I have tried and failed to explain what I do to family members.
For people that don't work in tech "working on the computer" all looks largely
the same, especially for older generations. Having your work, your name _in
print_ has an intangible "wow" factor. More so than conference talks,
websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the
accomplishments of the innumerable contributors to the Delta Lake ecosystem
over the years.


If you're data inclined, I recommend picking up a copy, Prashanth, Scott,
Tristen, and Denny have written a very useful guide, and also I contributed a
bit too! :)




1 change: 1 addition & 0 deletions images/post-images/2024-deltalake/book-cover.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4d10d7d

Please sign in to comment.