-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
265 lines (229 loc) · 14.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
<html>
<head>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-HQ9863ZP35"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() { dataLayer.push(arguments); }
gtag('js', new Date());
gtag('config', 'G-HQ9863ZP35');
</script>
<title>Capillaries</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Capillaries - distributed, supervised data processing">
<meta name="keywords" content="distributed, data, processing">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
<link rel="stylesheet" href="css/bootstrap.min.css">
<meta property="og:title" content="capillaries">
<meta property="og:type" content="website">
<meta property="og:url" content="http://capillaries.io/">
<meta property="og:image" content="http://capillaries.io/i/logo.svg">
<meta property="og:description" content="Capillaries - distributed, supervised data processing">
<style>
body {
border-top: 4px solid #666;
}
h1 {
font-weight: bold;
float: left;
}
h2 {
border-bottom: 1px solid #ccc;
padding-bottom: .2em;
margin-top:60px;
}
div.footer {
border-top: 1px solid #ccc;
padding: 1em 0 .5em;
margin-top: 3em;
}
img.logo {
margin-right: 1em;
margin-top: 20px;
}
</style>
</head>
<body>
<div class="container">
<div class="row" style="margin-top: 10px; margin-bottom: 20px;">
<div class="col-lg-6">
<img src="i/logo.svg" alt="Capillaries logo" class="pull-left logo" style="width:40px">
<h1>Capillaries</h1>
</div>
<div class="col-lg-3">
<ul class="nav nav-pills pull-right" style="margin-top: 20px">
<li><a href="./blog/index.html"><img src="i/web-content-icon.svg"
style="width: 20px;margin-bottom: 2px;margin-right: 6px;" alt="blog">Blog</a></li>
</ul>
</div>
<div class="col-lg-3">
<ul class="nav nav-pills pull-right" style="margin-top: 20px">
<li><a href="https://github.com/capillariesio/Capillaries"><img src="i/github-mark.svg"
style="width: 20px;margin-bottom: 3px;margin-right: 6px;" alt="github">Source code and
docs</a></li>
</ul>
</div>
</div>
<h3>Distributed data processing platform<br/>focused on delivering enriched, customer-ready, production-quality data
within SLA time limits</h3>
<div class="row">
<div class="col-lg-12">
<object data="i/capi-animation.svg" style="width:100%;margin-top:100px;margin-bottom:100px;"></object>
</div>
</div>
<h2>What are the use cases where Capillaries excels?</h2>
<div class="row">
<div class="col-lg-1">
<img src="i/speed-test-icon.svg" alt="predict" style="width:60px;margin-top:20px;">
</div>
<div class="col-lg-11">
<h3>Predictable performance</h3>
Capillaries prioritizes predictable performance, which is crucial for SLA management.
Ideally, after a few test runs, data engineers should be able to give reasonably accurate
predictions
about data transformation completion times for:
<ul>
<li>larger datasets of the same nature</li>
<li>bigger deployments from the same cloud provider</li>
</ul>
</div>
</div>
<div class="row">
<div class="col-lg-1">
<img src="i/configuration-icon.svg" alt="predict" style="width:60px;margin-top:20px;">
</div>
<div class="col-lg-11">
<h3>Configuration management consistency</h3>
1. The data processing DAG and relational algebra operations are part of the <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#script">Capillaries script</a> and are specified declaratively as JSON configuration.<br/>
2. <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#go-expressions">Go expressions</a> used in field transformations are just one-liners, leaving little room for error.<br/>
3. While row-level <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#py_calc-processor">Python formulas</a> can be very complex, they can easily be covered with <a href="https://github.com/capillariesio/capillaries/blob/main/test/data/cfg/portfolio_quicktest/py/portfolio_calc_test.py">unit tests</a>.<br/>
</div>
</div>
<div class="row">
<div class="col-lg-1">
<img src="i/piggy-bank-icon.svg" alt="predict" style="width:60px;margin-top:20px;">
</div>
<div class="col-lg-11">
<h3>Conservative cloud resource use</h3>
Capillaries shines when data processing is very calculation-heavy, data-heavy, and must be
performed periodically (daily, weekly, quarterly).
It can run on private or public VM or container infrastructure, which can be allocated and
provisioned within minutes and disposed of immediately after all transformations are complete.
</div>
</div>
<div class="row">
<div class="col-lg-1">
<img src="i/pass-icon.svg" alt="predict" style="width:60px;margin-top:20px;">
</div>
<div class="col-lg-11">
<h3>Operator interaction</h3>
Capillaries allows operators to validate data at selected processing steps and decide whether to
proceed or not.
</div>
</div>
<h2>Technical highlights</h2>
<div class="row">
<div class="col-lg-5">
<h3><img src="i/slideshow-line-icon.svg" alt="predict" style="width:30px;margin-right:20px;">Parallel processing</h3>
1. Executes multiple data processing tasks (<a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#script-node">DAG nodes</a>) simultaneously.<br/>
2. Splits large data volumes into smaller batches for parallel processing.
</div>
<div class="col-lg-3">
<h3><img src="i/shield-checkmark-line-icon.svg" alt="predict" style="width:23px;margin-right:20px;">Fault tolerance</h3>
Designed to withstand temporary database connectivity issues and <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#daemon">worker</a> node failures.
</div>
<div class="col-lg-4">
<h3><img src="i/reload-sync-icon.svg" alt="predict" style="width:23px;margin-right:20px;">Incremental computing</h3>
Allows splitting the entire data pipeline into separate <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#run">runs</a> that can be initiated
independently and re-run if needed.
</div>
</div>
<h2>Q & A</h2>
<div class="row">
<div class="col-lg-12">
<h3>Is Capillaries ETL or ELT?</h3>
<p>
Capillaries is much more about the "T" than the "E" or "L":
<ul>
<li>simple transformations and filtering can be performed when the data is being loaded, while
complex transformations are
performed after the data is loaded</li>
<li>the data is intended to be stored only until all transformations are complete and the result
files are produced</li>
</ul>
Capillaries is probably best described as "etlT"
</p>
<h3>Is Capillaries "low-code" or "no-code"?</h3>
<p>Capillaries is definitely "some-code" because data transformation rules may include <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#go-expressions">Go expressions</a>
and/or complex <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#py_calc-processor">Python formulas</a>. The "code" part applies only to the business logic, while the "orchestration" part does
not require coding at
all.
</p>
<h3>Why should I prefer Capillaries over my custom data pipelines?</h3>
<p>
Capillaries handles orchestration, <a href="https://capillaries.io/blog/2024-08-10-scaleup-scaleout/index.html">scalability</a>, and intermediate data storage, so you can focus
solely on the transformation logic.
</p>
<h3>Why should I prefer Capillaries over other distributed processing systems?</h3>
<ul>
<li>it's free and <a href="https://github.com/capillariesio/capillaries">open-source</a></li>
<li>it can be quickly <a href="https://github.com/capillariesio/capillaries/blob/main/doc/what.md#capillaries-components">deployed</a> on private or public VM or container infrastructure and disposed of
when no longer needed</li>
<li>it's better than no-code systems because it allows you to perform complex Python calculations at
the row level</li>
<li>it's better than code-heavy systems because it doesn’t require deep knowledge of any programming
language</li>
<li>with intermediate data stored in <a href="https://github.com/capillariesio/capillaries/blob/main/doc/transcript_portfolio.md">Cassandra tables</a>, all data processing steps are extremely
transparent, making
troubleshooting easier</li>
</ul>
<h3>What do I need to run Capillaries?</h3>
To set up a <a href="https://github.com/capillariesio/capillaries/blob/main/doc/what.md#capillaries-components">Capillaries environment</a>, you need to provide:
<ul>
<li>a Cassandra cluster</li>
<li>a few VMs/containers running <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#daemon">Capillaries workers</a></li>
<li>a VM/container running Capillaries <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#webapi">Webapi</a> and <a href="https://capillaries.io/blog/2023-02-20-ui/index.html">UI</a></li>
<li>a VM/container running RabbitMQ server</li>
<li>monitoring and logging infrastructure (optional, but recommended)</li>
</ul>
To run data processing for a specific dataset, you need to provide data in files, served from an NFS
drive, HTTP(S)
server, or S3 bucket:
<ul>
<li>a <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#script">Capillaries script</a> containing the DAG and <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#go-expressions">Go field transformation expressions</a></li>
<li>Python files with <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#py_calc-processor">formulas</a> used for row-level data transformations (optional)</li>
<li>input data in CSV or Parquet files</li>
</li>
</ul>
and a browser to use the <a href="https://capillaries.io/blog/2023-02-20-ui/index.html">Capillaries UI</a> or a REST API client to call <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#webapi">Capillaries Webapi</a> directly. After a <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#run">Capillaries run</a> is complete, you get a set of files (NFS or S3) containing transformed data.
<h3>Do I need to know SQL or a similar query language to define Capillaries transforms?</h3>
No. Capillaries implements some transformations that use relational algebra concepts like <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#lookup">lookups</a>,
grouping, and <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#tag_and_denormalize-processor">denormalization</a>, but users specify these transformations declaratively in the <a href="https://github.com/capillariesio/capillaries/blob/main/doc/glossary.md#script">script</a> file.
</div>
</div>
<div class="row">
<div class="col-lg-12">
<a href="https://github.com/capillariesio/Capillaries">
<h3><img src="i/github-mark.svg" style="width: 30px;margin-bottom: 6px;margin-right: 8px;"
alt="github">Source code and docs</h3>
</a>
</div>
</div>
<div class="row">
<div class="col-lg-12">
<a href="./blog/index.html">
<h3><img src="i/web-content-icon.svg" style="width: 30px;margin-bottom: 4px;margin-right: 8px;"
alt="blog">Blog</h3>
</a>
</div>
</div>
<div class="footer">
<p>©
<script type="text/javascript">document.write("2022-" + new Date().getFullYear());</script>
Capillaries.io
</p>
</div>
</div>
</body>
</html>