Skip to content

Commit f27d6a9

Browse files
authored
fix: outdated readme and envd (#42)
Signed-off-by: usamoi <usamoi@outlook.com>
1 parent be11548 commit f27d6a9

File tree

3 files changed

+172
-67
lines changed

3 files changed

+172
-67
lines changed

README.md

+131-44
Original file line numberDiff line numberDiff line change
@@ -16,84 +16,130 @@ pgvecto.rs is a Postgres extension that provides vector similarity search functi
1616
- 🦀 **Rewrite in Rust**: Rewriting in Rust offers benefits such as improved memory safety, better performance, and reduced **maintenance costs** over time.
1717
- 🙋 **Community**: People loves Rust We are happy to help you with any questions you may have. You could join our [Discord](https://discord.gg/KqswhpVgdU) to get in touch with us.
1818

19-
## Installation from Source
19+
## Installation
2020

2121
<details>
22-
<summary>Build from Source</summary>
22+
<summary>Build from source</summary>
2323

2424
### Install Rust and base dependency
25+
2526
```sh
26-
apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang
27+
sudo apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang git
2728
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
2829
```
2930

30-
### Install pgrx (tensorchord's fork)
31+
### Clone the Repository
32+
33+
```sh
34+
git clone https://github.com/tensorchord/pgvecto.rs.git
35+
cd pgvecto.rs
36+
```
37+
38+
### Install Postgresql and pgrx
39+
3140
```sh
41+
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
42+
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
43+
sudo apt-get update
44+
sudo apt-get -y install libpq-dev postgresql-15 postgresql-server-dev-15
3245
cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $(cat Cargo.toml | grep "pgrx =" | awk -F'rev = "' '{print $2}' | cut -d'"' -f1)
33-
cargo pgrx init
46+
cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config
3447
```
3548

36-
### Build the extension and config postgres
49+
### Install pgvecto.rs
50+
3751
```sh
3852
cargo pgrx install --release
39-
psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors"'
4053
```
54+
4155
You need restart your PostgreSQL server for the changes to take effect, like `systemctl restart postgresql.service`.
56+
4257
</details>
4358

59+
<details>
60+
<summary>Install from release</summary>
61+
62+
Download the deb package in the release page, and type `sudo apt install vectors-pg15-*.deb` to install the deb package.
63+
64+
</details>
65+
66+
Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`.
67+
68+
```sh
69+
psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"'
70+
```
71+
72+
You need restart the PostgreSQL cluster.
73+
74+
```
75+
sudo systemctl restart postgresql.service
76+
```
4477

45-
## Install the extension in postgres
78+
Connect to the database and enable the extension.
4679

4780
```sql
48-
-- install the extension
4981
DROP EXTENSION IF EXISTS vectors;
5082
CREATE EXTENSION vectors;
51-
-- check the extension related functions
52-
\df+
5383
```
5484

55-
## Get started with pgvecto.rs
85+
## Get started
86+
87+
pgvecto.rs allows columns of a table to be defined as vectors.
88+
89+
The data type `vector(n)` denotes an n-dimensional vector. The `n` within the brackets signifies the dimensions of the vector. For instance, `vector(1000)` would represent a vector with 1000 dimensions, so you could create a table like this.
90+
91+
```sql
92+
-- create table with a vector column
93+
94+
CREATE TABLE items (
95+
id bigserial PRIMARY KEY,
96+
embedding vector(3) NOT NULL
97+
);
98+
```
99+
100+
You can then populate the table with vector data as follows.
101+
102+
```sql
103+
-- insert values
104+
105+
INSERT INTO items (embedding)
106+
VALUES ('[1,2,3]'), ('[4,5,6]');
107+
```
56108

57-
We support three operators to calculate the distance between two vectors:
109+
We support three operators to calculate the distance between two vectors.
58110

59-
- `<->`: square Euclidean distance
60-
- `<#>`: negative dot product distance
61-
- `<=>`: negative square cosine distance
111+
- `<->`: squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
112+
- `<#>`: negative dot product distance, defined as $- \Sigma x_iy_i$.
113+
- `<=>`: negative squared cosine distance, defined as $- \frac{(\Sigma x_iy_i)^2}{\Sigma x_i^2 \Sigma y_i^2}$.
62114

63115
```sql
64116
-- call the distance function through operators
65117

66-
-- square Euclidean distance
118+
-- squared Euclidean distance
67119
SELECT '[1, 2, 3]' <-> '[3, 2, 1]';
68-
-- dot product distance
120+
-- negative dot product distance
69121
SELECT '[1, 2, 3]' <#> '[3, 2, 1]';
70-
-- cosine distance
122+
-- negative square cosine distance
71123
SELECT '[1, 2, 3]' <=> '[3, 2, 1]';
72124
```
73125

74-
Note that, "square Euclidean distance" is defined as $ \Sigma (x_i - y_i) ^ 2 $, "negative dot product distance" is defined as $ - \Sigma x_iy_i $, and "negative square cosine distance" is defined as $ - \frac{(\Sigma x_iy_i)^2}{\Sigma x_i^2 \Sigma y_i^2} $, so that you can use `ORDER BY` to perform a KNN search directly without a `DESC` keyword.
75-
76-
### Create a table
77-
78-
You could use the `CREATE TABLE` statement to create a table with a vector column.
126+
You can search for a vector simply like this.
79127

80128
```sql
81-
-- create table
82-
CREATE TABLE items (id bigserial PRIMARY KEY, emb vector(3));
83-
-- insert values
84-
INSERT INTO items (emb) VALUES ('[1,2,3]'), ('[4,5,6]');
85129
-- query the similar embeddings
86-
SELECT * FROM items ORDER BY emb <-> '[3,2,1]' LIMIT 5;
130+
SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
87131
-- query the neighbors within a certain distance
88-
SELECT * FROM items WHERE emb <-> '[3,2,1]' < 5;
132+
SELECT * FROM items WHERE embedding <-> '[3,2,1]' < 5;
89133
```
90134

91-
### Create an index
135+
### Indexing
92136

93-
You can create an index, using HNSW algorithm and square Euclidean distance with the following SQL.
137+
You can create an index, using squared Euclidean distance with the following SQL.
94138

95139
```sql
96-
CREATE INDEX ON train USING vectors (emb l2_ops)
140+
-- Using HNSW algorithm.
141+
142+
CREATE INDEX ON items USING vectors (embedding l2_ops)
97143
WITH (options = $$
98144
capacity = 2097152
99145
size_ram = 4294967296
@@ -103,12 +149,10 @@ storage = "ram"
103149
m = 32
104150
ef = 256
105151
$$);
106-
```
107152

108-
Or using IVFFlat algorithm.
153+
--- Or using IVFFlat algorithm.
109154

110-
```sql
111-
CREATE INDEX ON train USING vectors (emb l2_ops)
155+
CREATE INDEX ON items USING vectors (embedding l2_ops)
112156
WITH (options = $$
113157
capacity = 2097152
114158
size_ram = 2147483648
@@ -120,22 +164,56 @@ nprobe = 10
120164
$$);
121165
```
122166

123-
The index must be built on a vector column. Failure to match the actual vector dimension with the dimension type modifier may result in an unsuccessful index building.
167+
Now you can perform a KNN search with the following SQL simply.
124168

125-
The operator class determines the type of distance measurement to be used. At present, `l2_ops`, `dot_ops`, and `cosine_ops` are supported.
169+
```sql
170+
SELECT *, emb <-> '[0, 0, 0]' AS score
171+
FROM items
172+
ORDER BY embedding <-> '[0, 0, 0]' LIMIT 10;
173+
```
126174

127-
You can specify the indexing and the vectors to be stored in the disk by setting `storage_vectors = "disk"`, and `storage = "disk"`. On this condition, `size_disk` must be specified.
175+
Please note, vector indexes are not loaded by default when PostgreSQL restarts. To load or unload the index, you can use `vectors_load` and `vectors_unload`.
128176

129-
Now you can perform a KNN search with the following SQL simply.
177+
```sql
178+
--- get the index name
179+
\d items
130180

131-
```SQL
132-
SELECT *, emb <-> '[0, 0, 0, 0]' AS score FROM items ORDER BY embedding <-> '[0, 0, 0, 0]' LIMIT 10;
181+
-- load the index
182+
SELECT vectors_load('items_embedding_idx'::regclass);
133183
```
134184

135185
We planning to support more index types ([issue here](https://github.com/tensorchord/pgvecto.rs/issues/17)).
136186

137187
Welcome to contribute if you are also interested!
138188

189+
## Reference
190+
191+
### `vector` type
192+
193+
`vector` and `vector(n)` are all legal data types, where `n` denotes dimensions of a vector.
194+
195+
The current implementation ignores dimensions of a vector, i.e., the behavior is the same as for vectors of unspecified dimensions.
196+
197+
There is only one exception: indexes cannot be created on columns without specified dimensions.
198+
199+
### Indexing
200+
201+
We utilize TOML syntax to express the index's configuration. Here's what each key in the configuration signifies:
202+
203+
| Key | Type | Description |
204+
| ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- |
205+
| capacity | integer | The index's capacity. The value should be greater than the number of rows in your table. |
206+
| size_ram | integer | (Optional) The maximum amount of memory the persisent part of index can occupy. |
207+
| size_disk | integer | (Optional) The maximum amount of disk-backed memory-mapped file size the persisent part of index can occupy. |
208+
| storage_vectors | string | `ram` ensures that the vectors always stays in memory while `disk` suggests otherwise. |
209+
| algorithm.ivf | table | If this table is set, the IVF algorithm will be used for the index. |
210+
| algorithm.ivf.storage | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. |
211+
| algorithm.ivf.nlist | integer | (Optional) Number of cluster units. |
212+
| algorithm.ivf.nprobe | integer | (Optional) Number of units to query. |
213+
| algorithm.hnsw | table | If this table is set, the HNSW algorithm will be used for the index. |
214+
| algorithm.hnsw.storage | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. |
215+
| algorithm.hnsw.m | integer | (Optional) Maximum degree of the node. |
216+
| algorithm.hnsw.ef | integer | (Optional) Search scope in building. |
139217

140218
## Why not a specialty vector database?
141219

@@ -148,7 +226,16 @@ Why not just use Postgres to do the vector similarity search? This is the reason
148226
UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
149227

150228
-- Create an index on the embedding column
151-
CREATE INDEX ON documents USING vectors (embedding l2_ops) WITH (algorithm = "HNSW");
229+
CREATE INDEX ON documents USING vectors (embedding l2_ops)
230+
WITH (options = $$
231+
capacity = 2097152
232+
size_ram = 4294967296
233+
storage_vectors = "ram"
234+
[algorithm.hnsw]
235+
storage = "ram"
236+
m = 32
237+
ef = 256
238+
$$);
152239

153240
-- Query the similar embeddings
154241
SELECT * FROM documents ORDER BY embedding <-> ai_embedding_vector('hello world') LIMIT 5;

build.envd

+7-23
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,8 @@
1-
# syntax=v1
2-
3-
envdlib = include("https://github.com/tensorchord/envdlib")
4-
51
def build():
6-
base(dev=True)
7-
install.apt_packages(name=[
8-
"clang",
9-
"libreadline-dev",
10-
"zlib1g-dev",
11-
"flex",
12-
"bison",
13-
"libxslt-dev",
14-
"libssl-dev",
15-
"libxml2-utils",
16-
"xsltproc",
17-
"ccache",
18-
"pkg-config",
19-
])
20-
envdlib.rust()
21-
run(commands=[
22-
"cargo install cargo-pgrx --version 0.10.0-beta.1",
23-
"cargo pgrx init",
24-
])
2+
config.repo(url="https://github.com/tensorchord/pgvecto.rs")
3+
base(os="ubuntu20.04", language="python3")
4+
shell("zsh")
5+
io.copy("./envd.sh", "/tmp/build/envd.sh")
6+
io.copy("./rust-toolchain.toml", "/tmp/build/rust-toolchain.toml")
7+
io.copy("./Cargo.toml", "/tmp/build/Cargo.toml")
8+
run(commands=["cd /tmp/build", "sudo -u envd ./envd.sh"])

envd.sh

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/usr/bin/bash
2+
3+
sudo apt-get update
4+
sudo apt-get install -y lsb-release
5+
sudo apt-get install -y gnupg
6+
echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" | sudo tee -a /etc/apt/sources.list.d/pgdg.list
7+
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
8+
sudo apt-get update
9+
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC sudo -E apt-get install tzdata
10+
sudo apt-get install -y build-essential
11+
sudo apt-get install -y libpq-dev
12+
sudo apt-get install -y libssl-dev
13+
sudo apt-get install -y pkg-config
14+
sudo apt-get install -y gcc
15+
sudo apt-get install -y libreadline-dev
16+
sudo apt-get install -y flex
17+
sudo apt-get install -y bison
18+
sudo apt-get install -y libxml2-dev
19+
sudo apt-get install -y libxslt-dev
20+
sudo apt-get install -y libxml2-utils
21+
sudo apt-get install -y xsltproc
22+
sudo apt-get install -y zlib1g-dev
23+
sudo apt-get install -y ccache
24+
sudo apt-get install -y clang
25+
sudo apt-get install -y git
26+
sudo apt-get install -y postgresql-15
27+
sudo apt-get install -y postgresql-server-dev-15
28+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
29+
source "$HOME/.cargo/env"
30+
rev=$(cat Cargo.toml | grep "pgrx =" | awk -F 'rev = "' '{print $2}' | cut -d'"' -f1)
31+
cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $rev
32+
cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config
33+
sudo chmod 777 /usr/share/postgresql/15/extension/
34+
sudo chmod 777 /usr/lib/postgresql/15/lib/

0 commit comments

Comments
 (0)