-
-
Notifications
You must be signed in to change notification settings - Fork 17.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prostgres Vector Store] Add PGVector Driver option + Fix null character issue w/ TypeORM Driver #3367
[Prostgres Vector Store] Add PGVector Driver option + Fix null character issue w/ TypeORM Driver #3367
Conversation
const embeddingString = `[${query.join(',')}]` | ||
let _filter = '{}' | ||
let notExists = '' | ||
if (filter && typeof filter === 'object') { | ||
if (filter.$notexists) { | ||
notExists = `OR NOT (metadata ? '${filter.$notexists}')` | ||
delete filter.$notexists | ||
} | ||
_filter = JSON.stringify(filter) | ||
} | ||
|
||
const queryString = ` | ||
SELECT *, embedding <=> $1 as "_distance" | ||
FROM ${tableName} | ||
WHERE metadata @> $2 | ||
${notExists} | ||
ORDER BY "_distance" ASC | ||
LIMIT $3;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doubtful about this logic... (i pasted it as is from legacy code)
The WHERE clause sounds like: Where metadata
includes filter
OR metadata
keys not includes chatflowIdKey
which seems to make the first part optional when second part is truthy, and this is clearly not the expected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PGVector driver, i removed the chatflow part from filters and added a AND (<compareExistingValue> OR <keyNotExist>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original intention is to filter these conditions:
- metadata contains the JSON structure (filter)
OR - metadata does not contains the key
flowise_chatId
Say the following scenario:
1.) Owner of the chatflow upserted data to the postgres vector store through API, ex: [{pageContent: 'abc', metadata: {}}]
2.) User A interacts with the chatflow and upload some files. This will be upserted on the fly, resulting in [{pageContent: 'this is userA', metadata: { flowise_chatId: userA-session }}]
3.) User B interacts with the chatflow and upload some files. This will be upserted on the fly, resulting in [{pageContent: 'this is userB', metadata: { flowise_chatId: userB-session }}]
Now, when user A is asking questions, we want to only get 2 sources:
[
{pageContent: 'abc', metadata: {}}
{pageContent: 'this is userA', metadata: { flowise_chatId: userA-session }}
]
We do not want to get source with metadata of flowise_chatId: userB-session
Hence this results in the following filters:
metadata contains that specific chatId OR metadata that doesn't have the flowise_chatId
key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tanks for your response @HenryHengZJ 🙏
We're ok about the expected behavior 👍
Now imagine a 4th scenario:
For whatever reason, i add a metadata filter on the vector store node, ex : [{pageContent: 'this is userA', metadata: { flowise_chatId: userA-session, foo: 'bar' }}]
With the actual or clause, the or clause will be:
WHERE metadata @> '{ flowise_chatId: \'userA-session\', foo: \'bar\' }'
OR NOT (metadata ? 'flowise_chatId')
This will return:
- Chatflow uploaded documents filtered with
foo=bar
- Any documents without
flowise_chatId
, not filtered withfoo=bar
If we want the right behavior, the where clause should be:
WHERE metadata @> '{ foo: \'bar\' }'
AND (metadata @> '{ flowise_chatId: \'userA-session\' }' OR NOT (metadata ? 'flowise_chatId'))
This will return:
- Chatflow uploaded documents filtered with
foo=bar
- Any documents filtered with
foo=bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an alternative, we can also do:
WHERE metadata @> '{ foo: \'bar\' }'
OR metadata @> '{ flowise_chatId: \'userA-session\' }'
This will return:
- Any document with
foo=bar
- Chatflow uploaded documents, not filtered with
foo=bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think its 3 clauses:
- Any documents with metadata contains
foo: bar
- or any documents with metadata contains
foo: bar
, ANDflowise_chatId: <chatid>
- or any documents with metadata contains
foo: bar
, AND withoutflowise_chatId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't care about foo: bar
if there is a flowise_chatId
set, because if it's been uploaded via the chatflow, it's always part of the chatflow context (and filters will not be added to the document's metadata).
Basically, the expectations are:
- Return any document uploaded via the chatflow
- Return filtered documents from the vector store
- Don't return the documents uploaded via other chatflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part:
AND NOT (metadata ? 'flowise_chatId')
is just present to ensure that we don't return documents from other chatflows if there is no filter set:
WHERE (metadata @> '{}' AND NOT (metadata ? 'flowise_chatId'))
OR metadata @> '{ flowise_chatId: \'userA-session\' }'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so this will results if either of these two conditions is true:
- The metadata contains
foo': 'bar
and does not haveflowise_chatId
. - The metadata contains
flowise_chatId': 'userA-session
, regardless of whetherfoo': 'bar
is present or not.
Rewind back the scenario, and see if this fits:
1.) Owner upserted documents without metadata: pageContent: 'this is doc 1', metadata: {}
2.) Owner upserted documents WITH metadata: pageContent: 'this is doc 2', metadata: { source: doc2 }
Owner publish the chatflow to public
3.) User A uploaded a file from the chat window, results in pageContent: 'this is user A', metadata: { flowise_chatId: userA }
4.) User B uploaded a file from the chat window, results in pageContent: 'this is user B', metadata: { flowise_chatId: userB }
Now 4 docs are sitting in vector store:
1.) pageContent: 'this is doc 1', metadata: {}
2.) pageContent: 'this is doc 2', metadata: { source: doc2 }
3.) pageContent: 'this is user A', metadata: { flowise_chatId: userA }
4.) pageContent: 'this is user B', metadata: { flowise_chatId: userB }
Scenario 1: Owner doesnt set any filter on metadata. User A when chatting, should only get documents (1) (2) and (3)
Scenario 2: Owner pre set filter on metadata. User A when chatting, should only get documents (2) and (3)
Looks like the new clause you have there satisfy these 2 scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 0a39892
return this._postgresConnectionOptions | ||
} | ||
|
||
async getArgs(): Promise<PGVectorStoreArgs> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for pgvector
, can we allow the option for choosing distance strategy? #2198
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, i don't know about it but it's a good learning opportunity ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HenryHengZJ done in 967f0bb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did it for both PGVector an TypeORM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! will test it out
@HenryHengZJ Last commit (4e1f4c4) allows to set default Postgres credentials via env variables (for both vector store and record manager) and makes credentials optional if |
thanks for that! I understand you prefer to read the value from an env file, but given that we have too many components, doing this will introduce too much env variables, and we wanted to limit env variables to only settings of the flowise apps, not for each and every components. having said that, you can achieve the same using Variables: https://docs.flowiseai.com/using-flowise/variables#using-variables For example, if I wanted to read Pinecone Index value, I can create a variable as such: |
The reason why i'm doing this, is because of the Also, i wan't to be able to deploy my stack all in once without having to manually setup credentials on each stage and put identifiers in client apps, or it will be an headache to deploy. For this part, having the capacity to setup credentials as default for each category would do the job, but it it doesn't solve the deletion issue. |
} | ||
|
||
protected async adaptInstance(instance: PGVectorStore, metadataFilters?: any): Promise<PGVectorStore> { | ||
const { [FLOWISE_CHATID]: chatId, ...pgMetadataFilter } = metadataFilters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error when metadatFilters
is undefined
const { [FLOWISE_CHATID]: chatId, ...pgMetadataFilter } = metadataFilters || {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a7de414
I'm working on some issues of my project ATM, but i'll re-test all this stuff asap
instance.client = await instance.pool.connect() | ||
} | ||
|
||
await instance.client.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this error:
Error inserting: Client has already been connected. You cannot reuse a client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm sorry, i implemented that because i got Too many connections
error while using for a long period, but didn't test it since. I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 3527383
got it, in that case I'd suggest to add a README to the |
LGTM 👍 The expected behavior is that env vars act as default values and makes credentials optional, but it can be override by credentials if set. |
Done in 675641b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for late reply, thank you so much @JJK801 ! all looks good
my pleasure |
Fix #3362
Hi Flowise,
Here is a PR that improves PG Vector Store by:
driver
config (which can betypeorm
by default, orpgvector
)contentColumnName
(pgvector
only), so that some can migrate easily from other toolstypeorm
driverlet me know if you need some more details or changes.