Skip to content

Latest commit

 

History

History

colbert-long

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
#Vespa

Vespa sample applications - Long-Context ColBERT

This semantic search application demonstrates Long-Context ColBERT (multi-token vector representation) with extended context windows for long-document retrieval.

The app demonstrates the colbert-embedder and the tensor expressions for performing two types of extended ColBERT late-interaction for long-context retrieval.

See Announcing Vespa Long-Context ColBERT for details on this application.

Requires at least Vespa 8.311.28

To try this application

Follow Vespa getting started through the vespa deploy step, cloning colbert-long instead of album-recommendation.

Feed documents (this includes embed inference in Vespa):

vespa feed ext/sample-docs.jsonl

Example query using BM25:

vespa query 'yql=select * from doc where userQuery()'\
 'ranking=bm25' 'hits=1'\
 'query=What is the frequency of Radio KP?'

Example query using ColBERT :

vespa query 'yql=select * from doc where userQuery()'\
 'ranking=colbert-max-sim-context-level' 'hits=1' \
 'query=What is the frequency of Radio KP?' \
 'input.query(qt)=embed(colbert, @query)'
vespa query 'yql=select * from doc where userQuery()'\
 'ranking=colbert-max-sim-cross-context' 'hits=1'\
 'query=What is the frequency of Radio KP?' \
 'input.query(qt)=embed(colbert, @query)'

Evaluate the effectiveness on long-document retrieval using the MLDR dataset

Install external dependencies:

pip3 install datasets langchain

Run this script that downloads the MLDR English data split and generates three files; this takes a few minutes (depending on bandwidth).

This simple script writes the feed file to /tmp/vespa_feed_file_en.json:

python3 scripts/convert.py

Index the dataset (Note that if you are running this on CPU, or with longer documents you want to increase the default operation timeout to avoid re-trying doc operations that will never be able to succeed with default feed operation timeouts.

vespa feed /tmp/vespa_feed_file_en.json --timeout 600 --connections 1 

Run the queries (Replace endpoint and mTLS cert)

python3 evaluate.py --endpoint https://b5af15f0.e2b4d78d.z.vespa-app.cloud/search/ \
  --ranking colbert-max-sim-context-level --dataset ext/test_queries.tsv  --rank_count 10 \
  --key $HOME/.vespa/samples.long-colbert.default/data-plane-private-key.pem \
  --cert$HOME/.vespa/samples.long-colbert.default/data-plane-public-cert.pem

Then, evaluate effectiveness by using e.g. trec_eval. The above creates a .run file with ranking argument as the file name.

trec_eval -mndcg_cut.10 ext/test_en_qrels.tsv colbert-max-sim-context-level.run 

Terminate

Remove the container after use (Only relevant for our automatic testing of this sample app)

$ docker rm -f vespa