- Python >=3.10 or <=3.11
It is implementation of indexing pipeline which by default stores indexes locally.
-
From command prompt create, activate virtual environment, and install the dependencies using requirement.txt.
-
Manually create below folders:
- For file system:
C:\Temp\unittest\infy_dpp_processor\STORAGE C:\Temp\unittest\infy_dpp_processor\STORAGE\data\input C:\Temp\unittest\infy_dpp_processor\STORAGE\data\config
OR
- For cloud storage:
Makeinput
andconfig
folder insidedata
folder relative to your cloud storage path i.e.,DPP_STORAGE_ROOT_URI
in script.
- For file system:
-
Keep input files and config files in correct folder (check script for config file names)
-
Based on from where you are running the script use/modify config file.
- Local system:
Take config files from\config\dev\testing\
OR
- Container image in VM:
Refer config files from\config\dev\
- Local system:
-
In .env files provide values against
DPP_STORAGE_ACCESS_KEY=
andDPP_STORAGE_SECRET_KEY=
-
If a centralized vector dB is being used to store indexes, then:
infy_db_service
is expected to be running or deployed.- Modify indexing pipline input config file to
enable
onlyinfy_db_service
undervectordb
andsparseindex
ofDbIndexer
processor config and provide thedb_service_url
. - Below URL's are supposed to added in config against
db_service_url
.(replace the hostname with your hostname whereinfy_db_service
is deployed). - http://:8005/api/v1/sparsedb/saverecords
- http://:8005/api/v1/vectordb/saverecords
- Provide
index_name
andenable
index underDbIndexer
processor config.
"DbIndexer": { "embedding": {}, "index": { "enabled": true, "index_name": "", "index_id": "" }, "storage": { "vectordb": { "faiss": {}, "infy_db_service": { "enabled": true, "configuration": { "db_service_url": "http://localhost:8005/api/v1/vectordb/saverecords", "model_name": "all-MiniLM-L6-v2", "collections": [ { "collection_name": "documents", "collection_secret_key": "", "chunk_type": "" } ] } } }, "sparseindex": { "bm25s": {}, "infy_db_service": { "enabled": true, "configuration": { "db_service_url": "http://localhost:8005/api/v1/sparsedb/saverecords", "method_name": "bm25s", "collections": [ { "collection_name": "documents", "collection_secret_key": "", "chunk_type": "" } ] } } } } }
- Indexing pipeline creates index_id
-
Run the provided scripts for testing indexing pipeline. e.g.
test_indexing_script_local_to_file_sys.ps1
NOTE: While running indexing script ignore list index out of range error for now, in Content Extractor processor
- Before building the package please add values for
DPP_STORAGE_ACCESS_KEY
andDPP_STORAGE_SECRET_KEY
in.env.tf
file. - Run
BuildPackage.bat
. - Package will be available at
apps\infy_dpp_processor\target
.
- Copy the below folders to the machine where you have access to create a docker image.
-
apps\infy_dpp_processor\target
-
MyProgramFiles
(referdocs/notebook/src/use_cases/dpp/installation.ipynb
)The folder structure should look as below:
<folder_root_path> /dpp_processor_app /Dockerfile /MyProgramFiles
- Create, activate virtual environment, and install packages.
- Create docker image.
docker build -t <ImageURI> .
- Create
MyProgramFiles
folder (referdocs/notebook/src/use_cases/dpp/installation.ipynb
)
- Copy the package from
apps\infy_dpp_processor\target
to target server machine where you want to deploy. - Create and activate a virtual environment:
python -m venv .venv source ./.venv/bin/activate
- Upgrade pip:
pip install --upgrade pip
- Install required dependencies
pip install -r requirements.txt