Skip to content

Commit

Permalink
Add bingo-elastic python asyncio support (#691)
Browse files Browse the repository at this point in the history
  • Loading branch information
khyurri authored Mar 14, 2022
1 parent 081db1b commit a426f2b
Show file tree
Hide file tree
Showing 10 changed files with 670 additions and 148 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/indigo-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -643,7 +643,7 @@ jobs:
run: pip install ${GITHUB_WORKSPACE}/dist/epam.indigo-*manylinux1_x86_64.whl
working-directory: bingo/bingo-elastic/python
- name: Install dev dependencies
run: pip install -r requirements_dev.txt
run: pip install ".[dev]"
working-directory: bingo/bingo-elastic/python
- name: Setup elasticsearch
run: docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "indices.query.bool.max_clause_count=4096" docker.elastic.co/elasticsearch/elasticsearch:7.16.2
Expand All @@ -652,7 +652,7 @@ jobs:
- name: Build wheel
working-directory: bingo/bingo-elastic/python
run: |
pip install .
pip install ".[async]"
python setup.py bdist_wheel
cp dist/*.whl ${GITHUB_WORKSPACE}/dist/
- name: Run pylint
Expand Down
2 changes: 2 additions & 0 deletions bingo/bingo-elastic/python/.pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ disable=
R0903, # too-few-public-methods
E1136, # unsubscriptable-object
R0801, # duplicate-code
W0621, # redefined-outer-name
W0613, # unused-argument
109 changes: 94 additions & 15 deletions bingo/bingo-elastic/python/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Bingo API for using with Elasticsearch
## Bingo API for using with Elasticsearch

**IN DEVELOPMENT**

Expand All @@ -25,6 +25,15 @@ Install dependency using pip
pip install bingo-elastic
```

Install async version

```
pip install bingo-elastic[async]
```

bingo-elastic async version supports all the same methods to index and search
molecules as sync. To use async version, just instantiate `AsyncElasticRepository`


#### Elastisearch installation

Expand All @@ -41,21 +50,51 @@ Something simple could be done as following:
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "indices.query.bool.max_clause_count=4096" docker.elastic.co/elasticsearch/elasticsearch:7.15.1
```

### Usage
### Usage

#### Create ElasticRepository

Sync

```
repository = ElasticRepository(host="127.0.0.1", port=9200)
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
```

Other customisations like SSL, custom number of shards/replicas, refresh interval, and many more are supported
Async

```
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
...
repository.close()
```

Async version also supports async context manager to auto close connections:

```
async with AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200) as rep:
...
```

Using with FastAPI:

```
app = FastAPI()
rep = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
# This gets called once the app is shutting down.
@app.on_event("shutdown")
async def app_shutdown():
await rep.close()
```


Other customisations like SSL, custom number of shards/replicas, refresh interval, and many more are supported by `ElasticRepository` and `AsyncElasticRepository`

#### Read Indigo records from file

IndigoRecord can be created from IndigoObject.
Full usage example:

Full usage example:
```
from bingo_elastic.model.record import IndigoRecord
from indigo import Indigo
Expand All @@ -75,8 +114,8 @@ cml = helpers.iterate_cml("compounds.cml")
smi = helpers.iterate_smiles("compounds.smi")
```

Also function `helpers.iterate_file(file: Path)` is available. This function
selects correct iterate function by file extension. The `file` argument must
Also function `helpers.iterate_file(file: Path)` is available. This function
selects correct iterate function by file extension. The `file` argument must
be `pathlib.Path` instance

```
Expand All @@ -89,30 +128,58 @@ sdf = helpers.iterate_file(Path("compounds.sdf"))

#### Index records into Elasticsearch

Full usage example:
Full usage example sync:

```
from bingo_elastic.model import helpers
from bingo_elastic.elastic import, ElasticRepository IndexName
from pathlib import Path
repository = ElasticRepository(host="127.0.0.1", port=9200)
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
repository.index_records(sdf);
repository.index_records(sdf)
```

Full usage example async:

_Async indexing and search requires event loop created_

```
import asyncio
from bingo_elastic.model import helpers
from bingo_elastic.elastic import AsyncElasticRepository, IndexName
from pathlib import Path
async def index_compounds():
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
await repository.index_records(sdf)
asyncio.run(index_compounds)
```

*CAVEAT*: Elasticsearch doesn't have strict notion of commit, so records might appear in the index later on
Read more about it here - https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-refresh-interval-setting

For indexing one record the the method `ElasticRepository.index_record` can be used
For indexing one record the the method `ElasticRepository.index_record` can be used

#### Retrieve similar records from Elasticsearch

Sync:
```
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = repository.filter(similarity=alg, limit=20)
```

Async:
```
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = await repository.filter(similarity=alg, limit=20)
```

In this case we requested top-20 most similar molecules compared to `target` based on Tanimoto similarity metric

Supported similarity algorithms:
Expand All @@ -122,26 +189,38 @@ Supported similarity algorithms:

#### Find exact records from Elasticsearch

Sync:
```
exact_records = repository.filter(exact=target, limit=20)
```

In this case we requested top-20 candidate molecules with exact same fingerprint to `target`.
`target` should be an instance of `IndigoRecord` class.

Async:
```
exact_records = await repository.filter(exact=target, limit=20)
```

In this case we requested top-20 candidate molecules with exact same fingerprint to `target`.
`target` should be an instance of `IndigoRecord` class.


#### Subsctructure match of the records from Elasticsearch

Sync:
```
submatch_records = repository.filter(substructure=target)
```

Async:
```
submatch_records = await repository.filter(substructure=target)
```

In this case we requested top-10 candidate molecules with exact same fingerprint to `target`.

#### Custom fields for molecule records

_Async protocol exact same, just don't forget to `await`_

Indexing records with custom fields

```
Expand Down
Loading

0 comments on commit a426f2b

Please sign in to comment.