Skip to content

Commit

Permalink
drop metadata requirement (#1712)
Browse files Browse the repository at this point in the history
* drop metadata requirement

* fix linting

* Update docs for new knowledge

* more linting

* more linting

* make save_documents private

* update docs to the new way we use knowledge and include clearing memory
  • Loading branch information
bhancockio authored Dec 5, 2024
1 parent 55456a2 commit 4405136
Show file tree
Hide file tree
Showing 11 changed files with 63 additions and 78 deletions.
67 changes: 25 additions & 42 deletions docs/concepts/knowledge.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSourc
content = "Users name is John. He is 30 years old and lives in San Francisco."
string_source = StringKnowledgeSource(
content=content,
metadata={"preference": "personal"}
)

# Create an LLM with a temperature of 0 to ensure deterministic outputs
Expand All @@ -74,28 +73,14 @@ crew = Crew(
tasks=[task],
verbose=True,
process=Process.sequential,
knowledge={
"sources": [string_source],
"metadata": {"preference": "personal"}
}, # Enable knowledge by adding the sources here. You can also add more sources to the sources list.
knowledge_sources=[string_source], # Enable knowledge by adding the sources here. You can also add more sources to the sources list.
)

result = crew.kickoff(inputs={"question": "What city does John live in and how old is he?"})
```

## Knowledge Configuration

### Metadata and Filtering

Knowledge sources support metadata for better organization and filtering. Metadata is used to filter the knowledge sources when querying the knowledge store.

```python Code
knowledge_source = StringKnowledgeSource(
content="Users name is John. He is 30 years old and lives in San Francisco.",
metadata={"preference": "personal"} # Metadata is used to filter the knowledge sources
)
```

### Chunking Configuration

Control how content is split for processing by setting the chunk size and overlap.
Expand All @@ -116,21 +101,28 @@ You can also configure the embedder for the knowledge store. This is useful if y
...
string_source = StringKnowledgeSource(
content="Users name is John. He is 30 years old and lives in San Francisco.",
metadata={"preference": "personal"}
)
crew = Crew(
...
knowledge={
"sources": [string_source],
"metadata": {"preference": "personal"},
"embedder_config": {
"provider": "openai", # Default embedder provider; can be "ollama", "gemini", e.t.c.
"config": {"model": "text-embedding-3-small"} # Default embedder model; can be "mxbai-embed-large", "nomic-embed-tex", e.t.c.
},
knowledge_sources=[string_source],
embedder={
"provider": "openai",
"config": {"model": "text-embedding-3-small"},
},
)
```

## Clearing Knowledge

If you need to clear the knowledge stored in CrewAI, you can use the `crewai reset-memories` command with the `--knowledge` option.

```bash Command
crewai reset-memories --knowledge
```

This is useful when you've updated your knowledge sources and want to ensure that the agents are using the most recent information.


## Custom Knowledge Sources

CrewAI allows you to create custom knowledge sources for any type of data by extending the `BaseKnowledgeSource` class. Let's create a practical example that fetches and processes space news articles.
Expand Down Expand Up @@ -174,12 +166,12 @@ class SpaceNewsKnowledgeSource(BaseKnowledgeSource):
formatted = "Space News Articles:\n\n"
for article in articles:
formatted += f"""
Title: {article['title']}
Published: {article['published_at']}
Summary: {article['summary']}
News Site: {article['news_site']}
URL: {article['url']}
-------------------"""
Title: {article['title']}
Published: {article['published_at']}
Summary: {article['summary']}
News Site: {article['news_site']}
URL: {article['url']}
-------------------"""
return formatted

def add(self) -> None:
Expand All @@ -189,17 +181,12 @@ URL: {article['url']}
chunks = self._chunk_text(text)
self.chunks.extend(chunks)

self.save_documents(metadata={
"source": "space_news_api",
"timestamp": datetime.now().isoformat(),
"article_count": self.limit
})
self._save_documents()

# Create knowledge source
recent_news = SpaceNewsKnowledgeSource(
api_endpoint="https://api.spaceflightnewsapi.net/v4/articles",
limit=10,
metadata={"category": "recent_news", "source": "spaceflight_news"}
)

# Create specialized agent
Expand Down Expand Up @@ -265,7 +252,7 @@ The latest developments in space exploration, based on recent space news article
- Implements three key methods:
- `load_content()`: Fetches articles from the API
- `_format_articles()`: Structures the articles into readable text
- `add()`: Processes and stores the content with metadata
- `add()`: Processes and stores the content

2. **Agent Configuration**:
- Specialized role as a Space News Analyst
Expand Down Expand Up @@ -299,31 +286,27 @@ You can customize the API query by modifying the endpoint URL:
recent_news = SpaceNewsKnowledgeSource(
api_endpoint="https://api.spaceflightnewsapi.net/v4/articles",
limit=20, # Increase the number of articles
metadata={"category": "recent_news"}
)

# Add search parameters
recent_news = SpaceNewsKnowledgeSource(
api_endpoint="https://api.spaceflightnewsapi.net/v4/articles?search=NASA", # Search for NASA news
limit=10,
metadata={"category": "nasa_news"}
)
```

## Best Practices

<AccordionGroup>
<Accordion title="Content Organization">
- Use descriptive metadata for better filtering
- Keep chunk sizes appropriate for your content type
- Consider content overlap for context preservation
- Organize related information into separate knowledge sources
</Accordion>

<Accordion title="Performance Tips">
- Use metadata filtering to narrow search scope
- Adjust chunk sizes based on content complexity
- Configure appropriate embedding models
- Consider using local embedding providers for faster processing
</Accordion>
</AccordionGroup>
</AccordionGroup>
9 changes: 2 additions & 7 deletions src/crewai/knowledge/knowledge.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
import os
from typing import Any, Dict, List, Optional

from typing import List, Optional, Dict, Any
from pydantic import BaseModel, ConfigDict, Field

from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from crewai.knowledge.storage.knowledge_storage import KnowledgeStorage
from crewai.utilities.constants import DEFAULT_SCORE_THRESHOLD

os.environ["TOKENIZERS_PARALLELISM"] = "false" # removes logging from fastembed

Expand Down Expand Up @@ -46,9 +45,7 @@ def __init__(
source.storage = self.storage
source.add()

def query(
self, query: List[str], limit: int = 3, preference: Optional[str] = None
) -> List[Dict[str, Any]]:
def query(self, query: List[str], limit: int = 3) -> List[Dict[str, Any]]:
"""
Query across all knowledge sources to find the most relevant information.
Returns the top_k most relevant chunks.
Expand All @@ -57,8 +54,6 @@ def query(
results = self.storage.search(
query,
limit,
filter={"preference": preference} if preference else None,
score_threshold=DEFAULT_SCORE_THRESHOLD,
)
return results

Expand Down
9 changes: 4 additions & 5 deletions src/crewai/knowledge/source/base_file_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Union, List, Dict, Any
from typing import Dict, List, Union

from pydantic import Field

from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from crewai.utilities.logger import Logger
from crewai.knowledge.storage.knowledge_storage import KnowledgeStorage
from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
from crewai.utilities.logger import Logger


class BaseFileKnowledgeSource(BaseKnowledgeSource, ABC):
Expand Down Expand Up @@ -49,10 +49,9 @@ def validate_paths(self):
color="red",
)

def save_documents(self, metadata: Dict[str, Any]):
def _save_documents(self):
"""Save the documents to the storage."""
chunk_metadatas = [metadata.copy() for _ in self.chunks]
self.storage.save(self.chunks, chunk_metadatas)
self.storage.save(self.chunks)

def convert_to_path(self, path: Union[Path, str]) -> Path:
"""Convert a path to a Path object."""
Expand Down
8 changes: 4 additions & 4 deletions src/crewai/knowledge/source/base_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from abc import ABC, abstractmethod
from typing import List, Dict, Any, Optional
from typing import Any, Dict, List, Optional

import numpy as np
from pydantic import BaseModel, ConfigDict, Field
Expand All @@ -17,7 +17,7 @@ class BaseKnowledgeSource(BaseModel, ABC):

model_config = ConfigDict(arbitrary_types_allowed=True)
storage: KnowledgeStorage = Field(default_factory=KnowledgeStorage)
metadata: Dict[str, Any] = Field(default_factory=dict)
metadata: Dict[str, Any] = Field(default_factory=dict) # Currently unused
collection_name: Optional[str] = Field(default=None)

@abstractmethod
Expand All @@ -41,9 +41,9 @@ def _chunk_text(self, text: str) -> List[str]:
for i in range(0, len(text), self.chunk_size - self.chunk_overlap)
]

def save_documents(self, metadata: Dict[str, Any]):
def _save_documents(self):
"""
Save the documents to the storage.
This method should be called after the chunks and embeddings are generated.
"""
self.storage.save(self.chunks, metadata)
self.storage.save(self.chunks)
4 changes: 2 additions & 2 deletions src/crewai/knowledge/source/csv_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import csv
from typing import Dict, List
from pathlib import Path
from typing import Dict, List

from crewai.knowledge.source.base_file_knowledge_source import BaseFileKnowledgeSource

Expand Down Expand Up @@ -30,7 +30,7 @@ def add(self) -> None:
)
new_chunks = self._chunk_text(content_str)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
5 changes: 3 additions & 2 deletions src/crewai/knowledge/source/excel_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from typing import Dict, List
from pathlib import Path
from typing import Dict, List

from crewai.knowledge.source.base_file_knowledge_source import BaseFileKnowledgeSource


Expand Down Expand Up @@ -44,7 +45,7 @@ def add(self) -> None:

new_chunks = self._chunk_text(content_str)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
4 changes: 2 additions & 2 deletions src/crewai/knowledge/source/json_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import json
from typing import Any, Dict, List
from pathlib import Path
from typing import Any, Dict, List

from crewai.knowledge.source.base_file_knowledge_source import BaseFileKnowledgeSource

Expand Down Expand Up @@ -42,7 +42,7 @@ def add(self) -> None:
)
new_chunks = self._chunk_text(content_str)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
4 changes: 2 additions & 2 deletions src/crewai/knowledge/source/pdf_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from typing import List, Dict
from pathlib import Path
from typing import Dict, List

from crewai.knowledge.source.base_file_knowledge_source import BaseFileKnowledgeSource

Expand Down Expand Up @@ -43,7 +43,7 @@ def add(self) -> None:
for _, text in self.content.items():
new_chunks = self._chunk_text(text)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
2 changes: 1 addition & 1 deletion src/crewai/knowledge/source/string_knowledge_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def add(self) -> None:
"""Add string content to the knowledge source, chunk it, compute embeddings, and save them."""
new_chunks = self._chunk_text(self.content)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
4 changes: 2 additions & 2 deletions src/crewai/knowledge/source/text_file_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from typing import Dict, List
from pathlib import Path
from typing import Dict, List

from crewai.knowledge.source.base_file_knowledge_source import BaseFileKnowledgeSource

Expand All @@ -24,7 +24,7 @@ def add(self) -> None:
for _, text in self.content.items():
new_chunks = self._chunk_text(text)
self.chunks.extend(new_chunks)
self.save_documents(metadata=self.metadata)
self._save_documents()

def _chunk_text(self, text: str) -> List[str]:
"""Utility method to split text into chunks."""
Expand Down
25 changes: 16 additions & 9 deletions src/crewai/knowledge/storage/knowledge_storage.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
import contextlib
import hashlib
import io
import logging
import chromadb
import os
from typing import Any, Dict, List, Optional, Union, cast

import chromadb
import chromadb.errors
from crewai.utilities.paths import db_storage_path
from typing import Optional, List, Dict, Any, Union
from crewai.utilities import EmbeddingConfigurator
from crewai.knowledge.storage.base_knowledge_storage import BaseKnowledgeStorage
import hashlib
from chromadb.config import Settings
from chromadb.api import ClientAPI
from chromadb.api.types import OneOrMany
from chromadb.config import Settings

from crewai.knowledge.storage.base_knowledge_storage import BaseKnowledgeStorage
from crewai.utilities import EmbeddingConfigurator
from crewai.utilities.logger import Logger
from crewai.utilities.paths import db_storage_path


@contextlib.contextmanager
Expand Down Expand Up @@ -116,11 +118,16 @@ def reset(self):
def save(
self,
documents: List[str],
metadata: Union[Dict[str, Any], List[Dict[str, Any]]],
metadata: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
if self.collection:
try:
metadatas = [metadata] if isinstance(metadata, dict) else metadata
if metadata is None:
metadatas: Optional[OneOrMany[chromadb.Metadata]] = None
elif isinstance(metadata, list):
metadatas = [cast(chromadb.Metadata, m) for m in metadata]
else:
metadatas = cast(chromadb.Metadata, metadata)

ids = [
hashlib.sha256(doc.encode("utf-8")).hexdigest() for doc in documents
Expand Down

0 comments on commit 4405136

Please sign in to comment.