-
Notifications
You must be signed in to change notification settings - Fork 2
Admin user documentation
To use the Ailixir application, you first need to specify input sources to scrape data from. A config.json
file will be generated from them. Then, we start the data acquisition pipeline by executing the orchestrator component, which manages this process. Ensure that the specified data sources are accessible (free to use), as they significantly impact the AI's data quality and overall performance.
Before using the main commands in the following section, update the scraping targets in src/backend/Config/main.py
for example like this:
if __name__ == '__main__':
c = Config()
c.add_target(YouTubeTarget(
url='https://www.youtube.com/@NutritionFactsOrg')
)
c.write_to_json()
Here is the list of available data scraping targets with example usage:
- Recipes from allrecipes
- Articles from arXiv
- Podcasts from the Peter Attia Podcast
- Articles from PubMed
- Youtube channels (retrieve information from all videos on the channel)
- Nutrition related blog posts from NutritionFacts
# Without any parameters, it scrapes from allrecipes.com
AllRecipesTarget()
# The first argument specifies which keywords to look for
# and the second one limits the number of results for a set of keywords
ArchiveTarget(keywords=['nutrition', 'health'], max_results=1)
# The first parameter is the base URL link for scraping, don’t change it
# and the second one limits the number of podcasts to be retrieved
PodcastTarget(
url='https://peterattiamd.com/podcast/archive/', num_podcasts=1
)
# The first argument specifies which keywords to look for
# and the second one limits the number of results for a set of keywords
PubMedTarget(keywords=['food', 'exercise'], max_results=1)
# The first parameter specifies the channel to scrape data from
# (all videos of this channel will be scraped)
YouTubeTarget(url='https://www.youtube.com/@NutritionFactsOrg')
# The first argument specifies the base URL link, don’t change it
# and the second one limits the number of pages to scrape
NutritionTarget(url='https://nutritionfacts.org/blog/', max_pages=1)
To start the data acquisition pipeline, execute the following commands:
# Build data/config.json file and auxiliary folder structure
pdm build-config
# Scrape all targets specified in src/backend/Config/config.py
pdm run-orchestrator
The data will be stored in the data/
directory. For example for youtube scraped data, you can find the scraped data under data/youtube/raw/
. Unique ids of scraped youtube videos are stored in the file data/youtube/index.json
.
API keys and other sensitive information cannot be stored in the public code repository. When running the application for the first time, the user will be prompted to provide these values. The environment setup script ensures these values are stored in a .env
file. The data acquisition pipeline only requires the YOUTUBE_DATA_API_V3
key. We obtained these keys from our industry partner.
YOUTUBE_DATA_API_V3=""
GOOGLE_GEMINI_API=""
OPEN_AI_API=""
ASTRA_DB_API_ENDPOINT=""
ASTRA_DB_TOKEN=""
ASTRA_DB_NAMESPACE=""
ASTRA_DB_COLLECTION=""
FIREBASE_API_KEY=""
FIREBASE_AUTH_DOMAIN=""
FIREBASE_PROJECT_ID=""
FIREBASE_STORAGE_BUCKET=""
FIREBASE_MESSAGING_SENDER_ID=""
FIREBASE_APP_ID=""
FIREBASE_MEASUREMENT_ID=""
GOOGLE_AUTH_CLIENT_ID=""
To perform specific scraping tasks, execute the corresponding commands:
# Scrape AllRecipes
pdm scrape-allrecipes
# Scrape Podcast
pdm scrape-podcast
# Scrape PubMed
pdm scrape-pubmed
# Scrape YouTube
pdm scrape-youtube
# Scrape Archive
pdm scrape-archive
# Scrape NutritionFacts
pdm scrape-nutritionfacts
These commands will execute the code found in main.py
for each respective scraper (e.g. src/backend/Scrapers/Archive/main.py
). You can customise and try out code for the different scrapers there.
In order to deploying various instances of the app to perform as different types of agents, we can edit Google Document and one Python script to access the new background info to our LLM. To do so we first need to download a new Google Drive Fetching Credentials from Google Console by clicking on the download icon.
After downloading the file should be renamed to credentials.json
and pasted to the root of the project.
Now let's get to the google file itself. After clicking the following link for the first time, some developer / owner should approve the editing of the Initalisation Document. The document then should look like this.
After writing the basic prompt and some additional context the new info can be uploaded into our LLM using following python command in the root of our project. The result is save in the folder /data/google_docs_content.txt
.
pdm google-docs
If an entrepreneur wants to change the Google Docs link, he needs to first create one and then update python script function/get_google_docs.py
as seen in the following snippet.