Skip to content

Admin user documentation

George edited this page Jul 16, 2024 · 1 revision

User documentation - Admin / entrepreuner user

Usage

To use the Ailixir application, you first need to specify input sources to scrape data from. A config.json file will be generated from them. Then, we start the data acquisition pipeline by executing the orchestrator component, which manages this process. Ensure that the specified data sources are accessible (free to use), as they significantly impact the AI's data quality and overall performance.

Configuration

Before using the main commands in the following section, update the scraping targets in src/backend/Config/main.py for example like this:

if __name__ == '__main__':
    c = Config()
    c.add_target(YouTubeTarget(
       url='https://www.youtube.com/@NutritionFactsOrg')
    )
    c.write_to_json()

Data Sources

Here is the list of available data scraping targets with example usage:

# Without any parameters, it scrapes from allrecipes.com
AllRecipesTarget()
# The first argument specifies which keywords to look for
# and the second one limits the number of results for a set of keywords
ArchiveTarget(keywords=['nutrition', 'health'], max_results=1)
# The first parameter is the base URL link for scraping, don’t change it
# and the second one limits the number of podcasts to be retrieved
PodcastTarget(
   url='https://peterattiamd.com/podcast/archive/', num_podcasts=1
)
# The first argument specifies which keywords to look for
# and the second one limits the number of results for a set of keywords
PubMedTarget(keywords=['food', 'exercise'], max_results=1)
# The first parameter specifies the channel to scrape data from
# (all videos of this channel will be scraped)
YouTubeTarget(url='https://www.youtube.com/@NutritionFactsOrg')
# The first argument specifies the base URL link, don’t change it
# and the second one limits the number of pages to scrape
NutritionTarget(url='https://nutritionfacts.org/blog/', max_pages=1)

Running the Data Acquisition Pipeline

To start the data acquisition pipeline, execute the following commands:

# Build data/config.json file and auxiliary folder structure
pdm build-config
# Scrape all targets specified in src/backend/Config/config.py
pdm run-orchestrator

The data will be stored in the data/ directory. For example for youtube scraped data, you can find the scraped data under data/youtube/raw/. Unique ids of scraped youtube videos are stored in the file data/youtube/index.json.

Environment Variables

API keys and other sensitive information cannot be stored in the public code repository. When running the application for the first time, the user will be prompted to provide these values. The environment setup script ensures these values are stored in a .env file. The data acquisition pipeline only requires the YOUTUBE_DATA_API_V3 key. We obtained these keys from our industry partner.

YOUTUBE_DATA_API_V3=""
GOOGLE_GEMINI_API=""
OPEN_AI_API=""
ASTRA_DB_API_ENDPOINT=""
ASTRA_DB_TOKEN=""
ASTRA_DB_NAMESPACE=""
ASTRA_DB_COLLECTION=""
FIREBASE_API_KEY=""
FIREBASE_AUTH_DOMAIN=""
FIREBASE_PROJECT_ID=""
FIREBASE_STORAGE_BUCKET=""
FIREBASE_MESSAGING_SENDER_ID=""
FIREBASE_APP_ID=""
FIREBASE_MEASUREMENT_ID=""
GOOGLE_AUTH_CLIENT_ID=""

Additional Commands

Scraping

To perform specific scraping tasks, execute the corresponding commands:

# Scrape AllRecipes
pdm scrape-allrecipes
# Scrape Podcast
pdm scrape-podcast
# Scrape PubMed
pdm scrape-pubmed
# Scrape YouTube
pdm scrape-youtube
# Scrape Archive
pdm scrape-archive
# Scrape NutritionFacts
pdm scrape-nutritionfacts

These commands will execute the code found in main.py for each respective scraper (e.g. src/backend/Scrapers/Archive/main.py). You can customise and try out code for the different scrapers there.

Initialization Prompt (GDocs)

In order to deploying various instances of the app to perform as different types of agents, we can edit Google Document and one Python script to access the new background info to our LLM. To do so we first need to download a new Google Drive Fetching Credentials from Google Console by clicking on the download icon.

image

After downloading the file should be renamed to credentials.json and pasted to the root of the project.

image

Now let's get to the google file itself. After clicking the following link for the first time, some developer / owner should approve the editing of the Initalisation Document. The document then should look like this.

image

After writing the basic prompt and some additional context the new info can be uploaded into our LLM using following python command in the root of our project. The result is save in the folder /data/google_docs_content.txt.

pdm google-docs

Changing GDocs Link

If an entrepreneur wants to change the Google Docs link, he needs to first create one and then update python script function/get_google_docs.py as seen in the following snippet.

image