Welcome to our organization's data catalog. Below is a list of datasets that have been collected throughout our programme Fair Forward.
You can access the live website here: Data Catalog!
Note: This data catalog is currently a prototype and is not yet fully developed. It is intended to showcase the concept and functionality, but may undergo significant changes in the future.
Want to add your dataset or AI use case to this catalog? Great!
- Access the Source: The data for this catalog lives in this Google Sheet.
- Add Your Project: Add a new row to the sheet and fill in the details for your project. Please follow the format of existing entries and use the second row as a guide for the expected content in each column.
- Update the Website: Once you've added your information to the Google Sheet, the website needs to be rebuilt to include it. Please contact one of the repository maintainers or follow the "Update via GitHub Actions" steps below (if you have write access) to trigger an update.
The content of this catalog is primarily sourced from the Google Sheet mentioned above. Changes made there need to be reflected on the website.
There are two main ways to update the website:
-
Local Update (for Developers):
- Ensure you have the prerequisites installed (see Development section).
- Run the main build script locally:
python build_from_google_sheets.py
- This single script fetches the latest data, applies necessary processing (like fuzzy column matching), saves the intermediate
docs/data_catalog.xlsx
, creates/updates project markdown files indocs/public/projects/
, generates the finaldocs/index.html
, and saves a daily backup CSV todata_sources/google_sheets_backup/
. - Optionally, run
python download_placeholder_images.py
if new projects need placeholder images (requires Pexels API key setup). - Commit and push all changed files (including
index.html
,.xlsx
, backups, and any new/modified files indocs/public/projects/
) to themain
branch.
-
Update via GitHub Actions (for Non-Developers with Repo Access):
- This method allows updating the website directly from GitHub without running code locally.
- Go to the repository's Actions tab on GitHub.
- In the left sidebar, click on the "Manually Update Website from Google Sheets" workflow.
- Above the list of workflow runs, click the "Run workflow" dropdown button.
- Ensure the "Branch:
main
" is selected. - Click the green "Run workflow" button.
- The workflow will perform the same steps as running
build_from_google_sheets.py
locally, fetching the latest data, rebuilding the website, and automatically committing the changes to themain
branch. The live website will be updated shortly after the workflow completes successfully. (Note: This action does not currently run the placeholder image download script).
This repository contains the code for generating a static website data catalog.
- Python 3.7+ (due to
thefuzz
dependency) - Required Python packages are listed in
requirements.txt
. - Node.js and npm (Optional: only needed for the React frontend development)
Install Python packages with:
pip install -r requirements.txt
The primary script handles fetching data and building the site:
python build_from_google_sheets.py
This script:
- Fetches data from the Google Spreadsheet.
- Performs fuzzy matching on column headers.
- Saves the processed data to
docs/data_catalog.xlsx
. - Creates/updates markdown documentation files in
docs/public/projects/*/docs/
. - Creates/updates project image folders in
docs/public/projects/*/images/
. - Saves a daily raw backup to
data_sources/google_sheets_backup/
. - Runs
generate_catalog.py
internally to builddocs/index.html
.
If you only want to regenerate the HTML from the existing docs/data_catalog.xlsx
without fetching from Google Sheets, you can run:
python generate_catalog.py
To download placeholder images for projects lacking them:
python download_placeholder_images.py
This requires a Pexels API key. Set it up via:
- A
.env
file in the root directory:PEXELS_API_KEY=your_api_key_here
- Command-line argument:
--api-key YOUR_API_KEY
(Note: The download_placeholder_images.py
script uses requirements.txt
for its dependencies.)
The frontend/
directory contains an experimental React frontend. It uses data from frontend/data.json
, generated by generate_catalog.py
.
To run:
cd frontend
npm install
npm run dev
- Create a Google Cloud service account.
- Download its JSON key file.
- Save it as
data_sources/google_sheets_api/service_account_JN.json
(or match the path used in scripts/workflows). - Ensure this file is listed in
.gitignore
and never commit it.
Needed for download_placeholder_images.py
. Set via .env
file or --api-key
argument.
build_from_google_sheets.py
: Main script for fetching data from Google Sheets and orchestrating the website build.generate_catalog.py
: Generates the HTML catalog (docs/index.html
) fromdocs/data_catalog.xlsx
.download_placeholder_images.py
: Downloads placeholder images for projects.backup_google_sheet.py
: Standalone script for monthly raw data backup (used by GitHub Action).requirements.txt
: Consolidated Python dependencies for the build process.docs/
: Directory containing the generated website and assets.data_catalog.xlsx
: Intermediate Excel file processed from Google Sheet.index.html
: Generated HTML file for the website (served by GitHub Pages).enhanced_side_panel.js
andenhanced_side_panel.css
: Side panel functionality.public/projects/
: Project-specific markdown files and images.
frontend/
: (Experimental) React frontend.data_sources/
: Credentials and backups.google_sheets_api/
: Credentials file (ignored by git).google_sheets_backup/
: Daily and monthly raw CSV backups (tracked by git).
.github/workflows/
: GitHub Actions workflow files.
.github/workflows/update_from_google_sheets.yml
: Manually triggered workflow to runbuild_from_google_sheets.py
and commit results tomain
..github/workflows/monthly_backup.yml
: Automatically triggered workflow (1st of month) to runbackup_google_sheet.py
and commit the monthly raw CSV backup.
The data catalog includes the following features:
- Modern UI with a clean, responsive layout
- Filtering by domain, data type, and region
- Search functionality for finding specific datasets or use cases
- Visual distinction between domains and data types
- Detailed information for each dataset and use case
- Links to external resources for datasets and use cases
- Enhanced side panel for displaying detailed project information
- Automatic placeholder images for projects without custom images
- React frontend for a more interactive user experience
If you would like to contribute code changes to this project, please follow these steps:
- Fork the repository
- Create a new branch for your feature (
git checkout -b feature/YourFeature
) - Make your changes
- Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin feature/YourFeature
) - Create a new Pull Request
Please ensure that your code follows the existing style and includes appropriate documentation.
This project is licensed under the terms of the LICENSE file included in the repository.