AniSearchModel leverages Sentence-BERT (SBERT) models to generate embeddings for anime and manga synopses, enabling the calculation of semantic similarities between descriptions. This project facilitates the preprocessing, merging, and analysis of various anime and manga datasets to identify the most similar synopses.
AniSearchModel performs the following operations:
- Data Loading and Preprocessing: Loads multiple anime and manga datasets, cleans synopses, consolidates titles, and removes duplicates.
- Data Merging: Merges datasets based on common identifiers to create unified anime and manga datasets.
- Embedding Generation: Utilizes SBERT models to generate embeddings for synopses, facilitating semantic similarity calculations.
- Similarity Analysis: Calculates cosine similarities between embeddings to identify the most similar synopses or descriptions.
- API Integration: Provides a Flask-based API to interact with the model and retrieve similarity results.
- Testing: Implements a comprehensive test suite using
pytest
to ensure the reliability and correctness of all components.
- MyAnimeList Dataset (
Anime.csv
): Kaggle - Anime Dataset 2023 (
anime-dataset-2023.csv
): Kaggle - Anime Database 2022 (
Anime-2022.csv
): Kaggle - Anime Dataset (
animes.csv
): Kaggle - Anime DataSet (
anime4500.csv
): Kaggle - Anime Data (
anime_data.csv
): Kaggle - Anime2 (
anime2.csv
): Kaggle - MAL Anime (
mal_anime.csv
): Kaggle - Anime 270: Hugging Face
- Wykonos Anime: Hugging Face
- MyAnimeList Manga Dataset (
Manga.csv
): Kaggle - MyAnimeList Jikan Database (
jikan.csv
): Kaggle - Manga, Manhwa and Manhua Dataset (
data.csv
): Kaggle
-
Clone the repository:
git clone https://github.com/RLAlpha49/AniSearchModel.git cd AniSearchModel
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Linux/Mac venv\Scripts\activate # On Windows
-
Ensure
setuptools
is installed:Before running the setup script, make sure
setuptools
is installed in your virtual environment. This is typically included with Python, but you can update it with:pip install --upgrade setuptools
-
Install the package and dependencies:
Use the
setup.py
script to install the package along with its dependencies. This will also handle the installation of PyTorch with CUDA support:python setup.py install
This command will:
- Install all required Python packages listed in
install_requires
. - Execute the
PostInstallCommand
to install PyTorch with CUDA support.
- Install all required Python packages listed in
-
Verify the installation:
After installation, you can verify that PyTorch is using CUDA by running:
python -c "import torch; print(torch.cuda.is_available())"
This should print
True
if CUDA is available and correctly configured.
The repository already contains the merged datasets, but if you want to merge additional datasets, edit the merge_datasets.py
file and run:
python merge_datasets.py --type anime
python merge_datasets.py --type manga
To generate SBERT embeddings for the anime and manga datasets, you can use the provided scripts.
python sbert.py --model <model_name> --type <dataset_type>
Replace <model_name>
with the desired SBERT model, e.g., all-mpnet-base-v1
. Replace <dataset_type>
with anime
or manga
.
You can use the provided scripts to generate embeddings for all models listed in models.txt
.
The generate_models.sh
script is available for Linux users. To run the script, follow these steps:
-
Make the script executable:
chmod +x generate_models.sh
-
Run the script:
./scripts/generate_models.sh
-
Optionally, specify a starting model:
./scripts/generate_models.sh sentence-transformers/all-MiniLM-L6-v1
-
Open Command Prompt and navigate to the directory containing the script.
-
Run the script:
scripts\generate_models.bat
-
Optionally, specify a starting model:
scripts\generate_models.bat sentence-transformers/all-MiniLM-L6-v1
-
Open PowerShell and navigate to the directory containing the script.
-
Run the script:
.\scripts\generate_models.ps1
-
Optionally, specify a starting model:
.\scripts\generate_models.ps1 -StartModel "sentence-transformers/all-MiniLM-L6-v1"
- The starting model parameter is optional. If not provided, the script will process all models from the beginning of the list.
- For PowerShell, you may need to adjust the execution policy to allow script execution. You can do this by running
Set-ExecutionPolicy RemoteSigned
in an elevated PowerShell session.
To ensure the reliability and correctness of the project, a comprehensive suite of tests has been implemented using pytest
. The tests cover various components of the project, including:
-
tests/test_model.py
:- Purpose: Tests the functionality of model loading, similarity calculations, and evaluation result saving.
- Key Functions Tested:
test_anime_model
: Verifies that the anime model loads correctly, calculates similarities, and saves evaluation results as expected.test_manga_model
: Similar totest_anime_model
but for the manga dataset.
-
tests/test_merge_datasets.py
:- Purpose: Validates the data preprocessing and merging functions, ensuring that names are correctly processed, synopses are cleaned, titles are consolidated, and duplicates are removed or handled appropriately.
- Key Functions Tested:
test_preprocess_name
: Ensures that names are preprocessed correctly by converting them to lowercase and stripping whitespace.test_clean_synopsis
: Checks that unwanted phrases are removed from synopses.test_consolidate_titles
: Verifies that multiple title columns are consolidated into a single 'title' column.test_remove_duplicate_infos
: Confirms that duplicate synopses are handled correctly.test_add_additional_info
: Tests the addition of additional synopsis information to the merged DataFrame.
-
tests/test_sbert.py
:- Purpose: Checks the SBERT embedding generation process, verifying that embeddings are correctly created and saved for both anime and manga datasets.
- Key Functions Tested:
run_sbert_command_and_verify
: Runs the SBERT command-line script and verifies that embeddings and evaluation results are generated as expected.- Parameterized tests for different dataset types (
anime
,manga
) and their corresponding expected embedding files.
tests/test_api.py
:- Purpose: Tests the Flask API endpoints, ensuring that the
/anisearchmodel/manga
endpoint behaves as expected with valid inputs, handles missing fields gracefully, and correctly responds to internal server errors. - Key Functions Tested:
test_get_manga_similarities_success
: Verifies successful retrieval of similarities with valid inputs.test_get_manga_similarities_missing_model
: Checks the API's response when the model name is missing.test_get_manga_similarities_missing_description
: Ensures appropriate handling when the description is missing.- Tests for internal server errors by simulating exceptions during processing.
- Purpose: Tests the Flask API endpoints, ensuring that the
tests/conftest.py
:- Purpose: Configures
pytest
options and fixtures, including command-line options for specifying the model name during tests. - Key Features:
- Adds a command-line option
--model-name
to specify the model used in tests. - Provides a fixture
model_name
that retrieves the model name from the command-line options.
- Adds a command-line option
- Purpose: Configures
To run all the tests, navigate to the project's root directory and execute:
pytest
You can run specific tests or test modules. For example, to run only the API tests:
pytest tests/test_api.py
To run tests for a specific model, use:
pytest tests/test_sbert.py --model-name <model_name>
Replace <model_name>
with the name of the model you want to test.
--model-name
can be used when running all tests or specific tests.
To run the Flask application, use the run_server.py
script. This script automatically determines the operating system and uses the appropriate server. You can also specify whether to use CUDA or CPU for processing:
- On Linux, it uses Gunicorn.
- On Windows, it uses Waitress.
Run the script with:
python src/run_server.py [cuda|cpu]
Replace [cuda|cpu]
with your desired device. If no device is specified, it defaults to cpu
.
The application will be accessible at http://0.0.0.0:5000/anisearchmodel
.
This includes files and directories generated by the project which are not part of the source code.
AniSearchModel
├── .github
│ └── workflows
│ ├── codeql.yml
│ └── ruff.yml
├── data
│ ├── anime
│ │ ├── Anime_data.csv
│ │ ├── Anime-2022.csv
│ │ ├── anime-dataset-2023.csv
│ │ ├── anime.csv
│ │ ├── Anime2.csv
│ │ ├── anime4500.csv
│ │ ├── animes.csv
│ │ └── mal_anime.csv
│ └── manga
│ ├── data.csv
│ ├── jikan.csv
│ └── manga.csv
├── logs
│ └── <filename>.log.<#>
├── models
│ ├── anime
│ │ └── <model_name>
│ │ ├── embeddings_Synopsis_anime_270_Dataset.npy
│ │ ├── embeddings_Synopsis_Anime_data_Dataset.npy
│ │ ├── embeddings_Synopsis_anime_dataset_2023.npy
│ │ ├── embeddings_Synopsis_Anime-2022_Dataset.npy
│ │ ├── embeddings_Synopsis_anime2_Dataset.npy
│ │ ├── embeddings_Synopsis_anime4500_Dataset.npy
│ │ ├── embeddings_Synopsis_animes_dataset.npy
│ │ ├── embeddings_Synopsis_mal_anime_Dataset.npy
│ │ ├── embeddings_Synopsis_wykonos_Dataset.npy
│ │ └── embeddings_synopsis.npy
│ ├── manga
│ │ └── <model_name>
│ │ ├── embeddings_Synopsis_data_Dataset.npy
│ │ ├── embeddings_Synopsis_jikan_Dataset.npy
│ │ └── embeddings_synopsis.npy
│ ├── evaluation_results_anime.json
│ ├── evaluation_results_manga.json
│ ├── evaluation_results.json
│ ├── merged_anime_dataset.csv
│ └── merged_manga_dataset.csv
├── scripts
│ ├── generate_models.bat
│ ├── generate_models.ps1
│ └── generate_models.sh
├── src
│ ├── __init__.py
│ ├── api.py
│ ├── common.py
│ ├── merge_datasets.py
│ ├── run_server.py
│ ├── sbert.py
│ └── test.py
├── tests
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_api.py
│ ├── test_merge_datasets.py
│ ├── test_model.py
│ └── test_sbert.py
├── .gitignore
├── architecture.txt
├── datasets.txt
├── LICENSE
├── models.txt
├── pytest.ini
├── README.md
├── requirements.txt
└── setup.py
- Python 3.6+
- Python Packages:
- pandas
- numpy
- torch
- transformers
- sentence-transformers
- tqdm
- datasets
- flask
- flask-limiter
- waitress
- gunicorn
- pytest
- pytest-order
Install all dependencies using:
python setup.py install
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.