ingest.py - versioning #56

RHinDFIR · 2023-05-15T23:17:22Z

I'm suddenly running into an issue when running ingest.py where I am being flagged with this error instead of the script processing as it should:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY-main$ python ingest.py
Traceback (most recent call last):
File "/mnt/h/LLM/CASALIOY-main/ingest.py", line 24, in
from load_env import chunk_overlap, chunk_size, documents_directory, get_embedding_model, persist_directory
File "/mnt/h/LLM/CASALIOY-main/load_env.py", line 15, in
use_mlock = os.environ.get("USE_MLOCK").lower() == "true"
AttributeError: 'NoneType' object has no attribute 'lower'

I am running CASALIOY through WSL on Ubuntu 22.04.2LS.

I was able to successfully run the ingestion script this morning against a 5mb PDF and the results were pretty good. I updated my repo to the latest version and I am now getting this error, despite rebuilding venv and running through the installation instructions to be on the safe side.

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY-main$ ls -R
.:
Dockerfile pycache gui.py meta.json pyproject.toml startLLM.py
LICENSE convert.py ingest.py models source_documents tokenizer.model
README.md example.env load_env.py poetry.lock source_documents_old

./pycache:
load_env.cpython-310.pyc

./models:
PUT_YOUR_MODELS_HERE ggjt-v1-vic7b-uncensored-q4_0.bin ggml-model-q4_0.bin

./source_documents:
regex.txt

./source_documents_old:
sample.csv shor.pdf state_of_the_union.txt subfolder

./source_documents_old/subfolder:
Constantinople.docx 'LLAMA Leveraging Object-Oriented Programming for Designing a Logging Framework-compressed.pdf'
Easy_recipes.epub 'Muscle Spasms Charley Horse MedlinePlus.html'

SNIP

su77ungr · 2023-05-15T23:44:37Z

What version of the main are you running? We changed runners to modules inside ./casalioy

So your env at least should be listing those. also you are likely missing /casalioy/ask_libgen.py.

We had an issue with a PR so I had to revoke some earlier changes. Besides GUI this main's version should be stable

RHinDFIR · 2023-05-16T00:28:33Z

What version of the main are you running? We changed runners to modules inside ./casalioy

So your env at least should be listing those. also you are likely missing /casalioy/ask_libgen.py.

We had an issue with a PR so I had to revoke some earlier changes. Besides GUI this main's version should be stable

parent e972eac commit f9cc180

I've cloned the most recent main commit and I'm running through the setup now. I'll let you know how it goes.

RHinDFIR · 2023-05-16T00:41:27Z

Side note - the .env example in the README.md doesn't reflect the suggested models to download and user:

Nothing huge, just thought I should point it out

su77ungr · 2023-05-16T00:57:02Z

Oh feel free to PR such things and I'll commit ASAP.

Are you running fine again?

RHinDFIR · 2023-05-16T01:10:49Z

Will do! I'm just heading home and will take a look at it once I get back.

"python -m pip install --force streamlit sentence_transformers" is taking quite some time to run, so I'll hopefully have some good news once I am home 👍🏻

hippalectryon-0 · 2023-05-16T05:45:07Z

Jumping in late: you were just missing USE_MOCK=false/true in your .env file

RHinDFIR · 2023-05-16T13:39:38Z

Giving this another shot this morning. The installation completed, I modified my .env to the following before attempting to run ./casalioy/ingest.py:

# Generic
MODEL_N_CTX=1024
TEXT_EMBEDDINGS_MODEL=models/ggjt-v1-vic7b-uncensored-q4_0.bin
TEXT_EMBEDDINGS_MODEL_TYPE=LlamaCpp  # LlamaCpp or HF
USE_MLOCK=true

# Ingestion
PERSIST_DIRECTORY=db
DOCUMENTS_DIRECTORY=source_documents
INGEST_CHUNK_SIZE=500
INGEST_CHUNK_OVERLAP=50

# Generation
MODEL_TYPE=LlamaCpp # GPT4All or LlamaCpp
MODEL_PATH=models/ggml-vic7b-q5_1.bin
MODEL_TEMP=0.8
MODEL_STOP=[STOP]
CHAIN_TYPE=stuff

I was then hit with this error:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY$ python casalioy/ingest.py
llama.cpp: loading model from models/ggjt-v1-vic7b-uncensored-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: failed to load model
Traceback (most recent call last):
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/langchain/embeddings/llamacpp.py", line 78, in validate_environment
    values["client"] = Llama(
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 161, in __init__
    assert self.ctx is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 148, in <module>
    main(sources_directory, cleandb)
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 142, in main
    ingester.ingest_from_directory(sources_directory, chunk_size, chunk_overlap)
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 115, in ingest_from_directory
    encode_fun = get_embedding_model()[1]
  File "/mnt/h/LLM/CASALIOY/casalioy/load_env.py", line 44, in get_embedding_model
    model = LlamaCppEmbeddings(model_path=text_embeddings_model, n_ctx=model_n_ctx)
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1102, in pydantic.main.validate_model
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/langchain/embeddings/llamacpp.py", line 98, in validate_environment
    raise NameError(f"Could not load Llama model from path: {model_path}")
NameError: Could not load Llama model from path: models/ggjt-v1-vic7b-uncensored-q4_0.bin

I cloned the all-MiniLM-L6-v2 to my root CASALIOY directory and matched my .env file to that in the README.md and ended up with this:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY$ python casalioy/ingest.py
Traceback (most recent call last):
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 446, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 148, in <module>
    main(sources_directory, cleandb)
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 142, in main
    ingester.ingest_from_directory(sources_directory, chunk_size, chunk_overlap)
  File "/mnt/h/LLM/CASALIOY/casalioy/ingest.py", line 115, in ingest_from_directory
    encode_fun = get_embedding_model()[1]
  File "/mnt/h/LLM/CASALIOY/casalioy/load_env.py", line 41, in get_embedding_model
    model = HuggingFaceEmbeddings(model_name=text_embeddings_model)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/langchain/embeddings/huggingface.py", line 54, in __init__
    self.client = sentence_transformers.SentenceTransformer(
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 95, in __init__
    modules = self._load_sbert_model(model_path)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 840, in _load_sbert_model
    module = module_class.load(os.path.join(model_path, module_config['path']))
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 137, in load
    return Transformer(model_name_or_path=input_path, **config)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 29, in __init__
    self._load_model(model_name_or_path, config, cache_dir)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 49, in _load_model
    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2542, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "/mnt/h/LLM/CASALIOY/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 451, in load_state_dict
    raise OSError(
OSError: You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run `git lfs install` followed by `git lfs pull` in the folder you cloned.

Am I missing something super simple here? I ran the git commands suggested in the last error and was hit with:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY$ git lfs install
fatal: 'lfs' appears to be a git command, but we were not
able to execute it. Maybe git-lfs is broken?

RHinDFIR · 2023-05-16T14:14:35Z

Ok, I think I may have sorted it.

I addressed the following issue by referring to this thread:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY$ git lfs install
fatal: 'lfs' appears to be a git command, but we were not
able to execute it. Maybe git-lfs is broken?

I installed git-lfs and then ran "git lfs pull" in the "all-MiniLM-L6-v2" repo I pulled from HuggingFace. Running ingest.py now gives me:

(casalioy-py3.10) user@DESKTOP-MPA3RT3:/mnt/h/LLM/CASALIOY$ python casalioy/ingest.py
Scanning files
regex.txt
Processing 1211 chunks
Creating a new collection, size=384
Saving 1000 chunks
   0.0% [>

The README..md file definitely needs updating and I'll see if I can get to it later this afternoon.

hippalectryon-0 · 2023-05-16T14:23:07Z

I'm not sure I follow, isn't that a git-fls issue ? (which isn't used in the readme, nor required)

PS: the part of the README about downloading models will be gone when #61 is merged

Edit: I just reread your original problem: you don't actually mentioning using git lfs anywhere in the first place, so maybe it's just the error message that mislead you. There's no need to use it.

RHinDFIR · 2023-05-16T14:25:30Z

I'm not sure I follow, isn't that a git-fls issue ? (which isn't used in the readme, nor required)

PS: the part of the README about downloading models will be gone when #61 is merged

Beat me to it, I saw your edits as I was typing out my response about the models in the README.

hippalectryon-0 · 2023-05-16T14:40:28Z

Note for people who have the same issue: the actual problem you had comes from this line:

error loading model: this format is no longer supported (see ggerganov/llama.cpp#1305)

This is because you're using (as in the README) the old format q4 instead of q5. We'll adjust the readme.

RHinDFIR · 2023-05-16T14:42:07Z

I stumbled on the same thread and re-edited my comment before closing this issue.

Everything is running smooth and ingestion is now super-fast. Keep up the awesome work!

su77ungr changed the title ~~ingest.py - AttributeError: 'NoneType' object has no attribute 'lower'~~ ingest.py - versioning May 15, 2023

RHinDFIR closed this as completed May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest.py - versioning #56

ingest.py - versioning #56

RHinDFIR commented May 15, 2023 •

edited

Loading

SNIP

su77ungr commented May 15, 2023

RHinDFIR commented May 16, 2023

RHinDFIR commented May 16, 2023

su77ungr commented May 16, 2023

RHinDFIR commented May 16, 2023

hippalectryon-0 commented May 16, 2023

RHinDFIR commented May 16, 2023

RHinDFIR commented May 16, 2023 •

edited

Loading

hippalectryon-0 commented May 16, 2023 •

edited

Loading

RHinDFIR commented May 16, 2023 •

edited

Loading

hippalectryon-0 commented May 16, 2023 •

edited

Loading

RHinDFIR commented May 16, 2023 •

edited

Loading

ingest.py - versioning #56

ingest.py - versioning #56

Comments

RHinDFIR commented May 15, 2023 • edited Loading

SNIP

su77ungr commented May 15, 2023

RHinDFIR commented May 16, 2023

RHinDFIR commented May 16, 2023

su77ungr commented May 16, 2023

RHinDFIR commented May 16, 2023

hippalectryon-0 commented May 16, 2023

RHinDFIR commented May 16, 2023

RHinDFIR commented May 16, 2023 • edited Loading

hippalectryon-0 commented May 16, 2023 • edited Loading

RHinDFIR commented May 16, 2023 • edited Loading

hippalectryon-0 commented May 16, 2023 • edited Loading

RHinDFIR commented May 16, 2023 • edited Loading

RHinDFIR commented May 15, 2023 •

edited

Loading

RHinDFIR commented May 16, 2023 •

edited

Loading

hippalectryon-0 commented May 16, 2023 •

edited

Loading

RHinDFIR commented May 16, 2023 •

edited

Loading

hippalectryon-0 commented May 16, 2023 •

edited

Loading

RHinDFIR commented May 16, 2023 •

edited

Loading