Skip to content

Large Language Models

Mika Hämäläinen edited this page Nov 3, 2024 · 11 revisions

UralicNLP provides a seamless, unified interface to access a variety of powerful large language models, eliminating the need for separate installations or complex setups. With UralicNLP, you can effortlessly switch between different LLMs, saving time and simplifying your workflow while gaining the flexibility to work with the model that best suits your needs—all through a single, streamlined platform.

To put it simply, you can just do:

llm.prompt("my fancy prompt")
>>"Result from LLM"
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

ChatGPT

To use ChatGPT, you need to pip install openai

from uralicNLP.llm import get_llm
llm = get_llm("chatgpt", "YOUR API KEY", model="gpt-4o")
llm.prompt("What is Skolt Sami?")
>>"Skolt Sami is a Uralic, Sami language spoken by the Skolt Sami people, a small indigenous Finno-Ugric ethnic group ..."

You can use ChatGPT to get embeddings like this

from uralicNLP.llm import get_llm
llm = get_llm("chatgpt", "YOUR API KEY", model="text-embedding-3-small")
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

Gemini

To use Gemini, you need to pip install google-generativeai. Get the API key from AI Studio.

from uralicNLP.llm import get_llm
llm = get_llm("gemini", "YOUR API KEY", model="gemini-1.5-flash")
llm.prompt("What is Erzya?")
>>"Erzya is a **Finno-Ugric language** and the **cultural identity** of the **Erzya people**, one of the two main groups within the Mordvinic people. ..."

You can use Gemini to get embeddings like this

from uralicNLP.llm import get_llm
llm = get_llm("gemini", "YOUR API KEY", model="models/text-embedding-004")
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

Optionally, you can provide task_type to get_llm(). The default value is task_type="retrieval_document".

Mistral

To use Mistral, you need to pip install mistralai

from uralicNLP.llm import get_llm
llm = get_llm("mistral", "YOUR API KEY", model="mistral-small-latest")
llm.prompt("What is Komi-Zyrian?")
>>"Komi-Zyrian, often simply referred to as Komi, is a Uralic language spoken in the Komi Republic and some other regions of Russia. ..."

You can use Mistral to get embeddings like this

from uralicNLP.llm import get_llm
llm = get_llm("mistral", "YOUR API KEY", model="mistral-embed")
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

Claude

To use Claude, you need to pip install anthropic. Please notice that this model does not support embeddings.

from uralicNLP.llm import get_llm
llm = get_llm("claude", "YOUR API KEY", model="claude-3-5-sonnet-latest")
llm.prompt("What is Tundra Nenets?")
>>"Tundra Nenets is an indigenous Samoyedic language spoken by the Nenets people in northern Russia, primarily in the Yamalo-Nenets ..."

Voyage AI

To use Voyage AI, you need to pip install voyageai. Please notice that this model does not support prompting.

from uralicNLP.llm import get_llm
llm = get_llm("voyage", "YOUR API KEY", model="voyage-3")
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

Local models

To use models from Hugging Face, you need to pip install transformers

from uralicNLP.llm import get_llm
llm = get_llm("microsoft/Phi-3.5-mini-instruct", max_length=20)
llm.prompt("What is Livonian?")
>>"What is Livonian? Livonian is an extinct Finnic language that was histor"

You can also get embeddings like so

from uralicNLP.llm import get_llm
llm = get_llm("microsoft/Phi-3.5-mini-instruct")
llm.embed("My great sentence")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

It is possible to pass device to get_llm(). device=1 would use the first CUDA device. By default, the value is device=-1, which uses CPU.

Embeddings for an endangered language

Every LLM object also has a method called embed_endangered(). This method takes in text in an endangered language, iso code of the endangered language and iso code of a large language for dictionary-based translation.

from uralicNLP.llm import get_llm
llm = get_llm("microsoft/Phi-3.5-mini-instruct")
llm.embed_endangered("Näʹde täävtõõđi âʹtte peeʹlljid pärnnses täävtõõđi.", "sms", "fin")
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377...]

The above example is for Skolt Sami (sms) using Finnish (fin) translations. The method will work on any LLM that supports embed() method.