Some notes:
- Scriptorium version: Velasco v4.X (from the "Big Overhaul Update" on 27 Mar, 2019 until the 2nd Overhaul)
- Recognizable because Readers are Scribes and stored in a big dictionary called the Scriptorium, among others
- Overhaul 2 version: starting with Velasco v5.0
If you have a Velasco clone or fork from the Scriptorium version, you should follow these steps:
- First of all, update all your chat files to CARD=v4 format. You can do this by making a script that imports the Archivist, and then loading and saving all files.
- Then, pull the update.
- To convert files to the new unescaped UTF-16 encoding (previously the default, escaped UTF-8, was used), edit the
get_reader(...)
function in the Archivist so it usesload_reader_old(...)
instead ofload_reader(...)
. - Make a script that imports the Archivist and calls the
update(...)
function (it loads and saves all files). - Revert the
get_reader(...)
edit.
And voilà! You're up to date. Unless you want to switch to the mongodb
branch (WIP).
This bot uses Markov chains of 3 words for message generation. For each 3 consecutive words read, it will store the 3rd one as the word that follows the first 2 combined. This way, whenever it is generating a new sentence, it will always pick at random one of the stored words that follow the last 2 words of the message generated so far, combined.
The actual messages aren't stored. After they're processed and all the words have been assigned to lists under combinations of 2 words, the message is discarded, and only the dictionary with the lists of "following words" is stored. The words said in a chat may be visible, but from a certain point onwards its impossible to recreate with accuracy the exact messages said in a chat.
The storing action is made sometimes when a configuration value is changed, and whenever the bot sends a message. If the bot crashes, all the words processed from the messages since the last one from Velascobot will be lost. For high period
values, this could be a considerable amount, but for small ones this is negligible. Still, the bot is not expected to crash often.
The memory of a Speaker
is a small cache of the C
most recently modified Readers
(where C
is set through a flag; default is 20
). A modified Reader
is one where the metadata was changed through a command, or a new message has been read. When a new Reader
is modified that goes over the memory limit, the oldest modified Reader
is pushed out and saved into its file.
When a message is read, it gets stored in a temporal cache. It will only be processed into the vocabulary Generator
when the Reader
is asked to generate a new message, or whenever the Reader
gets saved into a file. This allows the bot to answer to other recent messages, and not just the last one, when the periodic message is a reply.
Generator
is the object class that holds a vocabulary dictionary and can generate new messagesMetadata
is the object class that holds one chat's configuration flags and other miscellaneous information.- Some times the file where the metadata is saved is called a
card
.
- Some times the file where the metadata is saved is called a
Reader
is an object class that holds aMetadata
instance and aGenerator
instance, and is associated with a specific chat.Archivist
is the object class that handles persistence: reading and loading from files.Speaker
is the object class that handles all (or most of) the functions for the commands that Velasco has- Holds a limited set of
Readers
that it loads and saves through someArchivist
functions (borrowed duringSpeaker
initialization).
- Holds a limited set of
velasco.py
is the main file, in charge of starting up the telegram bot itself.
After managing to get Velasco back to being somewhat usable, I've already stated in the News channel that I will focus on rewriting the code into a different language. Thus, I will add no improvements to the Python version from that point onwards. If you're interested of picking this project up and continue development for Python, here's a few suggestions:
- The
speaker.py
is too big. It would be useful to separate it into 2 files, one that has surface command handling, and another one that does all the speech handling (doing checks forrestricted
andsilenced
flags, theperiod
, the random chances, ...). - For a while now, Telegram allows to download a full chat history in a compressed file. Being able to send the compressed file, making sure that it is a Telegram chat history compressed file, and then unpacking and loading it into the chat's
Generator
would be cool. - The most active chats have files that are too massive to keep in the process' memory. I will probably add a local database in MongoDB to solve that, but it will be a simple local one. Expanding it could be a good idea.