-
-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Improving the function for calculating difficulty #352
Comments
While that would be great, all my attempts to integrate R into D have failed, and my attempts to change the formula completely and use a formula based on B-W (like in SM-18) have also failed.
That depends on the material though. If you are learning something that is very hard, why wouldn't most values of D be close to 10?
Honestly, I'm not convinced that reversion to a fixed value is a good idea. I've experimented with reversing to a value that changes, making so that instead of all cards reversing to the same "mean" value, each card would have it's own "mean", but that also didn't work very well.
That's an interesting idea.
Since optimizable grades didn't work, I'm willing to bet this won't work either, since it's basically the same idea. Honestly, I'm pretty tired of trying to improve D, there are other things I want to work on, like the matrices. |
I guess a good way to take A surprise function I've come up with is The way I think this new function should be used is as a new weight in the weighted average that the In order to use the surprise function as a weight it should be normalize Finally, we can connect the three posibilities this way: I have chosen 2 as the midpoint grade of the surprise function as it is usually not seen as a success nor as a lapse. |
This comment was marked as resolved.
This comment was marked as resolved.
I tested it using this file (I made a copy): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/fsrs4anki_optimizer_beta.ipynb I'm getting more and more perplexed as to why nothing we do to modify the formula for D improves RMSE. I could make an entire list of ideas related to D, all of which failed. |
Maybe we can change our reasoning thread to another way. Why not use a similar method like fitting initial stability to determine the initial difficulty. Is it possible to find out the initial difficulty via the current data and formula? If we make progress in fitting the initial difficulty, we are more likely to find out the formula to update the initial difficulty. |
Initial D doesn't matter as much as initial S. I would really like to see a hybrid model, with all FSRS formulas, except that D is given by an LSTM. Though I'm not sure if it's possible in pytorch, you would have to optimize the parameters of FSRS and of LSTM simultaneously. |
I have simplified the use of the surprise function and added some more weights. The surprise function is now just a companion to I got a decrease in RMSE of 2% with my dataset. Specially a decrease of 5% with a deck that I stopped studying for some time and then returned, so the effect of different retrievabilities is bigger. I'm not sure if the improvements are big enough. Actually, they should not be big for a dataset in which most reviews are done in time, since there is no variability in retrievability. This change becomes relevant in datasets where there have been a period long enough without studying in order to provide low retrievabilities. This is the code I used if you want to test it with your datasets: def step(self, X: Tensor, state: Tensor) -> Tensor:
'''
:param X: shape[batch_size, 2], X[:,0] is elapsed time, X[:,1] is rating
:param state: shape[batch_size, 2], state[:,0] is stability, state[:,1] is difficulty
:return state:
'''
if torch.equal(state, torch.zeros_like(state)):
keys = torch.tensor([1, 2, 3, 4])
keys = keys.view(1, -1).expand(X[:,1].long().size(0), -1)
index = (X[:,1].long().unsqueeze(1) == keys).nonzero(as_tuple=True)
# first learn, init memory states
new_s = torch.ones_like(state[:,0])
new_s[index[0]] = self.w[index[1]]
new_d = self.w[4] - self.w[5] * (X[:,1] - 3)
new_d = new_d.clamp(1, 10)
else:
r = power_forgetting_curve(X[:,0], state[:,0])
a = torch.exp( - self.w[18] * (r - self.w[19]) * (X[:,1] - self.w[20]))
new_d = state[:,1] - self.w[6] * a * (X[:,1] - self.w[17])
new_d = self.mean_reversion(self.w[4], new_d)
new_d = new_d.clamp(1, 10)
condition = X[:,1] > 1
new_s = torch.where(condition, self.stability_after_success(state, new_d, r, X[:,1]), self.stability_after_failure(state, new_d, r))
new_s = new_s.clamp(0.1, 36500)
return torch.stack([new_s, new_d], dim=1) |
Not exactly because three different algorithms are involved here, i.e., Anki SM-2, FSRS v3 and FSRS v4. The reviews were done according to the algorithm in use at that time. Now, all the reviews are analysed using FSRS v4. So, the R at the time of review would vary.
Can you pull the latest commits from the main branch on open-spaced-repetition/fsrs-optimizer? It would make the comparisons fairer. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I don't know how to test it the "proper" way, so I tested it my way again: I made another copy of https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/fsrs4anki_optimizer_beta.ipynb and implemented the code there. @L-M-Sherlock I am once again asking you to make a hybrid LSTM-FSRS version where all formulas (such as S for lapses and S for successful reviews) are the same, except that the value of D is given by a neural network (assuming it's possible to optimize such a hybrid in pytorch). This isn't just to satisfy my curiosity; it will answer an important question: "Can we change how the value of D is calculated to improve the accuracy of FSRS without completely changing the formulas themselves?". If RMSE is the same as before, that would indicate that rather than adding more factors and terms to the already existing formula for D, we should completely redefine D. In other words, if even a neural network cannot provide a value of D (between 1 and 10) that improves RMSE, then our attempts won't either. |
Why should the neural network be LSTM? I don't know what the input and output are for that LSTM. |
LSTMs are specifically designed for time series. Input: interval lengths and grades, output: a number between 1 and 10, which you then use in formulas for S. |
Do you mean that LSTM should output a serie of difficulty for a sequence of intervals and grades. Then, the sequence of difficulty is used as the input of FSRS? |
Yes. The LSTM should replace |
I tested @hydrogs' version with my collection. @Expertium, the proper way to test it is to insert the following line of code at the same place as described by Sherlock here: open-spaced-repetition/fsrs-optimizer#16 (comment) %pip install git+https://github.com/hydrogs/fsrs-optimizer@feat/difficulty For my collection, this change increased the RMSE. With @hydrogs' version
Original optimizer:
|
@L-M-Sherlock just a reminder to implement my idea with LSTM. Input: grades and interval lengths, output: a number between 1 and 10. It should replace the formula for |
This comment was marked as outdated.
This comment was marked as outdated.
I still think that we should try what I suggested here: #352 (comment), #352 (comment) |
|
I just want to echo what @filipenanclarez said above, with some more of my own thoughts, A lot of my decks have explicit siblings, which are almost certainly affecting the stability of each other when reviewed, so I imagine there's always going to be some loss one can't reduce due to this alone. Also, many of my cards have what I like to call "conceptual siblings". These being cards which are not explicit siblings, but which, nevertheless, have relationships to other cards, and their review has a graded effect on the stability of other "conceptual siblings". For instance, a deck full of vocabulary, but also example sentences, would have many of these relationships. Perhaps I should separate these out from the vocab, but really, they are affecting my retention of their "conceptual siblings" no matter what deck I place them in. @L-M-Sherlock mentioned, in a reddit thread I brought this up in, that one might use graph neural nets to begin to broach this problem (eventually), and that seemed possibly tractable to me. But first, lets try to think it through with good ol' graphs, no neurons required. Imagine you create a DAG where every pair of explicit sibling has two directed edges between them with some edge weights, you could then optimize the usual FSRS weights as normal, but also add some logic that makes it so that reviewing a card with an edge on the DAG to the current card improves the current card's stability (and possibly lowers difficulty) by some amount parameterized by the weight of the edge. This would allow "explicit sibling" review effects to be learned per note and cards. So far this doesn't seem too completely intractable or prone to the curse of dimensionality (famous last words). But I do imagine the "per card" basis might make it more likely to overfit, as training weights per card negates the benefits of the train vs test set (and would be a breaking change to the number of weights, lol). Perhaps it would be required to have an "explicit sibling" weight be learned per deck rather than per card relationship, though I can imagine scenarios where different types of sibling in one deck do not all convey the same stability benefits. However, next, we might think about also learning the edges and weights representing the directed-memory-strengthening-relationships for the "conceptual siblings". We may even be able to keep the mathematical logic the same as with the explicit siblings, but instead just attempt to add edges and weights that might help the loss. This is, of course easier said than done, and seems like a mathematically unusual problem that might require novel techniques to solve, especially if done in tandem with optimizing the normal FSRS parameters. Trying to use stochastic optimization here would be very slow, as there are n*(n-1)/2 possible edges for a deck of n cards. I'm attempting to learn more about current methods in "Latent Graph Inference", although I think current techniques might require mapping this problem from a plain ol' graph problem into a graph neural net problem, which, doesn't seem entirely impossible now, as graph neural nets seem particularly suited to this. Perhaps it involves thinking of each node in the graph as the card and card's history, and every historical day that passes is a time step for the graph, and every review during that historical day "flows" stability/difficulty information along the GNN (with each node also having a weight 1 edge to itself), I'll have to continue to read more to see if this conception makes any sense. Here's a particularly dense paper I'm working my way through: https://arxiv.org/pdf/2211.16199.pdf All that to say, I think siblings play a big part in any SRS inaccuracy. As well as the fact that you can never track all offline "retention events" (spontaneous recall in conversation, teaching someone about what you learned, foreign language practice, target language media consumption, etc.), which all seem like they necessarily will introduce a lot of inaccuracies that can't be mathematically modeled. However, siblings do seem like an interesting problem worth attacking eventually, especially if it requires using or creating some novel and fun techniques. :) |
I think there is actually a neural network that outputs a "similarity score" based on the card's text. @L-M-Sherlock I can't find it, but I think I saw it when browsing your profile on github. EDIT: https://github.com/thiswillbeyourgithub/AnnA_Anki_neuronal_Appendix |
I think that attempting to consider the effects of reviewing related cards is beyond practical use. People would inevitably interact with material they study outside Anki which will affect memory states, e.g. reading a book in a foreign language should change stability of cards with words that were encountered in the reading. I assume that unless someone is deliberately studying something very abstract which has absolutely no practical use in their life, the effects of reviewing sibling cards would be negligible compared to those of applying their skills in practice. These effects are impossible to account for, and instead, the best that could be done is to translate user's subjective perception of how well they know the answer, which comes in form of review grades, to changes in difficulty and stability. |
Of course if something happened outside of Anki, it's impossible to account for. But if something happened within Anki - such as reviewing a conceptually similar card - it should be possible to use the information about that event in scheduling. |
This is awesome! I didn't include in my spiel, but did have the thought that maybe some clever similarity metric might be a good place to start for the graph's edge weights, and then perhaps some optimization of the weights for magnitude of the effect, based off the time series data could be useful.
I would agree that the effect of real word practice is likely large and unknowable from the optimizer's point of view (but can likely be broadly optimized as deck ease), but the effect of conceptual neighbors is still very important, and I believe a major "cache" of loss that can be "harvested". Imagine you have a deck that has some accidental duplicates. For instance, I have a large deck about sign language that includes slight variations of specific signs. Some are so close to each other that reviewing one is basically like reviewing the other (perhaps only a slight change in facial expression, but still a difference worth noting). These are surely stored in my brain as the same memory, and reviewing one card is, surely, nearly identical to reviewing the other (with respect to the memory's stability). I would think an optimal system would be able to tease this relationship out, and essentially treat reviewing this one as reviewing the other. That is perhaps a situation where the weight between the nodes of the graph is 1, (like perhaps the self-edge also has a weight of 1 to represent a review's effect on itself) whereas there are situations where the effect of reviewing a conceptual neighbor is not as marked. Like maybe a certain example sentence uses a specific sign that is similar to a vocab card, but not quite, but it sometimes makes you think of the sign, and stabilizes your memory a bit. This would be a situation where the edge weight is quite low, but nonzero. This type of all neighbor/conceptual sibling type of optimization seems a lot more prone to the curse of dimensionality, especially if one was trying to optimize individual edge weights. But, I would imagine if you ran the aforementioned "text similarity neural net" once, and then used those edge weights with a few changes to the stability/difficulty mechanism to account for it, and optimized some parameters in that mechanism globally, you would see some improvement in loss for a lot of decks. And if this were the case, I'd think that would be a good sign that optimizing individual edge weights might be beneficial as well. One might just need to have the train and test set be temporally sampled rather than by cards (if this isn't already the case). (i.e. test set is the most recent time slice of the dataset, whereas training set ends before then), as all cards need to be represented in the training set so their edge weights can be established. |
So out of curiosity, I decided to combine the three most promising (on paper) ideas:
This introduces 11 new parameters: 4 for each grade, 4 for each value of initial D, and 3 for the surprise function. This has added more flexibility than ever, as well as incorporated R into D. Note that I'm still using |
One more thing that, I think, is worth trying is not excluding same-day reviews but only adjusting difficulty without changing stability. The logic is that if the user pressed "Hard" (or "Again") two or more times when seeing the card on the same day, this card must be more difficult than a card such that the user pressed "Hard" only once. |
This is true for sure. Cards that in learning that I have pressed 10 times in the same day again instead of once. Are way harder. I noticed that fsrs needs a lot of time to catch up to those. |
Hi, I'm the author of AnnA and was brought by @brubsby I'm a medical student very into datascience. I created AnnA to reduce the amount of reviews I had to do. But along the way I experimented with other use of my vectorization engine : for tagging the K nearest neighbors of each cards, for tagging the cards by the cluster semantics, to create semantic plots etc. I was not really aware of the potential of FSRS until @brubsby talked me into it and I'm now very interested! I probably won't have time to dig into the whole thread here but am happy to answer any question you might have and can add features if that helps. I don't have too much but I figured you might be interested in some details about AnnA. Quick summary of how AnnA works : it's a python script that gathers cards either from ankiconnect or from ankipandas (as ankiconnect is a serious speed bottleneck), use a field_mapping file to know which field of which notetype to take into account, apply a bunch of text processing (basically html stripping, I also supported at some point stemming, tokenizing, probably a few other things), then vectorize each card. The result is a 2D 'distance matrix' for the selected deck. Vectorization details: I supported several engines but mainly use either tokenization + TFIDF (universal, but not as semantic as I'd like it to be, also support ponderation of the fields) or sbert embeddings (multilingual models have low text size so I did a sort of rolling window + pool) Perf: t's not actually that intensive to compute the distance matrixexcept for very large decks but then you could use faster function to only find the n closest cards, and don't actually have to rerun the script very often. In one year from now I'll have finished my exam and plan to make AnnA very easy to use for everyone, (as well as many other of my hidden FOSS Ai Anki projects :) ) but won't have time for now. I know the code is a bit terrible at times but I had to make a tradeoff as I'm sure you can understand. Cheers ! |
@L-M-Sherlock I know this is fairly low on your priority list, but we could enhance "Delay siblings" feature with this. |
From my simple tests it looks like how currently D is used in stability_after_success and stability_after_failure promote the optimizer to shift most cards for most users to D=10.
wants to be as close to 1 as possible and
wants to be as close to max to achieve the smallest number with pow I tried a simple change to rescale D to 0.0-1.0 and this frees the difficulty it seems but the metrics are a bit worse across the board. |
The current formula for D in FSRS is just a heuristic one. We have already tried a lot of approaches to improve the calculation of difficulty. But all of them failed.
While developing a better function for D, we should keep in mind the following:
Again
is the first rating, it should not have as much impact on the D as the lapses have. Reason: You might have created the card several weeks ago and you have almost no remaining memory of it when you review it for the first time. The extreme case would be where the card was created by someone else, and the first review is the first time you encounter the information. So, anAgain
rating here doesn't mean that the card is difficult.The text was updated successfully, but these errors were encountered: