How to keep queue consistence #208
Replies: 9 comments 9 replies
-
I've been thinking of an alternative design. I have an idea but it's only a draft. I have to review whether it makes sense or not. These are the key points:
And this could be a different option (pending to validate):
Some of the
More notes:
|
Beta Was this translation helpful? Give feedback.
-
@josecelano Interesting. I really like your prospective of seeing a git-commit as a snapshot: It is a much more representative mental model than what I previously had. When considering comments as snapshots. That can be retaken; reference none, one or many parents; or even abstractly replaced entirely. Then we should be able to take any snapshot and determine the state of the queue for that snapshot. A snapshot could either have a consistent, or inconsistent git-queue status. It is very possible that something goes wrong somewhere and the queue(s) associated with a particular snapshot are in a invalid state. We should detect any inconsistencies. It begins to look like we have two fundamental options with git-queue's:
We need to consider the tradeoffs between these two approaches. In the case that we have some of the queue data stored in a different location to the git branch, we then need to consider how do we link these locations (by-directional linking?), and what happens when they become out-of-sync for any reason... |
Beta Was this translation helpful? Give feedback.
-
I've found out that some projects are using custom git references to store simple information. inside the
Another alternative to store data with Git is annotated tags. |
Beta Was this translation helpful? Give feedback.
-
I've also found this interesting video: Using Git as a NoSql Database by Kenneth Truyers They implemented this NoSQl database using Git. I think they only implemented a "lock" mechanism per branch using C# |
Beta Was this translation helpful? Give feedback.
-
I've been trying to define a new design for the queue based on storing jobs as files instead of empty commits. I hope it would be useful at least to discuss the pros and cons of this option. Proposal using files to store job current stateNOTES
The storage structure: .
└── .queue
├── 53b3ec1a908f8504e353f90ac43d680c7798199b
│ ├── job-1.json
│ ├── job-2.json
│ ├── job-3.json
│ └── queue.json
└── .gitkeep The {
"id": "53b3ec1a908f8504e353f90ac43d680c7798199b",
"name": "update_artwork"
} The {
"id": "c7172d00-7dc4-403a-b438-835facbc3b62",
"index": "1",
"state": "pending",
"parent_queue_commit": "3a032b0992d7786b00a8822bbcbf192326160cf9",
"queue-id": "53b3ec1a908f8504e353f90ac43d680c7798199b",
"payload": "payload"
} The commit body for job commits (commits created by the worker to perform the job) could contain links to the queue and job:
|
Beta Was this translation helpful? Give feedback.
-
More comments on multi-job feature We are implementing a new feature to allow more than 1 pending job. Although it's a generic nice-to-have feature I think we can't use that feature in our use case: updating a consumer repo that is consuming a library repo as a git submodule. We are currently using the git-queue in this website project. The process works like this:
Right now there are some rules:
Conclusion: Even if we modify the queue to allow more than one pending job, we should not add more than one pending job in that case. Becuase we do not know if the intermediary steps (snapshots) would make things faster. |
Beta Was this translation helpful? Give feedback.
-
Projects using git to store issues: |
Beta Was this translation helpful? Give feedback.
-
@josecelano Very Interesting, Good Find! I knew about https://pagure.io/pagure that keeps all the comments in git-repo's however that is centralized. git-bug's internal model document: https://github.com/MichaelMure/git-bug/blob/master/doc/model.md is very interesting! :) They also reference their use of a conflict-free replicated data type (CRDT). That in itself is very interesting. :) |
Beta Was this translation helpful? Give feedback.
-
hi @da2ce7 @yeraydavidrodriguez I've been trying to summarize what I've learnt about using Git Objects and references to store your app data. Using Git as a key-value databaseWhere to store things?With Git you can store data in two different ways: Git objects Git internally uses a key-value database with only 4 types of objects: blobs, trees, commits and annotated tags. Each object is stored in the database and the way to reference the object is by using its sha1 (a checksum of the content). For example, you can isert a new object in the database with: cd /tmp
mkdir test
git init
echo 'test content' | git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4 The last line is the sha1 of the new object. You can get the object with: git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4 This database is immutable. You can only add new content. References Git has a dir If you want to get the content from the previous example but do not want to use the sha1 you could create a reference like this:
The reference is a pointer or alias for the object hash you have inserted in the database before. And now you can use the reference to get the content: $ git cat-file -p refs/my-objects/object-1
test content This database is a mutable database. You can change references to point to different Git objects. You can create new references in any fork of the repo and push them to any remote repo. There are some special references handled by Git: $ tree -al .git/refs/
.git/refs/
├── heads
├── remotes
├── my-objects
│ └── object-1
└── tags Branches are only references pointing to the latest commit object hash in a sequence of commits. When you run a How to store things?Git provides only those basic low-level databases to store things: objects and references. You can use them in different ways. We have seen at least two different models: Store only the latest state You can store your data inside blob objects. When you want to update the version of your object you can store a new object. You can use a reference to retrieve the blob object. You also have to avoid the Git Gargable Collector to remove the object. If the blob object is not referenced anywhere it could be deleted. Pros
Cons
Store state change history (commits) The previous solution allows you to store objects like a key-value database. But you can take full advantage of Git by using the other objects available. When you update a blob object you could create a commit. This model would be the same as using an orphan branch for each object. cd /tmp
mkdir test
cd test
git init
git checkout --orphan my-objects-object-1
echo "test content" > object-1.txt
git add object-1.txt
git commit -m "add object-1"
git checkout my-objects-object-1 && cat object-1.txt You can get the object with: git checkout my-objects-object-1 && cat object-1.txt $ tree -al .git/refs/
.git/refs/
├── heads
│ └── my-objects-object-1
└── tags Pros
Cons
How to solve race conditionsWe have seen two possible models to use Git to store your data as a key-value database. But does this DB implementation offer you a way to handle race conditions? At some point, you are going to have two processes reading the same object and trying to update it at the same time. One of them is going to update from a previous value. We can create an example where we have a “table” with counters. We insert the object with the counter starting at 0. cd /tmp
mkdir my-counters
cd my-counters/
git init
git checkout -b my-counters-counter-1
echo "0" > counter-1.txt
git add -A
git commit -m "initialize counter-1" Independent processes could checkout the repo and increment the counter. After cloning the repo you have an old version of the data because other processes could have cloned and updated the counter. Optimistic Concurrency Control We normally have two options to fix that problem with normal databases, you can either lock the record when you want to modify it (pessimistic locking) or try to modify it always and make the update fail if the record has changed (optimistic locking). Git only allows us to use the “optimistic approach”. When you try to “push” your object version by updating the reference in the origin repo you will get an error if the reference (branch) was already changed. Increment the counter in the process 1: cd /tmp
git clone /tmp/my-counters my-counters-process-1
cd /tmp/my-counters-process-1/
git checkout my-counters-counter-1
echo "1" > counter-1.txt
git add counter-1.txt
git commit -m "increment counter to 1" You can do the same with a clone for process 2. Then you can push the changes from process 1. If you try to push the counter in the origin repo you could have this error: git push origin
…
! [remote rejected] counter-1 -> counter-1 (branch is currently checked out)
error: failed to push some refs to '/tmp/my-counters' You only need to checkout a different branch in the origin repo. After pushing from the process 1 fork and try to push from the second one you will see the Git error: On branch counter-1
Your branch is ahead of 'origin/counter-1' by 1 commit.
(use "git push" to publish your local commits)
nothing to commit, working tree clean
$ git push
To /tmp/my-counters
! [rejected] counter-1 -> counter-1 (fetch first)
error: failed to push some refs to '/tmp/my-counters'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details. You cannot update the reference because the previous commits created by process 1 would be lost. Git does not allow you to update the object if someone else has already changed it. So basically if you use Git to store your objects the only mechanism to avoid race conditions is optimistic locking. This way of using Git is like having an SQL table where you have version numbers for record updates by default. That means every time you read a record you get the version number. When you update the record you check that the version number is still the same. The only difference compared with the normal SQL optimistic lock approach is that you always store the new version of the object, but you do not update the reference (pointer) to it. So you will continue retrieving the previous version. How to design your unit of workSince you only can control concurrency at the object level (references) you have to make sure that you design your aggregates in a way that each aggregate is a different reference (or branch if you use the second model). What does that mean? In the same example, you can have strict domain rules between counters. For example:
In those cases, you should have to define an object like a "pool of counters" and store all the values in the same object (branch). I recommend reading these articles to know the trade-offs of aggregate design. How to store the Git QueueWe have not defined yet the final requirements for Git Queue 2.0. Some open questions are:
If we do not have any invariant between jobs we could store each job in a different orphan branch. That way we can reduce conflicts and have a better performance. On the contrary, if we have some invariants between jobs then we have to store the whole queue in the same branch, so we make sure we do not have inconsistencies. We could decide the storage method depending on the invariants declared in the queue configuration at runtime. For example, the user could declare the queue as a queue whose “jobs do not have any dependency”. Atomic pushJob consumers (workers) have to process the jobs. In our case (update Git submodule), the result is a set of commits you want to merge into the target branch. In our case, we want to update the queue and merge the “job commits” atomically, otherwise, you can have inconsistencies between the job done and the queue state. If you push the job commits but you cannot push the job update, the worker will try to process again the same job. On the contrary, you could push the job update to the queue branch, but you can have an error pushing the job commits. The queue update and the new commits have to be pushed atomically to their branches. Fortunately, Git has a push option “--atomic” that does exactly that: "Either all refs are updated, or on error, no refs are updated”. So you could do something like:
More info about Atomic pushes:
Projects using Git objects and references to store data
TalksArticles |
Beta Was this translation helpful? Give feedback.
-
@da2ce7 have defined some cases we have to solve when queue messages are moved from now branch to other.
I would add some more notes about this topic:
work-allocator
which creates the new jobs and aworker
which processes the jobs and marks them as finished.📝🈺: update_artwork: job.id.1 job.ref.ddc123312232baa3acb2342562fbc4535ccc234▶️ : update_artwork▶️ : update_artwork
📝
📝⏸: update_artwork
📝✅: update_artwork: job.id.2: job.ref.232baa3acb2342562fbc4535ccc234ddc123312
source code commits ...
📝👔: update_artwork: job.id.2: job.ref.232baa3acb2342562fbc4535ccc234ddc123312
📝🈺: update_artwork: job.id.2: job.ref.232baa3acb2342562fbc4535ccc234ddc123312
📝✅: update_artwork: job.id.1: job.ref.1e31b549c630f806961a291b4e3d4a1471f37490
source code commits ...
📝👔: update_artwork: job.id.1: job.ref.1e31b549c630f806961a291b4e3d4a1471f37490
📝🈺: update_artwork: job.id.1: job.ref.1e31b549c630f806961a291b4e3d4a1471f37490
📝
We wanted to track not only code changes but also the reason why that code was changed. Why? because some changes can be complex changes and they might be done automatically by other apps (GitHub bots, workflows, etcetera).
work-allocator
andworker
call some commands in the GitHub action. The side effect is that some messages are recorded in the commit message of an empty commit. The queue can know the current state of a job by reading all the events (messages) related to that job.This would be a simple and good solution if:
git commit --amend
,git rebase
, ...In general, the problem is we cannot guarantee the queue consistency because the commits belonging to a queue can be changed while a merge, amend, rebase, or other Git operation is executed.
In the end, it seems we need an app on top of Git that checks the queue integrity after any Git command. It should fix or warn when there is a queue error, for example, a different queue in a feature branch has not been "deallocated/stopped/finished".
Some common problems could be:
Questions
to preserve the identity of a change after executing Git commands. We could consider our jobs a given change. It's only an automatic change.
Beta Was this translation helpful? Give feedback.
All reactions