Thoughts on Architecture and Performance #33
Replies: 2 comments 3 replies
-
First, congrats on the shot! Sorry you're not feeling well :/ Please don't feel obligated to dig into this while you're feeling shitty. You're very kind for putting this level of thought and work in while not feeling well. So, back to the post; Beautifully documented. With regards to the current CPU utilization and Disk IO, I've had a few folks run this against their system starting from the root directory and it is not efficient whatsoever. Currently pretty taxing because it's not breaking up the load into separate workers at all.
Love this. That idea also opens up the opportunity to do just a portion of the work and collect the directory tree for further actions at a later date. Additionally this was my original hope for the approach to scanning systems. But when I started pillager I had a very fragile grasp of using go's concurrency model for worker fan-in fan-out. I'm still not great at it, but at least it makes more sense 😂
This is a brilliant idea. That would essentially take the current thought of a "rule system" and expand it towards a system that can pivot and take actions based on rules found, if I'm interpreting you correctly. This could easily get very complex, like you mentioned, so I think taking a good bit of care here to ensure the design starts with writing up a few interfaces and defining their methods might be a good way to think through the implementation of this in a "pseudo-code"-esque way. The whole architecture makes a lot of sense. The most logical path for implementing each of these to me is in discrete packages, if possible. That will keep the code easy to digest and very flexible for expansion. There is a lot of potential for it to turn into a bit of a rats nest if all the pieces get too tightly coupled but I am not too worried about it. I think we should be able to get a good separation of duties between the different stages and then allow the cmd/pillager package to be the consuming package that chains together the execution of these pieces. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hi @brittonhayes,
I had some time today to think about Pillager architecture and control flow. I have mocked up a small diagram to illustrate what I am thinking. In full disclosure, yesterday I received my vaccine and I have been running a fever for the past 24. If any of this is a little incoherent it may be my fevered mind XD
To provide a little explanation of what you are looking at here:
Tier 1: This is the first and most asynchronous section. By using
godirwalk
's unsorted non-deterministic tree traversal method, the dispatcher can deploy hundreds of tree crawling subroutines as it spiders. These routines are responsible for the high-level categorization of files and the first decision about whatHound Routine
will be dispatched for a particular file. The disk IO here is minimized slightly by the work ofgodirwalk
but this stage will be dominated by non-sequential disk seeks.Tier 2: This layer exists solely to reduce disk IO. Since the
godirwalk
dispatcher is only seeking to files, not reading them, it can dispatch and traverse the file system much faster than the hound routines can read and analyze the files. By sending results to an intermediate accumulator/semaphore, expensive IO can be limited, preventing deadlocks and other OS issues. Also by feeding into separate semaphores, hound routines memory and CPU usage can be controlled on a much more detailed level.Tier 3: This by far is the largest and most computationally expensive section. In my current vision, hound routines are segmented modules designed to analyze a file in a particular way. For example, the
SSH_KEY
hound routine would be responsible for parsing and collecting any suspected key files. As such, these modules can be stacked horizontally and easily expanded on as the project grows. Another property of hound routines is that they can operate "acyclically"(see below)Basically this means that hound routines can trigger other hound routines if they detect certain parameters. While this adds another layer of complexity, it provides a radically more powerful solution in cases where the original dispatcher incorrectly categorizes a file or in the case the file is sufficiently complex that a multi-stage analysis would be preferable.
Teir 4: The results accumulator doesn't warrant a huge amount of discussion and does exactly what it says. Since Go is great at this, this will be trivial to implement compared to other languages 🙏
Teir 5: Final results presentation and ultimate delivery appear to be something you have already solved in a really satisfactory way. We obviously want to make the results easy to read and easily ingestible by another program. Give all the options you have already, I think we are in a pretty solid position.
Beta Was this translation helpful? Give feedback.
All reactions