-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main issue for performance investigations #42
Comments
I ran Pose2SLAMG2O on the INTEL dataset from https://lucacarlone.mit.edu/datasets/. Here's a flamegraph: https://storage.googleapis.com/swift-tensorflow-misc-files/flamegraphs/2020-05-07-SwiftFusion-Pose2G2O.svg DiscussionIt's spending a lot of time doing small matrix operations in TensorFlow. The TensorFlow overhead is large relative to the small operations, so most of the time is probably spent on TensorFlow overhead. Therefore, switching from How to make the flame graph (Linux)Prerequisites:
|
Am I right in concluding |
Yes, that's correct! And there's an easy optimization to completely eliminate that call.
with
|
Thanks Marc for the benchmarks :) I think judging from the flamegraph the TF engine seems to be doing a lot of work for the |
The TF engine has a lot of overheads. I don't know specifically what's happening in the transpose, but in general it's something like this:
I wouldn't be surprised if this transpose is spending less than 1% of the time doing the actual work and the rest of the time in TensorFlow overhead. |
I switched things from I still need to clean things up and document them before it's ready for a PR, but here's a draft commit: 3e36e82 Here's the new flamegraph: https://storage.googleapis.com/swift-tensorflow-misc-files/flamegraphs/2020-05-09-SwiftFusion-Pose2G2O.svg |
Well, that certainly has the right directionality :-) |
I just did a profile of Pose3SLAM, it seems that most of the time is used in the initialization of Tensors ( |
This makes sense, |
Slowness from heap allocations and dictionary lookupI can confirm that it is important to minimize heap allocations and dictionary lookups in our inner loop. Evidence:
Therefore, we should discuss how to structure the APIs to achieve this. Dictionary lookupDictionary lookups that we have now are:
We could eliminate this by assigning a consecutive Int to each Key that is added to a NonlinearFactorGraph, and then use that Int everywhere instead of the Key. Users can still construct factor graphs using arbitrary Keys and they can still refer to their values using Keys. We just do the mapping to consecutive Ints under the hood. Heap allocationsHeap allocations that currently exist are (in order of badness):
We could eliminate these heap allocations by implementing some sort of "flattened container" that stores everything as a single array of Doubles under the hood. e.g. This doesn't have to be exposed through the user-facing APIs at all! These "flattened containers" can have APIs that make them look like totally normal collections of normal structs. Just under the hood, they store things efficiently. I think these are pretty important to explore soon, because they may require some changes to the APIs, and we don't want to get locked into unperformant APIs. Thoughts? |
Methinks our data-oriented design discussion might have a large influence on this. some relevant slides that I have not parsed in any depth :-) |
Yeah, this is, I think, very related to data-oriented design. I haven't read any of the related work you mentioned during the meeting yet. I'll go read them now and also those slides. |
Yeah, esp. http://optlang.org/ should be insightful. We are threading two graphs here: autodiff graph and sparse problem graph. |
optlang is very cool. I'm just a few pages into the paper, and I'm already seeing some interesting ideas that could make the SwiftFusion API very different, in a good way. Idea: The only "fixed" thing is the generic concept of a nonlinear least squares problem. There are multiple concrete representations of the problem. Depending on your concrete problem and depending on what algorithms you want to run on the problem, different concrete representations are optimal. For instance, should the factor graph (the vertices and edges) be materialized in memory? If so, which graph representation should we use in memory? Should the data associated with the vertices (e.g. the BetweenFactor distance) be stored inside or outside the graph datastructure? There is no right answer. It depends on the concrete problem. E.g:
We want an API with the flexibility to choose different data representations, and then run algorithms that are compatible with those representations. e.g. The linearization algorithm does not care how the graph is represented. It can operate generically on any graph datastructure that supports a query for vertex neighbors, like an adjacency list or the "neighboring pixels" function from the optical flow problem. This is very compatible with Swift because Swift encourages "protocol oriented programming"! (I highly recommend these talks: Protocol Oriented Programming, Embracing Algorithms). Protocol oriented orogramming is quite similar to data oriented programming. You write types representing your data, you write protocols representing the allowed operations on your data, and you write generic algorithms that can operate on any data that supports the required operations. |
I agree with that (actually that is exactly what Opt and similar "optimization DSLs" are doing: decoupling abstract representation of data with its in memory representation :)
These talks are great! (BTW I also love 80x24 terminals, haha)
Yeah I also think we can implement a container-like thing that offers the same API but uses a raw storage backend. (like a memory pool allocator in C++). P.S. I also did a profile on the G2O for INTEL and the result aligned with yours :-\ |
Excellent, I’m loving this discussion, and he will I will watch the talks. I think we may be able to undo some of the mistakes that were made in GTSam, because we did not know better at the time. The one thing we need to absolutely keep is the factor graph abstraction. This is strictly more general than least-squares solving. For example, we can solve a sudoku, by making the factors constraints. Or do inference with Bayes networks, I making the factors probabilities depending on discrete Variables. Something GTsam cannot do, is mixing discrete and continuous variables. In principle, a factor graph can have both, and factors could be continuous (think least squares or robust error metric), discrete (such as a constraint, or a probability), or mixed. Really: Factor: anything -> bool or float Where float is least squares or a negative log(probability) (least squares is -log Gaussian) Optimization: feasible (all bool factors are True) and sum of floats is minimized |
Interesting, thanks for these examples, I will think about them. Based on this, I'd argue that the general abstraction we should start with is a "Factor Objective" which is a function from "Variables" to "Factor Errors". Perhaps the objective also has a notion of "total error" and "feasibility". What's not included in the "Factor Objective" abstraction:
So "Factor Objective" is way too abstract to run any useful optimization algorithms on it. (I guess if you have a random generator for variables values, you can try a bunch of random values and see which one gives you the best result.) Therefore, we need slightly more concrete abstractions. (In Swift, making abstractions slightly more concrete is "protocol refinement"). One important refinement would be the Concrete implementations of
Another important refinement is a Concrete graph types like The power of abstracting things this way is that you're not constrained to use a particular representation of data when you invoke an algorithm. For example, maybe there's some solver that extracts a dense linear least squares subproblem from a factor graph and then runs some LLS solver on that dense problem (dunno if there is such an algorithm, it's just an example). If the LLS solver requires |
My favorite way of designing something like this is iteratively building up the abstract protocols and concrete structs together. Today we have some concrete structs already! I'm about to make a PR suggesting a slight abstraction over them. |
I like FactorObjective as a protocol, if not its name :-) I think "Factor" would be nice ! Similarly, why not just make the other protocol "FactorGraph". One of the requirements would be that it has an "error", which we could make float, reserving infinity for infeasible configurations. This has the danger of being religious/subtly, but I don't love "Objective", as it subtly shifts the emphasis away from the factor graph concept, in a way that I think hurts the cause. I understand how after seeing the OPTLang talk everything will seem cool and least-squarey :-) I think LinearLeastSquaresObjective moves us away even more, now having zero graph concept. But, truly, to me the graph is everything. The stenscils in optlang are all about a concise way of specifying a graph. The actual objective is almost irrelevant to the structure of the computation. Two Frankisms (aka Crusty grumblings) that are relevant:
Objective pushes us to matrix-land by association. |
I'll add my two cents here: I agree with Frank that one of the reasons why Opt is fast is by getting rid of the sparse matrix representation of LS problems (though not completely, as the notion of Jacobian is still there), and replace these into implicit representations, the Jacobian can be computed on GPU ad-hoc, saving a lot of memory bandwidth and ops. I consider the factor graph scheme to be something that effectively encloses the concept of objective function (in a LS sense), but with much more potential. Currently we are using an immediate-mode evaluation of factors (on linearization), and the graph structure can only be exploited at the sparse Jacobian level. However, with minimal amount of hint, we can effectively extract the common shared structure of sub-graphs, and build optimizations on top of that. One of the major roadblocks for "matrix-free optimization" in the past is that it is very hard to represent free-form Jacobians in a concise manner without too much boilerplate. S4TF does exactly that, so I think we are pretty lucky here :) |
Also, an interesting read would be https://stanford.edu/~boyd/papers/pdf/abs_ops.pdf Boyd is one of the biggest names in optimization :-\ |
This conversation has become very interesting. Would you like to have a call some time tomorrow so that we can discuss a bit more efficiently? I'm free any time after 2pm Eastern time. I'll also keep responding here in github because I can't wait to keep talking :)
I agree that "objective" has some unnecessary connotations, let's remove it. I agree that
Yes and no. There are two things:
I'll boldly claim that by decoupling For example, suppose I want to run CGLS on a matrix-free graph. I can't do that today with
This is very data-oriented. It's also very graph oriented -- the user specifies exactly what graph datastructure best represents their problem. Ohhhhhh, I think after I wrote this example I finally realize what you mean about
Yes, and I think that Overall, I'm trying to argue for the view that we should provide a library of generic functions (optimizers, graph algorithms) and a library of graph datastructures. Then, the user with a concrete problem defines their own type representing the data in their own problem, using the most appropriate datastructures from our library, and finally invokes some of our generic algorithms on their datastructure.
|
This is cool. I've started reading it. I notice that the "Forward-Adjoint Oracle" is very similar to
|
Awesome work! My bet would be on LM being different, as we discussed. |
@dellaert Sure thing, I'll try benching the inner loop. BTW, yes, we are using only one thread in both configurations. |
Cool graphs! Do you have instructions for generating them so that I can quickly see how perf optimizations show up on these graphs? |
Also I'm curious: What linear solver are the GTSAM runs using? |
@marcrasi It's the sequential Cholesky solver (a direct solver). I pushed all results and scripts to generate these graphs to https://github.com/ProfFan/expmap-benchmark, however you will also need the prebuilt wheels of GTSAM in different configurations in https://github.com/borglab/gtsam-manylinux-build/branches (the CI artifacts) |
Ah, that also explains some things! Can you do PCG with Jacobi also? How do they compare? Again it will be important to choose the exact same convergence criteria on both sides of the "aisle", to make sure we're not timing the convergence criterion but rather the solve. |
For perfectly comparable results, you'll need to do PCG with DummyPreconditioner because I haven't gotten to implementing the Jacobi preconditioning in Swift yet. (I intend to do that after simplifying the matrix boilerplate, because the preconditioning involves some linear algebra that will be much easier when there is an easy way to manipulate matrices.) |
I wanted to but haven't got time, probably will do next week |
Need some new figs:
|
I'll post information about performance investigations here.
The text was updated successfully, but these errors were encountered: