-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Repository #343
Comments
Ok, my thoughts: Performance Bottlenecks and Database I/O:In systems where large volumes of data, such as bets in a prediction market, are being processed, the primary bottleneck is often the database I/O rather than CPU usage. For example, if we are returning an entire model—let's say a million bets—just to calculate a simple statistic in Golang, this puts a significant and unnecessary load on the I/O. Instead, leveraging the database to perform a GROUP BY operation and return only the aggregated result (e.g., a count) dramatically reduces the data being transferred. This approach allows the database, which is written in C and highly optimized for such operations, to handle tasks like counting or grouping, thereby reducing the overhead on both I/O and CPU in the Golang application. By opinionatedly designing our system to use the database for what it's good at, we minimize inefficient Golang operations such as iterating over large datasets in for loops, enhancing overall performance. The overall objective of the project is to allow individuals or small organizations to run prediction markets, perhaps ideally with up to 1000 users on an around $100/year or so hosting budget, or to get as close to that as we can. @j4qfrost brought up the notion of injecting C math libraries as a way to deal with bottlenecks. This is a more, "data engineering," approach, e.g. just write certain types of queries for particular types of calculations in a way that we know will reduce the load on the database. Whereas injecting C math libraries would deal with performance on the CPU for the application, it wouldn't deal with I/O concerns. There is an I/O creep that will happen if we don't control for it at some point. ORM Raw ConcernsORM and Raw SQL Queries: While GORM provides a convenient abstraction GROUP BY seems to be not well supported by the ORM’s methods, e.g. it only allows a single string input, when we need to actually group by two fields to get a count of total number of first time bets across all markets. In cases like this, the ability to execute raw SQL is the only way to perform this operation. While this does expose us to having RAW SQL, which is undesirable, we could make exceptions for where GORM lacks library functionality, while also having a policy of not allowing RAW() writes ever, and reducing the number of raw queries, "use the library except in these circumstances." Why Not Couple Queries With ModelsMy thought is that we need a way of conceptually keeping our queries and how we are performing them separate than the models themselves as a way to highlight that queries can be costly, and to restrict the number of queries we can put in place. Now obviously having a RAW method may seem to go against this, but as mentioned in the above section, this is in there for exceptions, not as a rule. The intention is not to fully restrict according to a library, but rather to restrict against a set of functions that we pay attention to and maintain to try to get the app going in a performant direction from the start. MaintainabilityAt this point I don't think there are a huge number of ways we interact with the database. If anything having a repo model will control the number new of different types of queries that we write and ways to solve problems. While it may slow down certain buildouts and features, we could hypothetically allow for regular gorm usage for the buildout of new features, while requiring that a repo model must be put in place after the new feature is built and completed. |
The above conversation is attempting to address only the question of, "should we do this," ... related to my current MR / statsreporting. Everything else in terms of reducing the code size and all other comments that you both made I intend on fixing per what you suggested...I'm just trying to offer an explanation as to why a data repo, vs. coupling functions with the models. |
I set up this MR to help discuss the topic above.
Citations:
|
From: #299 (review)
The text was updated successfully, but these errors were encountered: