-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for User Stats Pipeline to Handle Both Scenarios (New and No New Data) #1006
base: master
Are you sure you want to change the base?
Fix for User Stats Pipeline to Handle Both Scenarios (New and No New Data) #1006
Conversation
@TeachMeTW just to clarify, this PR does not incorporate the fix to the two use cases - that was fixed earlier.
|
7e613fb
to
19118ca
Compare
@shankari I thought incorporating the fix meant fixing the unit test, hence the two new test cases. I have split the two unrelated changes into two commits and cleaned up the commit history as requested. |
This is correct, so I am happy to merge it. |
Well, the fix is already done, but I guess you can argue that if it is not tested, it is not done 😄 |
Aha! After I pushed c77c5c5 related to using the input key, the tests started failing with this error
@TeachMeTW if a test fails, you cannot arbitrarily modify the tests so that it passes. That breaks the entire point of having automated tests, which is to catch regressions. We need to investigate and indentify why we are getting different results for confirmed and composite trips. @TeachMeTW do you have an ETA for the investigation to be complete?? |
@shankari I am looking into it at the moment |
@shankari That is because we WERE using
During your refactor, you modified the statement to use
|
@shankari By changing the internal |
@shankari Based on my investigation: Why I believe 8 is the Correct Value:
Proposed Solution:
Next Steps:I will update the test to align with the |
@TeachMeTW the data model is such that we should have a 1:1 mapping between confirmed and composite trips.
This is not expected and needs more explanation
I am not sure what this means. **This mismatch is not expected. **. You can look at the code that creates composite objects - it reads the confirmed trips and creates composite trips from them. For the record, I don't need the summary here: #1006 (comment) the earlier comments are sufficient |
I will look into it. |
@shankari So as we
Does that mean confirmed trips is the ground truth in terms of length/amount? |
I added debugging and logging with my changes and got to this conclusion:
|
Changes:
to
Removed esda.CONFIRMED_UNTRACKED_KEY to ensure that only Confirmed Trips are processed for creating Composite Trips.
Introduced a new count to track the number of Confirmed Trips per user.
Added an assertion to verify that the current composite trip count is zero before updating, preventing duplication and ensuring alignment. Questions@shankari I just had a question in regards to the 1:1 mapping? If we are adding to it, how is it keeping a 1:1 ratio? I am not sure if ommiting the untracked_key is the good way to go about it. Let me know your thoughts before I push this commit. Currently the test works with 5 as the expected trip. The debug statement was above. |
@TeachMeTW first, identify the problem, then think about potential fixes, and then make the changes. Aha! This is the problem. Good job identifying it.
One potential fix is indeed to change the Do you have thoughts for an alternate fix? I can think of at least one other option. Please think through all the options, and list their pros and cons first. Please do not make changes without thinking through several pros and cons |
One option is also just to take the larger number, but change the label to "trips and untracked time", but I am concerned that will confuse the admins. |
@shankari I will take a look at these options and explore the pros and cons as suggested; will see if I can uncover a pssobile fix |
Here are some options, let me know your thoughts. This is based on how I understood the issue at hand: Add a metadata tag
Pros:
Cons:
Maintain Separate Counts and Labels
Pros:
Cons:
Change Labeling to "Trips and Untracked Time"
Pros:
Cons:
Thoughts:
|
|
@shankari I believe from reading the code, we already have a dict for I would think just to do add to it by doing `composite_trip_entry['metadata']['origin_type'] = tracked or untracked. But the thing is we already have confirmed_untracked so is there an unconfirmed untracked or unconfirmed tracked? I am just a bit confused on how composite_trips and the other keys are determined. I will look into it. Now with that, we can modify the db.count_document() to filter for metadata.origin_type ie: metadata.key: esda.COMPOSITE_TRIP_KEY So:
I believe my iteration above would suffice by adding to the composite trip entry so that line
We could use the count_documents and filter with the new metadata.origin so itd be like:
Would you say this explanation and train of thought suffices -- not to be always asking for reassurances that is. |
@shankari I also wanted to clarify by what you mean by "But the app also uses composite trips (which is why we want to keep it warm), so changing this data structure will break untracked time." How are we keeping it warm and how will changing the structure break things? |
Well thinking about it more -- it begs the question what things are effected with this change. Furthermore, why wasn't this 1:1 issue caught before? Does this mean it was broken already? If so why didn't the tests catch it? |
Since I saw that we have origin_keys already, I was thinking modifying filtering to:
We could filter by confirmed this way? But this seems too good to be true. How do you suggest me testing this @shankari -- what depends on the composite objects? I know as you said the phone does and I don't want to affect that and whatnot |
We set |
@shankari Rereading the code, I don't seem to understand by what this means:
if composite is created from BOTH confirmed trips AND cleaned untracked trips hence the From my understanding, if both confirmed trips and untracked trips are being combined into composites, shouldn’t the number of composite trips exceed the number of confirmed trips? Or am I missing something in how the relationship is established? @JGreenlee the git blame refers to 9d41f1d where include untracked time in composite trip creation. |
@shankari |
Since
I can explain in more detail if you want to get on a call later |
I can also provide context for
|
@TeachMeTW here's a concrete example working more independently. If you look at this PR, you have asked me 4 questions in ~ 12 hours! Neither Jack nor I can respond at that rate and still get our own work done. The goal is for you to slow down, read the code, explore the data and figure it out yourself. It is more efficient for you to ask us when you have a question, but it is less efficient for us. You have been added to the internal issue where we are discussing scalability changes. Learning how to read existing code is a critical skill. |
… instead of hardcoding analysis/confirmed_trip" This reverts commit c77c5c5. Reverting because this broke the tests. #1006 (comment) This is because the composite trips include both actual trips and untracked time. #1006 (comment) We don't want to store untracked time in the user stats or show it to users, so we can't push this to a real production system, where people check the admin dashboard, just yet. The fix is simple, but will take @TeachMeTW some time to figure out. So let's revert the change for now.
19118ca
to
a0e51c0
Compare
@shankari After clearing my thoughts, rereading and building off from my theories and approaches from yesterday, it seems that Option 1 -- the metadata was ALREADY in the codebase so it was a simple fix once I clarified with @JGreenlee with what composite trips are and why and how they are used in conjunction with the app. My notes: Untracked:
Keeping composite warm:
Filter by origin_key Total trips = confirmed trips Thus, the solution should just be filtering by origin which I suspected yesterday, and I did so by adding an extra query: |
Now the tests passes and we keep a 1:1 ratio WITHOUT changing the existing structure. Furthermore, I edited the existing commits to ensure no commit muddle/churn |
Yes. this was my point with
Again, the goal here is not to write a ton of code. The goal is to understand the existing system, and to make changes that are consistent with it.
I'm glad you were able to get clarity from Jack on this, and I appreciate Jack taking the time to help you, but I don't want you to ask Jack 5 questions in 12 hours either - that won't let him do any work. |
4c359d0
to
1df808d
Compare
with open("emission/tests/data/real_examples/shankari_2015-aug-27") as fp: | ||
additional_entries = json.load(fp, object_hook=esj.wrapped_object_hook) | ||
|
||
# Store in the DB | ||
for entry in additional_entries: | ||
# Adjust the entry's user_id to match the existing user's UUID if needed | ||
entry['user_id'] = self.testUUID | ||
edb.get_timeseries_db().insert_one(entry) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just setupRealExampleWithEntries
?
please look at existing code and use it instead of starting over and re-implementing
You are working with an existing codebase
You need to learn how to read, modify and apply existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shankari You're absolutely right, and I appreciate the feedback. I knew about setupRealExampleWithEntries
and setupRealExample
, but I autopiloted into a naive solution instead of stopping and thinking if this was already in the codebase using it. That was a mistake on my part, and I’ll refactor my code to leverage the existing function and slow down on jumping the gun and trying to implement things on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TeachMeTW note also that, with the review cycle, jumping the gun actually results in slower completion and more work for everybody. If you had spent the additional 10 minutes to look through the existing codebase, I could have merged this during my review.
Since you didn't, I had to put in review comments, you had to make more changes, and I have to now look at it again today when I have time to review. So instead of the code being "done" yesterday, it will be done tonight, with an additional time cost for me.
"Measure twice, cut once" https://en.wiktionary.org/wiki/measure_twice_and_cut_once
"Read codebase and check first, submit once"
…: (i) when there is new data and (ii) when there is no new data. Changed the test to track ts on both cases and simplified.
1df808d
to
9255c5d
Compare
Summary
This PR addresses a fix to ensure that the user stats pipeline behaves as expected in two specific scenarios:
While implementing the fix, I identified the need for unit tests to ensure these behaviors are consistently verified.
Changes Made
Unit Tests
The logic for these tests can be based on the example methods provided:
testGetAndStoreUserStatsSecondRunNoNewData
: Verifies no changes to stats when re-run without new data.testGetAndStoreUserStatsNewData
: Verifies correct updates when new data is added.See #1005