-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: make MDS queries more predictable #268
Comments
@johnpena +1 to this idea! |
Writing up some of the proposals from the webinar:
@lionel-panhaleux @hunterowens @ambarpansari please let me know if I missed anything important. I'll work on coming up with a proposal that takes all of these ideas into consideration. Thanks all for the brainstorm! |
Unless this is a burning issue for anyone, I'd suggest we move this out to 0.4.0 |
Our customers rely on timely data to understand where scooters are in their city "right now". For example, there are a few use cases around where scooters are parked. Having query parameters be on hour boundaries would mean we would have to wait too long to get some events for these use cases to be meaningful. We would want to retain the ability to query for recent time periods and would want these to be low latency but could tolerate higher latencies for older data. We do not need millisecond resolution as currently specified. |
What I want to suggest is that for queries that deal with data older than the current day (or maybe week, depending on use cases), that the provider API might round the timestamps for the queries to the nearest hour. This would be the default behavior for your typical provider. We could then introduce an additional parameter for the client to specify that they want to turn off this behavior and have their timestamps be used as-provided, e.g. This would preserve the integrity of 'live' data (the use case @billdirks is talking about), and allow providers to opt-in to caching older data. It would also be (for the most part) non-breaking or minimally-breaking. |
Depending on the size of datasets involved here, the simplest thing to do may be an alternate query parameter that specifies the desired day of data to retrieve. One thing I've seen effectively used for problems this is Ordinal Date, which is an unambiguous way to refer to a specific day via a single integer. An additional thing worth thinking about here is timezones; a city's data feed is hopefully in a specific timezone, and if that city queries for a specific day of data, the data should be cut on timezone boundaries. Timezone math is notoriously complicated to reason about, so an API that avoids having to do it for every request may be in the spec's interest. Ordinal dates (or any other non-time-specific day format) would let timezone conversion be handled entirely by provider implementations, which hopefully causes them to be more evidently consistent. |
How does this square with the new 24 hour telemetry delay? Real time info does not seem predictable or reliable with the current design, since it is impossible to know if you are seeing a complete picture until the 24 hour threshold is hit. |
@dyakovlev the problem with introducing an alternate query parameter is that clients would need to switch to using this parameter, without an added benefit to them. I'm skeptical that old clients would make the switch. And ultimately this doesn't leave providers with a way to optimize existing queries. |
@morganherlocker Are you referring to the telemetry delay in the agency specification? My use case resolves around querying the provider api endpoints (ie pulling the data) vs waiting for a push from the provider to an agency endpoint. |
Moving this to |
Correct, my understanding is that the delay in the agency specification is to account for inevitable latency in data processing, which means the data presumably would not exist in the provider store either. Without this allowance, the agency delay does not seem to have a purpose besides addressing the ~200-2000ms it takes to send the data to the agency. Given the wording in agency, I have been assuming anything on the historical endpoints (trips and status_changes) will be incomplete unless further than 24h in the past. |
Now that we're really kicking off into 0.4.0 I'd love to see us make some progress on this topic. We're definitely experiencing the pain of trying to deliver large amounts of data, consistently, across different agency requirements/capabilities, so would love to see pagination for historical (and real time) data codified in a way that best gives agencies the combination of real-time and historical access they need, and allows providers to scale across many cities. |
What if we do something like |
My understanding of the issues thus far:
My $0.02 is that provider at its core is a bulk historical data fetching API with agency fulfilling the need for real-time use cases. If there's consensus around that notion, I wonder if a breaking change for 0.4.0 is warranted to align the provider API closer to its core use case since as it stands today:
As a thought experiment, I find it helpful to imagine provider being built on top of static file serving architecture using UTC ordinal dates (e.g., Circling back to the existing API, a possible breaking change proposal for 0.4.0 would be to drop support for Until 0.3.x is fully deprecated, existing clients can be upgraded to the new "fast path" without any code change by altering their requests to align to single UTC day boundaries (e.g., I understand this is a big change... but I feel we're at a point where we can leverage the collective experience to tweak the API in a way that makes it easier to implement and maintain for the long haul. Could provider consumers weigh in with specific use cases/concerns? Could provider implementors comment on what would work given given the traffic they're seeing today? |
As a consumer, my experience is very different from @ascherkus:
It seems like the provider spec has been adopted for this purpose by more than just us. Also, I feel that there are some issues that may prevent the agency spec from becoming as widely adopted as the provider spec:
|
I believe there are two separate issues here.
A historical and live status_changes endpoints would solve this, as long as it'd be clearly written how much processing time is allowed for each. |
Thanks for sharing your feedback @billdirks! It seems like we have a classic tradeoff between latency and aggregation/scalability of a system :) I'm assuming this is a similar use case as the one you mentioned on April 4 -- but could you elaborate as to what the minimum latency (e.g., 5 minutes, 15 minutes, 30 minutes, an hour, a day) that is tolerable for this use case? My goal is to tease out the different use cases for different consumers to see whether it's possible to determine a pre-defined level of aggregation/latency that works both for consumers and for query predictability. Failing that, as @hyperknot mentions it does seem like there are two major separate use cases (historical vs. live) that might be better served with separate solutions vs. attempting to shoehorn both use cases into a single solution. The previous meeting notes [1,2] suggest there may need to be a distinction between historical vs. live as well. At a minimum, something like @johnpena's [1] https://github.com/CityOfLosAngeles/mobility-data-specification/wiki/Web-conference-notes,-2019.03.28 |
Of course less latency is better for my use case but I can deal with minutes. 5 minutes seems long though and 10 is definitely too long. @johnpena's suggestion of an extra parameter would work for me. It terms of latency of the endpoint itself, I care more about response speed for the recent time points than historical data. There are some other tickets about timeliness of data and a live endpoint. I'd invite us to have discussion on those point there. |
Our clients are utilizing MDS data for both historical and near real-time purposes. The Provider API is simpler to get started with in terms of infrastructure and engineering effort, so we're utilizing it for both at the moment. For near real-time, I agree with @billdirks (5 minutes seems long, 10 is definitely too long). The main thing I'm looking for are clear expectations about the timeliness of the data available in the Provider API as discussed by @hyperknot, so I can decide when and how often to ping it, ensure I'm not missing anything, and clearly communicate the limitations to our clients. Not to detract from the main goal of this ticket, but I also wanted to tag additional tickets I've seen with related conversation on these tradeoffs: #307 (when are status_changes added), #341 (when are status_changes removed), #282 (privacy and possibility of forcing a minimum 24 hour delay on telemetry data). |
Can we make the Provider API a little stricter without breaking the real-time tooling case in We could allow the Provider API to specify a "supported" time window. Let's say 1 hour. Then we require all requests to be on UTC hour boundaries. So for Aug 1 12pm - 1pm UTC you'd have: If it's the current hour, providers would serve the realtime data. If the data is old, it could be cached. Think of it as a logfile where you stream to the active one and then "rotate" logs on a regular basis. |
Can you keep everything the same as it is now (eg, still support arbitrary start/end times), but add a new API parameters like ' Then you would return the pre-cached data for that hour/day/week that that For example The provider would then generate flat files for every hour into the past, every day into the past, every week, and every month for each city. That file would be served when the API call is made, instead of hitting the database. One provider kinda does this now in their online dashboard to export data in bulk - you pick a day and get a month's worth of data as a file that the day falls within. Note I would recommend that these new pair parameters do not work in conjunction with other query parameters that filter. This is to get all bulk historic data for a chunk of time. |
Some additional comments from the discussion on the MDS call to add to this: In terms of understanding consumer use cases, I'm wondering how many consumers use the Provider API as essentially a data interchange format, to ingest it on a regular basis into some other pipeline? Versus an on-demand API that gets called in varied situations in response to actually interacting with the user-facing application? With the log rotation model, there remains the open question of when the data can be considered "complete" (i.e. there's an expectation that no additional elements will be added or removed). |
For us it's mostly a data interchange format. We're 100% happy to be asked
to limit our queries to hourly in the archived time period in order to get
max performance of data exchange.
…-Fletcher (Populus)
On Thu, Aug 8, 2019 at 11:52 AM Nathan Selikoff ***@***.***> wrote:
Some additional comments from the discussion on the MDS call to add to
this:
In terms of understanding consumer use cases, I'm wondering how many
consumers use the Provider API as essentially a data interchange format, to
ingest it on a regular basis into some other pipeline? Versus an on-demand
API that gets called in varied situations in response to actually
interacting with the user-facing application?
With the log rotation model, there remains the open question of when the
data can be considered "complete" (i.e. there's an expectation that no
additional elements will be added or removed).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#268>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOHNID6L5BEFMBWDCADTL3QDRTOVANCNFSM4HABTSWA>
.
|
Exactly the use case here in Santa Monica. Also happy to limit our queries to pre-defined windows. |
Same Here, for data interchange format. I think |
Will draft a pull request next week to continue the conversation around support for fixed time intervals. By scoping Provider API to data interchange, we should be able to build a more reliable solution that continues to work for historical and real-time use cases. Some things to consider in advance:
|
I think we should keep it as simple as possible, the specs are already over complicated and the majority of the providers cannot even implement the current specs. Here are my recommendations.
The only problem I cannot see being solved is the registration of lost vehicles. If it takes 3 days to register such a lost vehicle, I don't know how should we handle it. Should historical data be re-updated again, after 7 days for example, to contain the lost vehicles? If so, we might need to add a "lost_vehicles_processed" true/false boolean as well. Also, as an added side effect, we could get rid of pagination of these endpoints, something which is again very chaotically implemented across providers. We could simply just download preprocessed hourly files, making life easy both for providers and for clients. Zsolt (Populus) |
Proposal here! #354 |
I've been maintaining the MDS provider API at Lime since earlier this year. We've seen the latency across our API endpoints creep up on us as more agencies have adopted MDS and as more trips have been taken and added to our trips and status changes datasets.
We'd like to decrease latency as much as possible, but we've had some issues doing this because the datasets returned by the API are difficult to precompute and cache. Particularly, being able to query across arbitrary start and end times and having down-to-the-second query values means that we can't reliably cache an entire trip or status change dataset ahead of time. Users can make a query for a minute or month's worth of data, and we have to generate the results on the fly.
I'd like to brainstorm ways we can make MDS queries more predictable. In particular, if we could present a way for a user to ask for a specific day or hour of data, it would allow us to resolve the query ahead of time and return the result to them much faster.
The text was updated successfully, but these errors were encountered: