-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 81: Approach for next iteration of Content API #81
Conversation
This looks like really interesting work and this RFC is incredibly thorough 👍. Two things immediately spring to mind:
These are bit rambly, I apologise. I fully expect I'm not in possession of all the background context, so please do point me elsewhere if some background reading would clue me in. |
This is a great write up, and it's really great that this work is happening 🕺 I'll probably have a read through this in more detail and comment again later on, but for now I just wanted to echo @h-lame's reservations about the routes and unpublishing. On both points I don't have a strong opinion about what we should do, I just think we should be very careful as these are big decisions affecting the usability of the API, and will be very difficult to change once the thing is live. The endpointsOne way to mitigate some of the issues around URL structures might be to encourage the use of HATEOAS (links) instead of manually constructing URLs. For example: if the main entry point to the API is
The benefit of doing this is that users don't have to fully understand the document/edition publishing model to start doing something with the API, and they don't need to worry about things like locales if they don't need to. If you haven't already, I'd recommend building (or asking someone else to build) some kind of prototype that makes use of the different versions at some point before putting it live, to learn more about what it's like to use the API for historical content. I think we should learn as much we can about this before it goes live, because it'll be hard to change the endpoints afterwards, and this is one of the motivating use cases that isn't already existing functionality on GOV.UK. The data modelI think the RFC could be clearer about what the core content-item-y-thing is, if that makes any sense... I'm kind of assuming that what users care about is a On the other hand, some of the proposed endpoints hint at the publishing model, where the "thing" is identified by its Do we need the API to be both UnpublishingUnpublishing worries me in general because it's already very complicated. On the search team we've had a difficult time integrating rummager with the publishing api because we didn't have an accurate mental model of all the different scenarios, and there are inconsistencies (for example: withdrawn documents are essentially published, and retain their document type, whereas all other unpublishings become something else). If we're changing how we deal with unpublished content on the frontend, should we be changing anything on the backend to make unpublishing simpler across the whole platform? Will we be able to deprecate "redirect" and "gone" as valid schemas for content items in the publishing api? |
Hey @h-lame and @MatMoore thanks for reading and for the comments 👍 UnpublishingI don't think it's particularly clear from the RFC but this proposal does not actually suggest much of a change in how they are handled at the content store level. So I think of this as not being yet another way to handle unpublishing but more of a refinement of how content store currently handles them. Currently the content store doesn't have knowledge of unpublishing, it stores them in varieties of content items.
This wouldn't actually present any changes to how frontend apps use unpublishings.
It would be good in time to deprecate them - it's on the Publishing API wishlist I believe (haven't got access on this device). But I would want to avoid coupling changes elsewhere to this where possible to avoid scope creep.
I think it's a useful distinction but I'm not sure I explained it very well. It's something that only applies in a historical context. If we consider two documents A and B which both have been unpublished with type gone - so the resource at their paths is a RetiredGone. You can still access all of their content historically from the API - eg Editions endpoint
An edition would still always be in the context of a document - it would just be which way the association pointed. eg. when you access an Edition, you have an embedded Document by default. Whereas when you access Document you may have live Edition embedded by default. I'd dare say we (internally) are more familiar with the concept of iterating through editions than documents since they're more data rich and closer to Publishing API endpoints of
I'd expect them to both return the same entity, however in the links for the items both would specify the canonical path as being Endpoints / HATEOASWe do intend to make liberal use of links between content. Although we have intended to not mix them in with the expanded links and have a more generic part of an entity definition that defines the hypertext links between resources. We did find that once you dig into the links they're surprising contextual, which makes things like "previous_version" harder than you think. Eg. is previous_version to previous major version or previous minor_version. Or you could be interested in what was previously at a path. These become a lot easier in the context of the entities.
Yup, that's the plan. This is a pre-prototype check in. 👍 Data model
Sure, that might not be that clear without having more of a Publishing API understand. Edition is the concept most synonymous with a content item, and Document is used to group Editions together.
I think that depends on how you want to mine the information. You're right in the context of "I know about this URL I want the data behind it" but then that doesn't provide you means to see what pages are new or similar which could be used following the other endpoints.
I think of the difference between these as |
Oh! I think I get it now. When we unpublish something at the moment due to legal issues it ultimately presents a 410 to the user (maybe with a reason) and we're done because we only expose the current version of something. However, the new thing presents the entire history of the document and so we need to be able to retrospectively 410 versions. Got it, makes sense. Thanks for clarification! |
Are they though? I'd argue that changing the locale of an edition makes a new edition of the document so it should have a new id. And we only get a new version number when we publish a new edition of the document, so again, it should have a new id. Perhaps I'm thinking too much about the current document&edition models in whitehall and publishing-api - are these editions somehow different? |
@h-lame you're right. I'm thinking of it as a more of a once in a blue moon scenario where you have to fix something published with wrong locale or similar than something a user can do |
Thanks for this great write-up Kevin. Most of it makes good sense. As you know, I'm a big fan of separating out the routing information, as this will make it much simpler to kill the router-api. And making the unpublishing/gone representation simpler is definitely a big win. However I do think that like Mat I am finding it slightly hard to follow the dual focus of the API on both paths and IDs. I'm not convinced that content IDs are really relevant for an API; they seem to me to be an internal implementation detail of no interest externally. I don't really understand your reply that this allows you to see "what pages are new or similar". And I'm also not sure that items of content change path often enough, or that external users would be sufficiently interested in tracking that path change, to justify making that concern a first-class citizen. (Especially as changing the path of an existing piece of content automatically creates a redirect anyway, so anyone who is interested could follow that.) |
Since Kevin is on holiday at the moment I just thought I would try and answer the questions. I think the reason behind the apparent dual focus of the API is simply that we didn’t want to second guess what users might need from the API. If we only provide base path as a means of accessing content, there is a problem of accessing historical content if the base path changed during the lifetime of a document. Applications can use the content ID and locale as a more stable point of reference to content as combined they make up a primary identifier in our system, rather than relying on a base path which, although human friendly, is implicitly ambiguous when content moves. Although I agree content ID is currently an internal implementation detail, I don’t think there is any reason not to turn it into a more defined feature of the API to support certain use cases both within and outside of GOV.UK. While developing this RFC we were also thinking about the work we had done last quarter on how we might provide access to historical content on GOV.UK, and what we would want an API to look like if we were trying to implement that. We felt that being able to access content through non-changing IDs was a requirement for this, which is why we’ve designed the API in this way. |
@thomasleese I guess a way around that while still using base paths as the identifier would be to return anything that has ever existed at that base path, which may combine editions from multiple content ids (but the user could still distinguish between them by date). It sounds like we don't really know (at least, I don't know) whether something like this would make things simpler or harder for users at the moment. That was part of the reason I asked about testing/prototyping, as I think the endpoints could be streamlined based on what you learn from that. For example, if a really common task requires multiple requests - then we should restructure it so you don't have to do that. |
Thanks for the feedback @danielroseman and thanks for keeping the conversation going in my absence @thomasleese.
The Intention is not to have a dual focus, It's more that data associations can lead to the same resource. e.g: The purpose So to just summarise what some similar endpoints may mean and why they are different: Assuming we have a document of content_id: 123, locale en, with an edition of id: 5 and base_path:
An example might better illustrate this: Imagine you care about publications and want to know which ones are new
which would retrieve a list of editions
So this reflects more that the index endpoints provide new interfaces to enquire as to what data has changed on GOV.UK
This is something you could do by using say One of the things we’ve tried to do with this is abstract out locations from the resource in this so that you don’t have to access the whole resource if you’re iterating through. Rather than say creating something of a god endpoint which returns pretty much everything. Another thing to consider is that resources are associated with potentially multiple paths and for someone outside GOV.UK it’s quite a confusing concept to understand how these paths work. Ideally we want to provide means to look up just by path rather than base_path so people don’t have to understand the concept too deeply, but this can be a difficult thing to iterate through.
Yeah, we’ll be prototyping connecting this up for apps like government-frontend to make sure they can still things in a single request. But for the most part these endpoints are to enable data access rather than use cases so we'd learn when someone chose to use them. We'd likely use them to prototype some basic applications for comparing historical content to evaluate them. |
This seems to be causing some confusion so hopefully this copy will help address this.
I've just added a new section to address some of the confusion regarding preference of content_id vs path. Hopefully this helps, let me know if it actually makes things worse. I'm also adding a deadline to this of next Tuesday: 19th September 2017. |
Three points from me.
Other than that - looks good! |
Thanks @edent
Sure can make this explicit (assuming I can find a suitable spot)
That's understandable, this is HTML pages and I imagine this RFC does rely on that prior understand, will amend.
That's an excellent shout on 451 one, I hadn't realised there was one specifically for legal cases. We were planning to use 410 for that scenario which actually made life more complicated as we have two types of 410 resources. Switching to 451 for the legal scenario can make this simpler. Thanks. This does intend to make use of 30x redirects, I'll have a scan to see if that can be made clearer. Edit: Ah ha I've just noticed that you, @edent, were pivotal in the introduction of the 451 status code. Great job 👍 |
As an aside, is there any plan to track pdfs and other attachments as content items in the future? |
We currently have HTML attachments represented in content-store. I'm not aware of plans for integration of assets as entities in Publishing API and thus Content Store but @danielroseman may have more of an insight on this. |
Following feedback this includes a documentation plan.
Following feedback this includes some information on the status of non HTML files.
4de4022
to
62eb669
Compare
Includes some copy introducing a revised approach to Gone types considered from the context of the 451 HTTP status code. This has been done with minimal edits to the document since the deadline for the RFC is very soon and the audience for this RFC is likely already acquainted with the main copy.
ed61c26
to
3156c39
Compare
This RFC explains the approach we (the API for Content team) have taken towards defining the next iteration of the Content API. The Content API being the public endpoints that are served from the Content Store on
www.gov.uk/api/content
.As this is to be both a public and widely used internal API we felt it was prudent to put this to an RFC while it's still malleable to get wider feedback and awareness of things we haven't considered.
Apologies this is so long - I didn't quite realise how much longer it was than most of the other RFC's until it'd been a while. So maybe find a comfy chair before settling down to read through this.
Also I'm on holiday from 15th August until 23rd August so may myself be a bit tardy responding to comments, but others from API for Content will and I'll look forward to them when I'm back.
Thanks for reading 👍