Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 81: Approach for next iteration of Content API #81

Merged
merged 5 commits into from
Sep 21, 2017

Conversation

kevindew
Copy link
Member

@kevindew kevindew commented Aug 15, 2017

This RFC explains the approach we (the API for Content team) have taken towards defining the next iteration of the Content API. The Content API being the public endpoints that are served from the Content Store on www.gov.uk/api/content.

As this is to be both a public and widely used internal API we felt it was prudent to put this to an RFC while it's still malleable to get wider feedback and awareness of things we haven't considered.

Apologies this is so long - I didn't quite realise how much longer it was than most of the other RFC's until it'd been a while. So maybe find a comfy chair before settling down to read through this.

Also I'm on holiday from 15th August until 23rd August so may myself be a bit tardy responding to comments, but others from API for Content will and I'll look forward to them when I'm back.

Thanks for reading 👍

@h-lame
Copy link
Contributor

h-lame commented Aug 16, 2017

This looks like really interesting work and this RFC is incredibly thorough 👍. Two things immediately spring to mind:

  1. I'm somewhat wary of the "avoiding unpublishing terminology" and "here's some new terminology" around Gone. I think we might already have 2 different ways of talking about this sort of functionality and adding a third seems confusing. For example: I think in whitehall we can unpublish with or without reasons, but in other publishers we can withdraw? I admit I might already be confused about this stuff so could very much be wrong here. I wonder if we should attack this problem head on (separate to this RFC) and do the hard work to standardise across the apps and API stack before we introduce a new set of terminology. There's also a tiny part of me that wonders if we need the distinction that RetiredGone and RevokedGone give us? And how we might expose that in publishing apps (I think whitehall has a "legal problems" option for unpublishing, but not sure others do).

  2. I'm a little unsure about the /edition endpoints. Even taking all of the "we're not sure what the use-cases will be so don't want to engineer to limit them" stuff into account, I'm still not clear if we want to expose an edition outside of the context of the document it belongs to. (This might be the old rails deeply-nested routes vs. shallow routes argument which I'm rehashing and the community seems to have settled that on shallow routes being the way to go). In my head I imagine the id of an edition to be a compound of the document content id and the version number (maybe locale? - but I'm not sure we publish locales separately at the moment), so I'm not sure I see the benefit of /editions/{compound_id} vs. /document/{content_id}/{locale}/editions/version/{version} (other than brevity). How would we get an edition id anyway to point us to /editions/{id}? Would the longform /document/.... url be a redirect to the shortform /editions/{id} endpoint, or do we expect the data returned from these to be different somehow?

These are bit rambly, I apologise. I fully expect I'm not in possession of all the background context, so please do point me elsewhere if some background reading would clue me in.

@MatMoore
Copy link
Contributor

MatMoore commented Aug 16, 2017

This is a great write up, and it's really great that this work is happening 🕺

I'll probably have a read through this in more detail and comment again later on, but for now I just wanted to echo @h-lame's reservations about the routes and unpublishing. On both points I don't have a strong opinion about what we should do, I just think we should be very careful as these are big decisions affecting the usability of the API, and will be very difficult to change once the thing is live.

The endpoints

One way to mitigate some of the issues around URL structures might be to encourage the use of HATEOAS (links) instead of manually constructing URLs.

For example: if the main entry point to the API is /api/content/some-base-path and that returned you API urls that refer to all the editions that have ever existed at that base path, then you can very easily code something that reads those links and renders a browsable version history. You could do similar things for locales too, and present them alongside document/edition links. Something like:

"links": {
  "taxons": [...],
  "organisations": [...],
  "previous_versions": [...],
  "locales": [{"locale": "es", "content-id": "abc", "href": "/api/content/whatever/abc/es"}]
]

The benefit of doing this is that users don't have to fully understand the document/edition publishing model to start doing something with the API, and they don't need to worry about things like locales if they don't need to.

If you haven't already, I'd recommend building (or asking someone else to build) some kind of prototype that makes use of the different versions at some point before putting it live, to learn more about what it's like to use the API for historical content. I think we should learn as much we can about this before it goes live, because it'll be hard to change the endpoints afterwards, and this is one of the motivating use cases that isn't already existing functionality on GOV.UK.

The data model

I think the RFC could be clearer about what the core content-item-y-thing is, if that makes any sense...

I'm kind of assuming that what users care about is a base_path, because the API is providing information about "what has ever existed at this URL". But I may just be assuming this just because that's how the content store works: for every URL on GOV.UK you can stick "/api/content" in front of it and get some JSON.

On the other hand, some of the proposed endpoints hint at the publishing model, where the "thing" is identified by its content_id (not the base path). By exposing content ids in our endpoints we're encouraging another way of thinking about content, where two separate documents published at the same URL are considered separate things, but the same document published at different URLs would be considered the same thing.

Do we need the API to be both base_path oriented and content_id oriented?

Unpublishing

Unpublishing worries me in general because it's already very complicated. On the search team we've had a difficult time integrating rummager with the publishing api because we didn't have an accurate mental model of all the different scenarios, and there are inconsistencies (for example: withdrawn documents are essentially published, and retain their document type, whereas all other unpublishings become something else).

If we're changing how we deal with unpublished content on the frontend, should we be changing anything on the backend to make unpublishing simpler across the whole platform? Will we be able to deprecate "redirect" and "gone" as valid schemas for content items in the publishing api?

@kevindew
Copy link
Member Author

Hey @h-lame and @MatMoore thanks for reading and for the comments 👍

Unpublishing

I don't think it's particularly clear from the RFC but this proposal does not actually suggest much of a change in how they are handled at the content store level. So I think of this as not being yet another way to handle unpublishing but more of a refinement of how content store currently handles them. Currently the content store doesn't have knowledge of unpublishing, it stores them in varieties of content items.

If we're changing how we deal with unpublished content on the frontend, should we be changing anything on the backend to make unpublishing simpler across the whole platform?

This wouldn't actually present any changes to how frontend apps use unpublishings.

Will we be able to deprecate "redirect" and "gone" as valid schemas for content items in the publishing api?

It would be good in time to deprecate them - it's on the Publishing API wishlist I believe (haven't got access on this device). But I would want to avoid coupling changes elsewhere to this where possible to avoid scope creep.

There's also a tiny part of me that wonders if we need the distinction that RetiredGone and RevokedGone give us?

I think it's a useful distinction but I'm not sure I explained it very well. It's something that only applies in a historical context. If we consider two documents A and B which both have been unpublished with type gone - so the resource at their paths is a RetiredGone. You can still access all of their content historically from the API - eg /edition/a1 return 200 and /edition/b1 return 200. However if you later discover that document B contains legally sensitive information that we are no longer able to store in our API then you would want /edition/b1 to return a 410 with information as to why it was revoked. Did that help?

Editions endpoint

I'm still not clear if we want to expose an edition outside of the context of the document it belongs to

An edition would still always be in the context of a document - it would just be which way the association pointed.

eg. when you access an Edition, you have an embedded Document by default. Whereas when you access Document you may have live Edition embedded by default. I'd dare say we (internally) are more familiar with the concept of iterating through editions than documents since they're more data rich and closer to Publishing API endpoints of GET /v2/content and GET /v2/editions

Would the longform /document/.... url be a redirect to the shortform /editions/{id} endpoint, or do we expect the data returned from these to be different somehow?

I'd expect them to both return the same entity, however in the links for the items both would specify the canonical path as being /edition/{id}. I think brevity is useful compared to /document/{content_id}/{locale}/editions/version/{version} but I also think version and locale are potentially transient, whereas the edition id would be absolute.

Endpoints / HATEOAS

We do intend to make liberal use of links between content. Although we have intended to not mix them in with the expanded links and have a more generic part of an entity definition that defines the hypertext links between resources.

We did find that once you dig into the links they're surprising contextual, which makes things like "previous_version" harder than you think. Eg. is previous_version to previous major version or previous minor_version. Or you could be interested in what was previously at a path. These become a lot easier in the context of the entities.

If you haven't already, I'd recommend building (or asking someone else to build) some kind of prototype that makes use of the different versions at some point before putting it live, to learn more about what it's like to use the API for historical content.

Yup, that's the plan. This is a pre-prototype check in. 👍

Data model

I think the RFC could be clearer about what the core content-item-y-thing is, if that makes any sense...

Sure, that might not be that clear without having more of a Publishing API understand. Edition is the concept most synonymous with a content item, and Document is used to group Editions together.

I'm kind of assuming that what users care about is a base_path, because the API is providing information about "what has ever existed at this URL". But I may just be assuming this just because that's how the content store works: for every URL on GOV.UK you can stick "/api/content" in front of it and get some JSON.

I think that depends on how you want to mine the information. You're right in the context of "I know about this URL I want the data behind it" but then that doesn't provide you means to see what pages are new or similar which could be used following the other endpoints.

Do we need the API to be both base_path oriented and content_id oriented?

I think of the difference between these as base_path is something transient and content_id is absolute. So you want to use the former when you care about resource at that path, whereas content_id is much more about that particular piece of content.

@h-lame
Copy link
Contributor

h-lame commented Aug 17, 2017

There's also a tiny part of me that wonders if we need the distinction that RetiredGone and RevokedGone give us?

I think it's a useful distinction but I'm not sure I explained it very well. It's something that only applies in a historical context. If we consider two documents A and B which both have been unpublished with type gone - so the resource at their paths is a RetiredGone. You can still access all of their content historically from the API - eg /edition/a1 return 200 and /edition/b1 return 200. However if you later discover that document B contains legally sensitive information that we are no longer able to store in our API then you would want /edition/b1 to return a 410 with information as to why it was revoked. Did that help?

Oh! I think I get it now. When we unpublish something at the moment due to legal issues it ultimately presents a 410 to the user (maybe with a reason) and we're done because we only expose the current version of something. However, the new thing presents the entire history of the document and so we need to be able to retrospectively 410 versions. Got it, makes sense. Thanks for clarification!

@h-lame
Copy link
Contributor

h-lame commented Aug 17, 2017

but I also think version and locale are potentially transient, whereas the edition id would be absolute.

Are they though? I'd argue that changing the locale of an edition makes a new edition of the document so it should have a new id. And we only get a new version number when we publish a new edition of the document, so again, it should have a new id. Perhaps I'm thinking too much about the current document&edition models in whitehall and publishing-api - are these editions somehow different?

@kevindew
Copy link
Member Author

@h-lame you're right. I'm thinking of it as a more of a once in a blue moon scenario where you have to fix something published with wrong locale or similar than something a user can do

@danielroseman
Copy link

Thanks for this great write-up Kevin.

Most of it makes good sense. As you know, I'm a big fan of separating out the routing information, as this will make it much simpler to kill the router-api. And making the unpublishing/gone representation simpler is definitely a big win.

However I do think that like Mat I am finding it slightly hard to follow the dual focus of the API on both paths and IDs. I'm not convinced that content IDs are really relevant for an API; they seem to me to be an internal implementation detail of no interest externally. I don't really understand your reply that this allows you to see "what pages are new or similar". And I'm also not sure that items of content change path often enough, or that external users would be sufficiently interested in tracking that path change, to justify making that concern a first-class citizen. (Especially as changing the path of an existing piece of content automatically creates a redirect anyway, so anyone who is interested could follow that.)

@thomasleese
Copy link
Contributor

Since Kevin is on holiday at the moment I just thought I would try and answer the questions.

I think the reason behind the apparent dual focus of the API is simply that we didn’t want to second guess what users might need from the API. If we only provide base path as a means of accessing content, there is a problem of accessing historical content if the base path changed during the lifetime of a document. Applications can use the content ID and locale as a more stable point of reference to content as combined they make up a primary identifier in our system, rather than relying on a base path which, although human friendly, is implicitly ambiguous when content moves. Although I agree content ID is currently an internal implementation detail, I don’t think there is any reason not to turn it into a more defined feature of the API to support certain use cases both within and outside of GOV.UK.

While developing this RFC we were also thinking about the work we had done last quarter on how we might provide access to historical content on GOV.UK, and what we would want an API to look like if we were trying to implement that. We felt that being able to access content through non-changing IDs was a requirement for this, which is why we’ve designed the API in this way.

@MatMoore
Copy link
Contributor

@thomasleese I guess a way around that while still using base paths as the identifier would be to return anything that has ever existed at that base path, which may combine editions from multiple content ids (but the user could still distinguish between them by date).

It sounds like we don't really know (at least, I don't know) whether something like this would make things simpler or harder for users at the moment. That was part of the reason I asked about testing/prototyping, as I think the endpoints could be streamlined based on what you learn from that. For example, if a really common task requires multiple requests - then we should restructure it so you don't have to do that.

@kevindew
Copy link
Member Author

Thanks for the feedback @danielroseman and thanks for keeping the conversation going in my absence @thomasleese.

However I do think that like Mat I am finding it slightly hard to follow the dual focus of the API on both paths and IDs. I'm not convinced that content IDs are really relevant for an API; they seem to me to be an internal implementation detail of no interest externally.

The Intention is not to have a dual focus, It's more that data associations can lead to the same resource. e.g: location -> edition -> document and if you went from a document -> edition -> location.

The purpose content_id has in the API is the ability to access a document entity which represents a collection of editions. So a scenario of usage could be that you look up an Edition at a path and then you can use the document association (content_id and locale) to then look up all the editions of that document to see how that is evolved over time. I think we need a means to look up that document and content_id seems the natural means for it.

So to just summarise what some similar endpoints may mean and why they are different:

Assuming we have a document of content_id: 123, locale en, with an edition of id: 5 and base_path: /test

  • /resource/test - Would return edition: 5 (with document embedded), would return something different if replaced/updated
  • /edition/5 - Method to always return edition: 5, no matter whether live or at a different path.
  • /documents/123/en - Returns the document associated with edition: 5
  • /documents/123/en/editions - Would be a paginated list of all editions (and thus include edition 5)
  • /documents/123/en/editions/live - Would return edition: 5, not really intended as a different route to the edition, but it’s a logical data association.

I don't really understand your reply that this allows you to see "what pages are new or similar".

An example might better illustrate this:

Imagine you care about publications and want to know which ones are new

/editions?document_type=publication&published_at_after=2018-04-24

which would retrieve a list of editions

[
  { id: 4 ... },
  { id: 5 ... }
]

So this reflects more that the index endpoints provide new interfaces to enquire as to what data has changed on GOV.UK

I guess a way around that while still using base paths as the identifier would be to return anything that has ever existed at that base path, which may combine editions from multiple content ids (but the user could still distinguish between them by date).

This is something you could do by using say /locations?base_path=/my-path for instance. Which would return a list of location objects which would show you the history of that base_path.

One of the things we’ve tried to do with this is abstract out locations from the resource in this so that you don’t have to access the whole resource if you’re iterating through. Rather than say creating something of a god endpoint which returns pretty much everything.

Another thing to consider is that resources are associated with potentially multiple paths and for someone outside GOV.UK it’s quite a confusing concept to understand how these paths work. Ideally we want to provide means to look up just by path rather than base_path so people don’t have to understand the concept too deeply, but this can be a difficult thing to iterate through.

It sounds like we don't really know (at least, I don't know) whether something like this would make things simpler or harder for users at the moment.

Yeah, we’ll be prototyping connecting this up for apps like government-frontend to make sure they can still things in a single request. But for the most part these endpoints are to enable data access rather than use cases so we'd learn when someone chose to use them. We'd likely use them to prototype some basic applications for comparing historical content to evaluate them.

This seems to be causing some confusion so hopefully this copy will help
address this.
@kevindew
Copy link
Member Author

kevindew commented Sep 11, 2017

I've just added a new section to address some of the confusion regarding preference of content_id vs path. Hopefully this helps, let me know if it actually makes things worse.

I'm also adding a deadline to this of next Tuesday: 19th September 2017.

@edent
Copy link

edent commented Sep 13, 2017

Three points from me.

  1. We are planning on standardising on OpenAPI v3 (previously Swagger). It would be helpful if this form of documentation was mentioned explicitly.
  2. I'm confused as to whether this only gets HTML documents - or whether it can get previously published ODF, PDF, ZIPs etc.
  3. Any plans to use HTTP status codes? For example the 30X range and 451 may be suitable in some contexts.

Other than that - looks good!

@kevindew
Copy link
Member Author

kevindew commented Sep 15, 2017

Thanks @edent

  1. We are planning on standardising on OpenAPI v3 (previously Swagger). It would be helpful if this form of documentation was mentioned explicitly.

Sure can make this explicit (assuming I can find a suitable spot)

  1. I'm confused as to whether this only gets HTML documents - or whether it can get previously published ODF, PDF, ZIPs etc.

That's understandable, this is HTML pages and I imagine this RFC does rely on that prior understand, will amend.

  1. Any plans to use HTTP status codes? For example the 30X range and 451 may be suitable in some contexts.

That's an excellent shout on 451 one, I hadn't realised there was one specifically for legal cases. We were planning to use 410 for that scenario which actually made life more complicated as we have two types of 410 resources. Switching to 451 for the legal scenario can make this simpler. Thanks.

This does intend to make use of 30x redirects, I'll have a scan to see if that can be made clearer.

Edit: Ah ha I've just noticed that you, @edent, were pivotal in the introduction of the 451 status code. Great job 👍

@MatMoore
Copy link
Contributor

As an aside, is there any plan to track pdfs and other attachments as content items in the future?

@kevindew
Copy link
Member Author

As an aside, is there any plan to track pdfs and other attachments as content items in the future?

We currently have HTML attachments represented in content-store. I'm not aware of plans for integration of assets as entities in Publishing API and thus Content Store but @danielroseman may have more of an insight on this.

Following feedback this includes a documentation plan.
Following feedback this includes some information on the status of non
HTML files.
@kevindew kevindew force-pushed the content-api-approach branch from 4de4022 to 62eb669 Compare September 18, 2017 13:27
Includes some copy introducing a revised approach to Gone types
considered from the context of the 451 HTTP status code.

This has been done with minimal edits to the document since the deadline
for the RFC is very soon and the audience for this RFC is likely already
acquainted with the main copy.
@kevindew kevindew force-pushed the content-api-approach branch from ed61c26 to 3156c39 Compare September 18, 2017 13:53
@boffbowsh boffbowsh merged commit dce83c6 into master Sep 21, 2017
@boffbowsh boffbowsh deleted the content-api-approach branch September 21, 2017 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants