Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support new Aggregation DataStore #804

Closed
aklish opened this issue May 7, 2019 · 11 comments
Closed

Support new Aggregation DataStore #804

aklish opened this issue May 7, 2019 · 11 comments
Labels
RequestForComment For proposals and ideas in which input is sought. These issues are typically not yet team-accepted.
Milestone

Comments

@aklish
Copy link
Member

aklish commented May 7, 2019

Overview

Elide would benefit from a new Aggregation DataStore that can support:

  • Metric aggregation (SUM, COUNT, etc.) of model attributes
  • Group By clauses over model attributes and relationships
  • Having clauses which filter on aggregated attributes.

Models managed by this store would have the following restrictions:

  • The entity models would be read only.
  • Direct relationships to other entities would always be toOne relationships to enforce a star (or snowflake) schema.

One goal of such a DataStore is to add seamless integration between Elide and Navi.

Example Model

Here is an example model:

@Entity
@Include
public class PlayerStats {
  @Id
  private String id;

  @OneToOne
  private Player player;

  @OneToOne
  private Country country;

  private Date sessionDate;

  @MetricAggregation(default = Max.class)
  private long highScore;

  @MetricAggregation(default = Sum.class)

  @MetricAggregation(default = Sum.class)
  private long gemsPurchased;
}

Example API Call

JSON-API lacks some of the flexibility of GraphQL (attributes with parameters). While most features in this RFC will work with JSON-API, some aspects will only work with GraphQL.

GraphQL

Request

{
  playerStats(filter: "highScore>300") {{
    edges {
      node {
        highScore
        country
          edges {
            node {
              name
            }
          }
        }
      }
    }
  }
}

Response

{
  "playerStats": {
    "edges": [
      { 
        "node": {
          "highScore" : 681,
          "country": {
            "edges": [
              { 
                "node": {
                  "name": "Germany"
                }
              }
            ]
          }
        }
      },
      { 
        "node": {
          "highScore" : 421,
          "country": {
            "edges": [
              { 
                "node": {
                  "name": "Italy"
                }
              }
            ]
          }
        }
      }
    ]
  }
}

JSON-API

GET /api/v1/playerStats?fields[playerStats]=highScore,country?filter[playerStats]=highScore>300
Content-Type: application/vnd.api+json
Accept: application/vnd.api+json

{
  "data": [
    { 
      "type": "playerStats",
      "id": "AB121AF78972A",
      "attributes": {
        "highScore": "681"
      },
      "relationships": {
        "country":{
          "data": {"type": "country", "id" : 672}
        }
      }
    },
    { 
      "type": "playerStats",
      "id": "F1171AF7811C",
      "attributes": {
        "highScore": "421"
      },
      "relationships": {
        "country":{
          "data": {"type": "country", "id" : 565}
        }
      }
    }
  ]
}

JPQL Generation

The example API queries would generate JPQL similar to:

SELECT MAX(models_PlayerStats.highScore), models_PlayerStats.country.id FROM models_PlayerStats GROUP BY models_PlayerStats.country.id HAVING SUM(models_PlayerStats.highScore) > 300

Model Bindings

The intent is that the JPA Models will be bound to either existing SQL tables or views that are created to project the desired fact tables.

Metric Definition

Metric columns are any column annotated with a new DataStore annotation MetricAggregation. This annotation takes two arguments:

  1. the default aggregation function to apply to the metric.
  2. the complete list of supported aggregations.

Client Supplied Aggregation Function

For GraphQL only, we can support arguments for the metric attributes allowing the client to pick the aggregation function:

{
  playerStats(filter: "highScore>300") {{
    edges {
      node {
        highScore(aggregation: "MAX")
        country
          edges {
            node {
              name
            }
          }
        }
      }
    }
  }
}

Supported aggregations can be specified in the model:

  @MetricAggregation(default = Max.class, aggregations = {Max.class, Median.class})
  private long highScore;

Metric classes can be registered with the AggregationDataStore. They are responsible for generating a JPQL query fragment that wraps the column name.

public interface JPQLWrapper {
    public String toJPQL();
}

The return string is actually a template where the text '{}' is substituted with the annotated column name.

Dimension Definition

Dimension columns are any relationship or alternatively any attribute without a MetricAggregation annotation.

Date Grain Specification

The AggregationDataStore will inject a grain attribute argument into the GraphQL schema for any:

  1. java.util.Date
  2. java.sql.Date
  3. java.time.LocalDate
{
  playerStats(filter: "highScore>300") {{
    edges {
      node {
        highScore(aggregation: "MAX")
        sessionDate(grain = "month")
        }
      }
    }
  }
}

Time grains will be implementations of JPQLWrapper that are used to wrap the date columns with custom JPQL. Given that date truncation will be SQL dialect specific, the wrappers should leverage the FUNCTION capability in JPQL.

Group By

The data store will generate a Group By clause for any dimension column the client retrieves in the query. If no dimension columns are retrieved, no Group By clause is generated.

Dimension Attribute Grouping

Grouping by a dimension attribute (country.region) will not be supported initially. However, it could be supported by a new GraphQL parameter at the relationship level (groupBy):

{
  playerStats(filter: "highScore>300") {{
    edges {
      node {
        highScore(aggregation: "MAX")
        country(groupBy: "region.name")
          edges {
            node {
              name
            }
          }
        }
      }
    }
  }
}

The groupBy parameter can support a '.' separated path through the entity relationship graph (provided that all relationships in the path are to-one relationships).

Having

The data store will generate a Having clause for any client supplied filter for a metric column.

Id Generation

A row identifier will be generated that is unique per query row.

It will not be possible to fetch data through this ID.

Metadata Discovery

Metadata will be discovered through tables defined by this issue.

@aklish aklish added this to the elide-5.0 milestone May 7, 2019
@aklish aklish added the RequestForComment For proposals and ideas in which input is sought. These issues are typically not yet team-accepted. label May 7, 2019
@DennisMcWherter
Copy link
Collaborator

DennisMcWherter commented May 8, 2019

I think this would be a great addition to Elide. Just some questions/comments below.

Minor document fixup: I think the JPQL model should be using MAX instead of SUM in your example.

... single argument which is the default aggregation function

It is unclear to me how this would work for JSON-API. If I am fetching an object, I can't necessarily invoke the @ComputedAttribute as described for GraphQL. Is the intent that JSON-API will not support this?

One way I can think of is by possibly encoding this information in the id parameter. This feels like a pretty difficult-to-use interface to me, though.

However, this capability (computed attributes with model defined parameters) does not yet exist in Elide.

Should this be a separate design/write-up?

In either case, the data store generates the ID during object retrieval.

It should be possible to fetch an individual fact table row by its Id.

Since these tables are read-only, the @Transient id seems like the clearest approach to me. It makes it obvious that this value is not stored anywhere. However, id construction by the datastore would then have to be complete in identifying a unique query. Perhaps base64-encoding all the relevant query parameters is an effective way to do this? The datastore could then construct/deconstruct as necessary.

The discovery aspect of this document is interesting in that it's a small shift from what we have today. Previously, Elide had a uniform policy on how to interact with data (exception: certain computed attributes) and this was defined by which API-format you used (GraphQL vs. JSON-API). This adds some additional power and I wonder if the discovery mechanism needs to be very bold on how it's called out. Just a thought: no real action item here.

... all Elide data stores should implement BridgeableStore

To clarify: as part of this work we would implement the bridgeable store on any official store which does not yet support it, correct?

@aklish
Copy link
Member Author

aklish commented May 8, 2019

Minor document fixup: I think the JPQL model should be using MAX instead of SUM in your example.

Fixed.

It is unclear to me how this would work for JSON-API. If I am fetching an object, I can't necessarily
invoke the @ComputedAttribute as described for GraphQL. Is the intent that JSON-API will not support this?

Correct

Should this be a separate design/write-up?

Yes. We don't see this as an MVP feature for the AggregationDataStore

Perhaps base64-encoding all the relevant query parameters is an effective way to do this?

👍

This adds some additional power and I wonder if the discovery mechanism needs to be very bold on how it's called out. Just a thought: no real action item here.

Can you elaborate on what you mean here?

To clarify: as part of this work we would implement the bridgeable store on any official store which does not yet support it, correct?

Correct.

@jkusa
Copy link
Member

jkusa commented May 8, 2019

I think we need a section on date handling. Specifically how do you group by at different time grains (hour, day, week, month, quarter, year, etc).

@aklish
Copy link
Member Author

aklish commented May 8, 2019

@jkusa Since there is no guarantee the table even has a time dimension, I think the grain of the table should be baked into the table itself. This means there would be separate tables for hour, day, month, etc for a fact table with a time grain. Does Navi need to know the time grain of the table?

@QubitPi
Copy link
Contributor

QubitPi commented May 8, 2019

It looks like to me that SQL already support SUM, COUNT, etc. Are we talking about having Elide to just support these syntax? Because I am not sure why we need a new DataStore

@DennisMcWherter
Copy link
Collaborator

Can you elaborate on what you mean here?

I'm just calling out that call this API will now deviate out-of-the box. The id structure, reading/writing data, etc. While it's still well-formed (and to the remaining spec's) it's a restricted set of operations. While the distinction feels quite apparent to me, I wonder if new users will struggle with this if we don't document it thoroughly.

@jkusa
Copy link
Member

jkusa commented May 9, 2019

there is no guarantee the table even has a time dimension

That is correct, however aggregating across various time grains is a basic feature in any analytics/reporting tool. The datastore can be aware of the presence of time based columns. This could be done with an annotation such as @Temporal or a custom one. The field type could be used as well.

In the case there is a Date/Time field present, a user should be able group by that column at specified time grain. In graphql this could be an argument to the field. JSON-API is a little trickier.

Does Navi need to know the time grain of the table?

At a minimum, it needs to know if there is a Date/Time field present in the table. An optimization would be to communicate to Navi what time grains are supported. For instance, if the table is pre-aggregated at the day level, don't allow the option for hourly grouping and filtering

@jkusa
Copy link
Member

jkusa commented May 9, 2019

It should be possible to fetch an individual fact table row by its Id.

How would this work if a set of filter values no longer return any rows from the data source?

@michael-mclawhorn
Copy link
Collaborator

michael-mclawhorn commented May 9, 2019

In case it's not well understood, time grain is a computed dimension on a dateTime attribute. Which means if we did have a special @Temporal annotation or something like, it would potentially behave differently under grouping than under filtering. Or there could be a computed view on a raw phyiscal timestamp that simply rounds it to a grain, then it would be consistent under grouping and filtering.

@michael-mclawhorn
Copy link
Collaborator

It should be possible to fetch an individual fact table row by its Id.

How would this work if a set of filter values no longer return any rows from the data source?

We definitely need some sort of field seperator so that explicitly null values can be baked into the key as emptystrings and there's no ambiguity if FooBar is Foo, Bar, '' or '', Foo, Bar or Foo, '', Bar

@michael-mclawhorn
Copy link
Collaborator

How do null joins work now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RequestForComment For proposals and ideas in which input is sought. These issues are typically not yet team-accepted.
Projects
None yet
Development

No branches or pull requests

5 participants