Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler] disentangle IR parser from bindings #13990

Merged
merged 6 commits into from
Nov 15, 2023

Conversation

patrick-schultz
Copy link
Collaborator

@patrick-schultz patrick-schultz commented Nov 8, 2023

Currently the binding structure is redundantly specified in two places: Binds.scala, and the parser. We need the binding structure in the parser to propagate the environment, so we can annotate Ref nodes (and a few other things) with their types. But we can't use Binds.scala because we don't yet have an IR.

This PR removes environment maintenance from the parser by deferring type annotation to a separate pass (which is simple, because it can use the Binds.scala infrastructure). One consequence is that we can't assign types to nodes like Ref during parsing, which means we can't ask for the type of any node during parsing, and by extension we can't ask for types of children in IR node constructors. Instead, all typechecking logic is moved to the TypeCheck pass.

Some benefits of this change:

  • The parser is simpler, as it doesn't have to maintain a typing environment.
  • Binds.scala is now the single source of truth on the binding structure of the IR.
  • Instead of typechecking being split in an ad-hoc way between IR constructors and the TypeCheck pass, all typechecking and type error reporting logic is in one place.
  • The parser parses a context-free grammar, no more and no less. If the input is gramatically correct, the parser succeeds.
  • We can round trip IR with type errors through the text representation. For instance, if we log an IR that fails TypeCheck, we can copy the IR from the log, parse it, then debug.

This change was motivated by my work in progress to convert the parser to use the SSA grammar, which this should greatly simplify.

I chose to make the type annotation pass after parsing mutate the IR in place (with the unfortunate exception of Apply, which can change into an ApplyIR or ApplySpecial. Do these really need to be separate nodes?). The type of a Ref node was already mutable to allow this sort of deferred annotation, and I've had to make a few other things mutable as well. Alternatively we could rebuild the entire IR to include type annotations, but I think the mutation is sufficiently well localized, and this is a common compiler pattern.

This touches a lot of lines, but the high level changes are:

  • Remove the refMap from the IRParserEnvironment, and remove all code that modifies the typing environment from the parser. Nodes like Ref that need a type from the environment get types set to null, to be filled in after parsing.
  • Add annotateTypes pass to fill in type annotations from the environment. This is currently written in the parser, and always called after parsing. This means for the moment we can only parse type correct IR. But in the future we could move this to a separate stage of the compilation pipeline.
  • Move typechecking logic on relational nodes from the constructors to a typecheck method, which is called from the TypeCheck pass.
  • Make the typ field on IR nodes consistently lazy (or a def when the type is a child's type without modification). Before we sometimes did this for performance, but it's now required to avoid querying children's types during construction.
  • Make types mutable on AggSignature and ComparisonOp, so they can be filled in after parsing.
  • Ensure that the structure in Binds.scala satisfies the following invariant: to construct the typing environment of child i, we only need the types of children with index less than i. This was almost always satisfied already, and allows us to use the generic binds infrastucture in the pass to annotate types (where when visiting child i, we can only query types of already visited children).
  • Change the text representation of TailLoop/Recur to move the explicit type annotation from Recur to TailLoop. This is necessary to comply with the previous point. It's also consistent with Ref, where types of references are inferred from the environment.
  • Add explicit types in the text representation of TableFilterIntervals and MatrixFilterIntervals, where the types were needed during parsing and we can no longer get them from child types.
  • Fix IR examples used in parser tests to be type correct.
  • Add an explicit return type to the Apply node. Before the parser parsed an Apply node to one of Apply/ApplySpecial/ApplyIR; now it always parses to Apply, and the type annotation pass converts to the appropriate specialization, which needs the parsed return type.

@patrick-schultz patrick-schultz force-pushed the parser-no-env branch 4 times, most recently from 6342978 to 69d25d8 Compare November 9, 2023 21:43
@patrick-schultz patrick-schultz force-pushed the parser-no-env branch 2 times, most recently from 8f644a4 to 4112879 Compare November 13, 2023 14:46
Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has significantly improved the simplicity of the parser, so much so that much of the logic therein could be simplified further, though I think that's beyond the scope of this change.

I like the separation of type-checking and parsing, however I'd prefer in your implementations of typecheck that you assert one thing at a time. That way when things fail, it'll be clear which assertion was fired (ie if (a && b && c) fails, you don't know if it's a or b or c, whereas

assert(a)
assert(b)
assert(c)

would give you that information.

case class NEQ(t1: Type, t2: Type) extends ComparisonOp[Boolean] { val op: CodeOrdering.Op = CodeOrdering.Neq() }
case class NEQ(t1: Type, t2: Type) extends ComparisonOp[Boolean] {
val op: CodeOrdering.Op = CodeOrdering.Neq()
override def copy(t1: Type = t1, t2: Type = t2): NEQ = NEQ(t1, t2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these copy methods not automatically defined by nature of being a case class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that I want a virtual copy method on ComparisonOp. I tried, but the compiler generated copy methods won't override the virtual one.

Comment on lines +18 to 32
case "==" | "EQ" =>
EQ(null, null)
case "!=" | "NEQ" =>
NEQ(null, null)
case ">=" | "GTEQ" =>
GTEQ(null, null)
case "<=" | "LTEQ" =>
LTEQ(null, null)
case ">" | "GT" =>
GT(null, null)
case "<" | "LT" =>
LT(null, null)
case "Compare" =>
Compare(null, null)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ComparisonOps have always seemed a little funny to me. I'm not sure what value they add over lifting the operand types into ApplyComparisonOp and just using the CodeOrdering.Op, ie:

case class ApplyComparisonOp(op: CodeOrdering.Op, aTy: Type, bTy: Type, a: IR, b: IT)

I also wonder what's the point of ApplyComparisonOp over ApplyBinaryOp (or indeed, Apply).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair questions, but I think outside the scope of this PR. For the first question, I think it's mostly about ease of pattern matching; if an optimization wants to apply only for <, you don't want to muck around at the CodeOrdering level.

Similar for the second question, it's just an issue of what information is encoded in the IR, and how easy it is for optimizations to extract that information.

@@ -319,8 +324,7 @@ final case class StreamLen(a: IR) extends IR
final case class StreamGrouped(a: IR, groupSize: IR) extends IR
final case class StreamGroupByKey(a: IR, key: IndexedSeq[String], missingEqual: Boolean) extends IR

final case class StreamMap(a: IR, name: String, body: IR) extends IR {
override def typ: TStream = tcoerce[TStream](super.typ)
final case class StreamMap(a: IR, name: String, body: IR) extends TypedIR[TStream] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the use of TypedIR!

Comment on lines -131 to -194
refMap: BindingEnv[Type] = BindingEnv.empty[Type],
irMap: Map[Int, BaseIR] = Map.empty,
) {

def promoteAgg: IRParserEnvironment = copy(refMap = refMap.promoteAgg)

def promoteScan: IRParserEnvironment = copy(refMap = refMap.promoteScan)

def promoteAggScan(isScan: Boolean): IRParserEnvironment =
if (isScan) promoteScan else promoteAgg

def noAgg: IRParserEnvironment = copy(refMap = refMap.noAgg)

def noScan: IRParserEnvironment = copy(refMap = refMap.noScan)

def noAggScan(isScan: Boolean): IRParserEnvironment =
if (isScan) noScan else noAgg

def createAgg: IRParserEnvironment = copy(refMap = refMap.createAgg)

def createScan: IRParserEnvironment = copy(refMap = refMap.createScan)

def onlyRelational: IRParserEnvironment = {
if (refMap.eval.isEmpty && refMap.agg.isEmpty && refMap.scan.isEmpty)
this
else
copy(refMap = refMap.onlyRelational)
}

def empty: IRParserEnvironment = copy(refMap = BindingEnv.empty)

def bindEval(name: String, t: Type): IRParserEnvironment =
copy(refMap = refMap.bindEval(name, t))

def bindEval(bindings: (String, Type)*): IRParserEnvironment =
copy(refMap = refMap.bindEval(bindings: _*))

def bindEval(bindings: Env[Type]): IRParserEnvironment =
copy(refMap = refMap.bindEval(bindings.m.toSeq: _*))

def bindAggScan(isScan: Boolean, bindings: (String, Type)*): IRParserEnvironment =
copy(refMap = if (isScan) refMap.bindScan(bindings: _*) else refMap.bindAgg(bindings: _*))

def bindAgg(name: String, t: Type): IRParserEnvironment =
copy(refMap = refMap.bindAgg(name, t))

def bindAgg(bindings: (String, Type)*): IRParserEnvironment =
copy(refMap = refMap.bindAgg(bindings: _*))

def bindAgg(bindings: Env[Type]): IRParserEnvironment =
copy(refMap = refMap.bindAgg(bindings.m.toSeq: _*))

def bindScan(name: String, t: Type): IRParserEnvironment =
copy(refMap = refMap.bindScan(name, t))

def bindScan(bindings: (String, Type)*): IRParserEnvironment =
copy(refMap = refMap.bindScan(bindings: _*))

def bindScan(bindings: Env[Type]): IRParserEnvironment =
copy(refMap = refMap.bindScan(bindings.m.toSeq: _*))

def bindRelational(name: String, t: Type): IRParserEnvironment =
copy(refMap = refMap.bindRelational(name, t))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

Comment on lines 1367 to 1389
val errorID = int32_literal(it)
val function = identifier(it)
val typeArgs = type_exprs(it)
val rt = type_expr(it)
ir_value_children(env)(it).map { args =>
invoke(function, rt, typeArgs, errorID, args: _*)
ApplyIR(function, rt, typeArgs, args, errorID)
}
case "ApplySpecial" =>
val errorID = int32_literal(it)
val function = identifier(it)
val typeArgs = type_exprs(it)
val rt = type_expr(it)
ir_value_children(env)(it).map { args =>
ApplySpecial(function, typeArgs, args, rt, errorID)
}
case "Apply" =>
val errorID = int32_literal(it)
val function = identifier(it)
val typeArgs = type_exprs(it)
val rt = type_expr(it)
ir_value_children(env)(it).map { args =>
Apply(function, typeArgs, args, rt, errorID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little sad. Can you define a function that takes the data constructor to clean all this up? ie

def applyLikeParser[Ap <: IR](cons: (String, Type, IndexedSeq[Type], IndexedSeq[IR], Int) => Ap): Parser[Ap] =
  /* The de-duplicated implementation */

val parseApply: Parser[Apply] = 
  applyLikeParser(Apply)
val parseApplySpecial: Parser[ApplySpecial] = 
  applyLikeParser(ApplySpecial)
...

case "Apply" =>
  parseApply(env)(it)
case "ApplySpecial" =>
  parseApplySpecial(env)(it)
...

@patrick-schultz
Copy link
Collaborator Author

This has significantly improved the simplicity of the parser, so much so that much of the logic therein could be simplified further, though I think that's beyond the scope of this change.

Agreed. But as my follow up will be a complete rewrite of the parser, I definitely don't want to do more to simplify the current one.

I like the separation of type-checking and parsing, however I'd prefer in your implementations of typecheck that you assert one thing at a time. That way when things fail, it'll be clear which assertion was fired (ie if (a && b && c) fails, you don't know if it's a or b or c, whereas

assert(a)
assert(b)
assert(c)

would give you that information.

Good suggestion. I only moved assertions here, didn't add any new ones, but I don't mind splitting up some that I moved.

@danking danking merged commit 9dc3495 into hail-is:main Nov 15, 2023
8 checks passed
@patrick-schultz patrick-schultz deleted the parser-no-env branch November 15, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants