Skip to content

Latest commit

 

History

History
631 lines (521 loc) · 24.9 KB

2-Basics.md

File metadata and controls

631 lines (521 loc) · 24.9 KB

Table of contents

The Basics

The log file and the object model

A SARIF log is a JSON file.1 The SARIF spec defines an object model to describe the contents of this file, and the top-level object — the object that represents the log file as a whole — is the sarifLog object.

To work with the contents of a log file in your program, you need a set of classes that correspond to the elements of the SARIF object model. The SARIF spec doesn't standardize the binding between its object model and any particular programming language. Today, there are bindings for .NET (in the SARIF SDK NuGet package) and Python (in the sarif-om Python module). Each language binding conforms to normal language conventions. For example, the sarifLog SARIF object is represented by the C# class SarifLog in the SARIF SDK, and by the Python class sarif_log in the sarif-om Python module.

When we discuss the SARIF format, we usually speak in terms of its object model; for example, we might talk about the required properties of the result object, or about various places where message objects are used. So if we tell you that a particular property in the JSON file is a message object, you know what its sub-structure looks like without having to say anything more.

Logs and runs

The sarifLog object has a required version property that must be "2.1.0". version should come first (even though JSON is insensitive to property order) so SARIF consumers such as viewers can "sniff" the version.

The optional $schema property contains the URI of a copy of the SARIF schema. This allows development environments like VS Code to provide schema validation (for example, squiggles under misspelled property names) and Intellisense.

{
  "version": "2.1.0",
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0-rtm.4.json",
  ...
}

A log file contains an array of one or more runs.2 Each run represents a single invocation of a single analysis tool, and the run has to describe the tool that produced it.

{
  "version": "2.1.0",
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0-rtm.4.json",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner"
        }
      },
      ...
    }
  ]
}

Usually there is only one run. SARIF allows multiple runs for convenience, so that, for example, you can send the runs over a network in a single request.3

Tools: driver and extensions

The tool property is required. It describes the analysis tool that produced the run. The sub-property tool.driver is also required. It describes the tool's driver, which is the tool component that contains the tool's primary executable.

Some tools support additional components called extensions (a.k.a. "plugins"), for example, code libraries that define additional analysis rules. SARIF defines the optional tool.extensions property to represent extensions.4

Finally, tool.driver.name is required. Everything else under tool and tool.driver is optional.

Property bags

Before we go any further, let's address an issue that almost every tool vendor cares about: What do I do if my tool produces information that the SARIF specification doesn't mention?

The answer is that every object in the SARIF object model — from logs to runs to results to locations to messages, without exception — defines a property named properties. The spec calls a property named properties a property bag.

A property bag is a set of name/value pairs — a JSON object, in SARIF's JSON serialization — with any name and any value. The values can be integers, Booleans, arrays, nested objects — anything at all. If you can't find an element of your tool's data in the SARIF specification, the property bag is your friend.

For example, the Fortify tool from MicroFocus assesses the "confidence" of each result — the likelihood that the result is a "true positive." It looks like this:

{
  "version": "2.1.0",
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0-rtm.4.json",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner"
        }
      },
      "results": [
        {
          ...
          "properties": {
            "Confidence": 5.0
          }
        }
      ]
    }
  ]
}

Having said all that, it's important that you do your best to use the properties that SARIF defines (we call them first class properties) rather than using the property bag. Generic SARIF tooling — tooling that is not aware of the details of any particular tool — will at best be able to display property bag properties. It won't be able to extract any meaning from them.

There's a balancing act here, because it's also problematic to populate SARIF's first class properties with information that doesn't match their semantics. Each tool vendor needs to make the call on a property by property basis.

Results

The primary purpose of a run is to hold a set of results. A result is an observation about the code. For most tools, the results represent issues — conditions that might detract from the quality of the code — but some results might be purely informational.

In this example, the tool produced one result:

{
  "version": "2.1.0",
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0-rtm.4.json",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "ESLint"
        }
      },
      "results": [
        {
          "ruleId": "no-unused-vars",
          "level": "error",
          "message": {
            "text": "'x' is assigned a value but never used."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "file:///C:/dev/sarif/sarif-tutorials/samples/Introduction/simple-example.js",
                  "index": 0
                },
                "region": {
                  "startLine": 1,
                  "startColumn": 5
                }
              }
            }
          ]
        }
      ]
    }
  ]
}

The most commonly used properties of a result are:

  • A message describing the violation.
  • An identifier for the rule that was violated.
  • The severity of the violation.
  • The location of the violation.

There are many other properties used in advanced scenarios, which we'll cover in future tutorials.

When you open a SARIF file in a typical SARIF viewer, the viewer will display the list of results. For example, the Microsoft SARIF Viewer VSIX for Visual Studio displays the results in Visual Studio's Error List window. If physical location information is available, then when the user selects one of the results (in Visual Studio, by double-clicking it), the viewer will navigate to the result's location by opening the file specified by physicalLocation.artifactLocation.uri (simple-example.js in the example above). The viewer will typically scroll the portion of the file specified by physicalLocation.region (line 1 in the example) into view, and highlight it.

Message

The only required property of a result object is the message property.

SARIF messages are more than just strings: they are represented by a separate object, the message object. In the simplest case, which we'll show here, a message object contains a simple text string, like this:

{
  "message": {
    "text": "'x' is assigned a value but never used."
  }
}

We'll say much more about the features and capabilities of messages later. Just as important as these technical issues is the quality of the message text. The Appendix Authoring rule metadata and result messages provides guidance on authoring informative and actionable result messages.

Rule identifier

Most tools provide a "code" for each rule, for example CA1304, the Roslyn analyzer code for the rule "Specify CultureInfo". The SARIF property result.ruleId5 holds this code.

{
  "ruleId": "CA1304"
}

The spec explains why result.ruleId should be a "stable, opaque" identifier. It shouldn't change over time so that build scripts that disable a particular rule never break. It should be opaque (as opposed to being a human-readable string) for ease of web lookup and to avoid language difficulties.

Not all tools provide such an identifier. For example, ESLint uses human-readable rule identifiers such as "no-unused-vars". These tools should do the best they can to populate result.ruleId.

Level

The result.level property says how serious the result is. It usually has one of three values:

  • "error": A serious problem
  • "warning": A problem
  • "note": A minor problem or an opportunity to improve the code
{
  "ruleId": "CA1304",
  "level": "warning"
}

There are advanced scenarios where result.level has the value "None", but we won't discuss them here.6

In the simplest case, result.level defaults to "warning". But the complete algorithm for determining the default is complicated because it takes into account certain advanced scenarios7. For that reason, you are IMO better off just specifying it explicitly in each result, as in the example above.

Locations

The locations array

We call a place in the code where a tool detects a result a result location. SARIF represents result locations with the optional property result.locations, an array of location objects which almost always contains exactly one element:

{
  "ruleId": "CA1304",
  "level": "warning",
  "locations": [
    {
      ...
    }
  ]
}

result.locations is optional because a location doesn't always sense. For example, if a tool tells you that your C# program doesn't have a Main entry point, what location should it mention?

result.locations is an array because sometimes you have to make changes in more than one place to fix a problem. Suppose a style checker tells you that your C# class name doesn't start with a capital letter, and suppose your class is made up of a set of partial classes. You can't change the name in just one place: your code won't compile. You have to change every occurrence, and that's why the result points you at all the occurrences. (Of course IDEs can help you with this.)

Don't use result.locations to specify the locations of multiple problems, even problems of the same kind, if they can be fixed independently. You might choose to fix some occurrences of the problem and not others, for example, if you know that some of the occurrences are in code that is slated for removal, or are false positives. Only put more than one element in result.locations if you have to fix all the locations at once.

Physical and logical locations

SARIF supports two kinds of locations: physical and logical. A physical location describes a location with respect to some programming artifact, for example, a range of lines in a source file or a byte range in an executable file. A logical location describes a location by name, without reference to a programming artifact, for example the name of a method within a class within a namespace.

SARIF supports logical locations for two reasons:

  1. Most important, binary analysis tools don't always have physical location information available (although they might, for example, if they have access to a symbols file). These tools have no choice but to report (for example) a method name; they can't tell you what file the method was defined in.

  2. Even if physical location is available, it can be helpful for a tool to tell you that the problem on line 42 is inside function f().

The most common case is for a tool to report a physical location (rather than a logical location), and to specify the location by line and column number rather (rather than as a byte range):

{
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "io/kb.c"
        },
        "region": {
          "startLine": 42,
          "startColumn": 9
        }
      }
    }
  ]
}

A physicalLocation object almost always contains an artifactLocation property,8 and it can also contain a region property.

A simple logical location looks like this:

{
  "locations": [
    {
      "logicalLocations": [
        {
          "fullyQualifiedName": "NamespaceN.ClassC.MethodM"
        }
      ]
    }
  ]
}

Note that location.logicalLocations is an array. As usual, this is to support an advanced scenario.9

Artifacts

An artifact is anything you create in the course of programming, such as a source file, an object file, or a web page. In SARIF, every artifact must be URL-addressable. That means, for example, that if you want to write a static database analyzer that produces SARIF, then you need a way to express the locations of the things that you analyze — tables, rows, indices, and so on — as URLs.

The SARIF spec uses the term "artifact" in preference to "file" to emphasize that SARIF doesn't just support tools that analyze files. Having said that, in these tutorials we'll occasionally lapse and use the word "file" because it's just too awkward to say things like "open the artifact" when "open the file" is so much more natural.

Defining artifacts

As we said earlier, almost every result specifies a location, and those locations are often physical locations which in turn contain artifactLocation objects:

{
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "io/kb.c"
        },
        ...
      }
    }
  ]
}

At this point, all you know about the artifact is its location. But SARIF lets you provide more information about each artifact by using the run.artifacts property, whose value is an array of artifact objects:

{
  "runs": [
    {
      "artifacts": [
        {
          "location": {
            "uri": "io/kb.c"
          },
          "length": 3444,
          "sourceLanguage": "c",
          "hashes": {
            "sha-256": "b13ce2678a8807ba0765ab94a0ecd394f869bc81"
          }
        }
      ]
    }
  ]
}

Note that:

  • length is measured in bytes.
  • The SARIF spec suggests values for the sourceLanguage property for many programming languages.10
  • hashes can contain any number of hashes calculated by different algorithms. The spec recommends11 using the algorithms and algorithm names contained in the IANA registry of hash function names, and it particularly recommends that the hashes object contain a property "sha-256".

There are more properties to explore. In particular, you can embed the entire contents of each artifact in the SARIF file, which allows people to view the results in context even if they're not enlisted in the code base that was analyzed:

{
  "runs": [
    {
      "artifacts": [
        {
          "location": {
            "uri": "io/kb.c"
          },
          "contents": {
            "text": "#include <stdio>\n#include<stack>..."
          }
        }
      ]
    }
  ]
}

Of course that requires tooling that can extract the source file contents from the log file and display them.

Linking results to artifacts

Now that we have a result that occurs in an artifact, and we have more information about the artifact, how can we link the result to the artifact? For example, if a user is examining a result in a SARIF viewer, how can the viewer present information about the file where the result occured?

It would be natural to think that you could just find the element of run.artifacts whose location.uri matches the result's physicalLocation.artifactLocation.uri. In fact, early drafts of the SARIF spec did just that: run.artifacts was a JSON object whose property names were the URIs.

The problem is that in obscure cases, two distinct artifacts can have the same URI. So SARIF establishes the link from a result to an artifact by using the artifactLocation.index property, like this:

{
  "runs": [
    {
      "artifacts": [
        {
          "location": {
            "uri": "io/kb.c"
          },
          "contents": {
            "text": "#include <stdio>\n#include<stack>..."
          }
        }
      ],
      "results": [
        {
          "message": {
            "text": "Variable 'x' is used before being initialized."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "io/kb.c",
                  "index": 0
                }
              }
            }
          ]
        }
      ]
    }
  ]
}

The "index": 0 says "To find more information about the artifact at this location, look at the artifact object at index 0 in the array run.artifacts."

When artifactLocation.index is present, artifactLocation.uri is redundant, because you can find it in the linked artifact object. A tool can choose to omit uri to make the log file smaller, or to include it to make the file more understandable to a human reader.12

There are many places in SARIF where a property named index (or sometimes a more specific name, like ruleIndex) establishes a link from a SARIF object to another object that resides in an array. For each such property, the spec explains which array to look in.

Rule metadata

A SARIF log file can contain information about the analysis rules defined by the static analysis tool. The spec refers to this information as rule metadata. Rule metadata can include a complete description of the rule, its default severity level, one or more message strings (possibly including substitution sequences like {0}) to include in a result), and a URI where you can find more information about the rule.

If rule metadata is present, then when a user selects a result in a SARIF file, a SARIF viewer can display the metadata for the rule that was violated. Here is a screen shot that shows the Microsoft SARIF Viewer VSIX for Visual Studio displaying the SARIF file shown in the simple example from the introduction. The user has selected the result in the Error List window at the bottom. On the right, the user has selected the Info tab in the SARIF Explorer, and viewer has displayed the help URI from the metadata for the no-unused-vars rule.

A SARIF viewer displays rule metadata for a result

The Appendix Authoring rule metadata and result messages provides guidance on authoring rule metadata that provides the most useful information to the developer and also works well in automated systems.

Note the presence of a ruleIndex property in the result object in the example. In the same way that result.locations[0].physicalLocation.artifactLocation.index provides a link between a result and the artifact it was found in, result.ruleIndex provides the link between a result and the metadata for the rule that was violated. The link can't be made through result.ruleId because some tools use the same identifier for distinct rules, each with their own metadata.

In the same way that artifactLocation.uri is unnecessary if artifactLocation.index is present, so result.ruleId is unnecessary if result.ruleIndex is present. The same considerations apply when deciding whether to include result.ruleId: omitting it makes the log file smaller, while including it makes the log file more understandable to a human user.

Rule metadata is optional. An analysis tool can choose not to include it at all, to include metadata for only those rules that are relevant to the results, or to include metadata for all rules known to the tool.13

Notes

1. In future, SARIF might support other serializations of its underlying object model. See §3.1.

2. In rare cases, runs can be empty or even null. See §3.13.4: runs property for more information.

3. There's also an advanced scenario where multiple runs can share data stored at the log level, reducing the total size of the payload. See §3.13.5 inlineExternalProperties property for an example.

4. See §3.18.3, extensions property.

5. We use the notation objectName.propertyName, so result.ruleId denotes the ruleId property of the SARIF result object.

6. For more information, see §3.27.10, level property and §3.27.9, kind property.

7. Again, see §3.27.10, level property and §3.27.9, kind property.

8. physicalLocation.artifactLocation isn't required because SARIF also allows you to specify a location by its address (see §3.29.6, address property and §3.32, address object). This supports binary analysis tools. It also supports tools that examine memory contents. This isn't something a static analysis tool would do, but as I mentioned earlier, SARIF actually has some level of support for dynamic analysis tools, although the spec never makes that claim.

9. For more information, see §3.29.6, logicalLocations property.

10. See Appendix J. (Informative) Sample sourceLanguage values.

11. See §3.24.11, hashes property

12. Rather than requiring every analysis tool to implement logic for excluding redundant properties to reduce file size, or including them to improve readability, such "file transformation" operations can be implemented by a post-processor. The Sarif.Multitool NuGet package include a command line tool that (among other things) can post-process SARIF files, although at the time of this writing it doesn't implement the exact operation I've described here.

13. Appendix E. (Informative) Locating rule and notification metadata discusses how to decide whether to include rule metadata in a SARIF log file.

Table of contents