Skip to content

Commit

Permalink
Merge lexer from the toolchain repository. (#213)
Browse files Browse the repository at this point in the history
The only change here is to update the fuzzer build extension path.

The main original commit message:

> Add an initial lexer. (#17)
>
> The specific logic here hasn't been updated to track the latest
> discussed changes, much less implement many aspects of things like
> Unicode support.
>
> However, this should lay out a reasonable framework and set of APIs.
> It gives an idea of the overall lexer architecture being proposed. The
> actual lexing algorithm is a relatively boring and naive hand written
> loop. It may make sense to replace this with something generated or
> other more advanced approach in the future, getting the implementation
> right was not the primary goal here. Instead, the focus was entirely
> on the architecture, encapsulation, APIs, and the testing
> infrastructure.
>
> The architecture of the lexer differs from "classical" high
> performance lexers in compilers. A high level summary:
>
> -   It is eager rather than lazy, lexing an entire file.
> -   Tokens intrinsically know their source location.
> -   Grouping lexical symbols are tracked within the lexer.
> -   Indentation is tracked within the lexer.
>
> Tracking of grouping and indentation is intended to simplify the
> strategies used for recovery of mismatched grouping tokens, and
> eventually use indentation.
>
> Folding source location into the token itself simplifies the data
> structures significantly, and doesn't lose any fidelity due to the
> absence of a preprocessor with token pasting.
>
> The fact that this is an eager lexer instead of a lazy lexer is
> designed to simplify the implementation and testing of the lexer (and
> subsequent components). There is no reason to expect Carbon to lex so
> many tokens that there are significant locality advantages of lazy
> lexing. Moreover, if we want comparable performance benefits, I think
> pipelining is a much more promising architecture than laziness. For
> now, the simplicity is a huge win.
>
> Being eager also makes it easy for us to use extremely dense memory
> encodings for the information about lexed tokens. Everything is
> created in a dense array, and small indices are used to identify each
> token within the array.
>
> There is a fuzzer included here that we have run extensively over the
> code, but currently toolchain bugs and Bazel limitations prevent it
> from easily building. I'm hoping myself or someone else can push on
> this soon and enable the fuzzer to at least build if not run fuzz
> tests automatically. We have a significant fuzzing corpus that I'll
> add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

> Add fuzz testing infrastructure and the lexer's fuzzer. (#21)
>
> This adds a fairly simple `cc_fuzz_test` macro that is specialized for
> working with LLVM's LibFuzzer. In addition to building the fuzzer
> binary with the toolchain's `fuzzer` feature, it also sets up the test
> execution to pass the corpus as file arguments which is a simple
> mechanism to enable regression testing against the fuzz corpus.
>
> I've included an initial fuzzer corpus as well. To run the fuzzer in
> an open ended fashion, and build up a larger corpus:
> ```shell
> mkdir /tmp/new_corpus
> cp lexer/fuzzer_corpus/* /tmp/new_corpus
> ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus
> ```
>
> You can parallelize the fuzzer by adding `-jobs=N` for N threads. For
> more details about running fuzzers, see the documentation:
> http://llvm.org/docs/LibFuzzer.html
>
> To minimize and merge any interesting new inputs:
> ```shell
> ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
>     lexer/fuzzer_corpus /tmp/new_corpus
> ```

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
  • Loading branch information
chandlerc and jonmeow authored Dec 8, 2020
1 parent b72294a commit 3995fc2
Show file tree
Hide file tree
Showing 512 changed files with 2,348 additions and 0 deletions.
75 changes: 75 additions & 0 deletions lexer/BUILD
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Part of the Carbon Language project, under the Apache License v2.0 with LLVM
# Exceptions. See /LICENSE for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

load("@rules_cc//cc:defs.bzl", "cc_library", "cc_test")
load("//bazel/fuzzing:rules.bzl", "cc_fuzz_test")

package(default_visibility = ["//visibility:public"])

cc_library(
name = "token_kind",
srcs = ["token_kind.cpp"],
hdrs = ["token_kind.h"],
textual_hdrs = ["token_registry.def"],
deps = ["@llvm-project//llvm:Support"],
)

cc_test(
name = "token_kind_test",
srcs = ["token_kind_test.cpp"],
deps = [
":token_kind",
"@llvm-project//llvm:Support",
"@llvm-project//llvm:gtest",
"@llvm-project//llvm:gtest_main",
],
)

cc_library(
name = "tokenized_buffer",
srcs = ["tokenized_buffer.cpp"],
hdrs = ["tokenized_buffer.h"],
deps = [
":token_kind",
"//diagnostics:diagnostic_emitter",
"//source:source_buffer",
"@llvm-project//llvm:Support",
],
)

cc_library(
name = "tokenized_buffer_test_helpers",
testonly = 1,
hdrs = ["tokenized_buffer_test_helpers.h"],
deps = [
":tokenized_buffer",
"@llvm-project//llvm:Support",
"@llvm-project//llvm:gmock",
],
)

cc_test(
name = "tokenized_buffer_test",
srcs = ["tokenized_buffer_test.cpp"],
deps = [
":tokenized_buffer",
":tokenized_buffer_test_helpers",
"//diagnostics:diagnostic_emitter",
"@llvm-project//llvm:Support",
"@llvm-project//llvm:gmock",
"@llvm-project//llvm:gtest",
"@llvm-project//llvm:gtest_main",
],
)

cc_fuzz_test(
name = "tokenized_buffer_fuzzer",
srcs = ["tokenized_buffer_fuzzer.cpp"],
corpus = glob(["fuzzer_corpus/*"]),
deps = [
":tokenized_buffer",
"//diagnostics:diagnostic_emitter",
"@llvm-project//llvm:Support",
],
)
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 3995fc2

Please sign in to comment.