Skip to content

Testing and Benchmarking

Alexander edited this page Apr 14, 2015 · 16 revisions

The scalatest library is currently being used for testing.

Testing

All (unit-)tests are conducted within the scalatest functionality. All the tests are stored within the ./src/test/scala/. directory. Optional resources for those tests can be stored in the ./src/test/resources/. directory, or rather a subdirectory of that (depending of semantic coherence of resource data).

All tests should extend the LeoTestSuite class that automatically includes some testing utility functionality to the current name space. A general template for tests looks like the following:

/** Documentation of the test ... */
class MyTestSuite extends LeoTestSuite {
  // some contant definition
  // include external test resources by
  val resource = fromFile(getClass.getResource("relative path to resource"))
  // or obtain a directory with multiple resources in it (and for loading it later in the code)
  val resourceDir = getClass.getResource("relative path to directory").getPath

  // Tests begin
  /** A simple test with name "Test1" */
  test("Test1" /*, tags ... */) {
    // Test code
  }

  /** Multiple tests by looping */
  for (/** e.g. iterate over resourceDir */) {
    test(...) {
      // Test code
    }
  }
  // Tests end
}

All paths to testing resources are to be used relatively to ./src/test/resources/, i.e. the statement getClass.getResource("problems/").getPath would create a reference to the directory located at `./src/test/resources/problems/'.

Testing utility

All functionality from TestUtility (also located at the test package leo) is directly usable from any TestSuite that extends LeoTestSuite. This functionality includes:

  • The pre-defined function time allows to measure the amount of microseconds used for a particular calculation. For any value a of type A it returns a tuple (a',t) of type (A, Long) where the first element if the result of the calculation and the second argument is the measured calculation time. Usage:

          class MyTestSuite {
            // Other stuff
            test(...) {
              val (erg,time) = time(somefunction(...))
              // erg contains value of somefunction(...)
              // time contains the computation time
            }
          }
    
  • The function getFreshSignature returns a reference to the global signature with the additional assertion that all user-defined constants have been deleted. I.e. the global signature only contains the predefined HOL constants (such as fixed constants like "or" and "not" and defined constants like "and").

  • The function resetTermBank clears the TermBank, i.e. the term store.

  • The function resetBlackBoard clears the formula store of the BlackBoard.

  • WARNING: The function test will automatically execute all the three above reset methods in order to guarantee a clean test environment. If this is not what you want in your test suite, you need to override this manually by putting

          override def before(fun: => Any) {}
    

to your code.

  • Additionally, the TestUtility contains some more methods for "pretty printing" and other stuff. Please look into the according trait for more information.

Tagging the Test

There are currently three Tags for Tests in Leo-III:

  • Checked
  • Ignored
  • Benchmark

Only tests tagged as "Checked" will be called when invoking mvn test, those tagged with "Ignored" are explicitly excluded from the mvn test command. The "Benchmark" tag's purpose is to show that a certain test does not directly test "correctness" of a function, but rather runs a performance test of some kind. These tests tend to consume a lot of time, this is why tests tagged as "Benchmark" are also not run by default via mvn test.

Example Test

package leo
package modules.parsers

import scala.util.parsing.input.CharArrayReader
import scala.io.Source._

/**
 * This suite tests parsing of the SYN000-sample problems of the TPTP library.
 * The module [[leo.modules.parsers.TPTP]], i.e., the method parseFile() is tested.
 * A tests succeeds if the parser can successfully parse the input and
 * fails otherwise (i.e. if a parse error occurs).
 *
 * @author Alexander Steen
 * @since 22.04.2014
 * @note Updated 04.03.2015 -- Only test SYN000-files that cover most of the syntax features.
 */
class ParserTestSuite extends LeoTestSuite {
  val source = getClass.getResource("/problems").getPath
  val problem_suffix = ".p"

  val problems = Seq( "SYN000-1" -> "TPTP CNF basic syntax features",
    "SYN000+1" -> "TPTP FOF basic syntax features",
    "SYN000_1" -> "TPTP TF0 basic syntax features",
    "SYN000^1" -> "TPTP THF basic syntax features",
    "SYN000^2" -> "TPTP THF advanced syntax features",
    "SYN000+2" -> "TPTP FOF advanced syntax features",
    "SYN000_2" -> "TPTP TF0 advanced syntax features",
    "SYN000=2" -> "TPTP TFA with arithmetic advanced syntax features"
  )

  for (p <- problems) {
    test(p._2, Checked) {
      printHeading(s"Parsing test for ${p._2}")
      Out.output(s"## Parsing ${p._1} ...")

      val parsed = TPTP.parseFile(new CharArrayReader(fromFile(source + "/" +  p._1 + ".p").toArray))
      if (parsed.isLeft) {
        fail(s"FAILED. Cause: ${parsed.left.get}")
      } else {
        val res = parsed.right.get
        Out.output(s"Parsing succeeded. Parsed ${res.getFormulaeCount} formulae and ${res.getIncludeCount} include statements.")
      }
    }
  }
}

Benchmarking

A simple python script is located at ./src/test/resources/processBenchmark.py which calculates statistical information about a batch of input data. If you want to use this script, the input file(s) must have the form:

    <Identifier>\t<Value>
    <Identifier>\t<Value>
    <Identifier>\t<Value>
    ...

where \t indicates a Tab-Symbol, <Identifier> is some arbitrary string not containing any tab-symbol (\t), and <Value> is some numerical value.

The call to ./src/test/resources/processBenchmark.py <file> will then calculate the minimum, maximum, sum, mean, median and standard derivation of the s inside the file. If multiple files are given, as in ./src/test/resources/processBenchmark.py <file1> <file2> ..., the script will match/group the results of the input files according to their <Identifier> and calculate the respective values for each series of values, for each identifier.

Example

With one input file

Consider file A contains

  Prob1	123
  Prob2	234
  Prob3	432
  Prob4	534

then, the call ./src/test/resources/processBenchmark.py A will yield:

      Min	Max	Sum	Mean	Median	Std
      123	534	1323	330.75	333.0	161.309291425

With more than one input file

Consider file A contains

  Prob1	123
  Prob2	234
  Prob3	432
  Prob4	534

and file Bcontains

  Prob1	127
  Prob2	236
  Prob3	123
  Prob4	132

then, the call ./src/test/resources/processBenchmark.py A B will yield:

    Key		Min	Max	Sum	Mean	Median	Std

    Prob4		132	534	666	333.0	333.0	201.0
    Prob2		234	236	470	235.0	235.0	1.0
    Prob3		123	432	555	277.5	277.5	154.5
    Prob1		123	127	250	125.0	125.0	2.0

	        Min	Max	Sum	Mean*	Median*
    Global		123	534	1941	242.62	256.25

    (*: Mean of means resp. median of medians)