Skip to content

zeekat/surf-demodata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

49a337e · Feb 19, 2020
Feb 19, 2020
Jan 23, 2020
Feb 12, 2020
Jan 15, 2020
Feb 19, 2020
Feb 11, 2020
Jan 15, 2020
Feb 19, 2020
Jan 29, 2020
Feb 19, 2020
Feb 3, 2020
Feb 19, 2020

Repository files navigation

SURFnet DemoData toolkit

Can generate and export relational data for demo purposes. Useful for filling, for instance, a database with dummy data the looks real enough to use for demonstrations. See OOAPI-demodata for an example of a web service which is exposes demo data generated using this toolkit.

Installation

  • Get leiningen
  • Clone this repository
  • Install locally as a jar: lein install
  • Create standalone jar for command line: lein uberjar

https://github.com/SURFnet/demo-data/workflows/Tests/badge.svg [[https://clojars.org/nl.surf/demo-data][https://img.shields.io/clojars/v/nl.surf/demo-data.svg]]

Usage

The DemoData toolkit requires two descriptors to generate data:

  • schema describing all the types, references between types, the type attributes, their dependencies, constraints and generators

    Basic shape:

    {
      "types": [
        {
          "name": "cat",
          "refs": { ... },
          "attributes": { ... },
        }, {
          "name": "person",
          ...
        }
      ]
    }
        

    For an example see also doc/cats-schema.json.

  • population map providing the number of instances to generate per type

    For an example see doc/cats-population.json.

    {
      "cat": 10,
      "person": 30,
      ...
    }
        

Generation process

First some terminology to avoid confusion: a type is a collection of attributes and refs. The types will be instantiated into entities having properties derived from the attributes and refs of that type.

The data generation process is attribute driven and goes through 3 phases:

  1. Empty entities are created according to the distribution defined by the population descriptor
  2. Refs are transformed into attributes, so a ref becomes a property of an entity
  3. Attributes are sorted in dependency order, so attributes with no deps are first. Then, for every attribute in order, the corresponding properties are generated entities.

Attributes

An attribute is described by the following properties, all optional:

  • value

    Specifying a value will cause the property of the generated entities to always have the given value.

  • generator

    A named generator as string or array. In case of an array the first element of the array is the name of the generator, the rest of the elements are arguments to the generator; ["int", 1, 10] will pick a number between 1 and 10.

  • constraints

    A list of constraints. Currently only unique is available and will cause retries when other entities in the set have already take the generated value.

  • deps

    A list of attribute this property depends on. These are full qualified names like person/name and may be paths into related (see Refs) using a nested list like ["person/cat", "cat/name"].

    If a generator is specified the associated depend values will be added to the argument list to the generator. When no generator is specified the values will be passed as is to a property when instantiating entities.

  • hidden

    Properties generated from an attribute marked as hidden will be discarded in the final output. This defaults to false.

Refs

Refs are just special attributes to define relationships between types. They are hidden by default, cause they have a shape which is almost always considered “internal” to this library and another attribute is most likely used which depends on the ref to give it a proper format.

This toolkit supports the following relations through configuration:

  • many-to-one

    From the person "refs" in the cats example:

    {
      "types": [
        ...
        {
          "name": "person",
          ...
          "refs": {
            ...
            "cat": {
              "deps": ["cat/name"]
            },
            ...
        

    Here a person is associated with a random cat using the cat’s name as a key. This will create a (hidden by default) foreign key property named "cat" for a person which can be used to make a SQL-like join. To get from a person to the cat’s favorite, add an attribute with a dep like ["person/cat", "cat/favorite"].

    From the person "attributes" in the cats example:

    {
      "types": [
        ...
        {
          "name": "person",
          ...
          "attributes": {
            ...
            "dilemma": {
              "deps": ["person/name",
                       ["person/cat", "cat/name"],
                       ["person/cat", "cat/favorite"]],
              "generator": ["format", "%s loves %s but %2$s loves %s"]
            },
            ...
        
  • one-to-one

    Works similar to many-to-one, with a flag to specify that selected values must be unique.

    {
      "types": [
        ...
        {
          "name": "person",
          ...
          "refs": {
            ...
            "cat": {
              "deps": ["cat/name"],
                "unique": true
              },
              ...
        

    This will result in a one-to-one relation provided both persons and cats have the same population count.

  • many-to-many (Warning: needs work)

    We’ll use a linking table which has an association with both side. Similar to the the fed-by "refs" in the cats example:

    {
      "types": [
        ...
        {
          "name": "person",
          ...
          "refs": {
            ...
            "pair": {
              "deps": ["cat/name", "person/name"],
              "attributes": ["cat", "person"]
            },
            ...
        

    This ref yields two attributes as named above associated to the given types respectively with the given keys. The "unique" assignment ensures unique pairs are selected to prevent getting multiple equal relations.

    In the above case the distribution of choices is random. To steer the picking of pairs to select as many different of one side as possible, it’s possible to provide a list of booleans to the unique assignment. Given the above case, having "unique": [true, false] will cause as many cats to be included as possible, the selection of persons is still random.

  • one-to-many

    If a reference is to an entity, the values can be selected via a match on the referenced attribute:

    {
      "types": [
        {
          ...
          "name": "person",
          ...
          "attributes": {
            ...
            "fed-by": {
              "deps": [[["fed-by/cat", "cat/name"], "fed-by/person", "person/name"]]
            },
            ...
        

    The cat/fed-by property will get as a value the list of zero or more person/name values. The same technique can be used to find matching many-to-one or many-to-many refs.

Generators

Here’s a list of the currently implemented generators:

  • uuid

    Generates a Universally Unique Identifier.

  • string

    Generates a random string. Useful of creating test cases, not so much for demo data.

  • int (takes 2 arguments or none)

    Generate an integer between ~MIN_VALUE~ and ~MAX_VALUE~ or between the two given values (inclusive).

  • int-cubic (takes 2 arguments or none)

    Generate a integer between the two arguments with a cubic biased toward the high value.

  • int-log (takes 2 arguments or none)

    Generate a integer between the two arguments with a logarithmic biased toward the high value.

  • increasing-int (takes arguments or none)

    Generate a monotonically increasing integer starting at the given argument. Generated entities are grouped by their deps.

  • bigdec-cubic (takes 2 arguments)

    Generate a bigdecimal between the two arguments with a cubic biased toward the high value.

  • char (takes 2 arguments or none)

    Generate a random printable character or between the two given values (inclusive).

  • one-of (takes 1 argument)

    Take a random element from the given list.

  • one-of-resource-lines (takes 1 argument)

    Take a random line from the given file or resource.

  • one-of-keyed-resource (takes 2 arguments)

    Take a random line from a keys value of the given YAML file or resource. The first argument is the file and the second the key.

  • weighted (takes 1 argument)

    Take a value from a weighted object. For instance: with {"cat" 2, "ferret" 1} there’s a 2 in 3 chance "cat" will be picked.

  • text-from-resource (takes 1 or 2 arguments)

    Generate 3 lines of text from given resource by using markov probability chains. The number of lines can be specified by a second argument.

  • lorum-ipsum (takes 1 argument or none)

    Generate 3 “lorum ipsum” lines of fake Latin text. An optional argument specifies how many lines to generate.

  • date (takes 2 arguments)

    Pick a date between the given arguments formatted 1970-01-31.

  • timestamp (takes 2 arguments)

    Pick a timestamp between the given arguments formatted 1970-01-31T23:59:59+01:00.

And some generators which will transform their arguments in some way or other:

  • join (takes any number of arguments)

    Concatenate all arguments to a string separated by spaces. Empty values will be omitted.

  • format (takes at least 1 argument)

    Uses printf-like format as first argument to render the rest of the arguments. See syntax for details.

  • object (takes an even amount of arguments)

    Construct an object by splitting the list of arguments and zipping them together. For instance: ["name", "spouse", "Fred", "Wilma"] will become {"name": "Fred", "spouse": "Wilma"}.

  • inc (takes 1 or 2 arguments)

    Increments the given argument with one. One extra will be added on a retry attempt when trying to comply to a constraint.

  • dec (takes 1 or 2 arguments)

    Same as inc but decrement instead of increment.

  • first-weekday-of (takes 3 arguments)

    Determines the first given weekday of month in year. For instance "monday", "january", 2020.

  • last-day-of (takes 2 arguments)

    Determines the last day of the given month in year. For instance "january", 2020.

  • abbreviate (takes 1 argument)

    Make an abbreviation of group of words. So "Fred loves Wilma" becomes "BlW". When retrying this the number of retries will be appended.

See Writing generators to write your own.

Standalone

Use this toolkit from the command line as follows:

java -jar target/demo-data-standalone.jar generate \
  doc/cats-schema.json doc/cats-population.json \
  generated/cats.json

Please note: running generate standalone will allow loading resources from the current working directory and up.

Library

This toolkit can be used as library in your (Clojure). This will allow you to generate data in memory and, more importantly, create your own generators from scratch. To include this toolkit in you project, add [nl.surf/demo-data "0.1.0-SNAPSHOT"] as a dependency to your project.clj.

The most important namespaces are:

  • nl.surf.demo-data.config

    Functions here compile your configuration to a list of “executable” attributes. See the ~load~ function.

  • nl.surf.demo-data.world

    Functions here instantiate and populate your demo data by executing the attributes as defined by config/load. See the ~gen~ function.

Minimal example:

(-> {:types [{:name "person"
              :attributes {:name {:generator    ["one-of" ["Fred"
                                                           "Wilma"
                                                           "Barney"
                                                           "Betty"]]
                                   :constraints ["unique"]}}}]}
    (config/load)
    (world/gen {:person 2}))

Possible result:

{:person [#:person{:name "Wilma"} #:person{:name "Barney"}]}

Play with the cats example:

(-> "doc/cats-schema.json"
    (slurp)
    (json/parse-string keyword)
    (config/load)
    (world/gen (-> "doc/cats-population.json"
                   (slurp)
                   (json/parse-string keyword))))

Writing generators

Generators are defined by the generator multi-method in the nl.surf.demo-data.config namespace. An implementation of a generator should return a function which takes a least one argument; state. More arguments are allowed and can be passed as described in Attributes.

The state arguments contains 4 keys:

  • entity

    The current entity populated so far.

  • world

    A map of all the entities populated so far by type.

  • attr

    The internal “executable” representation of the attribute.

  • dep-vals (internal)

    The internal list of deferred deps values. These are also passed on the argument list.

Here a very basic example generator which sticks a exclamation mark after a string, exclaim!:

(defmethod config/generator "exclaim!" [_]
  (fn [world value]
    (str value "!")))

To handle retries for a constraint consider the world/*retry-attempt-nr* binding:

(defmethod config/generator "exclaim!" [_]
  (fn [_ value]
    (apply str value (repeat (inc world/*retry-attempt-nr*) "!"))))

Please note: all provided generator which require randomness use clojure.data.generators. If you are generating something random and need a reproducible result, consider using the primitives in this library and use the *rnd* binding as a seeding mechanism.

Writing constraints

Constraints are defined by the constraint multi-method in the nl.surf.demo-data.config namespace. An implementation of a constraint should return a function which takes two arguments: state and the value being considered. State is the same argument as provided to generator functions, see also Writing generators.

Here a constraint to require an integer is even:

(defmethod config/constraint "even" [_]
  (fn [_ value]
    (even? value)))

When a constraint is not met during generation it is retried up to a 1000 time (configurable with binding ~*retries*~).

Bootstrap from swagger.json

Create the schema and population descriptors from a JSON swagger defining from the command line as follows:

java -jar target/demo-data-standalone.jar bootstrap \
  doc/ooapo-swagger.json \
  generated/ooapi-schema.json generated/ooapi-population.json

This will create two files, demodata-schema.json and demodata-population.json, for you to edit and start generating demo data for your project.

You can generate data using:

java -jar target/demo-data-standalone.jar generate \
  generated/ooapi-schema.json generated/ooapi-population.json

License

Copyright (C) 2020 SURFnet B.V.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.