Create "data dictionary" for all SCTK fields #2008

mjherzog · 2020-04-13T22:44:23Z

We need a comprehensive "data dictionary" of all fields that may be present in a ScanCode output file.
The minimum requirement is a list of fields with type (single value, list or ?) and description. This should, of course, be versioned SCTK releases.
We have some of this information for the CSV output at https://scancode-toolkit.readthedocs.io/en/latest/cli-reference/output-format.html#custom-output-format (which may not be current), but we do not have it for the full set of output fields in the JSON output.
There is currently no single file in the codebase (like Django models), but two files seem to contain most of the field definitions:

./src/scancode/api.py - non-Package fields
./src/packagedcode/models.py - Package fields (see section starting line 361)

A first step should be to investigate whether there are existing automated documentation tools for Python that would help us get started.

This Issue supersedes #112 which is pretty stale at this point.

AyanSinhaMahapatra · 2020-11-23T15:01:57Z

From comments in the gitter/discuss channel that would be important:

@pombredanne

the thing is that the output of scancode should be (self) documenting from the models. We are using mostly attrs classes so imho the plan could be:

ensure that we are using @hynek https://github.com/python-attrs/attrs/ attrs classes all the way possibly adopting @Tinche https://github.com/Tinche/cattrs for nested types

define/design a simple way to add a docstring of sorts to each model and attribute ... there is some example on that here https://github.com/nexB/scancode-toolkit/blob/d9ae6e62ebad6a896cf5b58185d833302e95c72d/src/packagedcode/models.py#L143
using this https://github.com/nexB/commoncode/blob/fbe882da6c03352c8043cdd45c72b7ca44239e6d/src/commoncode/datautils.py#L45

There is also some ticket that I have tracked there python-attrs/attrs#357

have/create a way to get all that integrated into sphinx (possibly with custom extensions). Publish that as part of the doc publication

Enjoy and relax reading a beautiful doc :P

Check also this older approach https://github.com/nexB/scancode-toolkit/tree/d9ae6e62ebad6a896cf5b58185d833302e95c72d/etc/scripts/sch2js

This was taking the generated JSON and reversing a JSON schema from that. That could be OK too as an approach.

Tim Hatch @thatch

I don't know about where to document, but sphinx does handle something called "doc comments" -- comments before an assignment that start with #:

documented itself at https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoattribute

robinsingh-ai · 2021-04-03T10:29:07Z

hey @AyanSinhaMahapatra @pombredanne
I have seen this new project idea where we have to create docs automatically from scan code data
Earlier I have worked on my personal python package which is a python module that contains Python-based minimal and clean example implementations of popular data structures and all major algorithms and to explain how this module can be used I have also created docs for this package and you can find this here
So in these docs, you can find that I have imported the main source files where all programs are written, for example, consider this DP section, you can see in this section main source code along with doc strings is being imported with the help of automodule method(which is used to Include documentation from doc strings present in the source code for more info refer this), coming back to the point, with the help of this method we can easily fetch source code from main files and can display doc strings along with code, and like this, we can automate the AboutCode Documentation

autodoc is an extension of the sphinx which is used to include documentation from docstrings that is available inside the source code, so there are many in-built methods that are available in the autodoc extension like automodule, autoclass and autoexception etc.

.. automodule:: : which can be used to import any source file having all the classes and functions
.. autoclass:: : which can be used to import any class inside the source program.
The key to using these features is the :members: attribute. If:
- If we don’t include it at all, only the doc string for the object is brought in:
  - For example as @mjherzog commented, we can easily import api.py from src/scancode in docs and then sphinx will only parse this doc string
- if we just use :members: with no arguments, then all public functions, classes, and methods are brought it that have a docstring.
- If we explictly list the members like :members: <function_1>, <function_2>, <function_3> those explict members are brought.

For more refer this

So like this way, we can only parse that data that we want to show in docs more specifically, since the data we want to show, is only a small part of the classes, i.e. there will be (a lot) of functions and methods also documented which we don't want. which only takes out certain docstrings/class attributes having the documentation for that data field, and creates the doc from there, so like this way we can automate the docs for scancode data

AyanSinhaMahapatra · 2021-04-06T15:13:15Z

@Robin025 the information about autodoc/automodule/autoclass is known, but the issue here is, we want to document attribute members of classes, (not all members, only the ones that end up in the result data selectively).

Take an example of this.

Here in licensedcode/models.py, there's a class Rule. We don't want to create documentation based on this class itself. We want to document some of it's members, that end up in the result data (after running a license scan with -l). You can look into a example scan result, like on in the output format docs to see that these attributes that end up in the result are in the matched_rule section of each of the licenses that are detected.

{
  "path": "samples/zlib/iostream2/zstream.h",
  "type": "file",
  "licenses": [
    {
      "key": "mit-old-style",
      "score": 100.0,
      "name": "MIT Old Style",
      "short_name": "MIT Old Style",
      "category": "Permissive",
      "is_exception": false,
      "owner": "MIT",
      "homepage_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "text_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit-old-style",
      "spdx_license_key": null,
      "spdx_url": "",
      "start_line": 9,
      "end_line": 15,
      "matched_rule": {
        "identifier": "mit-old-style_cmr-no_1.RULE",
        "license_expression": "mit-old-style",
        "licenses": [
          "mit-old-style"
        ],
        "is_license_text": true,
        "is_license_notice": false,
        "is_license_reference": false,
        "is_license_tag": false,
        "matcher": "2-aho",
        "rule_length": 71,
        "matched_length": 71,
        "match_coverage": 100.0,
        "rule_relevance": 100
      }
    }
  ],

Now, we aren't documenting functions/entire classes as they exist in scancode now. We have to document these attributes of classes, selectively, and in all of scancode and it's plugins.

So, if we have to use the autodoc extension, we have to write one new class each for each of the attributes (there are a lot of them) that we want to document, write the docs in their docstring, and then use autoclass to collect all the docs from there. I'm not opposed to this, but we should consider all the options. If you see the suggestion of @thatch above, autoattuributes is a much better way rather than using autoclass here, if this method is preferred at all.

The documentation generation part would be easier this way, because nothing has to be done to collect these, but in the original suggestion above yours, it's more cleaner code wise and seems a better way to me, though some exploration has to be done on the collection and docs part.

So consider the method suggested in the comment above by @pombredanne , and look into the scancode data by generating some scan results and looking where the data for the attributes are located, and let us know if you have any questions there.

mjherzog added the documentation label Apr 13, 2020

mjherzog assigned pombredanne and johnmhoran Apr 13, 2020

mjherzog added design needed GUI and outputs must have labels Apr 13, 2020

mjherzog mentioned this issue Apr 13, 2020

Document all scancode toolkit json output keys #112

Closed

AyanSinhaMahapatra mentioned this issue Mar 8, 2021

Add new documentation that highlights JSON scan format and its change with a new release #2425

Open

AyanSinhaMahapatra mentioned this issue Apr 6, 2021

Creating docs automatically for scancode data #2455

Closed

5 tasks

AyanSinhaMahapatra mentioned this issue Jul 16, 2021

Document ScanCode JSON ouput format #2596

Open

pombredanne mentioned this issue Aug 20, 2021

Version JSON output data format #2653

Closed

pombredanne mentioned this issue Jul 5, 2024

Create a schema to store vulnerability and package metadata serialized in a suitable JSON (or YAML) format aboutcode-org/federatedcode#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create "data dictionary" for all SCTK fields #2008

Create "data dictionary" for all SCTK fields #2008

mjherzog commented Apr 13, 2020

AyanSinhaMahapatra commented Nov 23, 2020

robinsingh-ai commented Apr 3, 2021 •

edited

Loading

AyanSinhaMahapatra commented Apr 6, 2021 •

edited

Loading

Create "data dictionary" for all SCTK fields #2008

Create "data dictionary" for all SCTK fields #2008

Comments

mjherzog commented Apr 13, 2020

AyanSinhaMahapatra commented Nov 23, 2020

robinsingh-ai commented Apr 3, 2021 • edited Loading

AyanSinhaMahapatra commented Apr 6, 2021 • edited Loading

robinsingh-ai commented Apr 3, 2021 •

edited

Loading

AyanSinhaMahapatra commented Apr 6, 2021 •

edited

Loading