Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/question: how to distinguish between unparsed data and strings? #661

Open
felixfontein opened this issue Feb 16, 2025 · 6 comments
Labels
question Further information is requested

Comments

@felixfontein
Copy link

Is your feature request related to a problem? Please describe.
I'm looking at migrating SOPS to goccy/go-yaml. For that I need to be able to parse arbitrary YAML documents and process its structure without assumptions on how it should looks like. While playing around with parser.ParseBytes() and the resulting AST, I noticed that I don't know how to figure out whether something that ends up as a StringNode is actually a string, or something else (date/timestamp, integer, float):

examples:
  - 2024-01-01  # a YAML date
  - 2001-12-15T02:59:43.1Z  # a YAML timestamp
  - 0x123982139013098129831983AB1231231312  # hexadecimal integer
  - 98123918398129831987841872387138712837  # an integer
  - 0o012345670123456701234567  # an octal integer
  - 2.0e308  # a floating point number
  - foo  # an actual string

All the above sequence entries result in a StringNode. If you quote all the above numbers with "...", then you also get StringNodes with the same values. I don't see how to figure out from a StringNode whether it actually represents a string, or real data (like a date, timestamp, large integer, large float).

What gopkg.in/yaml.v3 does:

  • Dates and timestamps are parsed as time.Time objects.
  • The large decimal integer (98123918398129831987841872387138712837) is parsed as a floating point number (distorting the value).
  • Everything else is treated as a string. (With the same problems that I mentioned above.)

Describe the solution you'd like
Describe alternatives you've considered
I'm not sure what the best way to proceed is. The numbers above have been picked so that they cannot be parsed as Golang integers or floats. (The date, and timestamps in general, can be represented by Golang types.)

Maybe:

  1. One (or two) new node type(s) for dates/timestamps? (Depending on whether dates and timestamps should be treated separately. gopkg.in/yaml.v3 simply uses time.Time to represent all dates and timestamps.)
  2. Add extra data to the StringNode which tells what the data actually is (actual string; date/timestamp; integer; hexadecimal integer; octal integer; floating point number) if there is no native representation.
  3. Maybe even have an option to parse all values (integers, floats, bools, NaN/Infinity, ...) as StringNode with type info? (That would allow lossless transformations of floats, for example, allowing to distinguish between 1.10 and 1.1.)

(Obviously all three can also be implemented together. I'm currently tending to like 2. and 3. most.)

Ref: getsops/sops#1616

@goccy
Copy link
Owner

goccy commented Feb 16, 2025

@felixfontein It's difficult to provide hints without understanding exactly what you want to achieve, but basically, anything possible with gopkg.in/yaml.v3 is also possible with this library. According to the YAML specification, these are interpreted as strings at the AST level. If you want to decode them into arbitrary Go types (such as time.Time or int), type information is required. If you need to determine the type at the AST level, you must add a tag like !!timestamp. This is the same when using gopkg.in/yaml.v3.

Also, you can directly decode into a specified Go value using yaml.Unmarshal or yaml.NewDecoder. If you prefer to use parser.ParseBytes manually, you can decode an ast.File to convert it into Go types. Since both ast.File and ast.Node implement the io.Reader interface, they can be passed as arguments to the NewDecoder function.

@felixfontein
Copy link
Author

@goccy thanks for your reply! What I basically need to achieve is to read a YAML file and walk through its structure and identify the type for each value. I need to do this to create SOPS' internal representation of the data (and comments), where keys and values in mappings and elements in lists have the right types. For that I looked into the AST output, since it seems similar to gopkg.in/yaml.v3's Node (there I'm using yaml.NewDecoder(bytes.NewReader(in)) to create a decoder and successively call Decode() on it to obtain the next document as a yaml.Node). For bools, integers, floats, and dates/timestamps I can use

		var result interface{}
		node.Decode(&result)

to get hold of the value - it will be of type bool, int, float64, string, or time.Time. gopkg.in/yaml.v3 also does not allow me to distinguish between strings and large ints/floats that cannot be parsed into Golang ints/floats, but it does allow me to identify dates/timestamps (I cannot distinguish between timestamps and dates though, which is also a problem).

With goccy/go-yaml, I don't see how I can get that information from ast.Node. Large ints/strings and timestamps and dates are parsed as an ast.StringNode, and there is no information on whether it was actually a string (for example because it was quoted), or whether it was a large integer or float or date/timestamp that goccy/go-yaml simply didn't parse.

So, given a StringNode with value 12398132981498123981, how do I know whether this corresponds to an integer, or a string? Or if it has value 2024-01-01, how do I know whether it corresponds to a date, or a string? In YAML the distinction is very clear:

- 12398132981498123981  # integer
- "12398132981498123981"  # string
- 2024-01-01  # date
- "2024-01-01"  # string

But goccy/go-yaml's StringNode makes it hard to know that information. The only way I see right now is to parse StringNode.Token.Origin and basically manually re-do what goccy/go-yaml is doing.

@felixfontein
Copy link
Author

Ok, I think I figured out how to distinguish a node of type ast.StringNode that's a quoted string from a string that could be a date/timestamp/integer/float: node.Token.Type == token.SingleQuoteType || node.Token.Type == token.DoubleQuoteType checks whether it was quoted. If it wasn't quoted, I need to check whether it's actually an integer, float, date, or timestamp that wasn't parsed.

@goccy
Copy link
Owner

goccy commented Feb 16, 2025

@felixfontein Yes, if you just need to determine whether it's a quoted string, that approach will work fine.

@felixfontein
Copy link
Author

In case anyone has a similar problem, the following documents contain regular expressions that match all strings representing integers, floating point numbers, and timestamps supported by YAML:

I'm a bit torn about sexagesimal support. gopkg.in/yaml.v3, goccy/go-yaml, and ruamel.yaml do not seem to support it, though PyYAML does. (I guess it didn't always support it.)

@goccy goccy added question Further information is requested and removed feature request labels Feb 16, 2025
@goccy
Copy link
Owner

goccy commented Feb 16, 2025

Determining the Go's type at the ast.Node level is not a common approach, so if you want to do it manually, you will need to implement the type-checking logic yourself.

I want to make this project the de facto standard YAML library for Go. However, even though a large number of users are already using this library, the number of stars required for standardization is still significantly lacking. If you don’t mind, please lend me your support. If you haven’t starred it yet, please do so. I would also greatly appreciate it if you consider becoming a sponsor or recommending it to your friends. I hope that as this project grows, it will benefit all Go developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants