Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinct YulHexStringLiterals from common Yul string literals while reproducing the source code #11324

Closed
blitz-1306 opened this issue Apr 28, 2021 · 11 comments

Comments

@blitz-1306
Copy link

blitz-1306 commented Apr 28, 2021

Description

Since Solidity 0.8.4 release there is a support for YulHexStringLiterals (let x := hex"11", the commit for reference). While making attempts to reproduce the original source from compiler-provided AST, I met two issues with new literals:

  1. Unable to distinct common string literals from hex"" ones, as there are no markers in Yul AST. It is hard to decide if Yul string literal should produce the source code as common string or hex"" notation. In Solidity AST there is a "kind": "hexString" and/or "hexValue" entries that was helping with that.
  2. Unable to decode the value of hex"" literals from JSON output and get the original bytes, that was supplied in input source. See common Solidity hex string literal and Yul hex string literal examples for hex"112233445566778899aabbccddeeff6677889900" (link):
// Solidity
{
    "hexValue": "112233445566778899aabbccddeeff6677889900",
    "id": 15,
    "isConstant": false,
    "isLValue": false,
    "isPure": true,
    "kind": "hexString",
    "lValueRequested": false,
    "nodeType": "Literal",
    "src": "224:45:0",
    "typeDescriptions": {
        "typeIdentifier": "t_stringliteral_72dfaffeee45e650295b39194e71fce8dd7c225413cd7084c2de1e0a7b9f6057",
        "typeString": "literal_string hex\"112233445566778899aabbccddeeff6677889900\""
    }
}
// Yul
{
    "kind": "string",
    "nodeType": "YulLiteral",
    "src": "581:45:0",
    "type": "",
    "value": "\u0011\"3DUfwșʻ̝wș\u0000" // <- The value requires an additional processing to get the original bytes
}

I understand that the intention was "to support syntax at all", but the AST is also "kind of" important and I guess "producing source from AST" is somehow the expected usage. I believe it may look like a feature request, but the other part that the source code is not producible from AST itself properly looks like a bug. Anyway, the Solidity team should decide if this should be a bugfix or feature implementation. If this would be a feature, can the team suggest a temporary workaround?

Environment

$ node --version
v12.18.3

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

Steps to Reproduce

package.json

{
  "name": "solc-js-084-test",
  "version": "0.0.0",
  "main": "index.js",
  "dependencies": {
    "solc-0.8.4": "npm:solc@0.8.4"
  }
}

sample.sol

/// SPDX-License-Identifier: UNLICENSED

pragma solidity 0.8.4;

contract Sample {
    function testSolidityHexLiterals() public pure returns(bytes memory a, bytes memory b, bytes memory c) {
        a = "test";
        b = hex"112233445566778899aabbccddeeff6677889900";
        c = hex"1234_abcd";
    }

    /// @dev https://github.com/ethereum/solidity/pull/11173
    /// @dev https://github.com/ethereum/solidity/commit/7eb5e27e54c3ffc0e0a1e6d9b6c50e992df6886b
    function testAssemblyHexLiterals() public {
        assembly {
            let a := "test"
            let b := hex"112233445566778899aabbccddeeff6677889900"
            let c := hex"1234_abcd"

            sstore(0, b)
            sstore(1, c)

            pop(hex"2233")
        }
    }
}

index.js

const fs = require("fs");
const compiler = require("solc-0.8.4");

const fileName = "sample.sol";
const content = fs.readFileSync(fileName, { encoding: "utf-8" });

const input = {
    language: "Solidity",
    sources: {
        [fileName]: {
            content
        }
    },
    settings: {
        outputSelection: {
            "*": {
                "*": ["*"],
                "": ["*"]
            }
        }
    }
};

const output = compiler.compile(JSON.stringify(input), {});
const parsed = JSON.parse(output);

console.log(JSON.stringify(parsed, undefined, 4));

Run command

npm install # run once prior to execution to setup dependencies
node index.js

Regards and thank you for your time.

@blitz-1306
Copy link
Author

blitz-1306 commented Apr 28, 2021

P.S.: You can also look at the original source and produced source (and the logic, if you would prefer) in the referenced package.

@chriseth
Copy link
Contributor

Both the Solidity and the Yul AST lose information when they are generated from source and we do not see the reverse operation (generating source from AST) as a use-case of the AST. Since you lose information anyway, is it that important to re-generate the correct kind of string? You also won't be able to determine if certain characters were escaped in the original string, for example.

The internal Yul AST does not contain the distinction anymore because it is not needed for code-generation, so I would be hesitant to add it.

@blitz-1306
Copy link
Author

blitz-1306 commented Apr 28, 2021

@chriseth The "correct kind of string" is the exact wording for what was asked for. So far, there were no issues with code reproduction from AST, as all of the nodes were containing all the necessary compatible information to produce source that would result the same bytecode. It does not really matter if some details are changed while they are still readable and producing same results. With current Yul AST I was not able to achieve this.

I, sure, understand that AST is an abstract representation that loses some details (whitespace and some other, that may be expressed differently), however it is also widely used to analyze, instrument and modify code. It additionally serves for the purpose to indicate certain meta-details of the code structure. These are the nice features of compiler-provided AST. Would these arguments be enough motivation to keep such AST quality level?

@cd1m0
Copy link

cd1m0 commented Apr 28, 2021

@chriseth I think the problem is that the Yul AST seems ambiguous in this particular case. Perhaps if you clarify what the semantics of the string inside the YulLiteral node of the Yul ast is?

Is it always just the string itself utf-8 encoded? Or does it change depending on whether the original string was a normal string or a hex string? In the case of normal Literals there is different behavior which is why we wondered.

Regarding your second point about seeing generating source from AST as a use-case for the AST, I guess we (https://github.com/consensys/scribble) are the first consumer that use it in that way. However I think the exact semantics of the YulLiteral strings migth be of interested of other non-source-generating consumers, such as slither for example.

@chriseth
Copy link
Contributor

I currently cannot see any different behaviour about the AST output in the Solidity case.

For Solidity Literal has the fields kind, value and hexValue. hexValue always contains the hex encoding of the string literal's decoded contents. value is non-null if the string literal's decoded contents is valid utf8 and it is the utf8-encoded contents.

For Yul literals, the mechanism to detect whether it is valid utf8 is missing, so the string just contains (hex) escapes in that case.

Or is this what you meant by "different behaviour"?

So as far as I can see, neither in the Solidity case nor in the Yul case is there any different encoding of the value depending on whether or not the original literal was specified as a hex literal. In the Solidity case, the value field is omitted if the actual content cannot be encoded in utf8. What is different in Solidity is that kind has the values number, string ,unicodeString, hexString and bool, while in Yul, it can only be number, bool and string.

@blitz-1306
Copy link
Author

blitz-1306 commented Apr 29, 2021

Let me clarify something, that I probably missed in the starting post (and sorry if it caused too much confusion). Lets add few code lines to the reproduction sample script (index.js):

// console.log(JSON.stringify(parsed, undefined, 4)); // Comment compiler output

const literal = parsed.sources[fileName]
    .ast            // SourceUnit
    .nodes[1]       // ContractDefinition
    .nodes[1]       // FunctionDefinition (testAssemblyHexLiterals)
    .body           // Block
    .statements[0]  // InlineAssembly
    .AST            // YulBlock
    .statements[1]  // YulVariableDeclaration (let b := hex"112233445566778899aabbccddeeff6677889900")
    .value;         // YulLiteral

console.log(literal);

console.log("Original value", literal.value);

const bytes = new TextEncoder().encode(literal.value);

console.log("Decoded bytes", bytes);
console.log("Decoded bytes (as HEX)", Array.from(bytes, (byte) => byte.toString(16)));

const hex = Array.from(bytes, (byte) => byte.toString(16).padStart(2, "0")).join("");

console.log("HEX string", hex);

The output:

{
  kind: 'string',
  nodeType: 'YulLiteral',
  src: '581:45:0',
  type: '',
  value: '\u0011"3DUfwșʻ̝wș\u0000'
}
Original value "3DUfwșʻ̝wș
Decoded bytes Uint8Array(20) [
   17,  34,  51,  68,  85, 102,
  119, 200, 153, 202, 187, 204,
  157, 238, 191, 166, 119, 200,
  153,   0
]
Decoded bytes (as HEX) [
  '11', '22', '33', '44',
  '55', '66', '77', 'c8',
  '99', 'ca', 'bb', 'cc',
  '9d', 'ee', 'bf', 'a6',
  '77', 'c8', '99', '0'
]
HEX string 11223344556677c899cabbcc9deebfa677c89900

While in the original input source has 112233445566778899aabbccddeeff6677889900. There is something special happening with bytes since 88... I tried to rely on common approach to check if the AST node value is actually corresponds to something that we have in the code and JSON serialization/deserialization not messed things up. Maybe I'm doing something wrong or missing something, ofcourse. If so, could you (or somebody from the team), please, give a hint what is wrong here?

If there would be an original value from source and the "kind": "hex" switch, then there would not be so much problems.

Thanks in advance.

@blitz-1306
Copy link
Author

Guys, would some of you be able to look into a post above and explain the behavior?

@chriseth
Copy link
Contributor

chriseth commented May 3, 2021

This is certainly weird. I think we should at least add a hex value, so that we do not get into some json coding mess.

@blitz-1306
Copy link
Author

It seems that the fix is part of Solidity 0.8.5 release. I will check and close the issue in near future. Thanks.

@blitz-1306
Copy link
Author

Issue is fixed. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants