Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AdditionalVectorCrypto] adding section on addition vector crypto extensions #1306

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nibrunieAtSi5
Copy link
Contributor

@nibrunieAtSi5 nibrunieAtSi5 commented Mar 28, 2024

This pull request is the riscv-isa-manual version of a pull request started on riscv-crypto: riscv/riscv-crypto#362.

/!\ This pull request is a draft for the future fast track extensions.

This pull requests draft the changes associated with two fast track extensions for vector crypto.

During the specification process for vector crypto 1.0.0 a few items had to be discarded because they appeared too late in the process. This fast track extension tries to address some of them.

The official demand that will be discussed in the Task Group and submitted to the Unpriv Committee is being drafter here: https://docs.google.com/document/d/1zpYhnZi2NxhjfcBGvPOy0oDhx6lTXchscG17Qcl6wv8/edit?usp=sharing

New features:

  • Zvbc32e: Extending vclmul[h].v[vh] instruction to support SEW=32-bit value
    • should be available standalone (ELEN >= 32) or in addition to Zvbc (ELEN >= 64)
    • no new encoding
  • Zvkgs: Adding .vs variants to vghsh and vghmul
    • should depend on Zvkg
    • new encodings

Open questions:

  • Should Zvbc32e be allowed when ELEN >= 32 without depending on Zvbc ? (Answer: YES)
  • Should Zvbc32e support SEW=16 ? (SEW=8 ?)
  • Find encodings
  • How to name the two new extensions
  • Do we need to define a Zvkt(bc/bc32e) to extend Zvkt to the extension of vclmul[h/] defined in Zvbc32e ?

Related changes:

Draft versions:

Version pdf
v0.0.1 (August 31st 2023) https://github.com/riscv/riscv-crypto/files/12487628/riscv-crypto-spec-vector-extra.pdf
v0.0.2 (January 17th 2024) https://github.com/riscv/riscv-crypto/files/13970691/riscv-crypto-spec-vector-extra.pdf
v0.0.3 (February 1st 2024) https://github.com/riscv/riscv-crypto/files/14146438/riscv-crypto-spec-vector-extra_v0.0.3.pdf
v0.0.4 (February 6th 2024) riscv-crypto-spec-vector-extra_v0.0.4.pdf
v0.0.5 (March 7th 2024) riscv-crypto-spec-vector-extra_v0.0.5.pdf
v0.0.6 (June 19th 2024) riscv-crypto-spec-vector-extra_v0.0.6.pdf
v0.0.7 (July 31st 2024) riscv-crypto-spec-vector-extra_v0.0.7.pdf

Changelogs

  • from v0.0.5 to v0.0.6:
    • adding vs2 / vd overlap as reserved encoding for new vghsh.vs / vgmul.vs instructions (review feedback from @QJtaibai )

Original Plan for the fast track schedule

image

References

@QJtaibai
Copy link

QJtaibai commented Jun 5, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

@nibrunieAtSi5
Copy link
Contributor Author

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

@Yunzezhu94
Copy link

Hello! I have a question that in the instruction part of document vghsh.vs shares same encoding with vghsh.vv, and has func6:101100, while in Appendix A it looks vghsh.vs have func6:100011. I wonder which one is correct?

@nibrunie
Copy link

nibrunie commented Jul 5, 2024

Hello! I have a question that in the instruction part of document vghsh.vs shares same encoding with vghsh.vv, and has func6:101100, while in Appendix A it looks vghsh.vs have func6:100011. I wonder which one is correct?

Appendix A and the riscv-v opcode Pull request (https://github.com/nibrunieAtSi5/riscv-opcodes/pull/1/files) are right, the instruciton part of the document was incorrect. I have fixed it.

Note that those opcodes are simply suggestion at this point (but that does not mean the suggestions should not be consistent).

Thank you for pointing that out.

@nibrunieAtSi5
Copy link
Contributor Author

I have developed a small example to demonstrate how Zvbc32e can be leveraged to implement CRC (with the folding method): https://github.com/nibrunie/rvv-examples/blob/e55d529fe54316a95b733aea82cadbfbfbf08e67/src/crc/vector_crc_be_zvbc32e.c

The current implementation provides "performance" of about 0.65 instruction per bytes (on 1MB input buffer).

@nibrunie
Copy link

nibrunie commented Aug 1, 2024

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 14, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
[wavedrom, , svg]
....
{reg:[
{bits: 7, name: 'OP-P'},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have renamed it to OP_VE in #1311.

@wangpc-pp
Copy link

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

@topperc
Copy link
Contributor

topperc commented Aug 15, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

@wangpc-pp
Copy link

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

Yeah, thanks! I get it! But it seems that vd can overlap vs1 for reductions, aren't they the same?

@topperc
Copy link
Contributor

topperc commented Aug 15, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

Yeah, thanks! I get it! But it seems that vd can overlap vs1 for reductions, aren't they the same?

Maybe the assumption for reductions is that vd would only be updated after all of vs1 is read since you need to see all elements before you know the result.

@aswaterman
Copy link
Member

@topperc Right, for reductions, the amount of additional state that’s needed is nil for many implementations, but in the worst case, it’s bounded by ELEN. Here, I guess it’s bounded by EGW. That difference could be material for narrow vector machines. @nibrunieAtSi5 can you illuminate the situation?

@DavidYu360
Copy link

What about the VPR alignment rules for vghsh.vs / vgmul.vs?
As I mentioned in riscv-software-src/riscv-isa-sim#1548

Possible alignment value: 1, 2, 4, 8 (1 for any VPR, 2 for V0, V2, ...)
Prerequisite: VLEN * LMUL >= EGW

  1. Single key in .vs inst: max(EGW/VLEN, 1)
  2. Others: max(LMUL, 1)

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 19, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
wangpc-pp added a commit to llvm/llvm-project that referenced this pull request Aug 19, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
@nibrunieAtSi5
Copy link
Contributor Author

The non-overlap constraint for vector element group is part of the vector crypto specification

Source/Destination overlap constraints::

In the case of the .vs instructions defined in this specification, vs2 holds a 128-bit scalar element group.
For implementations with VLEN ≥ 128, vs2 refers to a single register. Thus, the vd register group must not
overlap the vs2 register.
However, in implementations where VLEN < 128, vs2 refers to a register group comprised of the number
of registers needed to hold the 128-bit scalar element group. In this case, the vd register group must not
overlap this vs2 register group.

[%autowidth]
[%header,cols="4,4,4"]
|===
| Instruction
| Register
| Cannot Overlap

| vaes*.vs | vs2 | vd
| vsm4r.vs | vs2 | vd
| vsha2c[hl] | vs1, vs2 | vd
| vsha2ms | vs1, vs2 | vd
| vsm3me | vs2 | vd
| vsm3c | vs2 | vd

It was defined as no overlapping between vector register groups.

This was not necessarily repeated for each operation

* Only for the `.vs` form: the `vd` register group overlaps the `vs2` scalar element group

  • Only for the .vs form: the vd register group overlaps the vs2 scalar element group

I think this constraint makes sense here as well since allowing overlapping would constraint uarch which cannot save/stage the scalar element group before finishing the operation (as stated by @topperc it is highly likely that for LMUL > EGW / VLEN implementations will iterate to compute the operation over the full vector register groups in a few cycles). Note that .vs makes sense for vl / EGS > 1 since this is where we can actually exploit the fact of having a scalar element group.
The size of vs2 vector register group should be understood as with EMUL = min(EGW / VLEN, 1).
The example in the base vector crypto spec seems to hint this way:

However, in implementations where VLEN < 128, vs2 refers to a register group comprised of the number
of registers needed to hold the 128-bit scalar element group.

but should maybe be generalized to EGW.

@topperc
Copy link
Contributor

topperc commented Aug 29, 2024

Zvkgs: Adding .vs variants to vghsh and vghmul
should depend on Zvkgs

Is the second line supposed to say Zvkg rather than Zvkgs?

@nibrunie
Copy link

nibrunie commented Sep 2, 2024

Zvkgs: Adding .vs variants to vghsh and vghmul
should depend on Zvkgs

Is the second line supposed to say Zvkg rather than Zvkgs?

Yes, you are right @topperc , this is a typo dependency to Zvkg was intended (the assumption is that the vector-scalar(element group) variant do not make sense without the vector-vector variant.

@httoracle
Copy link

In the v0.0.7 spec, I have a question of the operation code of the vghsh.vs and vgmul.vs. Take vghsh.vs for example, It says H is common to all element groups, and you put the "let H = brev8(...)" code outside the loop. I suspect the H of all the group should be the same, and is from the first 128bit of vs2. But inside the each loop, H will be changed.
So if you write just like the operation code in the spec, the H of the next loop will not be the first 128bit of the vs2. I want to ask it is just the purpose of the instruction or is just a miss?
If you want the H to be the same value in the beginning of the each loop, you should put "let H = brev8(...)" inside the loop, so in the beginning of the loop, we can get the H value from vs2 again.

@nibrunie
Copy link

nibrunie commented Sep 10, 2024

In the v0.0.7 spec, I have a question of the operation code of the vghsh.vs and vgmul.vs. Take vghsh.vs for example, It says H is common to all element groups, and you put the "let H = brev8(...)" code outside the loop. I suspect the H of all the group should be the same, and is from the first 128bit of vs2. But inside the each loop, H will be changed. So if you write just like the operation code in the spec, the H of the next loop will not be the first 128bit of the vs2. I want to ask it is just the purpose of the instruction or is just a miss? If you want the H to be the same value in the beginning of the each loop, you should put "let H = brev8(...)" inside the loop, so in the beginning of the loop, we can get the H value from vs2 again.

@httoracle You are right.
H needs to be reset inside the outer loop. this is a mistake in the current behavior code.
It should be fixed by 9982d13.

@nibrunie
Copy link

Additional Vector Crypto extension v0.0.8 (September 9th 2024):
riscv-crypto-spec-vector-extra_v0.0.8.pdf

Changelog from v0.0.7:

  • Fixing behavior of vghsh.vs and vgmul.vs (thanks to @httoracle )
  • Upgrading alias for 0x77 opcode from OP-P to OP-VE (thanks to @wangpc-pp )

Signed-off-by: Nicolas Brunie <nibrunie@gmail.com>
Signed-off-by: Nicolas Brunie <nibrunie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants