Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use full TextAttribute when copying data into Clipboard #16270

Closed
wants to merge 17 commits into from

Conversation

tusharsnx
Copy link
Contributor

@tusharsnx tusharsnx commented Nov 6, 2023

This PR replaces TextBuffer::TextAndColor with TextBuffer::TextAndAttribute which gives us the opportunity to use all of the text attribute we save within buffer while copying data into the clipboard.

We'll use til::small_rle for storing the attributes for each row in the selection rect. This helps in minimizing the memory required for storing all the attribute of a row.

This PR also fixes two bugs in the formatted copy:

  • We were applying line breaks after every selected TextBuffer row, even though the row could have been a wrapped row. This caused wrapped rows to break when they shouldn't. Fix: we now add a newline (<BR> in HTML, \line in RTF) only when we see \r or \n in the text.
  • We were mishandling Unicode text (\uN) within RTF copy. Every next character that uses a surrogate pair or high codepoint was missing in the copied text when pasted to Word. Fix: command \uc4 should have been \uc1, which is used to tell how many characters will be used as a fallback for a unicode character. We always use 1 ? character as the fallback (and not 4, as was the case).

Validation Steps Performed

echo "This is some Ascii \ {}`nLow code units: á é í ó ú `u{2b81} `u{2b82}`nHigh code units: `u{a7b5} `u{a7b7}`nSurrogates: `u{1f366} `u{1f47e} `u{1f440}"

PR Checklist

  • Closes Font variant ignored by copyFormatting #16191
  • Add other attributes to formatted-copy
  • Refactor GenRTF and GenHTML to loop using TextAttrbute RLE runs.
  • Refactor most of our Copy codepaths to get rid of GetText and the stuff we do with TextAndColor/TextAndAttribute. GenHTML and GenRTF read from the buffer directly.

@microsoft-github-policy-service microsoft-github-policy-service bot added Issue-Bug It either shouldn't be doing this or needs an investigation. Area-TerminalControl Issues pertaining to the terminal control (input, selection, keybindings, mouse interaction, etc.) Product-Terminal The new Windows Terminal. labels Nov 6, 2023
@@ -96,7 +96,7 @@ Clipboard& Clipboard::Instance()
// - cchData - Size of the Unicode String in characters
// Return Value:
// - None
void Clipboard::StringPaste(_In_reads_(cchData) const wchar_t* const pData,
void Clipboard::StringPaste(_In_reads_(cchData) PCWCHAR pData,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this codebase, unless we're interfacing with a Windows API we have a negative preference for the windows typedefs. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, ok, will revert this.

The declaration is little confusing though. It uses PCWCHAR, so I thought let's make it consistent. Maybe, we should change PCWCHAR to const wchar_t* const in the declaration too?

void StringPaste(_In_reads_(cchData) PCWCHAR pwchData,
const size_t cchData);

std::vector<COLORREF> selectionFgAttr;
std::vector<COLORREF> selectionBkAttr;
selectionText.reserve(maxCells);
til::small_rle<TextAttribute> attrs{ maxCells, it->TextAttr() };
Copy link
Member

@DHowett DHowett Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I am OK with using something memory-inefficient (like a vector for each row with width capacity, or even a huge height*width vector with sub-spans for each row), because this is a short-lived transport format rather than a long-lived permanent storage format.

If RLE is slower in this use case, we can spend memory to save time by not using it. 😄

Copy link
Member

@DHowett DHowett Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I know that til::rle has useful iteration behavior that specifically helps text attribute run production. That is, you can iterate attr-change-by-attr-change... which could be good for HTML/RTF generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If RLE is slower in this use case, we can spend memory to save time by not using it

I don't think RLE is any slower than vector type in this case 😅

you can iterate attr-change-by-attr-change... which could be good for HTML/RTF generation.

I totally agree. Though, I wasn't planning to make this change in this PR, but since you mentioned it, I'll see if that can be done in this PR itself. I'll add this to the PR checklist for now 👍

@lhecker
Copy link
Member

lhecker commented Nov 6, 2023

Relatedly, you might be interested in this code I'm currently working on:

bool TextBufferSerializer::SerializeNextRow(std::wstring& out)
{
if (currentRow > lastRowWithText)
{
// If you end up in here it means you ignored a previous `return false`.
assert(false);
return false;
}
const auto& row = _textBuffer.GetRowByOffset(currentRow);
const auto& runs = row.Attributes().runs();
auto it = runs.begin();
const auto end = runs.end();
const auto last = end - 1;
til::CoordType oldX = 0;
for (; it != end; ++it)
{
const auto attr = it->value.GetCharacterAttributes();
const auto hyperlinkId = it->value.GetHyperlinkId();
const auto fg = it->value.GetForeground();
const auto bg = it->value.GetBackground();
const auto ul = it->value.GetUnderlineColor();
if (previousAttr != attr)
{
const auto attrDelta = attr ^ previousAttr;
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::Intense))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::Intense) ? L"\x1b[1m" : L"\x1b[22m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::Italics))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::Italics) ? L"\x1b[3m" : L"\x1b[23m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::Blinking))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::Blinking) ? L"\x1b[5m" : L"\x1b[25m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::Invisible))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::Invisible) ? L"\x1b[8m" : L"\x1b[28m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::CrossedOut))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::CrossedOut) ? L"\x1b[9m" : L"\x1b[29m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::Faint))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::Faint) ? L"\x1b[2m" : L"\x1b[22m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::UnderlineStyle))
{
switch (it->value.GetUnderlineStyle())
{
case UnderlineStyle::NoUnderline:
out.append(L"\x1b[24m");
case UnderlineStyle::SinglyUnderlined:
out.append(L"\x1b[4m");
case UnderlineStyle::DoublyUnderlined:
out.append(L"\x1b[21m");
case UnderlineStyle::CurlyUnderlined:
out.append(L"\x1b[4:3m");
case UnderlineStyle::DottedUnderlined:
out.append(L"\x1b[4:4m");
case UnderlineStyle::DashedUnderlined:
out.append(L"\x1b[4:5m");
default:
out.append(L"\x1b[4m");
}
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::TopGridline))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::TopGridline) ? L"\x1b[53m" : L"\x1b[55m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::ReverseVideo))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::ReverseVideo) ? L"\x1b[7m" : L"\x1b[27m");
}
if (WI_IsAnyFlagSet(attrDelta, CharacterAttributes::BottomGridline))
{
out.append(WI_IsAnyFlagSet(attr, CharacterAttributes::BottomGridline) ? L"\x1b[4m" : L"\x1b[24m");
}
previousAttr = attr;
}
if (previousFg != fg)
{
switch (fg.GetType())
{
case ColorType::IsDefault:
out.append(L"\x1b[39m");
break;
case ColorType::IsIndex16:
{
BYTE index = WI_IsFlagSet(fg.GetIndex(), FOREGROUND_INTENSITY) ? 90 : 30;
index += fg.GetIndex() & 7;
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[{}m"), index);
break;
}
case ColorType::IsIndex256:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[38;5;{}m"), fg.GetIndex());
break;
case ColorType::IsRgb:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[38;2;{};{};{}m"), fg.GetR(), fg.GetG(), fg.GetB());
break;
default:
break;
}
previousFg = fg;
}
if (previousBg != bg)
{
switch (fg.GetType())
{
case ColorType::IsDefault:
out.append(L"\x1b[49m");
break;
case ColorType::IsIndex16:
{
BYTE index = WI_IsFlagSet(fg.GetIndex(), FOREGROUND_INTENSITY) ? 100 : 40;
index += fg.GetIndex() & 7;
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[{}m"), index);
break;
}
case ColorType::IsIndex256:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[48;5;{}m"), fg.GetIndex());
break;
case ColorType::IsRgb:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[48;2;{};{};{}m"), fg.GetR(), fg.GetG(), fg.GetB());
break;
default:
break;
}
previousBg = bg;
}
if (previousUl != ul)
{
switch (fg.GetType())
{
case ColorType::IsDefault:
out.append(L"\x1b[59m");
break;
case ColorType::IsIndex256:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[58:5:{}m"), ul.GetIndex());
break;
case ColorType::IsRgb:
fmt::format_to(std::back_inserter(out), FMT_COMPILE(L"\x1b[58:2::{}:{}:{}m"), ul.GetR(), ul.GetG(), ul.GetB());
break;
default:
break;
}
previousUl = ul;
}
if (previousHyperlinkId != hyperlinkId)
{
if (hyperlinkId)
{
const auto uri = _textBuffer.GetHyperlinkUriFromId(hyperlinkId);
if (!uri.empty())
{
out.append(L"\x1b]8;;");
out.append(uri);
out.append(L"\x1b\\");
previousHyperlinkId = hyperlinkId;
}
}
else
{
out.append(L"\x1b]8;;\x1b\\");
previousHyperlinkId = 0;
}
}
auto newX = oldX + it->length;
// Trim whitespace with default attributes from the end of each line.
if (it == last && it->value == defaultAttr)
{
// This can result in oldX > newX, but that's okay because GetText()
// is robust against that and returns an empty string.
newX = row.MeasureRight();
}
out.append(row.GetText(oldX, newX));
oldX = newX;
}
const auto moreRowsRemaining = currentRow < lastRowWithText;
if (!row.WasWrapForced() || !moreRowsRemaining)
{
out.append(L"\r\n");
}
currentRow++;
return moreRowsRemaining;
}

That's how I serialize a TextBuffer to text with VT sequences. It iterates over the backing run length compressed attributes directly and doesn't depend on any of the unwieldy cell iterators. This has the benefit of being massively faster and counter-intuitively being fairly concise code.

I believe the function could be made to also produce plain text, HTML or RTF with a couple nested if conditions and switch statements.

I think this PR is fine as is though. ^ Just some inspiration. 🙂

@tusharsnx
Copy link
Contributor Author

Relatedly, you might be interested in this code I'm currently working on

Exactly what I was thinking of, thanks 😀

@tusharsnx tusharsnx marked this pull request as ready for review November 14, 2023 14:54
Copy link
Member

@lhecker lhecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one minor issue (textTrimmedLength) and a nit (the slice() usage).

Comment on lines +1968 to +1980
const auto begin = text.rbegin();
const auto end = text.rend();
auto it = begin;
for (; it != end; ++it)
{
const auto chars = cell.Chars();
selectionText.append(chars);

if (copyTextColor)
if (*it != L' ')
{
const auto cellData = cell.TextAttr();
const auto [CellFgAttr, CellBkAttr] = GetAttributeColors(cellData);
for (size_t j = 0; j < chars.size(); ++j)
{
selectionFgAttr.push_back(CellFgAttr);
selectionBkAttr.push_back(CellBkAttr);
}
break;
}
}

++it;
const auto textTrimmedLength = gsl::narrow_cast<uint16_t>(end - it);
colEnd = std::min(colEnd, textTrimmedLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the textTrimmedLength code is correct... end - it returns the number of trailing whitespaces so colEnd should be row.size() - (end - it), no? Also, we can simplify this using either find_last_not_of(L' ') or ROW::MeasureRight(). That is:

colEnd = std::min(colEnd, row.MeasureRight());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row.size() - (end - it)

That would be correct if were to use .begin()/.end(), but here we are using .rbegin()/.rend(), isn't?

Also, we can simplify this using either find_last_not_of(L' ') or ROW::MeasureRight()

I faced some problem when I tried to use ROW::MeasureRight() for getting the trimmed length. IIRC, the row with the cursor was always _wrapForced = true, which made MeasureRight() return _columnCount. However, it seems to work now so I guess I'll update this code to use MeasureRight() 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention that MeasureRight() is better in the sense that it doesn't trim trailing whitespaces of wrapped rows. This is because, technically, the trailing whitespaces on a wrapped row are not considered trailing whitespace of the line that was wrapped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, end - it returns the number of characters from the start to the iterator. But that's also wrong, since there isn't a 1:1 mapping between the number of characters and columns.
I'm glad that MeasureRight() works for you now!

// We apply formatting to rows if the row was NOT wrapped or formatting of wrapped rows is allowed
const auto shouldFormatRow = formatWrappedRows || !GetRowByOffset(iRow).WasWrapForced();
// save selected text
selectionText = std::wstring{ row.GetText(colBegin, colEnd) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, BTW: This text copying is among others a reason for why I really want to remove TextAndAttribute entirely. It'd be much more efficient but just as simple to accumulate the HTML/RTF data right here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean to say accumulate HTML/RTF data for this line, I guess we can do this in these steps:

  1. Generate HTML/RTF header data.
  2. Generate HTML/RTF attr-run-by-attr-run by calling Gen[HTML|RTF] in a loop. Gen[HTML|RTF] would then only be responsible for this span of text (or an RTF group in case of RTF). We'll pass a single attribute which needs to be applied for this run of text.
  3. Generate HTML/RTF footer data.
  4. Final result would be Header + All text spans + Footer.

Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is an approach like the one I linked above in this comment: #16270 (comment)
It accumulates the text directly into std::wstring& out. (FYI it uses an out parameter because that's the only way the caller can control both, the console lock duration, and the batch size for WriteFile calls. For functions like GenHTML and GenRTF, I'd just return the string.)

I'd basically merge GetText into each GenHTML and GenRTF which then do their own text- and attribute-processing directly on the ROWs. Only once they're standalone functions without any external helpers, I'd start abstracting smaller bits away as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.. I'm implementing this, just letting you know that we will be calling RenderSettings::GetAttributeColors twice now, each in GenHTML and GenRTF. Would this be a concern?

I'm also looking into having one external helper where we would be calculating the exact text span we're interested in from the selection region. This is what will be responsible for trimming trailing whitespaces. We do this once and pass the result into GenHTML and GenRTF.

{
if (shouldFormatRow)
attrs = std::move(row.Attributes().slice(colBegin, colEnd));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to move return values, since C++11 and later have RVO - return value optimization. (Also, they're in the so called prvalue category (p = pure) and everything in the more general rvalue category behaves as if it's being moved.)

In fact, this code is a tiny bit dangerous, because if slice() was to return a mutable reference, you might steal the contents from the ROW instance.

Lastly, I think you can move the attrs variable from the other scope in here. Since slice() returns a new instance every time there's no way to reuse memory here anyways.

const int fontHeightPoints,
const std::wstring_view fontFaceName,
const COLORREF backgroundColor)
const COLORREF backgroundColor,
const bool isIntenseBold)
{
try
{
std::ostringstream htmlBuilder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: If you ever want to, on my end, you're more than welcome to replace the std::ostringstream with a std::string and use regular .append() or += and fmt::format_to(FMT_COMPILE(...), ...) instead. 😄 sstream classes have nasty overhead, both for binary size and runtime performance. And they aren't even particularly powerful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can take the alternative approach we're discussing above. If not, I'll come back to this 🙂

@tusharsnx
Copy link
Contributor Author

Closing in favor of #16377

@tusharsnx tusharsnx closed this Dec 7, 2023
@tusharsnx tusharsnx deleted the copypasta branch December 7, 2023 06:30
DHowett pushed a commit that referenced this pull request Jan 29, 2024
`TextBuffer::GenHTML` and `TextBuffer::GenRTF` now read directly from
the TextBuffer.

- Since we're reading from the buffer, we can now read _all_ the
attributes saved in the buffer. Formatted copy now copies most (if not
all) font/color attributes in the requested format (RTF/HTML).
- Use `TextBuffer::CopyRequest` to pass all copy-related options into
text generation functions as one unit.
- Helper function `TextBuffer::CopyRequest::FromConfig()` generates a
copy request based on Selection mode and user configuration.
- Both formatted text generation functions now use `std::string` and
`fmt::format_to` to generate the required strings. Previously, we were
using `std::ostringstream` which is not recommended due to its potential
overhead.
- Reading attributes from `ROW`'s attribute RLE simplified the logic as
we don't have to track attribute change between the text.
- On the caller side, we do not have to rebuild the plain text string
from the vector of strings anymore. `TextBuffer::GetPlainText()` returns
the entire text as one `std::string`.
- Removed `TextBuffer::TextAndColors`.
- Removed `TextBuffer::GetText()`. `TextBuffer::GetPlainText()` took its
place.

This PR also fixes two bugs in the formatted copy:

- We were applying line breaks after each selected row, even though the
row could have been a Wrapped row. This caused the wrapped rows to break
when they shouldn't.
- We mishandled Unicode text (\uN) within the RTF copy. Every next
character that uses a surrogate pair or high codepoint was missing in
the copied text when pasted to MSWord. The command `\uc4` should have
been `\uc1`, which is used to tell how many fallback characters are used
for each Unicode codepoint (\u). We always use one `?` character as the
fallback.

Closes #16191

**References and Relevant Issues**

- #16270

**Validation Steps Performed**

- Casual copy-pasting from Terminal or OpenConsole to word editors works
as before.
- Verified HTML copy by copying the generated HTML string and running it
through an HTML viewer.
[Sample](https://codepen.io/tusharvickey/pen/wvNXbVN)
- Verified RTF copy by copy-pasting the generated RTF string into
MSWord.
- SingleLine mode works (<kbd>Shift</kbd>+ copy)
- BlockSelection mode works (<kbd>Alt</kbd> selection)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-TerminalControl Issues pertaining to the terminal control (input, selection, keybindings, mouse interaction, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Terminal The new Windows Terminal.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Font variant ignored by copyFormatting
3 participants