Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML syntax error in stream writer #1519

Closed
IAkumaI opened this issue Apr 10, 2023 · 10 comments
Closed

XML syntax error in stream writer #1519

IAkumaI opened this issue Apr 10, 2023 · 10 comments
Labels
confirmed This issue can be reproduced

Comments

@IAkumaI
Copy link
Contributor

IAkumaI commented Apr 10, 2023

Description

We have this line: https://github.com/qax-os/excelize/blob/master/cell.go#L495 which escapes text for xml via xml.EscapeText.
Than buff.String() goes to trimCellValue function which has this https://github.com/qax-os/excelize/blob/master/cell.go#L462

if len(value) > TotalCellChars {
	value = value[:TotalCellChars]
}

Because of how xml.EscapeText works, it makes string much more sometimes (for example, if we put html into it).
Than value[:TotalCellChars] cuts this escaped xml without care of tags.

So, if final escaped XML has ending like this ">" and overflows TotalCellChars by 1 or 2, out XML becomes invalid. It will have invalid unclosed tag.

Steps to reproduce the issue:

value := strings.Repeat("<>", TotalCellChars/2)
f := NewFile()
streamWriter, _ := f.NewStreamWriter("Sheet1")
_ = streamWriter.SetRow("A1", []interface{}{value}) // No errors
_ = streamWriter.Flush() // Still no errors

_, err := f.GetCellValue("Sheet1", "A1") // XML syntax error: invalid character entity &gt (no semicolon)

Describe the results you received:

invalid character entity &gt (no semicolon)

Describe the results you expected:

It works.

Output of go version:

go version go1.19.1 darwin/arm64

Excelize version or commit ID:

v2.7.1

Environment details (OS, Microsoft Excel™ version, physical, etc.):
MacOS Ventura

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Apr 10, 2023

Simple test function:

func TestSetLongCellValuesXML(t *testing.T) {
	value := strings.Repeat("<>", TotalCellChars/2)

	f := NewFile()
	streamWriter, _ := f.NewStreamWriter("Sheet1")
	err := streamWriter.SetRow("A1", []interface{}{value})
	assert.NoError(t, err)
	err = streamWriter.Flush()
	assert.NoError(t, err)

	v, err := f.GetCellValue("Sheet1", "A1")
	assert.NoError(t, err)
	assert.Equal(t, value, v)
}

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Apr 10, 2023

As a temporary solution for someone who faced this bug:

if len(val) > excelize.TotalCellChars/5 {
	for {
		buff := &bytes.Buffer{}
		_ = xml.EscapeText(buff, []byte(val))
		if buff.Len() <= excelize.TotalCellChars {
			break
		}

		val = utils.Utf8Substr(val, 0, utf8.RuneCountInString(val)-(buff.Len()-excelize.TotalCellChars))
	}
}

Use it for every value which you want to add into stream writer.
But this one is very dirty and slow.

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Apr 11, 2023

Any thoughts?

@xuri xuri added the confirmed This issue can be reproduced label Apr 12, 2023
@xuri
Copy link
Member

xuri commented Apr 12, 2023

Thanks for your issue. We need to change the order for trim cell value and escape XML characters. Combine these two steps in trimCellValue, for example:

diff --git a/cell.go b/cell.go
index 1f01ce3..edfcb3c 100644
--- a/cell.go
+++ b/cell.go
@@ -451,17 +451,22 @@ func (f *File) setSharedString(val string) (int, error) {
 	sst.Count++
 	sst.UniqueCount++
 	t := xlsxT{Val: val}
-	val, t.Space = trimCellValue(val)
+	val, t.Space = trimCellValue(val, false)
 	sst.SI = append(sst.SI, xlsxSI{T: &t})
 	f.sharedStringsMap[val] = sst.UniqueCount - 1
 	return sst.UniqueCount - 1, nil
 }
 
 // trimCellValue provides a function to set string type to cell.
-func trimCellValue(value string) (v string, ns xml.Attr) {
+func trimCellValue(value string, escape bool) (v string, ns xml.Attr) {
 	if utf8.RuneCountInString(value) > TotalCellChars {
 		value = string([]rune(value)[:TotalCellChars])
 	}
+	buf := &bytes.Buffer{}
+	if escape {
+		_ = xml.EscapeText(buf, []byte(value))
+		value = buf.String()
+	}
 	if len(value) > 0 {
 		prefix, suffix := value[0], value[len(value)-1]
 		for _, ascii := range []byte{9, 10, 13, 32} {
@@ -492,15 +497,13 @@ func (c *xlsxC) setCellValue(val string) {
 // string.
 func (c *xlsxC) setInlineStr(val string) {
 	c.T, c.V, c.IS = "inlineStr", "", &xlsxSI{T: &xlsxT{}}
-	buf := &bytes.Buffer{}
-	_ = xml.EscapeText(buf, []byte(val))
-	c.IS.T.Val, c.IS.T.Space = trimCellValue(buf.String())
+	c.IS.T.Val, c.IS.T.Space = trimCellValue(val, true)
 }
 
 // setStr set cell data type and value which containing a formula string.
 func (c *xlsxC) setStr(val string) {
 	c.T, c.IS = "str", nil
-	c.V, c.XMLSpace = trimCellValue(val)
+	c.V, c.XMLSpace = trimCellValue(val, false)
 }
 
 // getCellDate parse cell value which containing a boolean.
@@ -1031,7 +1034,7 @@ func setRichText(runs []RichTextRun) ([]xlsxR, error) {
 			return textRuns, ErrCellCharsLength
 		}
 		run := xlsxR{T: &xlsxT{}}
-		run.T.Val, run.T.Space = trimCellValue(textRun.Text)
+		run.T.Val, run.T.Space = trimCellValue(textRun.Text, false)
 		fnt := textRun.Font
 		if fnt != nil {
 			run.RPr = newRpr(fnt)

What do you think about, and whould you like made a PR for it?

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Apr 14, 2023

I'll definitely take a look. Just a bit busy this days.

I was thinking about an edge case.
Assume, we have value of TotalCellChars-1 length. At this point we don't have to trim it.
Then, value goes to XML escaping which makes it length to TotalCellChars++++

What should we do here?
As a user, I know that excel has a limit up to TotalCellChars chars and I DID follow this limitation, but because of xml escaping (I don't know about it as a user), my cell will be trimmed, so I'm gonna lose some data. If the data contains html - i'll lose huge part of this data.

I tried to create an example xlsx file (attached), which contains 32767 > symbols in one cell and it works.
So the question is - why and how? Doesn't Excel escape xml?

32767-test.xlsx

@xuri xuri closed this as completed in 17c0294 Apr 16, 2023
@xuri
Copy link
Member

xuri commented Apr 16, 2023

Thanks for your feedback. I have fixed this issue. Please upgrade to the master branch code. This patch will be released in the next version.

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Apr 16, 2023

Great, thank you!

Just a little fix:
https://github.com/qax-os/excelize/blob/master/cell.go#L465
We don't need buff here. It should be inside if escape clause.

@xuri
Copy link
Member

xuri commented Apr 17, 2023

Thanks for your advice. I've fixed it in the commit d0ad0f3.

@IAkumaI
Copy link
Contributor Author

IAkumaI commented Jun 23, 2023

@xuri Hi, what about the next release with the fix? It's been a long time since April

@xuri
Copy link
Member

xuri commented Jun 26, 2023

This library usually releases a stable version every 3 or 4 months, if you need to use the code on the master branch version, please use the Go module to lock the version.

xuri added a commit to JDavidVR/excelize that referenced this issue Jul 11, 2023
jenbonzhang pushed a commit to jenbonzhang/excelize that referenced this issue Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed This issue can be reproduced
Projects
None yet
Development

No branches or pull requests

2 participants