Skip to content

Commit

Permalink
fix: utf8 encoding (#2555)
Browse files Browse the repository at this point in the history
# Description

Delete the replacement of non ASCII characters into spaces

## Checklist before requesting a review

Please delete options that are not relevant.

- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my code
- [ ] I have commented hard-to-understand areas
- [ ] I have ideally added tests that prove my fix is effective or that
my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] Any dependent changes have been merged

## Screenshots (if appropriate):
  • Loading branch information
chloedia authored May 7, 2024
1 parent cde7580 commit 748733d
Showing 1 changed file with 0 additions and 3 deletions.
3 changes: 0 additions & 3 deletions backend/packages/files/parsers/common.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import os
import re
import tempfile
import time

Expand Down Expand Up @@ -84,8 +83,6 @@ async def process_file(
doc.page_content = f"Filename: {new_metadata['original_file_name']} Content: {doc.page_content}"

doc.page_content = doc.page_content.replace("\u0000", "")
# Replace unsupported Unicode characters
doc.page_content = re.sub(r"[^\x00-\x7F]+", " ", doc.page_content)

len_chunk = len(enc.encode(doc.page_content))

Expand Down

0 comments on commit 748733d

Please sign in to comment.