Skip to content

Commit

Permalink
that is still fairly ugly
Browse files Browse the repository at this point in the history
  • Loading branch information
ArthurZucker committed Jun 19, 2024
1 parent 8c36539 commit 9d389bc
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 14 deletions.
16 changes: 3 additions & 13 deletions tokenizers/src/tokenizer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -862,19 +862,9 @@ where
"Pre-tok String: {} vs token {} vs pret {:?}",
string.original,
token,
string
.get_splits(OffsetReferential::Normalized, OffsetType::Byte)
.first()
.unwrap()
string.splits.first().unwrap().normalized.normalized.clone()

Check failure on line 865 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check everything builds & tests (ubuntu-latest)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 865 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.7)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 865 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.10)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 865 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.9)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private
);
Some(
string
.get_splits(OffsetReferential::Normalized, OffsetType::Byte)
.first()
.unwrap()
.0
.to_string(),
)
Some(string.splits.first().unwrap().normalized.normalized.clone())

Check failure on line 867 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check everything builds & tests (ubuntu-latest)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 867 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.7)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 867 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.10)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private

Check failure on line 867 in tokenizers/src/tokenizer/mod.rs

View workflow job for this annotation

GitHub Actions / Check it builds for Windows 32-bit (3.9)

field `splits` of struct `tokenizer::pre_tokenizer::PreTokenizedString` is private
} else {
println!("String: {}", token);
Some(token)
Expand Down Expand Up @@ -1334,7 +1324,7 @@ mod test {
let mut tokenizer = Tokenizer::from_pretrained("meta-llama/Meta-Llama-3-8B", None).unwrap();
tokenizer.add_tokens(&[AddedToken::from("ĠåĹİ", false)]); // this is the byte-level for 嗎
let encoded = tokenizer
.encode("Hey! how is this token: 嗎", false)
.encode("Hey! how is this token: 嗎 and ĠåĹİ", false)
.unwrap();
println!("Encoded tokens: {:?}", encoded.get_ids());
let decoded = tokenizer.decode(encoded.get_ids(), false);
Expand Down
2 changes: 1 addition & 1 deletion tokenizers/src/tokenizer/normalizer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ pub struct NormalizedString {
/// The original version of the string, before any modification
original: String,
/// The normalized version of the string, after all modifications
normalized: String,
pub normalized: String,
/// Mapping from normalized string to original one: (start, end) for each
/// byte of the normalized string
alignments: Vec<(usize, usize)>,
Expand Down

0 comments on commit 9d389bc

Please sign in to comment.