Skip to content

Commit

Permalink
Removing characters
Browse files Browse the repository at this point in the history
  • Loading branch information
AG committed Mar 4, 2024
1 parent ce9f3df commit c182ad7
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
# RECIPES #
###########
# All with chunk size of 1024
# ?.?MB: Target dictionary count 100,000, Prune 10,000,000
# 8.5MB: Target dictionary count 25,000, Prune 10,000,000
# 3.6MB: Target dictionary count 10,000, Prune 10,000,000
# ?.?MB: Target dictionary count 100,000, Prune 1,000,000
# 8.5MB: Target dictionary count 25,000, Prune 10,000,000
# 5.4MB: Target dictionary count 10,000, Prune 10,000,000

PRUNE_FREQUENCY = 10 * 1000 * 1000 # Every this many document positions
PRUNE_FREQUENCY = 1000 * 1000 # Every this many document positions
CHUNK_SIZE = 1024 # 1KB per chunk
TARGET_DICTIONARY_COUNT = 10 * 1000 * 1000
TARGET_DICTIONARY_COUNT = 100 * 1000

# Define a flag to indicate when an interrupt has been caught
interrupted = False
Expand Down Expand Up @@ -191,8 +191,9 @@ def main():
word = words[j]
# Define a set of punctuation that is allowed within a word
internal_punctuation = {"'", "-"}
additional_punctuation = {"“", "”"}
# Create a set of punctuation that signals the end of a word, excluding the internal punctuation
ending_punctuation = set(string.punctuation) - internal_punctuation
ending_punctuation = set(string.punctuation) - internal_punctuation + additional_punctuation

# Check for and remove ending punctuation from the word
cleaned_word = ''.join(char for char in word if char not in ending_punctuation)
Expand Down

0 comments on commit c182ad7

Please sign in to comment.