Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable hash values for k>64 #41

Closed
shenwei356 opened this issue Aug 30, 2022 · 4 comments
Closed

Unstable hash values for k>64 #41

shenwei356 opened this issue Aug 30, 2022 · 4 comments

Comments

@shenwei356
Copy link

shenwei356 commented Aug 30, 2022

Hi, thanks for inventing this great algorithm! And glad to see ntHash2 is also published.

I've used ntHash in several projects of mine, specifically, using a Golang implementation (by @will-rowe) of ntHash v1.0.4. It's really fast!

I just find that the hash values are unstable for k > 64. I guess it's related to the algorithm itself, cause it happens to be 64, not another number. I am not familiar with C/C++, so I can't check the original implementation. Here are steps with tools using the Golang implementation to reproduce the issue.

$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" | seqkit stats
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   DNA          1       65       65       65       65

# compute canonical ntHash with https://github.com/shenwei356/unikmer
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
14872199115326727818  ****

# now we add one base in the start position.
$ echo -ne ">s\naACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5042997269030491403
13219011773654478434  ****

$ echo -ne ">s\ncACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5252439041897790003
1248486797404628174  ****

$ echo -ne ">s\ngACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
6432712771380638299
10232415768797241538  ****

$ echo -ne ">s\ntACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5774425108696765737
11299031900349869606  ****

While for k=64, it's stable (7718595180140858881):

$ echo -ne ">s\nCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
7718595180140858881

$ echo -ne ">s\naCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
8752650216170443135
7718595180140858881

$ echo -ne ">s\ncCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
9222164657016850147
7718595180140858881

$ echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881

echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"   \
    | unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
@parham-k
Copy link
Member

parham-k commented Aug 31, 2022

Hi @shenwei356 and thanks for your comments! This was a known issue with the early versions of ntHash (see this issue and also ntHash2's paper). We've fixed the periodicity problem in ntHash2. The Golang implementation probably contains the old bug that causes your issue.

Unfortunately, we don't have a Golang implementation for ntHash2. That said, ntHash2 can be used in Python with btllib.

@mohamadi
Copy link
Collaborator

mohamadi commented Sep 4, 2022

Thanks @shenwei356 for reporting this! We'll try to sync up with @will-rowe and @luizirber to have new releases for Golang and Rust implementation.

@shenwei356
Copy link
Author

shenwei356 commented Sep 5, 2022

I see, thank you all! Currently, I'll limit the k-mer size below 64 for my projects.

@parham-k
Copy link
Member

parham-k commented Sep 6, 2022

Apparently, SWIG (the tool used in btllib to generate Python wrappers) also supports Go. I'll try it out on ntHash and post any updates here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants