Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate noise from py-ipfs-api due to its mis-handling of 0-length payloads per #485 #572

Merged
merged 2 commits into from
Sep 25, 2018

Conversation

machawk1
Copy link
Member

The issue of using py-ipfs-api to add_bytes() of zero-length payloads is still an outstanding issue per ipfs-shipyard/py-ipfs-http-client#137. This causes a lot of noise by the ipwb indexer when encountering these records, like when indexing samples/warcs/IAH-20080430204825-00000-blackbook.warc.gz.

This PR prevents these records from being attempted to be pushed into IPFS using py-ipfs-api. When ipfs-shipyard/py-ipfs-http-client#137 is fixed, we can change the logic but for now, this will short-circuit the attempt from ipwb.

@machawk1 machawk1 requested a review from ibnesayeed September 25, 2018 17:03
@codecov
Copy link

codecov bot commented Sep 25, 2018

Codecov Report

Merging #572 into master will decrease coverage by 0.07%.
The diff coverage is 22.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
- Coverage   23.84%   23.76%   -0.08%     
==========================================
  Files           6        6              
  Lines        1191     1199       +8     
  Branches      179      180       +1     
==========================================
+ Hits          284      285       +1     
- Misses        887      893       +6     
- Partials       20       21       +1
Impacted Files Coverage Δ
ipwb/indexer.py 52.58% <22.22%> (-1.32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3979cdc...941855b. Read the comment docs.

@ibnesayeed
Copy link
Member

We might want to add a special hash for now (e.g., 0000000000000000000...) to indicate zero bytes. And then check this well known hash in the replay to not attempt to fetch it. Current changes will ignore such records from being replayed as they are skipped from index.

Copy link
Member

@ibnesayeed ibnesayeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment under the Conversation tab

@machawk1
Copy link
Member Author

@ibnesayeed Fabricating data here seems like the wrong approach. It would be better to know what a blank string equates to as an IPFS hash and include it here rather than the 0s.

@ibnesayeed
Copy link
Member

I agree, but I do not know how to add an empty string/bytesarray to IFPS.

@machawk1 machawk1 merged commit e2caa9e into master Sep 25, 2018
@machawk1 machawk1 deleted the issue-485 branch September 25, 2018 19:32
@machawk1
Copy link
Member Author

Merging this for now. We can revisit it later but keeping the status quo makes for messy testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants