Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleting a file during post-processing messes up the containing folder #7909

Closed
individual-it opened this issue Dec 7, 2023 · 8 comments · Fixed by #8119
Closed

deleting a file during post-processing messes up the containing folder #7909

individual-it opened this issue Dec 7, 2023 · 8 comments · Fixed by #8119
Assignees
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug

Comments

@individual-it
Copy link
Member

individual-it commented Dec 7, 2023

Describe the bug

When a file gets deleted during the post-processing the whole folder can become unusable

Steps to reproduce

folder='ff'
file1='ff.txt'
file2='ff2.txt'
file3='ff3.txt'
# get the drive id
driveid=`curl -ku admin:admin https://host.docker.internal:9200/graph/v1.0/me/drives | jq -c '.[] | .[] | select(.driveAlias | contains("personal")) | .id' -r`

# create a folder
curl -ku admin:admin -XMKCOL 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder

# upload one file
curl -ku admin:admin -XPUT -d"content" 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder'/'$file1

# wait till the post-processing is done, check with:
curl -ku admin:admin -XPROPFIND 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder | xmllint --format -

# create, delete and overwrite the same file over and over again
for i in {1..1000}
do
    curl -ku admin:admin -XPUT -d"content" 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder'/'$file2
    curl -ku admin:admin -XDELETE 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder'/'$file2
    curl -ku admin:admin -XPUT -d"content" 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder'/'$file2
done

# upload a third file
curl -ku admin:admin -XPUT -d"content" 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder'/'$file3

# check the content of the folder
curl -ku admin:admin -XPROPFIND 'https://host.docker.internal:9200/dav/spaces/'$driveid'/'$folder | xmllint --format -

Expected behavior

ff2.txt and ff3.txt should be getting out of post-processing eventually

Actual behavior

ff2.txt and ff3.txt never finish post-processing

Setup

Please describe how you started the server and provide a list of relevant environment variables or configuration files.

POSTPROCESSING_DELAY=30s PROXY_ENABLE_BASIC_AUTH=true OCIS_INSECURE=true PROXY_HTTP_ADDR=0.0.0.0:9200 OCIS_URL=https://host.docker.internal:9200 PROXY_TRANSPORT_TLS_KEY=./ocis.pem PROXY_TRANSPORT_TLS_CERT=./ocis.crt ./ocis/bin/ocis server

Additional context

logs:

{"level":"error","service":"ocdav","name":"com.owncloud.web.ocdav","traceid":"00000000000000000000000000000000","request-id":"deepthought3/Mvoj78s77G-002127","spaceid":"e4d3bb66-6f2d-455c-9ebc-e518e89d3c06$00364da3-abb6-47d0-8002-6cd66ed6eea5","path":"/ff/ff2.txt","error":"Put \"https://host.docker.internal:9200/data\": context canceled","time":"2023-12-07T17:36:02+05:45","message":"error doing PUT request to data service"}
2023/12/07 17:36:02 http: proxy error: context canceled
{"level":"error","service":"frontend","pkg":"rhttp","traceid":"00000000000000000000000000000000","request-id":"deepthought3/Mvoj78s77G-002129","error":"Put \"http://localhost:9158/data/simple/47d1ecc3-888a-4ef7-a159-bbf9f99ca9f9\": context canceled","time":"2023-12-07T17:36:02+05:45","message":"error doing PUT request to data service"}
{"level":"error","service":"frontend","pkg":"rhttp","traceid":"00000000000000000000000000000000","host":"127.0.0.1","method":"PUT","uri":"/data","url":"/","proto":"HTTP/1.1","status":500,"size":0,"start":"07/Dec/2023:17:36:02 +0545","end":"07/Dec/2023:17:36:02 +0545","time_ns":231887,"time":"2023-12-07T17:36:02+05:45","message":"http"}
{"level":"error","error":"failed to upload file to blobstore: could not open blob '/home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs' for writing: open /home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs: is a directory","uploadID":"0645d53a-355c-4aca-9cb5-2177d11c582e","time":"2023-12-07T17:36:02+05:45","message":"could not finalize upload"}
{"level":"error","error":"decomposedfs: root has no parent","uploadID":"0645d53a-355c-4aca-9cb5-2177d11c582e","time":"2023-12-07T17:36:02+05:45","message":"could not propagate tree size change"}
{"level":"error","error":"failed to upload file to blobstore: could not open blob '/home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs' for writing: open /home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs: is a directory","uploadID":"1d13ffb8-24bf-4fdd-b321-d5c774fcf353","time":"2023-12-07T17:36:32+05:45","message":"could not finalize upload"}
{"level":"error","error":"decomposedfs: root has no parent","uploadID":"1d13ffb8-24bf-4fdd-b321-d5c774fcf353","time":"2023-12-07T17:36:32+05:45","message":"could not propagate tree size change"}
{"level":"error","error":"failed to upload file to blobstore: could not open blob '/home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs' for writing: open /home/artur/.ocis/storage/users/spaces/00/364da3-abb6-47d0-8002-6cd66ed6eea5/blobs: is a directory","uploadID":"3f94e7d8-837d-446e-a256-2ef1bdd87aaa","time":"2023-12-07T17:37:02+05:45","message":"could not finalize upload"}
{"level":"error","error":"decomposedfs: root has no parent","uploadID":"3f94e7d8-837d-446e-a256-2ef1bdd87aaa","time":"2023-12-07T17:37:02+05:45","message":"could not propagate tree size change"}

@micbar micbar added the Priority:p2-high Escalation, on top of current planning, release blocker label Dec 8, 2023
@micbar micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Dec 8, 2023
@jvillafanez
Copy link
Member

Maybe the main problem is that "context canceled" that shows up in the logs.

Assuming the context is attached to the ocdav request, I'm not sure what is the consequence of canceling that context. Probably the request itself keeps going (unless there is a thread checking the "done" channel of the context, but I haven't seen any), but further request from the ocdav service to a different service using the same context might be blocked (because the context is canceled). This will likely cause problems because some operations among the services won't take place.

As for why the context is canceled, my guess is that the connection gets cut. Assuming the ocdav service makes some requests (using that context) in a different thread, it might happen that the request itself finishes but the secondary thread making the asynchronous request is still running. The client would cut the connection after getting the server response, which could cancel / done the context, so this async request in the secondary thread might get on trouble due to the canceled context.
I haven't taken a deep look at the code, so maybe all requests are synchronous and the problem isn't this.

The other thing I've noticed is that the postprocessing service is mono-threaded (at least the processing part). This means that the service itself will process the events one by one, so if one event takes a lot of time to be processed for whatever reason, all the events after it could be severely delayed.
This could explain why the "ff3.txt" isn't out of postprocessing: the service could be stuck on one of the events from "ff2.txt" (or maybe still processing those), and the events from "ff3.txt" still in the queue.

@jvillafanez
Copy link
Member

From another perspective, the errors seems to indicate that the BlobID of the node is empty. That explains why the target blob points to a directory instead of the actual blob file.

The other more plausible explanation is that the target node doesn't exists and we fallback to the space root
https://github.com/cs3org/reva/blob/edge/pkg/storage/utils/decomposedfs/decomposedfs.go#L256 and then
There are some scenarios where the ReadNode returns the space root instead of the target node. We might need to check those scenarios.

@butonic
Copy link
Member

butonic commented Dec 18, 2023

I could observe a similar think when trying to debug https://drone.owncloud.com/owncloud/ocis/30185/27/7

It seems I have a dangling symlink:

ls -l ~/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/so/me/-a/dm/in-user-id-0000-000000000000*
-rw------- 1 jfd jfd     0 24. Nov 16:32 /home/jfd/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/so/me/-a/dm/in-user-id-0000-000000000000.mlock
-rw------- 1 jfd jfd   376 18. Dez 15:59 /home/jfd/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/so/me/-a/dm/in-user-id-0000-000000000000.mpk

/home/jfd/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/so/me/-a/dm/in-user-id-0000-000000000000:
insgesamt 0
lrwxrwxrwx 1 jfd jfd 55  6. Dez 20:27 'Gregorz Rutkowski' -> ../../../../../66/46/ef/54/-420a-4481-ac07-42e3506d950e
lrwxrwxrwx 1 jfd jfd 55 18. Dez 15:58 'New file (1).txt' -> ../../../../../96/8e/bf/ad/-3a2e-4a47-b46a-ca7f2cd44dc9
lrwxrwxrwx 1 jfd jfd 55 18. Dez 15:48 'New file.txt' -> ../../../../../38/4d/8d/9f/-80e6-4e5c-b8d0-0eab4682dcac

New file.txt wes a dangling symlink. In the log I would see

{"level":"error","service":"storage-users","pkg":"rgrpc","traceid":"00000000000000000000000000000000","path":"/home/jfd/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/38/4d/8d/9f/-80e6-4e5c-b8d0-0eab4682dcac","nodeid":"384d8d9f-80e6-4e5c-b8d0-0eab4682dcac","attrs":{},"messagepack":"\ufffd","time":"2023-12-18T16:07:18+01:00","message":"missing parent id"}

produced by the decomposedfs node.go with

	attrs, err := n.Xattrs(ctx)
	switch {
	case metadata.IsNotExist(err):
		return n, nil // swallow not found, the node defaults to exists = false
	case err != nil:
		return nil, err
	}
	n.Exists = true

	n.Name = attrs.String(prefixes.NameAttr)
	n.ParentID = attrs.String(prefixes.ParentidAttr)
	if n.ParentID == "" {
		d, _ := os.ReadFile(lu.MetadataBackend().MetadataPath(n.InternalPath()))
		if _, ok := lu.MetadataBackend().(metadata.MessagePackBackend); ok {
			appctx.GetLogger(ctx).Error().Str("path", n.InternalPath()).Str("nodeid", n.ID).Interface("attrs", attrs).Bytes("messagepack", d).Msg("missing parent id")
		}
		return nil, errtypes.InternalError("Missing parent ID on node")
	}

It was trying to read the metadata for non existing node in a directory ...

I'm still debugging other postrocessing stuff. Feels related. I will keep an eye on this.

@2403905 2403905 self-assigned this Dec 19, 2023
@2403905 2403905 moved this from Prio 2 to In progress in Infinite Scale Team Board Dec 19, 2023
@jvillafanez
Copy link
Member

https://github.com/cs3org/reva/blob/edge/pkg/storage/utils/decomposedfs/upload/processing.go#L327 seems suspicious. As far as I know, the "acquireLock" parameter should be always true, otherwise there could be problems with race conditions trying to write extended attributes in the file if there are multiple threads / processes competing to write in the same file
Same for

A haven't found conclusive evidence on whether the extended attributes are set atomically or not. If the operations aren't atomic, race conditions might happen when we set the extended attributes from multiple sides, which could cause problems with what is written.
According to https://man7.org/linux/man-pages/man7/xattr.7.html , "Extended attributes are accessed as atomic objects", however, there was an old bug in the btrfs regarding non-atomic access (https://bst.cisco.com/quickview/bug/CSCvu54502). If the real behavior is dependent on the underlying FS, maybe we shouldn't rely on atomic operations on the extended attributes.

@2403905
Copy link
Contributor

2403905 commented Dec 22, 2023

@individual-it Because the delay is so long in the test case POSTPROCESSING_DELAY=30s and the postprocessing is going sequentially it looks like never finished.

The error {"level":"error","error":"failed to upload file to blobstore: could not open blob '/Users/roman/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/blobs' for writing: open /Users/roman/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/blobs: is a directory","uploadID":"d44c8306-1c40-4c71-8927-8f6ac4e45ad1","time":"2023-12-22T14:22:03+01:00","message":"could not finalize upload"} happened because we tried to upload after the node was deleted and we can't find a destination node as well.

@individual-it
Copy link
Member Author

So is this an unrealistic test-case?

@2403905
Copy link
Contributor

2403905 commented Dec 22, 2023

Sounds to me unrealistic. But we still need to fix an error that appeared if we put the file and delete it immediately. The postprocessing has to handle it properly.

@2403905
Copy link
Contributor

2403905 commented Dec 22, 2023

Here is a draft PR. @butonic @kobergj
Please have a look a comment cs3org/reva#4434 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants