Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special build for testing serialization via a serialization roundtrip in JIT compilation and fix serialization leaks #7763

Merged
merged 36 commits into from
Nov 6, 2023

Conversation

TH3CHARLie
Copy link
Contributor

@TH3CHARLie TH3CHARLie commented Aug 15, 2023

relates #7745

As discussed in dev meeting, I add back the JIT roundtrip test. It works by intercepting the compile_jit call, doing a serialization roundtrip inside it and using serialized results to do the compilation.

Because its rather hacky, the code is included in #ifdef blocks, and a compile option WITH_SERIALIZATION_JIT is added, this should be a test-only option that builds a special version of Halide that have no use other than serialization testing. Under this special build, correctness test suite becomes serialization correctness test.

Welcome all thoughts of comments and feedbacks @steven-johnson @abadams @derek-gerstmann

@TH3CHARLie TH3CHARLie marked this pull request as draft August 15, 2023 21:14
@TH3CHARLie TH3CHARLie changed the title Add special build for testing serialization via JIT roundtrip Add special build for testing serialization via a serialization roundtrip in JIT compilation Aug 15, 2023
@TH3CHARLie TH3CHARLie marked this pull request as ready for review August 15, 2023 22:08
src/CMakeLists.txt Outdated Show resolved Hide resolved
src/Pipeline.cpp Outdated
@@ -582,6 +582,25 @@ void Pipeline::compile_jit(const Target &target_arg) {
// Clear all cached info in case there is an error.
contents->invalidate_cache();

#ifdef WITH_SERIALZATION_JIT
// TODO: replace file serialization with in-memory serialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please open a Issue for this task and add the comment here e.g. TODO(http://etc):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derek already implement the API in his PR so I will refer to that

src/Pipeline.cpp Outdated Show resolved Hide resolved
src/Pipeline.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor nits

@TH3CHARLie
Copy link
Contributor Author

Updated

@@ -508,6 +508,15 @@ if (WITH_SERIALIZATION)
target_compile_definitions(Halide PRIVATE WITH_SERIALIZATION)
endif ()

# Enable serialization testing by intercepting JIT compilation with a serialization roundtrip;
# This is used only for special builds made specifically for testing, and must be disabled by default.
option(WITH_SERIALIZATION_JIT "Intercepting JIT compilation with a serialization roundtrip, for test only" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rename this to something more evocative of testing, e.g. WITH_SERIALIZATION_JIT_ROUNDTRIP_TESTING

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I took the liberty of making this change myself, so that I could test the necessary buildbot changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry, I should have done it last night, thanks!

steven-johnson added a commit to halide/build_bot that referenced this pull request Aug 22, 2023
@steven-johnson

This comment was marked as outdated.

@steven-johnson
Copy link
Contributor

(I'll see if I can get a local build with symbols enabled to get any better info on this)

@steven-johnson

This comment was marked as outdated.

@steven-johnson steven-johnson self-requested a review August 24, 2023 00:28
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to understand how/why these leaks are occurring before this can land -- are they intrinsic to the serializer, or a byproduct of the testing hack?

@TH3CHARLie
Copy link
Contributor Author

thanks for reporting this, this directly confirms my previous hypothesis that serialization code contains leaks, we found it because our fuzzer driver will oom after running 30k ish pipelines(each run has a deserialization call). i'm pretty sure it's not because of the jit hack, it's part of the serialization. Last time i had no luck using valgrind, tmr i'll try figure it out

@derek-gerstmann
Copy link
Contributor

It’s probably easier/more accurate to use the asan build configuration for Halide. Val grind is notoriously slow since it’s a runtime intercept. If you look at the symbolized call sites that Steven posted, the allocation locations should lead you to areas to investigate. It appears to be IR and Parameter references that aren’t getting cleaned up or have ownership cycles.

@abadams
Copy link
Member

abadams commented Aug 24, 2023

One place to check is any new constructor-like things you have added, e.g. Function::update_with_deserialization

The normal ways Functions acquire right-hand-sides have various checks in place to avoid reference cycles, and that may not be true of the new ways you added.

@TH3CHARLie
Copy link
Contributor Author

One place to check is any new constructor-like things you have added, e.g. Function::update_with_deserialization

The normal ways Functions acquire right-hand-sides have various checks in place to avoid reference cycles, and that may not be true of the new ways you added.

Can you give an example of the existing checks?

@TH3CHARLie
Copy link
Contributor Author

okay, it's just like Andrew has guessed, Function::update_with_deserialization is the root cause of all leaks, why? because in build_reverse_function_mapping I added the function ptr as they are and they are later used in places like wrappers and extern func args, while all these function ptrs are strong ptrs. I have a fix now works on some cases I've tested, will file a PR once all tests passed locally.

@TH3CHARLie
Copy link
Contributor Author

Maybe pushing the updates here in this PR is easier for testing?

@steven-johnson
Copy link
Contributor

OK, fix applied, wait for it to go green (again)

@steven-johnson
Copy link
Contributor

Still one failure:

https://buildbot.halide-lang.org/master/#/builders/42/builds/54/steps/19/logs/correctness_vector_math

/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-64-linux-cmake/halide-build/test/correctness/correctness_vector_math
vector_math test seed: 1695764725
Testing floatx4
Testing floatx8
Testing doublex2
Testing uint8_tx16
Testing int8_tx16
Testing uint16_tx8
Warning: deserialized pipeline is empty
Required regular expression not found. Regex=[Success!]

@TH3CHARLie
Copy link
Contributor Author

TH3CHARLie commented Sep 27, 2023

Locally this test passed, I have a theory that we are deserializing from files and several tests are running together with some race issue on the same file names (i.e. f0.hlpipe), that's why we are seeing deserializing from empty pipeline warnings.

@TH3CHARLie
Copy link
Contributor Author

To verify my theory, the new commit now uses serialize to memory instead of files, passed all tests on CPU locally, let's see how it goes on the buildbots.

@TH3CHARLie
Copy link
Contributor Author

These failures shouldn't be related to this PR.....

@abadams
Copy link
Member

abadams commented Sep 29, 2023

Looks like the serialization tutorial is throwing an error

@TH3CHARLie
Copy link
Contributor Author

I see the same error in the tutorial PR, then what about the fast_pow one https://buildbot.halide-lang.org/master/#/builders/82/builds/75

@derek-gerstmann
Copy link
Contributor

derek-gerstmann commented Sep 29, 2023

I see the same error in the tutorial PR, then what about the fast_pow one https://buildbot.halide-lang.org/master/#/builders/82/builds/75

I'm confused ... I don't see a failure for the serialization tutorial in the PR I made: #7760

Oh ... wait ... nvm. Sorry ... it appears to fail due to WITH_SERIALIZATION not being set on OSX?

@derek-gerstmann
Copy link
Contributor

derek-gerstmann commented Sep 29, 2023

@TH3CHARLie Sorry for not catching this .... adding this to the Makefile should resolve the tutorial issue:


Makefile Line 286:
...
TUTORIAL_CXX_FLAGS ?= -std=c++17 -g -fno-omit-frame-pointer $(RTTI_CXX_FLAGS) -I $(ROOT_DIR)/tools $(SANITIZER_FLAGS) $(LLVM_CXX_FLAGS_LIBCPP)
+ ifneq (,$(shell which flatc))
+ TUTORIAL_CXX_FLAGS += -DWITH_SERIALIZATION -I $(BUILD_DIR) -I $(shell which flatc | sed 's/bin.flatc/include/')
+ endif
...

@TH3CHARLie
Copy link
Contributor Author

@derek-gerstmann tutorial make failure still exsits, the x86 failure seems unrelated though

@derek-gerstmann
Copy link
Contributor

@derek-gerstmann tutorial make failure still exsits, the x86 failure seems unrelated though

That would indicate that flatc isn't installed on the machine. You can add a rule to skip running the tutorial if flatc is not found (right after the

tutorial_%: $(BIN_DIR)/tutorial_% $(TMP_DIR)/images/rgb.png $(TMP_DIR)/images/gray.png
	@-mkdir -p $(TMP_DIR)
	cd $(TMP_DIR) ; $(CURDIR)/$<
	@-echo

+ # Skip the serialization tutorial, if we didn't build -DWITH_SERIALIZATION
+ ifeq (,$(shell which flatc))
+ tutorial_lesson_23_serialization: 
+ 	@echo "Skipping tutorial lesson 23 (serialization not enabled) ..." 
+ endif


@derek-gerstmann
Copy link
Contributor

Also, it looks like the makefile checks for flatc ... shouldn't it be checking for flatcc?

@TH3CHARLie
Copy link
Contributor Author

OK now all failures look unrelated (failing on host targets that are not serialization test), PTAL @steven-johnson @derek-gerstmann @abadams

Makefile Outdated
@@ -288,6 +288,9 @@ LLVM_SHARED_LIBS = -Wl,-rpath=$(LLVM_LIBDIR) -L $(LLVM_LIBDIR) -lLLVM
LLVM_LIBS_FOR_SHARED_LIBHALIDE=$(if $(WITH_LLVM_INSIDE_SHARED_LIBHALIDE),$(LLVM_STATIC_LIBS),$(LLVM_SHARED_LIBS))

TUTORIAL_CXX_FLAGS ?= -std=c++17 -g -fno-omit-frame-pointer $(RTTI_CXX_FLAGS) -I $(ROOT_DIR)/tools $(SANITIZER_FLAGS) $(LLVM_CXX_FLAGS_LIBCPP)
ifneq (,$(shell which flatc))
TUTORIAL_CXX_FLAGS += -DWITH_SERIALIZATION -I $(BUILD_DIR) -I $(shell which flatc | sed 's/bin.flatc/include/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is necessary? These flags are the ones that would be set by users of Halide. It would be bad if they needed to set any of this to use serialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just remove this part, we already skipping tutorial 23 if no flatc found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

src/Serialization.cpp Outdated Show resolved Hide resolved
@derek-gerstmann derek-gerstmann self-requested a review October 9, 2023 20:01
@TH3CHARLie
Copy link
Contributor Author

Most failures are related to webGPU, bad_partition_always is failing due to a missing serialization, which I will fix in a later PR, landing this now.

@TH3CHARLie TH3CHARLie merged commit e5bf7ab into halide:main Nov 6, 2023
3 checks passed
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
…trip in JIT compilation and fix serialization leaks (halide#7763)

* add back JIT testing, enclosed in #ifdef blocks

* fix typo

* nits

* WITH_SERIALIZATION_JIT->WITH_SERIALIZATION_JIT_ROUNDTRIP_TESTING

* fix self-reference leaks: now uses weak function ptr in reverse function mappings

* Move clang-tidy checks back to Linux

Recent changes in the GHA runners for macOS don't play well with clang-tidy; rather than sink any more time into debugging it, I'm going to revert the relevant parts of halide#7746 so that it runs on the less-finicky Linux runners instead.

* bogus

* Update Generator.cpp

* Update Generator.cpp

* call copy_to_host before serializing buffers

* throw an error if we serialize on-device buffer

* Skip specialize_to_gpu

* Update Pipeline.cpp

* Skip two more tests

* use serialize to memory during jit testing

* makefile update

* makefile fix

* skip the tutorial if flatc is not there

* fix

* fix signature

* fix makefile

* trigger buildbot

---------

Co-authored-by: Steven Johnson <srj@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants