-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate our core representation to the typed choice sequence #3921
Comments
This is super interesting! |
Welcome, Jonathan! We'd love to have you continue contributing - I already really appreciate the type-annotation-improvements for our numpy and pandas extras, so this would be a third contribution 😻 @tybug might have some ideas here, but my impression is that the "refactor for an IR" project in this issue is more-or-less a serialized set of tasks and so adding a second person is unlikely to help much - even with just one we've had a few times where there were two or three PRs stacked up and accumulating merge conflicts between them. As an alternative, #3764 should be a fairly self-contained bugfix. On the more ambitious side, #3914 would also benefit from ongoing work on that - testing, observability, reporting whatever bugs you surface, etc. Or of course you're welcome to work on any other open issue which appeals to you! |
I was thinking that we'd still serialize to a bytestring - that's the ultimate interop format, and when we need to handle weird unicode and floats like subnormals or non-standard bitpatterns for |
yeah, this is a hard one to parallelize 😄. Some of the steps may subtly depend on others in ways that aren't obvious until one is knee deep in implementing it. Nice! I agree with the reasoning here. Added a task for this. This probably needs to be the absolute last thing to switch to the ir. |
Definitely the last thing to switch, I just got nerdsniped 😅 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I'm working on migrating shrinker block programs. Our upweighting for large integer ranges is giving the shrinker trouble, because it means that a simpler tree can result in a longer buffer: the buffer runs through the weighted distribution and draws Real example of this: b1 = b'\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00'
b2 = b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
s = st.lists(st.integers(0, 2**40))
print("complex result, smaller buffer", ConjectureData.for_buffer(b1).draw(s))
# complex result, smaller buffer [0, 0, 0, 0, 0]
print("simpler result, larger buffer", ConjectureData.for_buffer(b2).draw(s))
# simpler result, larger buffer [0, 0, 0, 0] As a result I'd like to look at moving that weighting logic into |
What if we forced even more instead? If we choose a smaller |
We could do that! I'm fairly confident exactly what you stated, or some small variation, would work. I was thinking of killing two birds with one stone here, though. Do you think the upweighting belongs in the ir or in |
I think doing it 'below' the IR, so we just represent a single integer value with a minimum of redundancy, is the principled approach here. "Literally just give me an integer" feels like it should be bijective 😅 |
The concern is that moving the weighting to class IntegersStrategy(SearchStrategy):
...
def do_draw(self, data):
weights = None
if self.end is not None and self.start is not None:
bits = (self.end - self.start).bit_length()
# For large ranges, we combine the uniform random distribution from draw_bits
# with a weighting scheme with moderate chance. Cutoff at 2 ** 24 so that our
# choice of unicode characters is uniform but the 32bit distribution is not.
if bits > 24:
def weighted():
# INT_SIZES = (8, 16, 32, 64, 128)
# INT_SIZES_SAMPLER = Sampler((4.0, 8.0, 1.0, 1.0, 0.5), observe=False)
total = 4.0 + 8.0 + 1.0 + 1.0 + 0.5
return (
(4.0 / total) * (-2**8, 2**8),
# ...except split these into two ranges to avoid double counting bits=8
(8.0 / total) * (-2**16, 2**16),
(1.0 / total) * (-2**32, 2**32),
(1.0 / total) * (-2**64, 2**64),
(0.5 / total) * (-2**128, 2**128),
)
weights = (
(7 / 8) * weighted()
+ (1 / 8) * uniform()
)
# for bounded integers, make the near-bounds more likely
weights = (
weights
+ (2 / 128) * self.start
+ (1 / 64) * self.end
+ (1 / 128) * (self.start + 1)
+ (1 / 128) * (self.end - 1)
)
# ... also renormalize weights to p=1, or have the ir do that
return data.draw_integer(
min_value=self.start, max_value=self.end, weights=weights
) Now the ir |
That would work! I'm also fine with the IR |
I'm working on a native ordering for the IR (wip branch). My current plan is to have a bijective map This will replace some ad-hoc constructs:
We can also take advantage of this ordering as a unified representation to work over when convenient, just like the bytestring was. I plan to use this ordering to migrate Two things:
>>> sorted([0.01 * n for n in range(100)], key=float_to_lex)
[0.0, 0.5, 0.75, 0.8300000000000001, 0.54, 0.79, 0.71, 0.96, 0.52, 0.77, 0.56, 0.81, 0.73, 0.98, 0.51, 0.76, 0.72, 0.97, 0.55, 0.8, 0.53, 0.78, 0.74, 0.99, 0.5700000000000001, 0.8200000000000001, 0.59, 0.84, 0.67, 0.92, 0.63, 0.88, 0.61, 0.86, 0.6900000000000001, 0.9400000000000001, 0.65, 0.9, 0.68, 0.93, 0.6, 0.85, 0.64, 0.89, 0.7000000000000001, 0.9500000000000001, 0.62, 0.87, 0.58, 0.66, 0.91, 0.25, 0.27, 0.48, 0.26, 0.28, 0.49, 0.38, 0.36, 0.4, 0.39, 0.37, 0.41000000000000003, 0.42, 0.46, 0.44, 0.43, 0.47000000000000003, 0.45, 0.34, 0.3, 0.32, 0.35000000000000003, 0.31, 0.29, 0.33, 0.24, 0.13, 0.14, 0.19, 0.18, 0.2, 0.21, 0.23, 0.22, 0.17, 0.15, 0.16, 0.12, 0.07, 0.09, 0.1, 0.11, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
>>> like I guess a starting point is, how should we order [0, 1)? We don't have to go crazy with this – the ordering only has an impact insofar as the shrinker can intelligently match it1, there are just too many floats – but @Zac-HD I'm curious if you've thought about a good ordering before 🙂. I'm tempted to say Footnotes
|
|
An update: we've steadily been working towards this goal over the past 6 months. The current status is that input generation and redundancy tracking (via The typed choice sequence has been incrementally worsening performance (#4076 (comment)) as we maintain two internal representations simultaneously. We expect this to improve roughly back to the status quo once Hypothesis is fully on the TCS. And it may well already be a wash due to the increased bug-finding power of the TCS (because of e.g. less duplication). |
This epic-style issue tracks our work on migrating the internals of Hypothesis from a bytestring to the typed choice sequence.
This is a huge change that touches every part of Hypothesis. The short version is that we have encountered several shortcomings of Hypothesis internals over time, which have limited the efficiency of input generation and shrinking, and made alternative backends infeasible. We expect this migration to give moderate efficiency gains and pave a clear path forward to future integration with alternative backends or other powerful features which are currently intractable.
Bytestring
Hypothesis currently works at the level of a bytestring (a sequence of bytes).
DataTree
internally) and randomly generating the remaining bytes.But, the bytestring has its limitations.
0
is represented by many different bytestrings, so any strategy usingst.integers()
effectively wastes some number of inputs (except for detecting flakiness). See also generate_novel_prefix interacts poorly with biased_coin (and lists) #1574.These limitations extend equally to alternative backends over the bytestring. e.g. for CrossHair (#3086) integration, bytes is simply too low level to get an efficient SMT algorithm out of.
Typed Choice Sequence
Enter the typed choice sequence, which replaces the bytestring. We lift up the representation from bytes to a slightly higher representation at the level of five types:
boolean
,integer
,float
,string
, andbytes
.This improves redundancy (as
DataTree
operates at this higher level) and precision (as we retain type and shape information about what was previously spans of the bytestring). The end goal is rebuilding input generation, shrinking, targeted pbt, theexplain
phase, the database, etc on the typed choice sequence, and we expect to see performance gains in each (except the database).Roadmap
Completed:
float
,integer
,string
,byte
, andboolean
drawing logic intoPrimitiveProvider
#3788DataTree
to the new IR #3818crosshair
backend #3806Float
shrinker to the ir #3899generate_mutations_from
)weights=
#4138extend
forcached_test_function_ir
#4159explain
phase to the typed choice sequence #4162Optimiser
to the typed choice sequence #4163generate_novel_prefix
to the typed choice sequence #4172choice_{to, from}_index
#4209sort_key_ir
#4215Future work, roughly in order of expected completion:
lower_blocks_together
in the shrinkerThe text was updated successfully, but these errors were encountered: