29:09Yeah, and you know, like we had a very
strict rule, right, where it just,
29:13we do not look at content, right?
29:15and so that was the thing when
debugging issues, the saving grace is
29:20that for most of the issues we saw.
29:22They were more metadata issues around
like sync, not converging or sync, getting
29:28to the client thinking it's in sync
with the server, but them disagreeing.
29:32so we had a few pretty,
yeah, like pretty interesting
29:35supporting algorithms for this.
29:37So one of them was just simple like
hang detection, like making sure, like
29:41if, when should a client reasonably
expect that they are in sync?
29:45And if they're online and if
they've downloaded all the recent
29:49versions and things are getting
stuck, why are they getting stuck?
29:53So are they getting stuck because
they can't read stuff from the
29:55server, either metadata or data?
29:57Are they getting stuck because they
can't write to the file system and
30:00there's some permission errors?
30:02So I think having very fine-grained
classification of that and having the
30:06client do that in a way that's like not
including any private information and
30:11sending that up for reports and then
aggregating that over all of the clients
30:14and being able to classify was a big part
of us being able to get a handle on it.
30:20And I think this is just generally
very useful for these sync engines.
30:23the biggest return on investment we
got was from consistency checkers.
30:27So part of sync is that there's the same
data duplicated in many places, right?
30:33Like, so we had the data that's
on the user's local file system.
30:37We had all of the metadata that we stored
in SQLite or we would store like what
30:41we think should be on the file system.
30:43We would store what the latest
view from the server was.
30:46We would store things that were
in progress, and then we have
30:49what's stored on the server.
30:50And for each one of those like hops, we
would have a consistency checker that
30:55would go and see if those two matched.
30:57And those would, that was like the
highest return on investment we got.
31:02Because before we had that, people
would write in and they would
31:05complain that Dropbox wasn't working.
31:07And until we had these consistency
checkers, we had no idea the
31:10order of magnitude of how
many issues were happening.
31:13And when we started doing
it, we're like, wow.
31:16There's actually a lot.
31:18So a consistency check in this regard
was mostly like a hash over some
31:22packets that you're sending around.
31:24And with that you could verify, okay, up
until like from A to B to C to D, we're
31:30all seeing the same hash, but suddenly
on the hop from D to E, the hash changes.
31:35Ah-huh.
31:36Let's investigate.
31:37Exactly.
31:38And so, and to do that in a way
that's respectful of the users,
31:42even like resources on their system.
31:45Like we wouldn't just go and blast their
CPU and their disc and their network to go
31:50and like turn through a bunch of things.
31:51So we would have like a sampling
process where we like sample a random
31:54path in the tree and the client
and do it the same on the server.
31:58we would have stuff with like Merkle
trees and then when things would diverge,
32:02we would try to see like, is there a way
we can compare on the client and see like
32:07for example one of the kind of really
important, goals for us as an operational
32:12team was to have like the power of zero.
32:14I think it might be from AWS or something.
32:17My co-founder James, has
a really good talk on it.
32:19but we would want to have a metric of
saying that the number of unexplained
32:25inconsistencies is zero and one 'cause.
32:28Then the nice thing right, is that
if it's a zero and it regresses,
32:31you know that it's a regression.
32:33If it's at like fluctuating at like 15
or like a hundred thousand and it kind
32:38of goes up by 5%, it's very hard to know
when evaluating a new release, right?
32:42That like that's actually safe or not.
32:44so then that would mean that whenever we
would have an inconsistency due to a bit
32:49flip, which we would see all the time
on client devices, then we would have to
32:55categorize that and then bucket that out.
32:57So we would have a baseline.
32:59Expectation of how many bit flips there
are across all of the devices on Dropbox.
33:03And we would see that that's
staying consistent or increasing or
33:06decreasing, and that the number of
unexplained things was still at zero.
33:10now let's take those detours
since you got me curious.
33:13Uh, what would cause bit
flips on a local device?
33:16I think a few, few causes, one of them
is just that in the data center, most
33:20memory uses error correction and you have
to pay more for it, usually have to pay
33:24more for a motherboard that supports it.
33:26at least back then.
33:27now like on client
devices we don't have that.
33:30So this is a little bit above
my pay grade for hardware cosmic
33:34rays or thermal noise or whatever.
33:36But memory is much more
resilient in the data center.
33:40I think another is just that, storage
devices are very greatly in quality.
33:44Like your SSDs and your hard drives are
much higher quality inside the data
33:49center than they are on local devices.
33:51And so.
33:53You know, there's that.
33:54it also could be like I had
mentioned that people have all
33:57types of weird configurations.
33:59Like on Mac there are all these
kernel extensions on Windows, there's
34:03all of these mini filter drivers.
34:05There are all these things
that are interposing between
34:07Dropbox, the user space process
and writing to the file system.
34:11And if those have any memory safety
issues where they're corrupting memory
34:15'cause of the written in archaic C
you know, or something that that's
34:19the way things can get corrupted.
34:20I mean, we've seen all types of things.
34:22We've seen network routers get
having corrupting data, but usually
34:26that fails some checksum, right?
34:28Or we've seen even registers on CPUs
being bad where the memory gets replaced
34:33and the memory seems like it's fine, but
then it just turns out the CPU has its
34:38own registers on CHIP that are busted.
34:40And so all of that stuff I
think just can happen at scale.
34:44Right.
34:45that makes sense.
34:45And I'm happy to say that I've hadn't
had yet to worry about flip bits, whether
34:51it's being for storage or other things,
but huge respect to whoever had already
34:56to, tame those parts of the system.
34:59So, you mentioning the consistency check
as probably the biggest lever that you
35:05had to understand which health stage
your sync engine is in the first place.
35:11was this the only kind of metric and
proxy for understanding with how well
35:18the syn system is working or were
there some other aspects that gave
35:22you visibility both macro and micro?
35:26Yeah, I mean, I think this yeah,
the kind of hangs, so like knowing
35:30that something gets to a sync state
and knowing the duration, right?
35:33So the kind of performance of that
was one of our top line metrics.
35:38And the other one was
this consistency check.
35:40And then first specific
like operations, right?
35:43Like uploading a file, like how much
bandwidth are people able to use
35:47because for like, people wanted to
use Dropbox, but, and upload lots,
35:53like huge data, like huge number of
files where each file is really large.
35:57And then they might do it on in
Australia or Japan where they're
36:01far away from a data center.
36:03So latency is high, but bandwidth
is very high too, right?
36:06So making sure that we could
fully saturate their pipes and all
36:09types of stuff with debugging.
36:12Things in the internet, right?
36:13People having really bad
routes to AWS and all that.
36:16so we would track things like that.
36:18I think other than that it was
mostly just the usual quality stuff,
36:20like just exceptions and making
sure that features all work.
36:25I think when we rewrote this system
and we, designed it to be very correct.
36:30We moved a lot of these things into
testing before we would release.
36:35So we this is I think one of the, to
jump ahead a little bit, we designed,
36:38decided to rewrite Dropbox's sync engine
from this big Python code base into Rust.
36:45And one of the specific design decisions
was to make things extremely testable.
36:49So we would have everything be
deterministic on a single thread,
36:53have all of the reads and rights
to the network and file system,
36:56be, through a virtualized API.
36:59So then we could run all of these
simulations of exploring what would
37:03happen if you uploaded a file here and
deleted it concurrently and then had a
37:08network issue that forced you to retry.
37:10And so by simulating all of those in
ci, we would be able to then have very
37:14strong in variance about them that
knowing that like a file should never
37:18get deleted in this case, or that
it should always converge, or things
37:21like the sharing that this file should
never get exposed to this other viewer.
37:26I think like the, having much, like
having stronger guarantees was something
37:31that we only could really do effectively
once we designed the system to make
37:36it easy to test those guarantees.
37:38Right.
37:39That makes a lot of sense.
37:40And I think we're seeing more
and more systems, also in the
37:43database world, embrace this.
37:45I think TigerBeetle is,
is quite popular for that.
37:49I think the folks at Torso are
now also embracing this approach.
37:54I think it goes under the
umbrella of simulation testing.
37:57that sounds very interesting.
37:58Can you explain a little bit more how
maybe in a much smaller program would
38:03this basically be Just that every
assumption and any potential branch,
38:08any sort of side effect thing that might
impact the execution of my program.
38:13Now I need to make explicit and it's
almost like a parameter that I put into
38:19the arguments of my functions and now I
call it under these circumstances, and I
38:25can therefore simulate, oh, if that file
suddenly gives me an unexpected error.
38:31Then this is how we're gonna handle it.
38:33Yeah, exactly.
38:34So it's like and there's techniques
that like the TigerBeetle folks, like
38:38we, we do this at Convex in rust with the
right, like abstractions, there's like
38:42techniques to make it not so awkward.
38:45But yeah, it is like this idea of like,
can you pin all of the non-determinism in
38:50the system can, whether it's like reading
from a random number generator, whether
38:54it's looking at time, whether it's reading
and writing to files or the network.
38:58Can that all be like pulled out so
that in, production it's just using the
39:04random AP or the regular APIs for it.
39:07so there's like for any of these
sync engines, there's a core
39:10of the system which represents
all the sync rules, right?
39:13Like when I get a new file
from the server, what do I do?
39:16You know, if there's a concurrent
edit to this, what do I do?
39:19and that I. Core of the code is often
the part that has the most bugs, right?
39:23It has the, it doesn't think about
some of the corner cases or if
39:27there are errors or needs retries
or doesn't handle concurrency.
39:30It might have race conditions.
39:32So the kind of, I think the core idea
for determination, determin deterministic
39:36simulation testing is to take that core
and just kind of like pull out all of the
39:43non-determinism from it into an interface.
39:45So time randomness, reading and
writing to the network, reading
39:49and writing to the file system, and
making it so that in production,
39:52those are just using the regular APIs.
39:55But in a testing situation,
those can be using mocks.
39:59Like they could be using things
that for a particular test
40:02and wants to test a scenario or
setting it up in a specific way.
40:06Or it could be randomized, right?
40:09Where it might be that reading from
Like time, the test framework might
40:14decide pseudo randomly to advance it
or to keep it at the current time or
40:18might serialize things differently.
40:21And that type of ability to have random
search explore the state space of
40:27all the things that are possible is
just one of those like unreasonably
40:30effective ideas, I think for testing.
40:33And then that like getting a
system to pass that type of
40:37deterministic simulation testing.
40:39It's not at the threshold of having
formal verification, but in our
40:42experience it's pretty close and with
a much, much, smaller amount of work.
40:48And you mentioning
Haskell at the beginning?
40:50I still remember when I, after a a lot of
time having spent writing unit tests in
40:55JavaScript and I, back then, in the other
order, I first had JavaScript and then I
41:00learned Haskell, and then I found quick
test and was quick test, Quick Check.
41:05which one was it?
41:06I think it was Quick check, right?
41:07Well, right.
41:08So I found Quick Check and I could express
sort of like, Hey, this is this type.
41:13It has sort of those aspects to it,
those invariants and then would just
41:18go along and test all of those things.
41:20Like, wait, I never thought
of that, but of course, yes.
41:23And then you combine those and you
would get way too lazy to write unit
41:27tests for the combinatorial explosion
of like all of your different things.
41:32And then you can say, sample it
like that, and like, focus on this.
41:36and so I actually also, started
embracing this practice a lot more in the
41:40TypeScript work that I'm doing through
a great project called Prop Check.
41:45and that is, picking up the same
ideas and for particularly those
41:52sort of scenarios where, okay,
Murphy's Law will come and haunt you.
41:56this is in distributed systems.
41:58That is typically the case.
42:00Building things in such a way where
all the aspects can be, specifically
42:05injected and the, the sweet spot.
42:07If you can do so still in an ergonomic
way, I think that's the way to go.
42:13It's so, so valuable, right?
42:15And yeah.
42:15And yeah, the ability to, for prop tasks,
for quick check for all of these to
42:20also minimize is just magical, right?
42:23Like it comes up with this crazy
counter example and it might be
42:27like a list with 700 elements, but
then is able to shrink it down to
42:31the, like, real core of the bug.
42:33It's magic, right?
42:35And you know, I mean, I think
this is something like, you know.
42:38A totally different theme, right?
42:40Like one thing at Convex we're exploring
a lot is like coding has changed a lot
42:44in the past year with AI coding tools.
42:46And one of the things we've observed
for getting coding tools to work very
42:50well with Convex is that these types
of like very succinct tests that can
42:54be generated easily and have like a
really high strength to weight or power
42:59to weight ratio are just really good
for like autonomous coding, right?
43:03Like, if you are gonna take like
cursor agent and let it go wild,
43:06like what does it take to just let it
operate without you doing anything?
43:10It takes something like a prop test
because then it can just continuously
43:13make changes, run the test, and not know
that it's done until that test passes.
43:18Yeah, that makes a lot of sense.
43:20So let's go back for a moment to the
point where you were just transitioning
43:25from the previous Python based sync
engine to the Rust based sync engine.
43:32So you're embracing simulation
testing to have a better sense of
43:36like all the different aspects that
might influence the outcome here.
43:41walk me through like how you, went about.
43:44Deploying that new system.
43:46Were there any sort of big headaches
associated with migrating from the
43:52previous system to the new system?
43:54since you, for everything, you
had sort of a defacto source
43:57of truth, which are the files.
43:59So could you maybe just forget everything
the old system has done and you just
44:04treat it as like, oh, the, user would've
just installed this fresh, walk me
44:09through like how you thought about
that since migrating systems on such
44:14a big scale is typically, quite dread