Success with Rust in Production | Tristan King
A quick look into a successful case of using Rust in production.
One of our recent projects was to build a cloud based system that would communicate with on-site systems running inside a customer's network, without having to have the customer's network directly accessible from the internet. As we were expecting to deal with thousands of customers and wanting to support a flexible and scalable backend architecture, running a typical reverse tunneling ssh client on the customer's side to make the connection to the cloud did not seem practical. Our idea was to build an application that would run on the customer's machines alongside the other internal services, that would connect to our cloud services via a websocket which we could then use to proxy requests through. Simple enough! However, there was two other requirements. The machine that the service needs to run on is a windows machine, and it needed to be a self contained bundle.
These requirements had to make us think a bit. If it was a Linux based system
that we could install whatever we wanted on we probably would have gone with
Python, as it's what we do for most of the backend code we write, and what we
were using for the cloud services we were also building for the client. And we
probably could have gone that route anyway using something like
PyInstaller, but at the time I was a avid
Rust hobbyist, looking for any excuse to write something in rust that would be
used in production, so I pushed the idea based on the fact that it was dead
simple compile a single executable with rust that could be installed on the
remote machines. This was back in September 2020,
async
/await
had been in stable rust for almost a year, and tokio was a
month away from version 1.0, so I was quite confident we could build it in rust
without any costly migration issues due to changing APIs that might have given
me pause had we started the previous year. Luckily both the team and the client
were happy with this idea, so the journey into production rust began!
The first steps were getting communication going between rust and the other internal services we needed to talk to. The services were a mix of XML-RPC and other text based protocols, and we wanted to combine these into a single interface that our cloud backend could use and decided to use JSON-RPC. This meant that our application would not just be a simple proxy, but would also involve request/response translation between the different internal protocols and JSON-RPC.
The rust library serde gave us a good starting point for
how to do this. We started by building rust
struct
s for
all of different messages we would handle, and simply adding
#[derive(Deserialize, Serialize)]
at the start of each one. Then, we could
easily use serde_json to generate the
JSON-RPC payload for each request. Similarly we started out using
quick-xml's serde integration to generate
XML-RPC, so our translation layer was simply converting the JSON-RPC payload to
the rust struct with a single serde_json::from_str
call, then converting that
rust struct back into XML-PRC with a single quick_xml::se::to_string
call.
After getting into some of the more complicated requests, we had some trouble
making quick-xml's serialization generate and parse the XML in the exact
structure we needed for the internal service we were talking to, so we ended up
writing our own
Derive macro using
quick-xml's regular interface to read and write the XML, which was a bit of an
unfortunate side-track to have to take, but as a developer I thoroughly enjoyed
doing and didn't take so long that it put the schedule at risk. quick-xml has
also gotten much better documentation and some extra utilities for it's serde
support in the years since, so I would not be surprised to find that we could
remove this custom Derive macro now and go back to using quick-xml's, but at
this point the work is done and we have no problems with it, so it's unlikely to
be replaced any time soon. The other services we needed to talk with also
required doing some manual deserialization as they were proprietary interfaces,
but in the end we ended up with a single JSON-RPC interface to talk to
everything we needed.
Now that we had the protocol to speak, it was down to making the actual connection. We used tokio-tungstenite to make the websocket connects to our backend with very little effort, essentially expanding on the example code provided to perform the extra functionality we need. We used reqwest to make the http calls needed to talk to the internal services and a simple tcp socket for the other text based protocol services and all the pieces were in place. After building the server side of the websockets and JSON-RPC communication in the Python backend, we had a connection into the remote location and could finally talk to all the internal services.
So how did it go? The application has been running in various remote locations
for the last 2.5 years. In that time we've had... 0 crashes, zero, or even more
specifically 0 times the service has stopped performing it's intended function
unexpectedly. Of course there were some bug fixes along the way, but all of
those were to correct logic issues. We even shelved the initial work to turn the
service into an actual windows service vs a plain old exe, which we though we'd
need to restart the service in the event that it crashed randomly. It simply
didn't happen. This I attribute to Rust's approach to failability, forcing you
to decide what to do if a function fails at it's call site, rather than being
able to ignore, or simply not know about, possible failures (e.g. exceptions)
that could cause things to go wrong at run time. Of course, you can still be
lazy and call
unwrap()
on a Result
and crash the program if something goes wrong, which we have done
in some places during startup of the service when it doesn't make sense to start
the service at all if it fails, but was quite easy to avoid doing in the main
loop. There was never the feeling that handling a possible (or even unlikely)
error was painful enough to just slap an unwrap()
on it and forgot about it
(or at least try to until it crashed the program at runtime).
As for things that did go wrong: Some of our beta users got the idea that if something wasn't working with the system as a whole, it could be due to our software not running, probably because we had some cases when the machine would get restarted and we didn't have the application's exe auto launched, but if they started it again when it was already running they would then have two instances of the service running and this would cause the websocket to continuously get disconnected since the server would force disconnect any old websocket connections authenticated with the same credentials as a new connection. We fixed this by using the single-instance library to prevent starting up more than one instance.
Another issue we had was with the XML-RPC date format. The service we were talking to would format it's dates with whatever format the windows system was setup to do. So D/M/Y for EU date formats and M/D/Y for US. Luckily the US format was the most common used in production and it would include the AM/PM for the time which let us distinguish it from the EU format that was used in the testing VMs, so we simply left it to try those two and fail otherwise. I suspect there are some windows APIs we could use to either get the system locale or even to parse the date string, but that is a rabbit hole that wasn't worth going down since we didn't need to be so generic (and one that might have been easier to go down had we been using C# rather than Rust).
We also had an issue one time when a newer version of one of the internal services was used that was sending us an event that we didn't expect and we were turning unexpected events into errors which were stopping the service from starting. Once we changed the logic to simply ignore unexpected events this wasn't a problem any more.
As you can see these issues were not specifically Rust only issues, we would have hit these with any other language as well.
So, all in all, our experience building and deploying a Rust built service into production was smooth and painless. When thinking back to things that we would do differently, choosing Rust is not something that we would change!