Success with Rust in Production

| Tristan King

A quick look into a successful case of using Rust in production.

One of our recent projects was to build a cloud based system that would communicate with on-site systems running inside a customer's network, without having to have the customer's network directly accessible from the internet. As we were expecting to deal with thousands of customers and wanting to support a flexible and scalable backend architecture, running a typical reverse tunneling ssh client on the customer's side to make the connection to the cloud did not seem practical. Our idea was to build an application that would run on the customer's machines alongside the other internal services, that would connect to our cloud services via a websocket which we could then use to proxy requests through. Simple enough! However, there was two other requirements. The machine that the service needs to run on is a windows machine, and it needed to be a self contained bundle.

These requirements had to make us think a bit. If it was a Linux based system that we could install whatever we wanted on we probably would have gone with Python, as it's what we do for most of the backend code we write, and what we were using for the cloud services we were also building for the client. And we probably could have gone that route anyway using something like PyInstaller, but at the time I was a avid Rust hobbyist, looking for any excuse to write something in rust that would be used in production, so I pushed the idea based on the fact that it was dead simple compile a single executable with rust that could be installed on the remote machines. This was back in September 2020, async/await had been in stable rust for almost a year, and tokio was a month away from version 1.0, so I was quite confident we could build it in rust without any costly migration issues due to changing APIs that might have given me pause had we started the previous year. Luckily both the team and the client were happy with this idea, so the journey into production rust began!

The first steps were getting communication going between rust and the other internal services we needed to talk to. The services were a mix of XML-RPC and other text based protocols, and we wanted to combine these into a single interface that our cloud backend could use and decided to use JSON-RPC. This meant that our application would not just be a simple proxy, but would also involve request/response translation between the different internal protocols and JSON-RPC.

The rust library serde gave us a good starting point for how to do this. We started by building rust structs for all of different messages we would handle, and simply adding #[derive(Deserialize, Serialize)] at the start of each one. Then, we could easily use serde_json to generate the JSON-RPC payload for each request. Similarly we started out using quick-xml's serde integration to generate XML-RPC, so our translation layer was simply converting the JSON-RPC payload to the rust struct with a single serde_json::from_str call, then converting that rust struct back into XML-PRC with a single quick_xml::se::to_string call. After getting into some of the more complicated requests, we had some trouble making quick-xml's serialization generate and parse the XML in the exact structure we needed for the internal service we were talking to, so we ended up writing our own Derive macro using quick-xml's regular interface to read and write the XML, which was a bit of an unfortunate side-track to have to take, but as a developer I thoroughly enjoyed doing and didn't take so long that it put the schedule at risk. quick-xml has also gotten much better documentation and some extra utilities for it's serde support in the years since, so I would not be surprised to find that we could remove this custom Derive macro now and go back to using quick-xml's, but at this point the work is done and we have no problems with it, so it's unlikely to be replaced any time soon. The other services we needed to talk with also required doing some manual deserialization as they were proprietary interfaces, but in the end we ended up with a single JSON-RPC interface to talk to everything we needed.

Now that we had the protocol to speak, it was down to making the actual connection. We used tokio-tungstenite to make the websocket connects to our backend with very little effort, essentially expanding on the example code provided to perform the extra functionality we need. We used reqwest to make the http calls needed to talk to the internal services and a simple tcp socket for the other text based protocol services and all the pieces were in place. After building the server side of the websockets and JSON-RPC communication in the Python backend, we had a connection into the remote location and could finally talk to all the internal services.

So how did it go? The application has been running in various remote locations for the last 2.5 years. In that time we've had... 0 crashes, zero, or even more specifically 0 times the service has stopped performing it's intended function unexpectedly. Of course there were some bug fixes along the way, but all of those were to correct logic issues. We even shelved the initial work to turn the service into an actual windows service vs a plain old exe, which we though we'd need to restart the service in the event that it crashed randomly. It simply didn't happen. This I attribute to Rust's approach to failability, forcing you to decide what to do if a function fails at it's call site, rather than being able to ignore, or simply not know about, possible failures (e.g. exceptions) that could cause things to go wrong at run time. Of course, you can still be lazy and call unwrap() on a Result and crash the program if something goes wrong, which we have done in some places during startup of the service when it doesn't make sense to start the service at all if it fails, but was quite easy to avoid doing in the main loop. There was never the feeling that handling a possible (or even unlikely) error was painful enough to just slap an unwrap() on it and forgot about it (or at least try to until it crashed the program at runtime).

As for things that did go wrong: Some of our beta users got the idea that if something wasn't working with the system as a whole, it could be due to our software not running, probably because we had some cases when the machine would get restarted and we didn't have the application's exe auto launched, but if they started it again when it was already running they would then have two instances of the service running and this would cause the websocket to continuously get disconnected since the server would force disconnect any old websocket connections authenticated with the same credentials as a new connection. We fixed this by using the single-instance library to prevent starting up more than one instance.

Another issue we had was with the XML-RPC date format. The service we were talking to would format it's dates with whatever format the windows system was setup to do. So D/M/Y for EU date formats and M/D/Y for US. Luckily the US format was the most common used in production and it would include the AM/PM for the time which let us distinguish it from the EU format that was used in the testing VMs, so we simply left it to try those two and fail otherwise. I suspect there are some windows APIs we could use to either get the system locale or even to parse the date string, but that is a rabbit hole that wasn't worth going down since we didn't need to be so generic (and one that might have been easier to go down had we been using C# rather than Rust).

We also had an issue one time when a newer version of one of the internal services was used that was sending us an event that we didn't expect and we were turning unexpected events into errors which were stopping the service from starting. Once we changed the logic to simply ignore unexpected events this wasn't a problem any more.

As you can see these issues were not specifically Rust only issues, we would have hit these with any other language as well.

So, all in all, our experience building and deploying a Rust built service into production was smooth and painless. When thinking back to things that we would do differently, choosing Rust is not something that we would change!