Testing Rust web services (part 1): What is testing anyway?

2023-05-07
25 minutes
Rust

Since joining Apollo GraphQL, my professional career shifted from advising people on how to leverage cloud providers to achieve their objectives to getting my hands dirty, and building a managed cloud offering alongside a team of awesome engineers.

One thing I’ve been loving so far as a software engineer is the opportunity to put into practice my previous advices through a complex, long-running project. Many of those recommendations and the fundamental principles behind them still hold, but living with implementation details and tools for almost a year brings lots of nuance to these practices.

For example, while I’ve explored the principles of a hexagonal architecture for Rust serverless applications, this was a relatively simple CRUD microservice with very little business logic. Thus while it had some test coverage, it is nowhere near the reality of a complex, production system.

In this series of articles, I will explore the different components and challenges of testing Rust web services, alongside the tools and methods we’ve built and employed to achieve good test coverage. However, before diving directly into implementation details, I thought it would be helpful to put some context and take a step back. After all, testing means many things for many different people, and one thing I’ve learned working as a team is that we all bring our bias, understanding, and past experiences to the table – with all their positive and negative aspects.

If most of the concepts in this article seem obvious to you – great! This means you’ve integrated the core principles of testing web applications, and feel free to move on to the next article in the series where I start to dive into implementation details and practicalities of Rust services testing.

What is testing anyway?

When we write a change to a web service and deploy it to production, the users will always end up testing it somehow – unless you don’t have users. This would be fine if pieces of software always behaved exactly as we expect it. However, human beings – and generative AIs – are imperfect beings that tend to introduce bugs in code changes.

So, if we do not implement anything to check that our code is actually doing what we want, we risk shipping bugs to production. Ultimately, users may complain that something is not working as expected. This is a feedback loop: we make a change to the system, receive feedback that something is wrong, and can then correct it to hopefully approach the desired outcome.

However, this loop has a few disadvantages. First of all, it is slow: a bug could sit in production for weeks before a user reports that something is wrong. Users might also not report anything at all – just living with the quirks of an unreliable system, or decide to leave entirely. This leads to the most significant disadvantage: unless you provide something unique that your users cannot get anywhere else, people might just stop using your services.

There are some cases where such testing makes sense, such as A/B testing where you want to see how users will react to small changes that don’t impact critical functionality. However, for the rest of this article, I will focus on validating that the system behaves as expected.

Q&A and manual testing

To avoid exposing your users to bugs, security vulnerabilities, and all other potential risks of software engineering, you could test your changes internally first before propagating them to your users. In its simplest approach, this would consist of deploying your service into a staging environment, and then acting like you’d expect a user to interact with your service – browsing through a web interface, making API calls, using a CLI, etc.

Compared to the previous approach, there is now a step in the road to production that will ensure some level of internal testing before exposing end-users to new changes. This is still a feedback loop, but a shorter and (slightly) more reliable one than expecting users to send report errors.

That said, in the same way that engineers can introduce bugs in code, so can they make mistakes or miss scenarios while testing an application. And as the complexity of a codebase increases, so does the number of test cases. Manual testing means someone needs to spend time that can’t be used for something else, and this time spent also increase with the number of test cases. It’s also hard (at least for me) to stay focused on a repetitive task and not lose track of the small details.

From a technical point of view, manual testing requires some form of test environment – a replica of the production environment. This means we need to maintain and run an environment where we can freely deploy changes and run tests without impacting users. Web services are rarely self-contained: they depend on other services, external storage systems, APIs, specific network and system configurations, and more. As we’ll see in the dependencies in tests section, this isn’t specific to manual tests and is something we just have to live with.

While there are drawbacks to this approach as well, it’s quite often useful to just run some manual tests just to validate that everything is behaving as expected. In the testing tests section of this article, we’ll delve into why we cannot always take tests at face value. In this situation, some quick manual tests to validate we are going in the right direction is a good thing.

Automated testing

But for many cases, we can just write code to test our code. The whole reason why we write code in the first place is to automate repetitive tasks so they can be performed over and over in a predictable way. Why not use the same principles when it comes to testing?

If you’ve been coding in Rust for a while, you might’ve come across tests embedded into your Rust code. By initializing a new lib crate with cargo new --lib, you will get a sample function and its related test that looks like this (at least as of Rust 1.69.0):

pub fn add(left: usize, right: usize) -> usize {
    left + right
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let result = add(2, 2);
        assert_eq!(result, 4);
    }
}

The benefits here are obvious – with a simple cargo test, we can compile and run this test in less than a second. As the codebase and number of dependencies, it might take a few minutes to compile a new debug build from scratch, but this is still faster and more reliable than manual testing.

Once again, we have a feedback loop, but much faster and closer to the developers. Since it only takes from a few seconds to a few minutes to complete this loop, we can run it fairly frequently while iterating on our code.

However, there’s a tiny little detail that’s of importance here. As mentioned before, web services are rarely self-contained – which means we must either limit our testing to the self-contained part of our code, find a way to invoke all the dependencies we use at runtime or a way to bypass them altogether.

For example, let’s change the code sample to make calls to an external system. Instead of a function that adds two numbers together, we now have a struct that we can use to call an external service that maps user IDs to names.

struct MyStruct {
    client: AdderClient,
}

impl MyStruct {
    pub fn new() -> Self {
        todo!()
    }

    pub fn get_username(&self, user_id: usize) -> String {
        self.client.call(user_id)
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let client = MyStruct::new();

        let result = client.get_username(2);
        assert_eq!(result, "John Doe");
    }
}

Here, this test can no longer work in complete isolation. We would need to run this other service locally to get the right values or have a way to artificially inject expected values.

As implementation details can be complex, we’ll explore this topic further in a future post. What’s important to note is that we have a spectrum of testing options, each with its advantages and disadvantages:

Faster tests that are less precise (a.k.a. unit tests), as they can run locally but cannot test integrations with external dependencies.
Slower tests that are more precise (a.k.a. end-to-end tests), as they run on actual systems that mimic a production environment.

The key concept is that testing exists on a spectrum. While some external dependencies, like SQL databases, can be easily tested locally. On the other hand, testing the APIs of a cloud provider can be more challenging: although tools like LocalStack can imitate some of AWS’s APIs, they may not support all features for every service.

Test pyramid

You might be familiar with the test pyramid for test automation. This is a concept that was invented by Mike Cohn in his book Succeeding with Agile. In short, it collapses our spectrum of tests into three distinct categories. The reality is often messier than these three categories, but it’s a good starting point to think about these different types of tests.

Unit tests are fast, small, self-contained tests that you can run locally. They’re great for testing complex business logic and catching potential problems early on.
Service tests (also called integration tests) ensure that all the components of a web service work together. When a single service is comprised of multiple sub-systems, this might require deploying into an isolated environment containing some of the dependencies of that service (e.g. databases).
Finally, UI tests (a.k.a. end-to-end tests) validate that the system as a whole works as expected. This requires an environment where all components of the system are deployed and working.

By using a pyramid instead of a spectrum, we can represent the number of tests at each layer. Ideally, you should have fewer service tests than unit tests and fewer UI tests than service tests.

We can update our feedback loops one final time to take into account these different types of automated tests:

Common problems with tests

Before closing on this refresher on tests and diving into the specifics of Rust web applications, I’d like to discuss a few problems with testing applications that are applicable across the board.

Testing tests

The first problem is that automated tests are code, and are as subject to bugs as the system under test is. As a consequence, a change could pass all tests but still end up generating errors in production.

While a combination of multiple accidental bugs accidentally giving the expected outcome is unlikely, there are a few situations that could cause false positives.

Incorrect assumptions

When writing self-contained unit tests for systems that make calls to external dependencies, we often have to create test doubles to mimic the behavior of those dependencies. Since this is only mimicking a certain behavior, it’s easy for a developer to misinterpret some documentation which would cause them to write tests with the wrong assumptions.

Let’s take our external service from before – where we pass a user ID and get a name back. In the implementation, we would send a usize and get back a String. However, those user IDs could be UUIDs instead. The whole system – tests included – would have been designed to handle integers and would fail at the first request on a deployed environment.

Incomplete coverage

Complex systems have complex behaviors. If a system has 10 different inputs with 10 different possibilities each, there are already 10 billion different permutations. It’s not always practical to test all possible scenarios, and sometimes hard to identify those in the first place.

To get better coverage, we often need to increase the number of test cases, which increases the burden of test maintenance. A function with many branching statements could necessitate a test per branch combination, which would explode the number of tests.

Edge case example

Even the simple function generated by cargo can be changed to pass a large number of test cases, but still fail in a very specific situation. On a 64 bits system, there are 3.40 * 10^38 different permutations of numbers for this function. It would be impossible to test them all to ensure that they are all behaving correctly.

pub fn add(left: usize, right: usize) -> usize {
    if left == 2345 && right == 1337 {
        0
    } else {
        left + right
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        static TEST_CASES: &[(usize, usize, usize)] = &[
            (2, 2, 4),
            (3, 4, 7),
            (1234, 1111, 2345),
        ];
        
        for &(left, right, expected) in TEST_CASES {
            let result = add(left, right);
            assert_eq!(result, expected);
        }
    }
}

Branching example

To demonstrate this, let’s take a function with multiple if statements. This function takes two floating point numbers and multiply them only if they are between 0.0 and 1.0. Otherwise, it returns an appropriate OutOfBound error. While this is a relatively simple function with 4 branches and 2 inputs, we already need 9 tests just to cover its basic behaviors.

#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum OutOfBound {
    TooLow,
    TooHigh,
    TooHighAndLow,
}

pub fn multiply_bounded(left: f64, right: f64) -> Result<f64, OutOfBound> {
    if (left < 0.0 && right > 1.0) || (left > 1.0 && right < 0.0) {
        Err(OutOfBound::TooHighAndLow)
    } else if left < 0.0 || right < 0.0 {
        Err(OutOfBound::TooLow)
    } else if left > 1.0 || right > 1.0 {
        Err(OutOfBound::TooHigh)
    } else {
        Ok(left * right)
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_multiply_bounded() {
        static TEST_CASES: &[(f64, f64, Result<f64, OutOfBound>)] = &[
            (0.5, 0.5, Ok(0.25)),
            // testing limit cases
            (0.0, 0.0, Ok(0.0)),
            (1.0, 1.0, Ok(1.0)),
            // testing errors
            (-1.0, 0.5, Err(OutOfBound::TooLow)),
            (0.5, -1.0, Err(OutOfBound::TooLow)),
            (2.0, 0.5, Err(OutOfBound::TooHigh)),
            (0.5, 2.0, Err(OutOfBound::TooHigh)),
            (2.0, -1.0, Err(OutOfBound::TooHighAndLow)),
            (-1.0, 2.0, Err(OutOfBound::TooHighAndLow)),
        ];
        
        for &(left, right, expected) in TEST_CASES {
            let result = multiply_bounded(left, right);
            assert_eq!(result, expected)
        }
    }
}

Test Maintenance

A corollary of the incomplete coverage problem is that ensuring complete branch coverage for a complex codebase will result in a long list of tests. Already the simple branching example from the previous question had more lines of code for testing (26 lines) than for the function itself (11 lines). When we will get into creating doubles for HTTP servers, creating all the appropriate responses will quickly take hundreds of lines of code. On top of this, we will need to add logic to ensure incoming HTTP requests match the expectations.

All this extraneous code has to be maintained like the rest of the codebase. If we make changes that will modify the behavior of a function, we also need to adapt all those tests accordingly. As the amount of code for tests could be much larger than the functionality under test, this can be a significant undertaking that would impair an engineer’s velocity.

Branching example, part 2

Taking the branching example from the previous section, if we wanted to modify its logic to accept numbers from -1.0 to 1.0 instead of just 0.0-1.0, we would need to modify 5 out of the 9 test cases:

The limit case test that validates when both values are 0.0;
The TooLow cases when one of the input value are -1.0; and
The TooHighAndLow cases when one of the input value are -1.0.

In total, we need to change 2 lines for the function under test, but 5 lines for test cases.

#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum OutOfBound {
    TooLow,
    TooHigh,
    TooHighAndLow,
}

const LOW_BOUND: f64 = -1.0;
const HIGH_BOUND: f64 = 1.0;

pub fn multiply_bounded(left: f64, right: f64) -> Result<f64, OutOfBound> {
    if (left < LOW_BOUND && right > HIGH_BOUND) || (left > HIGH_BOUND && right < LOW_BOUND) { // changed
        Err(OutOfBound::TooHighAndLow)
    } else if left < LOW_BOUND || right < LOW_BOUND { // changed
        Err(OutOfBound::TooLow)
    } else if left > HIGH_BOUND || right > HIGH_BOUND {
        Err(OutOfBound::TooHigh)
    } else {
        Ok(left * right)
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_multiply_bounded() {
        static TEST_CASES: &[(f64, f64, Result<f64, OutOfBound>)] = &[
            (0.5, 0.5, Ok(0.25)),
            // testing limit cases
            (-1.0, -1.0, Ok(1.0)), // changed
            (1.0, 1.0, Ok(1.0)),
            // testing errors
            (-2.0, 0.5, Err(OutOfBound::TooLow)), // changed
            (0.5, -2.0, Err(OutOfBound::TooLow)), // changed
            (2.0, 0.5, Err(OutOfBound::TooHigh)),
            (0.5, 2.0, Err(OutOfBound::TooHigh)),
            (2.0, -2.0, Err(OutOfBound::TooHighAndLow)), // changed
            (-2.0, 2.0, Err(OutOfBound::TooHighAndLow)), // changed
        ];
        
        for &(left, right, expected) in TEST_CASES {
            let result = multiply_bounded(left, right);
            assert_eq!(result, expected)
        }
    }
}

The eagle readers among you might have noticed that I have now put the low/high bounds in constants instead of inlining them in the function. As we now have established that those could be changed for business reasons, it’s easier – and safer – to make them into a constant. You might be wondering why I’m not using them in the test cases.

If I were to use them in test cases, I wouldn’t be testing an expected behavior in its entirety anymore. Someone could change those constants and the tests would pass, but might not meet our requirements. By using direct floating point numbers instead, I also test that those bound values are (roughly) where they should be.

One way to reduce the complexity and number of tests is to isolate branches into smaller functions that can be tested in isolation. These would often be private functions, and there are debates on whether this is a good idea. By adding tests for private functions, it will be harder to modify the system’s internals in the future (as we need to update their corresponding tests). On the other side, getting complete test coverage from public functions alone could mean creating a large number of test cases to handle all possible categories of input values and external system states.

Dependencies in tests

All the examples I’ve given so far in this section about common problems with tests only looked at pure functions – functions where the output only depends on the given inputs. In the rest of this series, we will focus a lot on web services with some form of dependencies, such as databases, cloud services, or other services over HTTP.

For UI/end-to-end tests, we want to run tests that call those actual services. This will serve as an ultimate validation that our integrations behave as expected, and that the system as a whole is configured correctly. The first point serves as confirmation that we made the right assumptions when writing lower-level tests (e.g. unit and service level).

The second is something we can only test in a production-like environment. A complete application in the cloud doesn’t consist of just code that calls each other. We also need to configure services to know where they can find the other ones, granting them the right level of permissions, and ensuring that all the cloud architecture components are set up correctly.

AWS IAM Policies

If you are using a cloud provider like AWS, and want to use a service such as Amazon DynamoDB as a NoSQL data store, having code that performs correctly is not enough. You will also need to ensure you are using the right DynamoDB table name in your service, and that your service has an AWS IAM policy that grants the right permission.

Let’s consider the following code base on the example section of the AWS SDK for Rust:

use aws_sdk_dynamodb::Client;

pub async fn movies_in_year(
    client: &Client,
    table_name: &str,
    year: u16,
) -> Result<Vec<Movie>, MovieError> {
    let results = client
        .query()
        .table_name(table_name)
        .key_condition_expression("#yr = :yyyy")
        .expression_attribute_names("#yr", "year")
        .expression_attribute_values(":yyyy", AttributeValue::N(year.to_string()))
        .send()
        .await?;

    if let Some(items) = results.items {
        let movies = items.iter().map(|v| v.into()).collect();
        Ok(movies)
    } else {
        Ok(vec![])
    }
}

Since we are passing a Client object here, writing unit tests might get complicated, but we will explore how to do that in a future article. However, even with such tests, we still need to configure our application to work correctly in a cloud environment. First, we need a way to pass the right value for the table_name argument to this function.

Second, on AWS, an application does not have any permission to do anything by default. We would need to create an IAM role with a permission document that could look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1685383853785",
      "Action": [
        "dynamodb:Query"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:dynamodb:eu-north-1:123456789012:table/my-movie-table"
    }
  ]
}

Here, multiple things could go wrong. While we don’t typically think of JSON documents as code, they too can have bugs that would render the application as a whole dysfunctional. For example, if we set the wrong permissions in the Action section, or the wrong ARN in the Resource section, our application will not be able to perform the query successfully, even though the code is correct.

In this article, we’ve touched on the core principles behind why we write tests. We’ve also explored some of the challenges we might face while implementing good test coverage for a web service. One important part was centered around automated testing, where we looked at a simple example without side effects. However, we saw that by adding a struct that makes calls to an external, we would need a way to artificially inject expected values.

In the next article of this series, we will learn different ways to write test doubles for Rust web applications to solve this problem. We’ll take a look at the differences between different types of doubles such as mocks and fakes, and how we can implement them in Rust.

If you want to be informed when futures articles come out, you can either follow me on Twitter where I’ll post about it, or subscribe to the syndication feed of this blog.

n14n.dev