How does Rust compare to Python (and other programming languages)?
It's important to know how Rust compares to Python if you consider switching some of your workloads to Rust. Using Rust instead of Python is a tough sell especially for things like data engineering. For any data problem that you can think of, there might surely be some Python implementation somewhere that can provide you with parts, if not all the solution.
So why Rust? Let's have a look.
Syntax and readability
With enough precautions, one can write Rust code that almost reads like Python code. I documented many examples in the initial article that kickstarted this website on my personal blog.
It is not a secret though, that Rust code will be a bit more verbose since it's statically typed (you must declare variables with their types), requires a lot of punctuation (semicolons, curly braces..) and leverages borrowing, making it at times tedious to write and read. However, these things provide you with the guarantees of performance and safety.
Writing a similar program in Python, with the same features and guarantees would most certainly yield code that is hard to read. Python's clarity comes from the happy path that it assumes it is always in: it will gladly run wrong code.
Writing and reading Rust definitely needs some getting used to, but the ROI (return on investment) is significant.
Here's an example of a function that concatenates an array in both Python and Rust:
# Python
def main(arr):
concatenated = "".join(arr)
print(concatenated)
main(["Hello", "world", "!"]) # works
main("Hello there") # works too
// Rust
fn main() {
let arr = ["Hello", "world", "!"];
let concatenated = &arr.join("");
println!("{:?}", concatenated);
}
You'll notice that the Rust function is a bit more verbose than the Python one. But here's the thing, the Python function would still function if it were given a string as input, the Rust function would not. Don't believe me? Edit the code and try it out.
If you were to change the Python method to provide the same guarantees, this is the code you'll need:
# Python
from typing import List
def main(arr: List[str]) -> None:
if not isinstance(arr, list):
raise TypeError("Expected 'arr' to be a list")
concatenated = "".join(arr)
print(concatenated)
main(["Hello", "world", "!"]) # works
main("Hello there") # doesn't work
So who's verbose now? There are a lot of things more to do in Python to get the same level of guarantees that you'd have with Rust.
Dynamic vs static typing
Now there's a case that can be made for Python and typing, since Python has some support for optional typing as previously shown. However, from experience, introducing types to Python makes it extremely verbose and awkwared to read.
Types on Python side are essentially just type hints, which can can be circumvented at runtime if needed. This makes Python at best a dynamically typed language, even if there exists static type checkers for Python (mypy) which can be used to approach the benefits of static type checking. Even if there's benefit of adding static type checkers, their usage feels like a limitation when using a language like Python where the main benefit is flexibility and development speed over strict type safety or absolute correctness. Besides, they'll be yet another tool you'd have to maintain and create a configuration file for. Typing in Python feels like shoehorning, at least to me.
Don't just take my word for it, here's what the creator of Flask things about typing in Python:
So what's the alternative here? I think you guessed it, Rust!
Rust on the other hand is statically typed, meaning types are enforced during compile time and cannot be changed on the fly. Variables that are declared with a specific type can only hold values of that type and errors are caught early on before the code even runs. In general, having typing as a main component in a language instead of an add-on is preferred since enforcement of a good practice is safer than suggestion of a good practice: it's on you and the team if somehow someone forgot to add a type to a variable in Python and it caused an error - and this will happen, eventually. Knowing that all tools can fail, it's better to catch most of the failures during compilation, before the code is even built, rather than afterwards.
Examples of other statically typed languages are Java, C++ and Go. I'll add some comparisons to Python too, with interactive examples down the line.
Benchmarking & efficiency
Now I long wondered why types are even useful and why anyone would bother needing them. Using types is a tradeoff between code flexibility and code readability but don't seem to bring much at first. Adding types all over the place can make the code difficult to read but also not flexible to change if you used an incorrect type, meaning you need to refactor quite a bit if for example your values grow beyond the capacity of their current type (type widening int32
to int64
).
The case then needs to be made why types help.
Code runs on computers and how computers work is extremely well understood. What is still an ongoing effort is how we can convert abstractions (human readable code) into machine readable code in the most efficient way. This is why there are so many different programming languages. They all cater to different developer audiences and to different use cases. The commonality being converting human thoughts into things machines can run. When doing this conversion, many optimisations can be made, as well as catching obvious bugs, when we can already predict what the code will be doing.
An example: You can think of a compiler like an architect who designs and reviews blueprints for a house before construction starts. The architect can spot errors or inefficiencies in the plans early on and get them corrected before any physical work begins. This avoids wasting time and money on implementation that would not meet requirements or pass inspections. With a solid plan in place upfront, the actual construction can also proceed more efficiently and effectively due to the level of detail in the blueprints. Dynamic/loosely typed "building" lacks these types of careful upfront checks and optimizations.
Knowing types in advance can help the compiler prepare and implement these optimisations for you. In essence, you can consider the code you write almost like a big configuration file for the compiler. The compiler is itself a piece of code purposely made to efficiently translate human input into maximally efficient machine code. When I learned this my approach to programming changed quite a bit.
All of this is of course only as good as our ability to model and predict things in advance. At least we can give guarantees for the things that are predicted and stay flexible/robust to change. But maybe to drive the point home, an example is in order. Let's take the example of summing all prime numbers between 1 and 1,000,000.
Here is the Rust code:
fn is_prime(n: u32) -> bool {
if n <= 1 {
return false;
}
for i in 2..(n as f64).sqrt() as u32 + 1 {
if n % i == 0 {
return false;
}
}
return true;
}
fn main() {
let mut sum: u64 = 0;
for i in 1..1000000 {
if is_prime(i) {
sum += i as u64;
}
}
println!("Sum: {}", sum);
}
and here is the Python code:
import math
def is_prime(n):
if n <= 1:
return False
for i in range(2, int(math.sqrt(n)) + 1):
if n % i == 0:
return False
return True
sum = 0
for i in range(1, 1000000):
if is_prime(i):
sum += i
print("Sum:", sum)
To benchmark the Rust code, you can run cargo bench
but we're not going to use it right now, what we'll do is just use the unix time
tool to keep things simple and get a rough estimate.
Below are the execution times for the two pieces of code above. The important thing to note is the relative time difference and not the absolute numbers since those will be different on different machines.
# Rust
time ./rustprimes # 0.50s user 0.00s system 74% cpu 0.673 total
# Python
time python3 pythonprimes.py # 2.25s user 0.02s system 99% cpu 2.270 total
Given relatively similar code, the Rust code is a clear winner meaning the Rust compiler did a great job optimising the input it was given.
Now you might object that I'm not comparing the Rust code the a typed version of the Python code:
import math
def is_prime(n: int) -> bool:
if n <= 1:
return False
for i in range(2, int(math.sqrt(n)) + 1):
if n % i == 0:
return False
return True
sum: int = 0
for i in range(1, 1000000):
if is_prime(i):
sum += i
print("Sum:", sum)
And guess what? The results are the same:
# Python
time python3 pythonprimestyped.py # 2.24s user 0.02s system 99% cpu 2.257 total
This clearly makes the case that adding types to Python doesn't get any substantial benefit besides the remote (unsubstantiated) eventuality that it will make it easier for large teams to collaborate on the code base.
All of this to drive the point home that Rust is geared and designed from the ground up for leveraging types (amongst other things) to be efficient, as opposed to Python where types are rather more of a convenience.
Comparing with other data engineering languages
That was Python, now how does Rust fare compared to other programming languages that are compiled and typed? The following is not meant to be an exhaustive list but rather a list comparing languages that are usually used in data engineering task.
Amongst programming languages usually used for data engineering, either in the development of the big data systems themselves or as interfaces to using those tools, we mainly find: Java, Scala, Go and C++.
C++ might be surprising, but has been used with quite some success for developing the big data systems themselves (Redpanda, DuckDB).
The following is completely biased, but informed, from my personal experience through out the years.
Java/Scala
Java and Scala are currently the most used language for building big data systems. Many frameworks, tools and libraries are written in and support Java & Scala out of the box. Amongst these are notably Apache Spark, Apache Hadoop, Apache Flink, Apache Beam and many more. Java offers good performance, portability and can handle large scale projects. Java is however very verbose which can make it difficult to read, write and maintain. This is why Scala is sometimes presented as a good and more concise alternative to Java, that still runs on the JVM.
From my personal experience I can say:
- Scala has a very steep learning curve
- Projects written in Java very quickly grow in complexity, making them extremely difficult to operate and maintain (Kafka is an example of such behemoth)
- Java projects require a lot of boilerplate code
- Garbage collection in both can cause unnecessary latency
- Java has a larger community than Scala, easier to find help
Both Java and Scala can do the job but they are more tailored towards enterprise settings, where robustness largely outweighs delivery speed. Generally speaking, for every Java project you can find a huge company (or cloud provider) that offers support and services for the tool, since it requires a lot of resources to just understand, let alone maintain, what is going on. It's also always a good thing for an enterprise to have someone else responsible for these things and file the costs incurred by these big systems under operative costs.
Depending on your team setup, they might be a good choice. But I'm here to convince you otherwise, right? ;)
Go
Go is a very interesting programming language. It features fast compile times and has built in concurrency support. Besides, it's - relatively - simple to learn. In my personal opinion I think it's a great language for things like microservices where there is a lot of chatter/traffic over a network. It has limited library support for data processing.
I'd recommend Go for data scrapers or for tasks where it's only important to fetch data from an API/Database for example, but not for data transformation.
C++
I've used C++ quite a lot growing up and it's been one of the first programming languages - with Java - that I learned. It features high performance, low level control over hardware and has many libraries for efficient data processing. DuckDB, Redpanda and other tools are written in C++.
It has a steep learning curve, that's for sure. On top of that, it requires manual data management and is prone to memory leaks and segmentation faults.
Most of these issues are properly addressed in Rust.
Tooling
The tooling and ecosystem around a programming language are extremely important. They can either stand in your way or boost your efficiency.
For most programming languages (except perhaps Scala), you need to stitch multiple tools together to get a simulacrum of working environment. For instance, in Python, to get typing, tests, package management you'll need at least:
- mypy: For typing
- pip: Package management for Python, or Poetry or pipenv or or or ... (the investigation is still open on this one)
- venv: virtual environment to not pollute your system libraries
- pytest: Testing framework - let's be real, very few people use the integrated unit test library ;)
- ...
These are all things that need to be properly maintained and configured to provide a setup that is robust in a team setting. Some might say it's not needed, we can abstract all of it away by introducing yet another tool like Docker to hide away the complexity in a CI/CD pipeline, somewhere. But let's be real, it's a lot. Although these libraries are great and do their job pretty well, they add friction to something that should be smooth (this is written in 2023) especially for beginners.
The real challenge is shoveling & processing data and not yak shave around tooling.
In Rust's case, the tooling is designed to make the development process as smooth and efficient as possible. Most of the work you'll do involves using Cargo, which is Rust's package manager and build tool. It comes bundled with Rust. It even makes it easy to build and bundle projects written in Rust. Cargo also provides a simple way to manage different versions of a package, so you can easily switch between different versions of a dependency without worrying about conflicts.
One of the major advantages of Cargo is its ability to automatically build and link native libraries, which can be a complex task in other programming languages. This means that with Rust and Cargo, you can easily build and distribute cross-platform applications with native performance.
It's not all perfect, one small caveat persist: you sometimes get long compile times compiling Rust code. So that's that.
There's a lot more to unpack over the next chapters, but for an overview it suffices to say that Rust's tooling has been purposefully built to help with programming and not added as an afterthought.
When to use which?
As my favorite tax consultant would say: "it depends".
This is not a question that can usually be answered easily although my goal with this website is to motivate you to try Rust out and see if it works for you.
These considerations might help in your decision, on top of the things we have already covered. They are mostly covering why you should use Rust and are not meant to be applicable in absolutely every case, your mileage may vary.
The team's capabilities
A very important factor to consider first is how the team is set up and what capabilities are at the team's disposition. If you already have a team of Rust developers we wouldn't be having the discussion right now as it'll be the obvious choice to use Rust. If the team is mostly composed of Java or Python developers it might be tough to convince them to try something new. Refer them to this website and I'll hope to make a good enough job to get them comfortable to learn something new by building many of the things they are accustomed to.
Either way, if the team is open minded in terms of what tool to use, Rust can be a good candidate to try out, even as a first MVP (minimum viable product) and compare with a Python one. I'll share more resources as we go, of teams who made the jump and their conclusions.
The budget & constraints
If you have a lot of budget, work in a big enterprise with many teams and a lot of legacy systems you need to interface with, it will be very difficult to avoid Java, unless you're the one calling the shots on a new greenfield project. The question in big companies is usually "who's going to maintain this" and as long as there are no "big Rust shops", it might be a risky bet.
Even if the case can be made that using Rust will make things easier to maintain in the long run, it remains to be proven. What works for one project might not work on another so it's important to take into consideration how much wiggle room there is to try out something new.
The scope & goal
Beyond the points mentioned above, the most important point that should dictate which tools you use is what you actually want to build.
If you're building a landing page, Rust might not be the best tool.
If you're building a data stream processing pipeline and are hitting road blocks, or high costs, with other methods, Rust might definitely help.
Define your goal and run some tests and comparisons: ultimately, the best tool for the job is the tool that gets the task done within your context and unique setting.
From my personal experience
I'll leave you with this though, if you you would rank (from 1 = good to 4 = less good) the different programming languages using system maintainability, ease of use and flexibility to change, I'd place the programming languages as follows (this is very subjective and based on my personal and very subjective experience):
Language | Maintainability | Ease of use | Flexibility to change | Performance |
---|---|---|---|---|
Rust | 1 | 3 | 3 | 2 |
Java/Scala | 2 | 2 | 2 | 3 |
Python | 3 | 1 | 1 | 4 |
C++ | 4 | 4 | 4 | 1 |
Some comments on this table:
The numbers are completely subjective and aim to provide relative comparison metrics instead of absolute ones. The numbers will look completely different for somebody else, especially depending on the years of experience with those languages. Over the following chapters, I'll provide more color to these metrics and explain a little bit more. For now, the main takeaways are:
- Rust systems are more maintainable than Java, Python or C++ ones, but place second after C++ in terms of performance, even if they come relatively close in different benchmarks
- Python is the language that is the easiest to use and the easiest to change (meaning changing or updating the code)
- Java and Scala are still strong contenders, all things considered.
This table can be updated with including things like which programming language is more efficient and more. This will however be the topic for a longer chapter later on.
I think so far we covered a lot without even discussing about programming language efficiency and resource usage, which are also important things. It's time we get going and start with programming, don't you think? It's the only way to really get an idea of what we're dealing with.
Let's make a small recap now