website/content/blog/flash-filesystem-encryption/index.md
2025-12-23 18:02:53 +01:00

227 lines
18 KiB
Markdown

+++
title = "Embedded filesystem encryption on flash memory"
date = 2025-12-23
description = "My journey in the world of filesystem encryption and flash memories."
insert_anchor_links = "left"
[taxonomies]
tags = ["cryptography", "ESP32"]
[extra]
katex = true
+++
One of my long-term projects is an ESP32-based phone, using an SD card for storage. Then, why not encrypting the SD card?
_In this post, we first explain the basics of filesystem encryption, then explore ways to apply it to the case of an embedded device and flash memory. This last part is quite rarely analyzed in the literature._
## Threat model
The threat model is often a bit vague when it comes to filesystem encryption. Expectations may vary depending on the context. Here is what we would like here:
* If an adversary should steal the device, they would not be able to obtain information about its content, except maybe the total size.
* If an adversary should steal the device, they would not be able to make us obtain the plaintexts of their choice.
* The adversary can choose plaintexts and make the defender encrypt and write them. (That happens naturally when you receive a message.)
* The adversary can choose ciphertexts and write them. (Imagine you leave the SD card on your desk when going to lunch.)
## Choosing a cipher
Cryptography is relatively expensive in computation and memory, this is why some chips include hardware implementations that are way more efficient than classical ones made of basic instructions. Fortunately, ESP32 has AES, SHA1 and SHA2 instructions, and [they are faster than software implementations](https://www.oryx-embedded.com/benchmark/espressif/crypto-esp32-s3.html). AES128 will be a good choice then.
If we didn't have a hardware AES, it would be interesting to look at lightweight ciphers like [ASCON](https://csrc.nist.gov/news/2023/lightweight-cryptography-nist-selects-ascon) that may be slower than AES but more memory-efficient, or more common ones such as CHACHA20 that can be faster on software.
## How filesystem encryption works in general
### Can't we just encrypt files?
The simpler solution would be to use a FAT filesystem encrypt files directly. But how are files encrypted, again?
#### ECB
Block ciphers like AES process data by blocks of fixed size, for instance 16 bytes. (note: AES128 has a 128-bit key and AES256 has a 256-bit key, but both process 128-bit blocks) The simplest way to encrypt data would be to divide it into blocks and encrypt each block separately.
<div style="text-align:center"><img alt="ECB" src="ecb.svg"/></div>
$$C = E(K, P)$$
(I love those diagrams so I made [a simple Python script](diagram.py) to generate them in SVG. I could have used existing ones from Wikimedia Commons or used Tikz, but I wanted clean SVG respecting light/dark mode.)
This mode of operation is called ECB for Electronic Code Book. It has, however, fatal flaws:
* If two blocks are identical, an adversary can spot them and learn information.
* A adversary who has access to an encryption oracle (i.e. they can obtain the ciphertext from a plaintext, that could happen if you store incoming messages in the encrypted files) can try different values and decrypt other blocks by bruteforcing well-known formats.
ECB alone is almost never a good idea.
#### CTR
From ECB we learn that each block must be encrypted differently. Another mode that solves this problem is CTR, that uses a counter to generate a secret pseudorandom for each block. It is like generating a long one-time pad from a short key.
<div style="text-align:center"><img alt="CTR" src="ctr.svg"/></div>
$$C = P \oplus E(K, N||i)$$
Here, $\oplus$ is bitwise XOR and $||$ is concatenation. N is a nonce.
> **Could we simply vary the key using the counter, and avoid xoring?** I tried that. Every time we change AES's key, we must recompute the key expansion, which can be roughly as long as encrypting a block. Hence we save a lot when reusing the same key for many blocks.
CTR provides privacy, as long as the same $(K, N, i)$ is used at most once. Else, the attacker can compute $C \oplus C' = (P \oplus E(K, N||i)) \oplus (P' \oplus E(K, N||i)) = P \oplus P'$. If $P$ is only zeros (which is common in binary files), then $P'$ is fully known. Thus we have to update the nonce whenever we update the file. Storing one nonce per block would almost double the storage size! The nonce could be stored per file, but then we have to reencrypt the whole file. And what if the attacker reverted the storage into an older state, so we reuse an old nonce?
This mode also has a 1-bit malleability: if the attacker flips a bit, this exact bit will be flipped at decryption.
These two properties are fatal in our context: let's look elsewhere.
### Flash memories are delicate
#### CBC
CBC (Cipher Block Chaining) introduces chaining.
<div style="text-align:center"><img alt="CBC" src="cbc.svg"/></div>
$$C_0 = E(K, P_0 \oplus IV)$$
$$C_i = E(K, P_i \oplus C_{i-1})$$
CBC is still somewhat malleable: if the adversary can afford scrambling one block, they can bit-flip the next one trivially.
CBC has another fatal property for us: modifying one block implies to re-encrypt all the following blocks.
The problem is that flash memories are slow and become damaged after a limited number of writes (like 10k or 100k). Rewriting entire files would take too much time and wear the device quickly.
#### A dedicated filesystem
Now that we've highlighted an important property of flash memories, it appears FAT32 may not be the best choice as a filesystem. Indeed, a modified block would stay at the same physical address, causing different regions of the storage to wear more rapidly than others. It would be better to spread the write operations across the entire space, in order to maximize the time before a failure happens.
[LittleFS](https://github.com/littlefs-project/littlefs) is made exactly for this purpose. Moreover, it provides atomic operations, meaning it never leaves the filesystem in an incoherent state if there is a power loss or a storage failure during a write operation.
If we're going down at the filesystem level, why not going further? Instead of encrypting files, we can directly encrypt the filesystem's blocks, by placing the cryptographic module between LittleFS and the IO. LittleFS's write length can be customized so we can set it to our block length and avoid dealing with partial blocks, as we would have to do when encrypting files. Another benefit is that we're hiding the file tree as well: directories, names and metadata are encrypted as well, with no additional complexity.
Such a niche filesystem has the disadvantage that it's not natively supported by Linux, making development, debug or even file transfer between the device and a computer more difficult. A LittleFS kernel module exists, and adding our encryption layer should be feasible.
#### XTS
We need something looking more like ECB or CTR in that it allows small random writes. XTS is a popular mode for filesystem encryption and satisfies this criterion.
<div style="text-align:center"><img alt="XTS" src="xts.svg"/></div>
$$C = E(K_1, P \oplus \Delta) \oplus \Delta$$
$$\Delta = E(K_2, i) \times \alpha^j$$
$$X \times \alpha = (X \ll 1) \oplus (MSB(X) \cdot 135)$$
Here, the storage is divided into sectors and sectors into blocks. In the diagram, i is the sector number and j is the block number. $\ll$ is left bitshift and MSB is the most significant bit.
Why so complicated? First, $E(K_2, i)$ looks like CTR. To make it faster, it remains constant through the entire sector (which is useful because LittleFS prefers to read or write contiguous blocks when possible). Multiplication by $\alpha$ is faster than a block encryption and can be computed iteratively with $x \times \alpha^j = (x \times \alpha^{j-1}) \times \alpha$. The double XOR prevents attacks on chosen ciphertext or known plaintext as described before.
XTS has a way to deal with final partial blocks (when data length is not a multiple of block size), but as we're encrypting full blocks of 16 bytes only, we don't need that mechanism.
[Rogaway 2011](https://www.cs.ucdavis.edu/~rogaway/papers/modes.pdf) criticized XTS on multiple points.
* XTS is based on a modified version of Rogaway's XEX mode (XOR-Encrypt-XOR) which has well understood security properties.
* Ciphertext stealing, the way to deal with final partial blocks, is poorly designed or at least not proven secure under well-defined security goals. Again, we are not concerned.
* The use of two different keys is unjustified, except it makes proofs easier. If the sector number i is xored with a secret random salt, there is no risk of collision between the inputs of the two cipher blocks, as long as we do not store ciphertexts of the secret key or the salt (they should be user inputs stored in volatile memory only).
* It is a FIPS (NIST standard) but only specified in an IEEE spec that is seemingly not available publicly (unless using Sci-Hub of course).
* In the original definition, $\Delta$ is byte-swapped to make implementation easier on little-endian machines, but this has no security implications.
## Benchmarking ciphers
I implemented XTS in Rust and ran a benchmark on the ESP32. As the multiplication by powers of alpha can be implemented in many ways, I also tried different versions.
First version, delta is an unaligned array of bytes, cast to u128 to do the maths:
```rust
fn mul_delta_u128(delta: &mut [u8; 16]) {
let mut delta1 = u128::from_be_bytes(*delta);
delta1 = (delta1 << 1) ^ (135 * (delta1 >> 127));
*delta = delta1.to_be_bytes();
}
```
However ESP32's registers are only 32 bits so we have to trust the compiler to implement u128 efficiently. Using [cargo-show-asm](https://crates.io/crates/cargo-show-asm), I see the above function's assembly is 151 lines long and only operates bytewise. We can do better.
Switching to little endian (`from_le_bytes`, `to_le_bytes`) improves quite a bit to 82 lines, but still produces byte operations only. We can still do better!
To ensure the compiler can work efficiently with u128, we can use u128 from the start and avoid casting, so it should be aligned:
```rust
fn mul_delta_u128_aligned(delta: &mut u128) {
*delta = (*delta << 1) ^ (135 * (*delta >> 127));
}
```
Finally we can try implementing the details ourselves:
```rust
fn mul_delta_u32(delta: &mut [u32; 4]) {
let term = (delta[3] >> 31) * 135;
delta[3] <<= 1;
delta[3] ^= delta[2] >> 31;
delta[2] <<= 1;
delta[2] ^= delta[1] >> 31;
delta[1] <<= 1;
delta[1] ^= delta[0] >> 31;
delta[0] <<= 1;
delta[0] ^= term;
}
```
The two last implementations produce 22 assembly lines, using 32 bits operations.
Here are the benchmark results (encrypting 100 times 128kB):
| Mode | Implementation | Sector size (blocks) | Time (ms) (1 key) | Time (ms) (2 keys) |
| ---- | ----------------- | -------------------- | ----------------- | ------------------ |
| ECB | - | - | 744 | - |
| XTS | unaligned u128 BE | 8 | 2770 | 2774 |
| XTS | unaligned u128 BE | 16 | 2664 | 2775 |
| XTS | unaligned u128 BE | 32 | 2637 | 2749 |
| XTS | unaligned u128 BE | 64 | 2624 | 2736 |
| XTS | unaligned u128 LE | 8 | 2499 | 2448 |
| XTS | unaligned u128 LE | 16 | 2399 | 2445 |
| XTS | unaligned u128 LE | 32 | 2373 | 2420 |
| XTS | unaligned u128 LE | 64 | 2361 | 2408 |
| XTS | aligned u128 | 8 | 2549 | 2447 |
| XTS | aligned u128 | 16 | 2449 | 2445 |
| XTS | aligned u128 | 32 | 2424 | 2420 |
| XTS | aligned u128 | 64 | 2412 | 2408 |
| XTS | [u32; 4] | 8 | 2495 | 2493 |
| XTS | [u32; 4] | 16 | 2395 | 2490 |
| XTS | [u32; 4] | 32 | 2370 | 2465 |
| XTS | [u32; 4] | 64 | 2357 | 2453 |
The fastest is XTS with one key (and salted sector number) and long sectors.
Sectors must not be too long, however, as random access to block j needs computing all j successive powers of $\alpha$. 32 blocks may be a good value, as it matches flash erase size.
## The key
### Deriving the key from a password
AES128 needs 128 bits of key, however the user will only remember ASCII words, not fully random bytes. We need something to derive a key from a variable-length password. We could just compute a hash of the password, as the ESP32 provides a hardware implementation of SHA2, but for storing passwords it is better to use a dedicated function that is fast enough to run once but hard to bruteforce efficiently on optimized systems.
[PBKDF2](https://fr.wikipedia.org/wiki/PBKDF2) chains thousands of calls to a hash function, each one depending on the previous one, so it is impossible to parallelize. However an attacker can run thousands of instances in parallel on a GPU or cryptocurrency-mining chip.
A popular choice as of today is [Argon2](https://en.wikipedia.org/wiki/Argon2), which is memory-hard: one instance requires efficient access to a big amoung of memory, potentially megabytes or even gigabytes, so it is difficult to optimize even on dedicated hardware. Problems are that its implementation is quite complicated (it will take too much ROM) and its specs are not even complete.
[Catena](https://www.researchgate.net/publication/261548591_The_Catena_Password_Scrambler) is a scheme with similar properties but with a very simple description. It takes less than 50 lines of Rust. To run on the ESP32 (and its 256kB RAM), I used SHA256 and set its memory usage to 128kB and 1024 iterations. In comparison, recommended parameters are between 67MB and 1GB with 3 or 4 iterations. It runs in 911ms. We can expect a speedup of more than 10 on a good CPU, and it still can be parallelized easily on an old GPU: if your GPU has 1GB of RAM, it can hold at most 8192 parallel instances.
The benefit of password hashing functions on the ESP32 is a bit disappointing, we only slow down attacks by a small factor. It seems easier to enforce strong passwords. Picking 10 random words from a [BIP39](https://github.com/bitcoin/bips/blob/04b448b599cb16beae40ba9a98df9f262da522f7/bip-0039/english.txt) wordlist gives $\log_2(2048^{10})=110$ bits of entropy. To make it faster to type, each word can be shortened to its 4 first letters without loosing entropy.
### Storing the key
It can be useful to use two keys: the first one, derived from the password, is used to encrypt the second key, which is written to the storage. The second key is use to encrypt the filesystem. This way, the password can be changed, as the second key does not depend on it. If you have to destroy the data in a hurry and you have a reason to think someone with a gun may force you to hand over the password, you just have to erase the stored key.
## Active attacks and authentication
Assumptions and security goals about malleability are debatable. Lack of authentication allows many attacks which are inherently hard to counter when encrypting a filesystem.
If an adversary **steals your device**, they may copy your encrypted data before handling it back to you. They may as well install a keylogger in the program memory. In this case, you should ideally copy your data, destroy the potentially compromised device and install a fresh one. One motivation to still consider defending against this attack is that in our context, the executable code is stored in the ESP32 meanwhile the data are in the SD card, so it is possible that the SD card gets compromised while the ESP32 stays in your pocket.
**Replay attacks** are trivial. XTS prevents copying a block from one place to another without scrambling its content, but nothing prevents it from being copied through time: the adversary makes a copy of block N one day, you write newer data to block N, the adversary rewrites the old data to block N, and you have no way to detect the attack because the block is valid. LittleFS coincidentally mitigates this problem, because when modifying a block, it writes the new data to an unused block and modifies the link that points to it, so the old one is now unused. The old block will only be used again after some time, to equalize wear through the entire storage. This requires replay attacks to be more subtle but doesn't make them impossible.
**Data can be scrambled.** Altering encrypted blocks will produce valid garbage plaintexts, which may or may not be detected, depending on what files or filesystem structures are affected. Again LittleFS partly mitigates this issue, because every bit of data is covered by a checksum. A checksum is not a cryptographic tool as it has low entropy and is malleable, and its goal is to detect hardware faults, not attacks. However as XTS is not bitwise malleable, it may contribute to render active attacks harder, as a scrambled block can be marked as faulty.
**Why not authenticate?** We could write authentication tags along the data (e.g. AES-GCM, HMAC), but that would be very expensive to compute. It would also break the 1:1 correspondance between ciphertext blocks and plaintext blocks, that is vital to its performance. We would need either to write all authentication tags to a different partition (out of the filesystem, hence causing performance issues), or to make encryption part of the filesystem itself, which is a lot of work.
## Conclusion
For my project, I will go on with LittleFS over AES128-XTS. Deciding between the one-key or two-key variants will need benchmarking on a more realistic setup. I would also like to make energy consumption measurements to complete the running time benchmarks, and to decide whether Catena or PBKDF2 are worth it.
If you want to know more about filesystem encryption in general, here is [a quick presentation](fse.pdf) I made. [CryptSetup's FAQ](https://gitlab.com/cryptsetup/cryptsetup/-/wikis/FrequentlyAskedQuestions) is also a great source of information for non-cryptographers.