SMR Drives and UASP20 Jul 2020
A part of my workflow involves using restic to backup data onto an external hard drive. After a migration, I needed to perform a rather large ingestion into my restic repo (about 500GB). However, in the middle of the backup my drive suddenly disappeared without a trace. No drive letter, not even in device manager.
Thinking it was just a fluke, I unplugged the drive and plugged it back in.
It appeared normally in Explorer, and worked just fine. Ran
on it and it returned no errors. Weird. Ran the backup again, starting from
the very beginning. The drive disappears again half an hour later.
Something was definitely wrong, so I looked at Event Viewer to see if Windows agreed with me. I saw a number of worrying warnings…
Namely, UASPStor: Reset to device, \Device\RaidPort3, was issued. Burn these words into your mind. Precursory search results pointed towards hardware failure or system instability. I didn’t buy it.
I shucked the drive out of the external enclosure, connected it directly via a SATA interface, and ran the backup again. After about half an hour, the backup is still running. But the speed dropped to a whopping 1 MB/s. And the average access latency was into the minutes.
I stopped the backup again, frustrated. In complete silence, I tried to come up with any fathomable reason why I could not backup my files. That was when I heard a peculiar noise. The clickety-clackety sound my hard drive makes when it moves to do writes.
I open task manager, and check the current write speed to the drive. Zero. It was at this moment that everything became abundantly clear.
My external drive probably uses Shingled Magnetic Recording (DM-SMR specifically). One fantastic aspect of SMR drives is that they have a much better data density than conventional PMR/CMR drives. One less fantastic aspect of SMR drives is that their write speeds are complete garbage.
SMR drives abuse the fact that you can read from tracks much narrower than you can write, by stacking them on top of each other like the shingles on a roof. This makes them great for write-once, read-many applications, since read performance is not affected. However, if you want to write even a single sector to the drive, you will have to read and write the entire zone of tracks that overlap. Your 4K write just turned into a 256MB one.
Device-managed SMR drives (DM-SMR) make the fact that they are SMR drives completely transparent to the host system. That is, they expose a normal interface and handle all the additional work that SMR drives need to do. Since SMR drives have amazingly awful write performance, many levels of caching are usually added to improve performance. For example, a DM-SMR drive may have a fairly large PMR cache, and a relatively small DRAM cache. The drive will try to optimize the use of the PMR region and the DRAM cache to avoid having SMR writes in the critical path. When are idling, the drive controller will slowly siphon data from the PMR cache to the SMR drive, freeing up PMR cache space.
What happens when the PMR cache is full? Writes go directly to the SMR drive. You know what SMR drives are really bad at? Writing. You know what SMR drives are even worse at doing? Random writing. You know what restic likes to do alot? Yeah. The drive shits the bed so hard that the UASP controller thinks the drive is literally gone.
Who’s fault is this?
- The UASPStor driver probably needs to handle timeouts better than it does right now, since apparently minutes of access latency are a real use case.
- Drive manufacturers need to label their disks with the technology they use.
- Software writers need to optimize for what spinning disks do great: sequential reads and writes.
Let block sizes be configurable, because sometimes I want to waste a bit of space if it means saving hours of time.