Fixing slow AWS uploads
I usually store private datasets on my NAS. This lets me do a significant amount of prototyping locally; you can do a lot with a 10gbps network and a modern laptop1. But sometimes you really do need 48 or 96 CPUs chugging on a problem at the same time.
For non-GPU work I usually spin up a beefy box on AWS or GCP. But as I was rsyncing a particularly big set of files I walked away for a coffee and came back to upload speeds around 2MB/s. Sometimes it would drop to 500kbps and sometimes jump to 10MB but never push much higher. Let's see if you can spot the issue right off the bat:
rsync -av --progress \ --exclude='.*' \ -e "ssh -i ~/.ssh/primary-laptop.pem" \ /Volumes/Common_Drive/dataset \ [email protected]:~/dataset/
If you can - then congratulations! No need for this blog post. But if you didn't these results just don't make any sense:
- Client has a 10 Gbps symmetric fiber connection (Sonic in SF)
- Powerful EC2 instance:
c5d.12xlargeso network nor CPU used by rsync should slow it down - This box has an SSD
I provisioned this box with a large NVMe storage for fast access to local compute. The d in c5d.12xlarge actually means "instance store volumes included." If I'm going to be paying for 48 CPUs I want to be fully saturating them.2
But I made the mistake of copying my rsync from a previous run to my local homelab. Where the home directory is a perfectly reasonable place to dump a folder until you figure out its permanent location. It's all backed by the same SSD. But on AWS that home directory has dragons. By default it's booted to a slow EBS network-attached storage device.
Specifically:
/dev/root → Amazon Elastic Block Store (EBS)
So my data path looked like:
Laptop → Network (Internet) → EC2 → Network (internal) -> EBS
With TCP, the receiver controls the pace. So even though it looked like an overall network issue, that was just the backpressure from the EBS "disk" leading to slower writes from the EC2 image which then in turn looked like slower network speeds on my end.
EBS volumes - for what it's worth - are pretty basic disks with low default throughput caps and IOPS limits. You can customize them to be higher but you're almost always better off using these nvme disks if you are doing data processing.
The Fix
If you suspect this might be happening to you, check lsblk:
nvme1n1 Amazon EC2 NVMe Instance Storage
nvme2n1 Amazon EC2 NVMe Instance Storage
Format and mount one:
sudo mkfs.ext4 -F /dev/nvme1n1
sudo mount /dev/nvme1n1 /mnt/nvme
Then change your rsync target to:
/mnt/nvme/
And in my case I immediately saw speeds jump from ~2 MB/s to ~26 MB/s.
Conclusion
Beware the home directory! And make use of your local disks. Your pipelines will thank you.
Footnotes
/dev/newsletter
Technical deep dives on machine learning research, engineering systems, and building scalable products. Published weekly.
Unsubscribe anytime. No spam, promise.