Faster file copying on cold AWS EBS volumes (created from snapshot)
TL;DR
Copying from cold AWS EBS volumes seems to go faster with parallelism. This blog shows how
to split large files into smaller chunks to be copied using dd
, so you can push the
parallelism hard enough to go as fast as the infra will allow.
The problem
When you create an AWS EBS volume from a snapshot, it will be “cold” and therefore slower than its maximum throughput (depending on the configured throughput) until it has warmed up. From the docs:
For volumes, of any volume type, that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.
I’m trying copy a subset of the files from the cold volume to an empty volume, and naturally I want to go as fast as possible.
Trivia
Those linked docs talk about warming the volume with dd
or fio
. Both work but I’ve
seen fio
go much faster than dd
, presumably because fio
is multithreaded and/or
performs IO queuing. I was seeing transfer throughput of ~5MB/s for dd
and when fio
is
configured with the right IO depth, parallelism and block size I could get it to read at
100MB/s on the same file. So for warming, fio
is hands down the winner.
This got me curious about how I could skip the warming and just copy with enough parallelism to max out the available throughput. As I have a few files that make up the majority of the bytes, I needed to chunk them otherwise I’d end up at slow speeds once I was only copying the few remaining large files.
The solution
I attacked the problem in two passes:
- generate the commands to copy each chunk
- execute the commands in parallel (with a cap on maxmimum parallelism)
It’s worth nothing that I’m “copying” using dd
because we can write all the chunks to
the output file. There’s no need for temp files and cat
-ing when they’re all copied.
This approach works with copying out of order too, which is happening here via shuf
.
It’s the best of all worlds!
Here’s generating the list of commands:
find /some/src/dir/ \
-type f \
-not -name '*.ignored' \
> /tmp/files-to-copy.txt
function divideCeiling {
# thanks https://stackoverflow.com/a/12536521/1410035
local dividend=$1
local divisor=$2
echo $(( (dividend+divisor-1) / divisor ))
}
echo "[INFO] generating list of chunk copy commands..."
while read -r currFile; do
inFile="/some/src/dir/$currFile"
outFile="/some/dest/dir/$currFile"
sizeMb=$(du -BM "$inFile" | cut -f1)
sizeMb=${sizeMb%M} # strip "M" suffix
chunkSizeMb=1000
chunks=$(divideCeiling "$sizeMb" "$chunkSizeMb")
if [ -e "$outFile" ]; then
echo "[ERROR] dest file $outFile already exists"
exit 1
fi
if [ "$chunks" = "0" ] || [ "$chunks" = "1" ]; then # 0 chunks = 0 byte file
echo "cp $inFile $outFile || exit 255" >> /tmp/copy-chunk-commands.txt.cp
else
for currChunk in $(seq 1 "$chunks"); do
cmdFile=/tmp/copy-chunk-commands.txt.not-last-chunks
if [ "$currChunk" = "$chunks" ]; then
cmdFile=/tmp/copy-chunk-commands.txt.last-chunks
fi
echo "CHUNK_SIZE_MB=$chunkSizeMb ~/cp-chunk.sh $inFile $outFile $currChunk $chunks" \
>> $cmdFile
done
fi
done <<< "$(sed 's%/some/src/dir/%%' /tmp/files-to-copy.txt)"
# the larger files don't seem to like all the concurrency focued on them
# (total disk throughput is low) so we'll shuffle things up to hopefully be
# copying many different files at any given time.
shuf /tmp/copy-chunk-commands.txt.not-last-chunks > /tmp/copy-chunk-commands.txt
# only start last chunks after all the other chunks have started
shuf /tmp/copy-chunk-commands.txt.last-chunks >> /tmp/copy-chunk-commands.txt
# the remaining files are small, they can fill the space while big files finish
cat /tmp/copy-chunk-commands.txt.cp >> /tmp/copy-chunk-commands.txt
That cp-chunk.sh
script is:
#!/bin/bash
# copies the specified chunk of a file
set -eu
trap 'exit 255' INT ERR # 255 causes xargs to fail-fast
inFile=$1
outFile=$2
currChunk=$3
totalChunkCount=$4
chunkSizeMb=$CHUNK_SIZE_MB
oneLess=$((currChunk - 1))
# these direct flags seem to make copying faster for EBS volumes. Note: if
# you test this command on your laptop, you might see "Invalid argument",
# which I think is your filesystem saying it doesn't support "direct" mode.
flags="iflag=direct oflag=direct"
dd \
if="$inFile" \
of="$outFile" \
bs=1M \
count="$chunkSizeMb" \
skip=$((oneLess * chunkSizeMb)) \
seek=$((oneLess * chunkSizeMb)) \
conv=notrunc \
status=none \
$flags \
|| {
rc=$?
echo "[ERROR] failed to copy $inFile -> $outFile chunk $currChunk/$totalChunkCount; RC=$rc"
exit $rc
}
# hashing the large files takes a long time, so we'll only do it for the
# smaller files to prove the code works.
if [ "$totalChunkCount" -lt 5 ] && [ "$currChunk" = "$totalChunkCount" ]; then # is last chunk
while sudo lsof | grep -q "$outFile"; do
# just because the command to copy the last chunk is started last, doesn't
# guarantee it'll finish last: other chunks could copy slower or the last
# chunk could be 1 byte. So we need to make sure no other chunks are still
# copying before we verify.
echo "[WARN] waiting for $outFile to have no more chunks copying"
sleep 30
done
# verify the copied file is "the same" (having the same hash is a pretty good indicator)
diff \
<(cd "$(dirname "$inFile")"; md5sum "$(basename "$inFile")") \
<(cd "$(dirname "$outFile")"; md5sum "$(basename "$outFile")") \
|| {
rc=$?
echo "[ERROR] hashes are different between $inFile and $outFile; copying has mangled the file"
exit $rc
}
echo "[INFO] successfully copied $inFile -> $outFile final chunk and verified hash of file"
else
echo "[INFO] successfully copied $inFile -> $outFile chunk $currChunk/$totalChunkCount"
fi
Then when it’s time to execute the commands:
xargs \
--verbose \
-n1 \
--delimiter='\n' \
-P20 \
bash -c \
< /tmp/copy-chunk-commands.txt
…to have max 20 concurrent jobs. You can monitor the VM with iotop
to see the total
disk throughput.
Or, if you have parallel
installed, the command is simpler:
parallel \
--verbose \
-j20 \
< /tmp/copy-chunk-commands.txt
On a maxed out gp3
volume (16k IOPS and 1000MB/s throughput) connected to a VM that has
enough throughput, I’m seeing each dd
process read/write at ~7MB/s. With parallelism=24,
I’m seeing total disk throughput of ~170MB/s. As of this writing I haven’t figured out
what the limit is. I thought it was 100MB/s total as I see with fio
on a single file,
but even with the numbers above, clearly the limit is higher and is maybe dependent on
spreading the “warming” load out over the volume.