Faster file copying on cold AWS EBS volumes (created from snapshot)


Copying from cold AWS EBS volumes seems to go faster with parallelism. This blog shows how to split large files into smaller chunks to be copied using dd, so you can push the parallelism hard enough to go as fast as the infra will allow.

The problem

When you create an AWS EBS volume from a snapshot, it will be “cold” and therefore slower than its maximum throughput (depending on the configured throughput) until it has warmed up. From the docs:

For volumes, of any volume type, that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.

I’m trying copy a subset of the files from the cold volume to an empty volume, and naturally I want to go as fast as possible.


Those linked docs talk about warming the volume with dd or fio. Both work but I’ve seen fio go much faster than dd, presumably because fio is multithreaded and/or performs IO queuing. I was seeing transfer throughput of ~5MB/s for dd and when fio is configured with the right IO depth, parallelism and block size I could get it to read at 100MB/s on the same file. So for warming, fio is hands down the winner.

This got me curious about how I could skip the warming and just copy with enough parallelism to max out the available throughput. As I have a few files that make up the majority of the bytes, I needed to chunk them otherwise I’d end up at slow speeds once I was only copying the few remaining large files.

The solution

I attacked the problem in two passes:

  1. generate the commands to copy each chunk
  2. execute the commands in parallel (with a cap on maxmimum parallelism)

It’s worth nothing that I’m “copying” using dd because we can write all the chunks to the output file. There’s no need for temp files and cat-ing when they’re all copied. This approach works with copying out of order too, which is happening here via shuf. It’s the best of all worlds!

Here’s generating the list of commands:

find /some/src/dir/ \
  -type f \
  -not -name '*.ignored' \
  > /tmp/files-to-copy.txt
function divideCeiling {
  # thanks
  local dividend=$1
  local divisor=$2
  echo $(( (dividend+divisor-1) / divisor ))
echo "[INFO] generating list of chunk copy commands..."
while read -r currFile; do
  sizeMb=$(du -BM "$inFile" | cut -f1)
  sizeMb=${sizeMb%M}  # strip "M" suffix
  chunks=$(divideCeiling "$sizeMb" "$chunkSizeMb")
  if [ -e "$outFile" ]; then
    echo "[ERROR] dest file $outFile already exists"
    exit 1
  if [ "$chunks" = "0" ] || [ "$chunks" = "1" ]; then  # 0 chunks = 0 byte file
    echo "cp $inFile $outFile || exit 255" >> /tmp/copy-chunk-commands.txt.cp
    for currChunk in $(seq 1 "$chunks"); do
      if [ "$currChunk" = "$chunks" ]; then
      echo "CHUNK_SIZE_MB=$chunkSizeMb ~/ $inFile $outFile $currChunk $chunks" \
        >> $cmdFile
done <<< "$(sed 's%/some/src/dir/%%' /tmp/files-to-copy.txt)"
# the larger files don't seem to like all the concurrency focued on them
#  (total disk throughput is low) so we'll shuffle things up to hopefully be
#  copying many different files at any given time.
shuf /tmp/copy-chunk-commands.txt.not-last-chunks > /tmp/copy-chunk-commands.txt
# only start last chunks after all the other chunks have started
shuf /tmp/copy-chunk-commands.txt.last-chunks >> /tmp/copy-chunk-commands.txt
# the remaining files are small, they can fill the space while big files finish
cat /tmp/copy-chunk-commands.txt.cp >> /tmp/copy-chunk-commands.txt

That script is:

# copies the specified chunk of a file
set -eu
trap 'exit 255' INT ERR  # 255 causes xargs to fail-fast


oneLess=$((currChunk - 1))

# these direct flags seem to make copying faster for EBS volumes. Note: if
#  you test this command on your laptop, you might see "Invalid argument",
#  which I think is your filesystem saying it doesn't support "direct" mode.
flags="iflag=direct oflag=direct"

dd \
  if="$inFile" \
  of="$outFile" \
  bs=1M \
  count="$chunkSizeMb" \
  skip=$((oneLess * chunkSizeMb)) \
  seek=$((oneLess * chunkSizeMb)) \
  conv=notrunc \
  status=none \
  $flags \
  || {
    echo "[ERROR] failed to copy $inFile -> $outFile chunk $currChunk/$totalChunkCount; RC=$rc"
    exit $rc

# hashing the large files takes a long time, so we'll only do it for the
#  smaller files to prove the code works.
if [ "$totalChunkCount" -lt 5 ] && [ "$currChunk" = "$totalChunkCount" ]; then  # is last chunk
  while sudo lsof | grep -q "$outFile"; do
    # just because the command to copy the last chunk is started last, doesn't
    #  guarantee it'll finish last: other chunks could copy slower or the last
    #  chunk could be 1 byte. So we need to make sure no other chunks are still
    #  copying before we verify.
    echo "[WARN] waiting for $outFile to have no more chunks copying"
    sleep 30
  # verify the copied file is "the same" (having the same hash is a pretty good indicator)
  diff \
    <(cd "$(dirname "$inFile")"; md5sum "$(basename "$inFile")") \
    <(cd "$(dirname "$outFile")"; md5sum "$(basename "$outFile")") \
    || {
      echo "[ERROR] hashes are different between $inFile and $outFile; copying has mangled the file"
      exit $rc
  echo "[INFO] successfully copied $inFile -> $outFile final chunk and verified hash of file"
  echo "[INFO] successfully copied $inFile -> $outFile chunk $currChunk/$totalChunkCount"

Then when it’s time to execute the commands:

xargs \
  --verbose \
  -n1 \
  --delimiter='\n' \
  -P20 \
  bash -c \
  < /tmp/copy-chunk-commands.txt

…to have max 20 concurrent jobs. You can monitor the VM with iotop to see the total disk throughput.

Or, if you have parallel installed, the command is simpler:

parallel \
  --verbose \
  -j20 \
  < /tmp/copy-chunk-commands.txt

On a maxed out gp3 volume (16k IOPS and 1000MB/s throughput) connected to a VM that has enough throughput, I’m seeing each dd process read/write at ~7MB/s. With parallelism=24, I’m seeing total disk throughput of ~170MB/s. As of this writing I haven’t figured out what the limit is. I thought it was 100MB/s total as I see with fio on a single file, but even with the numbers above, clearly the limit is higher and is maybe dependent on spreading the “warming” load out over the volume.

comments powered by Disqus