Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

#!/usr/bin/env sh
# File: set-symlinks.sh
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
#    https://stackoverflow.com/questions/2564634/convert-absolute-path-into-relative-path-given-a-current-directory-using-bash
# Improve:
# Dependencies:
#    coreutils >= 8.23

INDIR=~/src/project

#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
   do
      x=0
      while test $x -lt 4 ;
      do
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
         then
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"
         fi
      done
   done

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.

The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

An improvement upon while true loop

The story

So normally when I want to see output for something, I’ll run a while true loop.

while true; do zabbix_get -s rhevtester.example.com -k 'task.converter_cpu'; done

That doesn’t always stop even when I mash ^C (CTRL+C).

The solution

So I offer my improvement. The way to stop the loop is obvious, and also I will explain it.

while test ! -f /tmp/foo; do zabbix_get -s rhevtester.example.com -k 'task.converter_cpu'; done

To stop the loop, in another shell just:

touch /tmp/foo