Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

#!/usr/bin/env sh
# File:
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
# Improve:
# Dependencies:
#    coreutils >= 8.23


#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
      while test $x -lt 4 ;
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.

The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

Convert input sets of numbers to numerical sequences


I wrote a function for shell (basically bash) that makes it possible to convert a series of numbers such as “1,5-8,15” into a completely enumerated sequence, so 1 5 6 7 8 15.

I needed this to facilitate passing parameters to another function, but with the ability to give arbitrarily-grouped sets of numbers.

You can see my gist on github.

convert_to_seq() {
  printf "${@}" | xargs -n1 -d',' | tr '-' ' ' | awk 'NF == 2 { system("/bin/seq "$1" "$2); } NF != 2 { print $1; }' | xargs

convert_to_seq "$1"

Try it out for yourself! If you are looking for such a function, here you go.


Input: 1,5,8-10
Output: 1 5 8 9 10

Input: 500-510,37
Output: 500 501 502 503 504 505 506 507 508 509 510 37

Remove only certain duplicate lines with awk

Basic solution demonstrates and explains how to use awk to remove duplicate lines in a stream without having to sort them. This statement is really useful.
awk '!x[$0]++'

The fancy solution

But if you need certain duplicated lines preserved, such as the COMMIT statements in the output of iptables-save, you can use this one-liner:
iptables-save | awk '!asdf[$0]++; /COMMIT|Completed on|Generated by/;' | uniq
The second awk rule prints again any line that matches “COMMIT” or “Completed on” or “Generated by,” which appear multiple times in the iptables-save output. I was programmatically adding rules and one host in particular was just adding new ones despite the identical rule already existing. So I had to remove the duplicates and save the output, but keep all the duplicate “COMMIT” statements. I also wanted to keep all the comments as well.