Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

#!/usr/bin/env sh
# File: set-symlinks.sh
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
#    https://stackoverflow.com/questions/2564634/convert-absolute-path-into-relative-path-given-a-current-directory-using-bash
# Improve:
# Dependencies:
#    coreutils >= 8.23

INDIR=~/src/project

#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
   do
      x=0
      while test $x -lt 4 ;
      do
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
         then
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"
         fi
      done
   done

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.

The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

Convert input sets of numbers to numerical sequences

Introduction

I wrote a function for shell (basically bash) that makes it possible to convert a series of numbers such as “1,5-8,15” into a completely enumerated sequence, so 1 5 6 7 8 15.

I needed this to facilitate passing parameters to another function, but with the ability to give arbitrarily-grouped sets of numbers.

You can see my gist on github.

convert_to_seq() {
  printf "${@}" | xargs -n1 -d',' | tr '-' ' ' | awk 'NF == 2 { system("/bin/seq "$1" "$2); } NF != 2 { print $1; }' | xargs
}

convert_to_seq "$1"

Try it out for yourself! If you are looking for such a function, here you go.

Examples

Input: 1,5,8-10
Output: 1 5 8 9 10

Input: 500-510,37
Output: 500 501 502 503 504 505 506 507 508 509 510 37

Remove only certain duplicate lines with awk

Basic solution

http://www.unix.com/shell-programming-and-scripting/153131-remove-duplicate-lines-using-awk.html demonstrates and explains how to use awk to remove duplicate lines in a stream without having to sort them. This statement is really useful.
awk '!x[$0]++'

The fancy solution

But if you need certain duplicated lines preserved, such as the COMMIT statements in the output of iptables-save, you can use this one-liner:
iptables-save | awk '!asdf[$0]++; /COMMIT|Completed on|Generated by/;' | uniq
The second awk rule prints again any line that matches “COMMIT” or “Completed on” or “Generated by,” which appear multiple times in the iptables-save output. I was programmatically adding rules and one host in particular was just adding new ones despite the identical rule already existing. So I had to remove the duplicates and save the output, but keep all the duplicate “COMMIT” statements. I also wanted to keep all the comments as well.