This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.
#!/usr/bin/env sh # File: set-symlinks.sh # License: CC-BY-SA 3.0 # Author: bgstack15 # Startdate: 2020-02-07 14:03 # Title: Script that Replaces Duplicate Files with Symlinks # Purpose: # History: # Usage: # Reference: # https://stackoverflow.com/questions/2564634/convert-absolute-path-into-relative-path-given-a-current-directory-using-bash # Improve: # Dependencies: # coreutils >= 8.23 INDIR=~/src/project #results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )" find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \ awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \ while IFS=':' read main child1 child2 child3 child4 ; do x=0 while test $x -lt 4 ; do x=$(( x + 1 )) eval thischild="\${child$x}" if test -n "${thischild}" ; then linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )" test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2 test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}" fi done done
I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.
The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.