Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

#!/usr/bin/env sh
# File:
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
# Improve:
# Dependencies:
#    coreutils >= 8.23


#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
      while test $x -lt 4 ;
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.

The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

Monitor owner and permissions changes

A user on the Fedora forum asked for assistance monitoring owner and permissions changes to files. I whipped up a general solution in shell.

It uses a compressed database to store the last run, and will show the changes of the requested attributes of each file.

Here’s some of the business logic.

   # not empty
   test -n "${CO_DEBUG}" && echo "Comparing ${CO_INPUT} to database ${CO_OUTPUT}"

   # learn current status
   scan_dir "${CO_INPUT}" > "${CO_TMPFILE}"

   # compare to database
   zcat "${CO_OUTPUT}" | diff -W300 --suppress-common-lines -y "-" "${CO_TMPFILE}"

   # replace database
   cat "${CO_TMPFILE}" | gzip > "${CO_OUTPUT}"

And the scan function is pretty simple. Just change what stat outputs if you want to monitor different file characteristics.

scan_dir() {
   # call: scan_dir "${CO_INPUT}"
   # output: listing of hash, owner+perm hash for each file
   local td="${1}"

   find "${td}" -exec stat -L -c '%u,%U,%g,%G,%a,%n' {} + 2>/dev/null | sort -t ',' -k6

The script stores its compressed databases in /var/cache/check-owners/, and it will make files named based on the base directory it scans, so /home would be db file /var/cache/check-owners/co.home.db.gz.

You could write a cron entry to call this once a day on a particular directory and email the output to you. A poor man’s AIDE, if you will.

Convert m4a to mp3 while preserving audio quality

If you have a set of m4a files and want to automatically convert them to mp3 so you can tag them the right way, use a snippet I wrote.

See the code in its proper formatting at

# reference:

logfile="/mnt/bgstack15/log/m4a-to-mp3.$( date -u "+%FT%H%M%SZ" ).log"

func() {
for word in "$@" ;
   echo "Entering item ${word}";
   outdir="${word}/mp3" ; mkdir "${outdir}" || exit 1 ;
   find "${word}" -type f \( -regex '.*M4A' -o -regex '.*m4a' \) | while IFS='\0' read infile ;
      test -f "${infile}" && echo "Found file: \"${infile}\"" || echo "INVALID! ${infile}"
      outfile="$( echo "${infile}" | sed -r -e "s/\.m4a/\.mp3/i" )"
      echo  ffmpeg -i \"${infile}\" -codec:v copy -codec:a libmp3lame -q:a 2 \"${outfile}\"
      yes | ffmpeg -i "${infile}" -codec:v copy -codec:a libmp3lame -q:a 2 -y "${outfile}" ; test -n "${outdir}" && /bin/mv -f "${outfile}" "${outdir}/" ;
      sleep 2 ;

time func "$@" | tee -a "${logfile}"

Set word-read permissions on python libs

In an environment where the default umask is 0077 or similar, the pip utility for installing python libraries can set up new files that cannot be read by all users.

I dropped this script into my ~/bin dir so I can enforce other-read permissions easily on all the python libs. I wrote this script after hours and hours of troubleshooting python libs just to find out it’s an old-school permissions issue.

# File: /usr/local/bin/set-readable-python-libs
for word in /usr/{local/,}lib{,64}/python* ;
find ${word} ! -perm -o+rX -exec chmod g+rX,o+rX {} + 2>/dev/null


Insert filename in Libreoffice Calc spreadsheet ods file in Linux

In LibreOffice Writer, you can use the “Insert Field” tool to easily insert the filename into the document. In Calc, it’s a little different, but still possible. You can access the raw filename and sheet with CELL(“filename”). To make it pretty, use a longer formula.



  2. search libreoffice calc insert filename

Grep odt file


In the GNU/Linux world, you spend a lot of time on the command line. Searching a text file is a piece of cake:
grep -iE "expression" file1

You might even use a gui, and inside that gui you might even use an open-source office suite for all those times that plain text isn’t enough. But what about when you want to search one of those odt files you vaguely remember is some form of xml?

Easy. You use unoconv or odt2txt (look those up in your package manager) and then grep the outputted file. Or you can use the –stdout option.

unoconv -f txt foo.odt

unoconv -f txt --stdout foo.odt | grep -iE "Joe Schmoe"


I first started tackling this problem by figuring out how to access the xml inside. I learned an odt file is zipped, but a tar xf didn’t help. Turns out it was some other compression, that unzip manages.

I also had to actually learn the tiniest bit of perl, as regular GNU grep (and I inferred sed) doesn’t do non-greedy wildcard matching.

So I got this super-complicated one-liner going before I decided to try a different approach and discovered the unoconv and odt2txt applications.

time unzip -p foo.odt content.xml | sed -e 's/\([^n]\)>\n(.*)<\/\1>/\2/;s/<text:h.*?>(.*)<\/text:h>/\1/;' -e 's/<style:(font-face|text-properties).*\/>//g;' | sed -e "s/'/\'/g;s/"/\"/g;s/<text:.*break\/>//g;"




  1. Unzipping an odt file
  2. Perl non-greedy wildcard matching

Shell one-liner to show total size of filetype in directory

find /home/bgstack15 -mtime +2 -name "*.csv" | xargs stat -c "%s" | awk '{Total+=$1} END{print Total/1024/1024}'

This one-liner shows the cumulative size of all the .csv files in /home/bgstack15 (and subdirectories).

The explanation

find /home/bgstack15 -mtime +2 -name “*.csv”
Lists all csv files modified 2 or more days ago in my home directory. If the time of the file is insignificant, just remove the -mtime +2.
Pipe that output to
xargs stat -c “%s”
Some people out there use ls for this, but other people say don’t do that, a la
Anyway, this command takes the standard output from the pipe and adds it to the end of the command, which in this case is stat. Stat here is listing just the file size in bytes for each file. It doesn’t even include the name of the file in this case. That’s all adjustable, of course.
Pipe that output to
awk ‘{Total+=$1} END{print Total/1024/1024}’
This command adds the first delimited (tab, in this case) word from each line and adds it to a variable, “Total.” At the end of all the lines, show the total value divided by 1024 divided by 1024, so the output is in MB (megabytes).