Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it’s the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

#!/usr/bin/env sh
# File: set-symlinks.sh
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
#    https://stackoverflow.com/questions/2564634/convert-absolute-path-into-relative-path-given-a-current-directory-using-bash
# Improve:
# Dependencies:
#    coreutils >= 8.23

INDIR=~/src/project

#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
   do
      x=0
      while test $x -lt 4 ;
      do
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
         then
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"
         fi
      done
   done

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don’t need efficiency; I just need to run it once, really.

The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

Python3: convert relative date to UTC timestamp

I needed to convert a relative time string, e.g., “Yesterday at 3:08 am,” to a UTC timestamp. Apparently this is not that easy to find on the Internet.
My research combines at least 3 different sources to accomplish my goal.

My incoming timezone is UTC-0500 (“US/Eastern”). And I realize it is unfortunate that I am doing some string-to-ints and back and forth. Well, please share how you could do better!

Code

#!/usr/bin/env python3
from datetime import timedelta
from parsedatetime import Calendar
from pytz import timezone 

indate = "Yesterday at 3:08 am"
print("indate:",indate)

# long version, for explanation
cal = Calendar()
dto, _ = cal.parseDT(datetimeString=indate, tzinfo=timezone("US/Eastern"))
add_hours = int((str(dto)[-6:])[:3])
outdate = (timedelta(hours=-add_hours) + dto).strftime('%Y-%m-%dT%H:%M:%SZ')
print("outdate:",outdate)

# short form
dto, _ = Calendar().parseDT(datetimeString=indate, tzinfo=timezone("US/Eastern"))
outdate = (timedelta(hours=-int((str(dto)[-6:])[:3])) + dto).strftime('%Y-%m-%dT%H:%M:%SZ')
print("outdate:",outdate)

Sample run

$ ./sample.py 
indate: Yesterday at 3:08 am
outdate: 2020-02-02T08:08:00Z
outdate: 2020-02-02T08:08:00Z

References

Weblinks

  1. python – Convert relative date string to absolute date – Stack Overflow
  2. bear/parsedatetime: Parse human-readable date/time strings [github.com]
  3. python – Convert UTC datetime string to local datetime – Stack Overflow

Package for devuan: waterfox

In my Internet searching I have not found a Devuan-centric Waterfox classic package. So I decided to bundle it myself!
You can see the build info for it in OBS, as well as use the nice download page to get it for yourself. This build is for Devuan Ceres.

It excludes pulseaudio and dbus, which is why I call it “Devuan centric.” While the Open Build Service only offers Debian builds, the package will install just fine on Devuan ceres. If you are the more paranoid type, you can examine my sources and even build the package yourself on Devuan!

Shoutouts

I follow a number of people on the Internet to learn packing tricks for dpkgs, or for how to build Waterfox. I know basically nothing about real software development, but I can build scripts pretty well, especially when I can follow good examples.

Install build dependencies from source files, dpkg and rpm

For a dpkg

cd $packagedir ;
mk-build-deps
sudo apt-get install ./${package}-build-deps*.deb

Source: ubuntu – Given a debian source package – How do I install the build-deps? – Server Fault

For an rpm

yum-builddep my-package.spec

Or

dnf builddep my-package.spec

fedora – Automatically install build dependencies prior to building an RPM package – Stack Overflow/13228992#13228992

Build rpm with Jenkins project

Here is how I build rpm files in Jenkins.

Prerequisites

Add a Fedora node to the cluster. I am running Jenkins on Devuan, which obviously is not ideal for building rpms.
I added a few labels, which are space-delimited which I found unusual. But whatever. I used a service account, and set up an ssh key for passwordless authentication.
Configuring a Jenkins node, with name fc30x-01a and a remote directory and ssh setup.
Add a user and grant them some specific sudo permissions:

useradd jenkins

cat <<EOF >/etc/sudoers.d/70_jenkins
User_Alias JENKINS = jenkins
Defaults:JENKINS !requiretty
JENKINS fc30x-01a=(root)	NOPASSWD: /usr/bin/dnf -y builddep *
EOF

Install some build tools:

sudo dnf -y install rpm-build rpmdevtools

My rpmbuild workflow in Jenkins

Add a new project. Restrict it to run on label “fedora.”
Project configuration showing restrict where project is run, to label "fedora"
Like last time, I am checking out my git repo to a local subdirectory.
Screenshot of project configuration showing SCM, and additional behavior of "Check out to a sub-directory"
All the build steps for this project are shell commands.
The first command uses some tooling I learned about for this project: spectool. The dnf installs the build dependencies, and spectool downloads all the source files that are not already present in the directory.

pwd ; ls -altr ; mkdir -p rpmbuild ; cd rpmbuild ;
cp -p ../work/veracrypt/* . || :;
sudo dnf -y builddep *.spec ;
spectool -g *.spec ;

The actual build command happens in the second step. I am using a few macro definitions to keep everything happening in the present working directory.

cd rpmbuild ; rpmbuild --define "_topdir %(pwd)" --define "_builddir %{_topdir}" --define "_rpmdir %{_topdir}" --define "_sourcedir %{_topdir}" --define "_srcrpmdir %{_topdir}" --define "_rpmfilename %%{NAME}-%%{VERSION}-%%{RELEASE}.%%{ARCH}.rpm" -ba *.spec

And the final step deploys the files to my nfs share for manual curation.

mkdir -p /mnt/public/Public/${JOB_NAME} ;
cp -p *.rpm */*.rpm /mnt/public/Public/${JOB_NAME}/ || :

References

Weblinks

  1. fedora – Automatically install build dependencies prior to building an RPM package – Stack Overflow
  2. rpm – How do I get rpmbuild to download all of the sources for a particular .spec? – Stack Overflow
  3. Rpmdevtools – Fedora Project Wiki
  4. Build RPMs using Jenkins/Hudson

Build dpkg with Jenkins project

Here is how I build dpkgs in Jenkins. I have not yet tried the jenkinsfile syntax, but I’m sure that’s a better way to do it.

Prerequisites

Install the Debian Package Builder plugin. It adds a nice option for a build step which simplifies the debuild process.
Now, I use a single repository to store my rpm specs and dpkg debian/ directories. It’s not ideal, I realize, but it was based on the architecture I understood at the time, as well as it was modeled after a few upstream places I rip off follow.
So I had to set up the Gitlab plugins:

My debuild workflow in Jenkins

Make a new item, of type Freestyle project.
screenshot of Jenkins wui where the user is about to make a new project named "veracrypt."
To load my repository with its specs and debian/ directories, I pointed to my public gitlab repo which is configured elsewhere in Jenkins.
Screenshot of jenkins project showing General settings, Gitlab connection
I am pulling down the updates branch, of my main git repo, and saving to a local subdirectory.
Screenshot of project configuration showing SCM, and additional behavior of "Check out to a sub-directory"
I’m behind a NAT, so I don’t expect to easily set up a webhook. But I really wish I would bother to get it hooked up so any repo changes would do this. I’ll have to check that out in the future.
For the build steps, I have some shell running before and after the main “Build debian package” section.
Screenshot of build steps from jenkins project, showing parts of the shell statments and Build debian package step.
I had to revamp the debian/watch file in my veracrypt sources, because I wasn’t smart enough to get it to reliably download from sourceforge (oh how the might have fallen). But uscan is a wonderful tool that downloads the source tarball for a package.
My first full shell step:

uscan -v -ddd --destdir ../../ --symlink work/veracrypt ; mkdir -p dpkg ; tar -zx -C dpkg --strip-components 1 -f $( find . -maxdepth 1 -iregex '.*\/veracrypt_[0-9]+.*orig.*tar.*z.?' | head -n1 ) ; cp -pr work/veracrypt/debian dpkg/

The syntax got a little weird when I was switching between the bzip2 and the gzip file. Also, I don’t always know the filename, ergo the find command.
And then, the Build debian package step. The source tarball was extracted to the dpkg/ location, and the debian dir copied there. I really appreciate the ability to specify where the debian/ directory is.
And the final shell step deploys the files to my nfs share for final processing by hand. I hand-curate what’s in my own repo. I don’t have a proper pool/ setup like the real Devuan repos, so I curate it myself, plus my volume is low enough I can do it all myself.

mkdir -p /mnt/public/Public/${JOB_NAME} ;
cp -p *.build *.buildinfo *.changes *.deb *.debian.tar* *.dsc /mnt/public/Public/${JOB_NAME}/

And that’s it! For now I will trigger these builds manually.

Mirror an OBS deb repository locally

Story

I run an OBS repository for all my packages, and it is available at the main site: https://build.opensuse.org/project/show/home:bgstack15.

But I wanted to mirror this for myself, so I don’t have to configure all my systems to point outward to get updates. I already host a Devuan ceres mirror for myself, and so mirroring this Open Build System repository is the last step to be self-hosting entirely for all systems except the mirror server.

I first dabbled with debmirror, but it kept wanting to try rsync despite my best configuration, plus it really insists on using the dists/ directory which isn’t used in the OBS deb repo design. So, I researched scraping down a whole site, and I found httrack which exists to serve a local copy of an Internet site. Bingo!

After a few hours of work, here is my solution for mirroring an OBS deb repo locally.

Solution

Create a user who will own the files and execute the httrack command, because httrack didn’t want to be run as root. Also, this new user can’t munge other data.

useradd obsmirror

Configure a script (available at gitlab)

#!/bin/sh
# File: /etc/installed/obsmirror.sh
# License: CC-BY-SA 4.0
# Author: bgstack15
# Startdate: 2020-01-05 18:01
# Title: Script that scrapes down OBS site to serve a copy to intranet
# Purpose: save down my OBS site so I can serve it locally
# History:
# Usage:
#    in a cron job: /etc/cron.d/mirror.cron
#       50	12	*	*	*	root	/etc/installed/obsmirror.sh 1>/dev/null 2>&1
# Reference:
#    https://unix.stackexchange.com/questions/114044/how-to-make-wget-download-recursive-combining-accept-with-exclude-directorie?rq=1
#    man 1 httrack
#    https://software.opensuse.org//download.html?project=home%3Abgstack15&package=freefilesync
# Improve:
#    use some text file as a list of recently-synced URLs, and if today's URL matches a recent one, then run the httrack with the --update flag. Probably keep a running list forever.
# Documentation:
#    Download the release key and trust it.
#       curl -s http://repo.example.com/mirror/obs/Release.key | apt-key add -
#    Use a sources.list.d/ file with contents:
#       deb https://repo.example.com/mirror/obs/ /
# Dependencies:
#    binaries: curl httrack grep head tr sed awk chmod chown find rm ln
#    user: obsmirror

logfile="/var/log/obsmirror/obsmirror.$( date "+%FT%H%M%S" ).log"
{
   test "${DEBUG:-NONE}" = "FULL" && set -x
   inurl="http://download.opensuse.org/repositories/home:/bgstack15/Debian_Unstable"
   workdir=/tmp/obs-stage
   outdir=/var/www/mirror/obs
   thisuser=obsmirror
   echo "logfile=${logfile}"

   mkdir -p "${workdir}" ; chmod "0711" "${workdir}" ; chown "${thisuser}:$( id -Gn obsmirror )" "${workdir}" 
   cd "${workdir}"
   # get page contents
   step1="$( curl -s -L "${inurl}/all" )"
   # get first listed package
   step2="$( echo "${step1}" | grep --color=always -oE 'href="[a-zA-Z0-9_.+\-]+\.deb"' | head -n1 | grep -oE '".*"' | tr -d '"' )"
   # get full url to a package
   step3="$( curl -s -I "${inurl}/all/${step2}" | awk '/Location:/ {print $2}' )"
   # get directory of the mirror to save down
   step4="$( echo "${step3}" | sed -r -e "s/all\/${step2}//;" -e 's/\s*$//;' )"
   # get domain of full url
   domainname="$( echo "${step3}" | grep -oE '(ht|f)tps?:\/\/[^\/]+\/' | cut -d'/' -f3 )"
   echo "TARGET URL: ${step4}"
   test -z "${DRYRUN}" && {
      # clean workdir of specific domain name in use right now.
      echo su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\""
      su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\"*"
      # have to skip the orig.tar.gz files because they are large and slow down the sync process significantly.
      echo su "${thisuser}" -c "httrack \"${step4}\" -*.orig.t* -v --mirror --update -s0 -r3 -%e0 \"${workdir}\""
      time su "${thisuser}" -c "httrack ${step4} -*.orig.t* -v --mirror --update -s0 -r3 -%e0 ${workdir}"
   }
   # -s0 ignore robots.txt
   # -r3 only go down 3 links
   # -%e0 follow 0 links to external sites

   # find most recent directory of that level
   levelcount="$(( $( printf "%s" "${inurl}" | tr -dc '/' | wc -c ) - 1 ))"
   subdir="$( find "${workdir}" -mindepth "${levelcount}" -maxdepth "${levelcount}" -type d -name 'Debian_Unstable' -printf '%T@ %p\n' | sort -n -k1 | head -n1 | awk '{print $2}' )"

   # if the work directory actually synced
   if test -n "${subdir}" ;
   then

      printf "%s " "DIRECTORY SIZE:"
      du -sxBM "${subdir:-.}"
      mkdir -p "$( dirname "${outdir}" )"
      # get current target of symlink
      current_target="$( find "${outdir}" -maxdepth 0 -type l -printf '%l\n' )"

      # if the current link is pointing to a different directory than this subdir
      if test "${current_target}" != "${subdir}" ;
      then
         # then replace it with a link to this one
         test -L "${outdir}" && unlink "${outdir}"
         echo ln -sf "${subdir}" "${outdir}"
         ln -sf "${subdir}" "${outdir}"
      fi

   else
      echo "ERROR: No subdir found, so cannot update the symlink."
   fi

   # disable the index.html with all the httrack comments and original site links
   find "${workdir}" -iname '*index.html' -exec rm {} +
} 2>&1 | tee -a "${logfile}"

And place this in cron!

#       50	12	*	*	*	root	/etc/installed/obsmirror.sh 1>/dev/null 2>&1

Explanation of script

So the logic is a little convoluted, because the OBS front page actually redirects downloads to various mirrors where the files are kept. So I needed to learn what the actual site is, and then pull down that whole site.
I couldn’t just use httrack –getfiles because it makes just a flat directory, which breaks the Packages contents’ accuracy to the paths of the package files. But I didn’t want the whole complex directory structure, just the repository structure. So I make a symlink to it in my actual web contents location.

Package for devuan: chicago95-theme-all

Because self-promotion (erm, I mean… learning!) is the purpose of this blog, here is a package I have assembled from a nifty upstream: Chicago95!

You can go get it from the OBS repository now.

screenshot of XFCE with chicago95 theme in use
Screenshot from https://github.com/grassmunk/Chicago95 project

chicago95-theme-all is a metapackage that pulls in all the elements needed to configure your Devuan GNU+Linux system to look like a classic non-free OS from 1995!

$ apt-cache search chicago
chicago95-theme-all - XFCE Windows 95 Total Conversion
chicago95-theme-cursors - Mouse cursor themes for Chicago95
chicago95-theme-doc - Documentation for Chicago95
chicago95-theme-fonts - Fonts for Chicago95
chicago95-theme-greeter - Lightdm webkit greeter for Chicago95
chicago95-theme-gtk - GTK and WM themes for Chicago95
chicago95-theme-icons - Icon themes for Chicago95
chicago95-theme-login-sound - Login sound for Chicago95
chicago95-theme-plymouth - Plymouth theme for Chicago95
$ apt-cache policy chicago95-theme-all
chicago95-theme-all:
  Installed: (none)
  Candidate: 0.0.1-1+devuan
  Version table:
     0.0.1-1+devuan 500
        500 http://download.opensuse.org/repositories/home:/bgstack15/Debian_Unstable  Packages

Of course, like any other theme, you need to manually change your settings to use the theme. But these packages make it easy to install it so you can control the files with the package manager.

Backstory

I was helping a family member set up a system to look like a particular non-free OS, and an old one was acceptable, and even preferred. After some research, I discovered a few good places to look for themes:

There were multiple options for the type of theme I wanted, but Chicago95 was the most cohesive and cleanly-installed set.

Devuan generate new ssh keys for freeipa host

If a Devuan system is a freeipa client, but you cannot ssh -o GSSAPIAuthentication=yes to it, even though all the regular troubleshooting steps work, and the logs don’t show you anything, the host ssh keys might be wrong in freeipa.

Generate new ssh keys for freeipa host

All the steps can be taken on the host in question.
As root, make sure you can kinit -k to get a kerberos key with the host keystore. If this step doesn’t work, you need to go fix that, which is beyond the scope of this post.

kinit -k
# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: host/d2-03a.ipa.example.com@IPA.EXAMPLE.COM

Valid starting       Expires              Service principal
12/31/2019 07:25:47  01/01/2020 07:25:47  krbtgt/IPA.EXAMPLE.COM@IPA.EXAMPLE.CO

Now, generate new ssh keys. Apparently on Devuan systems, restarting the daemon is not good enough. On CentOS, if you delete the ssh host keys, restarting the daemon will just generate new ones which can cause some interesting effects when connecting to a host that did so. However, on Devuan you have to run:

rm -rf /etc/ssh/ssh_host_*_key*
dpkg-reconfigure openssh-server
service ssh restart

And then, with the fresh keytab from the kinit -k earlier, it’s a piece of cake to modify this host in freeipa to use a new set of ssh public keys!

LC_ALL="" LC_CTYPE="C.UTF-8" ipa host-mod --sshpubkey="$( cat /etc/ssh/ssh_host_rsa_key.pub )" --sshpubkey="$( cat /etc/ssh/ssh_host_ecdsa_key.pub )" --sshpubkey="$( cat /etc/ssh/ssh_host_ed25519_key.pub )" $( hostname -s )
----------------------
Modified host "d2-03a"
----------------------
  Host name: d2-03a.ipa.example.com
  Principal name: host/d2-03a.ipa.example.com@IPA.EXAMPLE.COM
  Principal alias: host/d2-03a.ipa.example.com@IPA.EXAMPLE.COM
  SSH public key: ssh-rsa
                  AAAAB3NzaC1yc4EAAAADAQABAAABg[truncated]
                  root@d2-03a, ecdsa-sha2-nistp256
                  AAAAE@VjZHNhLXNoYTItbmlzdHAyNTYAAAAI[truncated]
                  root@d2-03a, ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBU/CbzrNnMivn5kAiHTU6WSadY/FWPG8qZ3sGleDbHr
                  root@d2-03a
  SSH public key fingerprint: SHA256:tMcJ2uFNmx6K+dF+Gm6WUBO4AvBmGVj9247mvg5LxU4 root@d2-03a (ssh-rsa),
                              SHA256:uJeRc0dkao/DmnQm2hyQUSfeC0HgIZppB2NVyA+BoTA root@d2-03a (ecdsa-sha2-nistp256),
                              SHA256:j+trvcJAQx5PeaJbUJ8xImBDgCJ2U/nW3h5D3m2kTj4 root@d2-03a (ssh-ed25519)
  Password: False
  Keytab: True
  Managed by: d2-03a.ipa.example.com

My ipa command kept complaining about all these language problems. Maybe I failed to set them correctly in my preseed. Whatever.

References

Internet searches

freeipa new ssh host key

Weblinks

6.8. Managing Public SSH Keys for Hosts
How To: Ubuntu / Debian Linux Regenerate OpenSSH Host Keys – nixCraft

Man pages

ipa help host-mod