Gitlab Issue Preservation project

The GitLab Issue Preservation (GLIP) Project

Introduction

Welcome to my huge hack of a project, which saves down static webpages of the git issues from a private gitlab instance. As a part of the Devuan migration from a self-hosted gitlab instance to a self-hosted gitea instance, the topic came up of saving the historical information in the git issues. I volunteered. The end results are not on the public Internet yet, but will be eventually.

Overview of the process

For merely executing the steps, these are the instructions. Most of them invoke scripts will be explained in the annotated process.

  1. Use gitlablib to list all issue web urls, and then remove all the “build”, “buildmodify” and similar CI/CD issues.
    . gitlablib.sh
    list_all_issues | tee output/issues.all
    <output/issues.all jq '.[]| if(.title|test("build-?(a(ll)?|mod(ify)?|add|del)?$")) then empty else . end | .web_url' | sed -r -e 's/"//g;' > output/issues.all.web_url
    
  2. Use fetch-issue-webpages.py to fetch all those webpages.
    ln -s issues.all.web_url output/files-to-fetch.txt
    ./fetch-issue-webpages.py
    
  3. Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task.
    • Fix newlines
      sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html
    • find data-original-titles and replace the
      ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
      ./fix-timestamps.py
      
    • Download all relevant images, and then fix them.
      ./fetch-images.sh
      sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
    • Download all stylesheets and then fix them.
      mkdir -p /mnt/public/www/issues/css
      ./fetch-css.sh
      sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html
    • Fix some encoding oddities
      sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
      
    • Remove html components that are not necessary
      ./remove-useless.py
    • Fix links that point to defunct domain without-systemd.org.
      sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
      

The annotated process

    1. Use gitlablib to list all issue web urls, and then remove all the “build”, “buildmodify” and similar CI/CD issues.
      . gitlablib.sh
      list_all_issues | tee output/issues.all
      <output/issues.all jq '.[]| if(.title|test("build-?(a(ll)?|mod(ify)?|add|del)?$")) then empty else . end | .web_url' | sed -r -e 's/"//g;' > output/issues.all.web_url
      

      I wrote a brief shell script library for interacting with the gitlab rest api. I had more functions than I ended up needing for the whole project; I really only needed the one private function for pagination of course, and then list_all_issues. Once I figured out how to list all issues (and how to handle the pagination) I didn’t need to bother with the per-project or per-repo perspective.

      #!/bin/sh
      # Startdate: 2020-05-29
      # Dependencies:
      #    jq
      #    my private token
      # Library for interacting with Gitlab API
      # For manual work:
      #    curl --header "${authheader}" "https://git.devuan.org/api/v4/projects/devuan%2Fdevuan-project/issues"
      # References:
      #    https://docs.gitlab.com/ee/api/README.html#pagination
      #    handle transforming the / in the path_with_namespace to %2F per https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding
      #    https://docs.gitlab.com/ee/api/issues.html
      
      export token="$( cat /mnt/public/work/devuan/git.devuan.org.token.txt )"
      export authheader="Private-Token: ${token}"
      
      export server=git.devuan.org
      
      export GLL_TMPDIR="$( mktemp -d )"
      
      clean_gitlablib() {
         rm -rf "${GLL_TMPDIR:-NOTHINGTODELETE}"/*
      }
      
      # PRIVATE
      _handle_gitlab_pagination() {
         # call: list_all_projects "${startUri}"
         ___hgp_starturi="${1}"
         test -n "${GLL_DEBUG}" && set -x
         # BEGIN
         rhfile="$( TMPDIR="${GLL_TMPDIR}" mktemp -t "headers.XXXXXXXXXX" )"
         done=0
         size=-1
         uri="${___hgp_starturi}"
      
         # LOOP
         while test ${done} -eq 0 ;
         do
            response="$( curl -v -L --header "${authheader}" "${uri}" 2>"${rhfile}" )"
            #grep -iE "^< link" "${rhfile}" # determine size if test "${size}" = "-1" ; then # run only if size is still undefined tmpsize="$( awk '$2 == "x-total:" {print $3}' "${rhfile}" 2>/dev/null )"
               test -n "${tmpsize}" && size="${tmpsize}"
               echo "Number of items: ${size}" 1>&2
            fi
      
            tmpnextpage="$( awk '$2 == "x-next-page:" {print $3}' "${rhfile}" 2>/dev/null )"
            # if x-next-page is blank, that means we are on the last page. Also, we could try x-total-pages compared to x-page.
            test -z "${tmpnextpage}" && done=1
            # so if we have a next page, get that link
            nextUri="$( awk '{$1="";$2="";print}' "${rhfile}" | tr ',' '\n' | awk -F';' '/rel="next"/{print $1}' | sed -r -e 's/^\s*<//;' -e 's/>\s*$//;' )"
            if test -n "${nextUri}" ; then
               uri="${nextUri}"
            else
               echo "No next page provided! Error." 1>&2
               done=1
            fi
      
            # show contents
            echo "${response}"
         done
      
         # cleanup
         rm "${rhfile}"
         set +x
      }
      
      list_all_projects() {
         _handle_gitlab_pagination "https://${server}/api/v4/projects"
      }
      
      list_all_issues() {
         _handle_gitlab_pagination "https://${server}/api/v4/issues?scope=all&status=all"
      }
      
      list_issues_for_project() {
         ___lifp_project="${1}"
         ___lifp_htmlencode_bool="${2}"
         istruthy "${___lifp_htmlencode_bool}" && ___lifp_project="$( echo "${___lifp_project}" | sed -r -e 's/\//%2F/g;' )"
         _handle_gitlab_pagination "https://${server}/api/v4/projects/${___lifp_project}/issues"
      }
      
      
    2. Use fetch-issue-webpages.py to fetch all those webpages.
      ln -s issues.all.web_url output/files-to-fetch.txt
      ./fetch-issue-webpages.py
      

      The script is where the bulk of my learning occurred for this project. Did you know that headless browsers can scroll down a webpage, and basically force the AJAX to load– that annoying stuff that doesn’t load when you do a nice, simple wget?

      #!/usr/bin/env python3
      # Startdate: 2020-05-29 16:22
      # History:
      # Usage:
      #    ln -s issues.all.web_url output/files-to-fetch.txt
      #    ./fetch-issues-webpages.py
      # How to make this work:
      #    apt-get install python3-pyvirtualdisplay
      #    download this geckodriver, place in /usr/local/bin
      # References:
      #    basic guide https://web.archive.org/web/20191031110759/http://scraping.pro/use-headless-firefox-scraping-linux/
      #    https://stackoverflow.com/questions/40302006/no-such-file-or-directory-geckodriver-for-a-python-simple-selenium-applicatio
      #    geckodriver https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
      #    https://www.selenium.dev/selenium/docs/api/py/index.html?highlight=get
      #    page source https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html?highlight=title#selenium.webdriver.remote.webdriver.WebDriver.title
      #    make sure all comments load https://stackoverflow.com/questions/26566799/wait-until-page-is-loaded-with-selenium-webdriver-for-python/44998503#44998503
      #    https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
      # Improve:
      from pyvirtualdisplay import Display
      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      import re, time, getpass
      
      def ask_password(prompt):
          #return input(prompt+": ")
          return getpass.getpass(prompt+": ")
      
      def scrollDown(driver, value):
         driver.execute_script("window.scrollBy(0,"+str(value)+")")
      
      # Scroll down the page
      def scrollDownAllTheWay(driver):
         old_page = driver.page_source
         while True:
            #logging.debug("Scrolling loop")
            for i in range(2):
               scrollDown(driver, 500)
               time.sleep(2)
            new_page = driver.page_source
            if new_page != old_page:
               old_page = new_page
            else:
               break
         return True
      
      server_string="https://git.devuan.org"
      outdir="/mnt/public/www/issues"
      
      with open("output/files-to-fetch.txt") as f:
         lines=[line.rstrip() for line in f]
      
      # ask password now instead of after the delay
      password = ask_password("Enter password for "+server_string)
      
      display = Display(visible=0, size=(800, 600))
      display.start()
      
      browser = webdriver.Firefox()
      
      # log in to gitlab instance
      browser.get(server_string+"/users/sign_in")
      browser.find_element_by_id("user_login").send_keys('bgstack15')
      browser.find_element_by_id("user_password").send_keys(password)
      browser.find_element_by_class_name("qa-sign-in-button").click()
      browser.get(server_string+"/profile") # always needs the authentication
      scrollDownAllTheWay(browser)
      
      for thisfile in lines:
         destfile=re.sub("\.+",".",re.sub("\/|issues",".",re.sub("^"+re.escape(server_string)+"\/","",thisfile)))+".html"
         print("Saving",thisfile,outdir+"/"+destfile)
         browser.get(thisfile)
         scrollDownAllTheWay(browser)
         with open(outdir+"/"+destfile,"w") as text_file:
            print(browser.page_source.encode('utf-8'),file=text_file)
      
      # done with loop
      browser.quit()
      display.stop()
      
    3. Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task.
      • Fix newlines
        sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html

        Nothing fancy here. I guess my encoding choice for saving the output was a little… wrong. So I’m sure this a crutch that isn’t used by the professionals.

      • find data-original-titles and replace the
        ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
        ./fix-timestamps.py
        

        I’m really fond of this one, partially because it’s entirely my solution and using it exactly as written for another project, and it because depends on an amazing little piece of metadata that gitlab provides in the web pages! The timestamps for relevant items are included, so while the rendered html shows “1 week ago,” we can convert the text to show the absolute timestamp.
        The script is as follows:

        #!/usr/bin/env python3
        # Startdate: 2020-05-29 20:40
        # Purpose: convert timestamps on gitlab issue web page into UTC
        # History:
        #    2020-05-30 09:24 add loop through files listed in output/files-for-timestamps.txt
        # Usage:
        #    ls -1 /mnt/public/www/issues/output*.html > output/files-for-timestamps.txt
        #    ./fix-timestamps.py
        # References:
        #    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing
        #    https://gitlab.com/bgstack15/vooblystats/-/blob/master/vooblystats.py
        #    https://bgstack15.wordpress.com/2020/02/16/python3-convert-relative-date-to-utc-timestamp/
        # Improve:
        #    this is hardcoded to work when the pages are shown in EDT.
        from bs4 import BeautifulSoup
        from datetime import timedelta
        from parsedatetime import Calendar
        from pytz import timezone 
        
        def fix_timestamps(page_text):
           soup = BeautifulSoup(page_text,"html.parser")
           cal = Calendar()
           x = 0
           for i in soup.find_all(name='time'):
              x = x + 1
              j = i.attrs["data-original-title"]
              if 'EDT' == j[-3:] or 'EST' == j[-3:]:
                 tzobject=timezone("US/Eastern")
              else:
                 tzobject=timezone("UTC")
              dto, _ = cal.parseDT(datetimeString=j,tzinfo=timezone("US/Eastern"))
              add_hours = int((str(dto)[-6:])[:3])
              j = (timedelta(hours=-add_hours) + dto).strftime('%Y-%m-%dT%H:%MZ')
              # second precision %S is not needed for this use case.
              i.string = j
           return soup
        
        with open("output/files-for-timestamps.txt") as f:
           lines = [line.rstrip() for line in f]
        
        for thisfile in lines:
           print("Fixing timestamps in file",thisfile)
           with open(thisfile) as tf:
              output=fix_timestamps(tf.read())
           with open(thisfile,"w",encoding='utf-8') as tf:
              tf.write(str(output.prettify()))
        
      • Download all relevant images, and then fix them.
        ./fetch-images.sh
        sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html

        I wrote this script first, because it was the images that were the most important item for this whole project.

        #!/bin/sh
        # startdate 2020-05-29 20:04
        # After running this, be sure to do the sed.
        #    sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
        # Improve:
        #    It is probably an artifact of the weird way the asset svgs are embedded, but I cannot get them to display at all even though they are downloaded successfully. I have seen this before, the little embedded images you cannot easily download and simply display.
        
        INDIR=/mnt/public/www/issues
        INGLOB=*.html
        
        SEDSCRIPT=/mnt/public/work/devuan/fix-images-in-html.sed
        
        INSERVER=https://git.devuan.org
        
        cd "${INDIR}"
        
        # could use this line to get all the assets, but they do not display regardless due to html weirdness
        #orig_src="$( grep -oE '(\<src|xlink:href)="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$0]++{print $2}' )"
        orig_src="$( grep -oE '\<src="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"
        
        echo "${orig_src}" | while read line ; do
           getpath="${INSERVER}${line}"
           outdir="$( echo "${line}" | awk -F'/' '{print $2}' )"
           test ! -d "${outdir}" && mkdir -p "${outdir}"
           targetfile="${outdir}/$( basename "${line}" )"
           test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
           test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
           # dynamically build a sed script
           echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
        done
        
      • Download all stylesheets and then fix them.
        mkdir -p /mnt/public/www/issues/css
        ./fetch-css.sh
        sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html

        This is basically a rehash of the previous script.

        #!/bin/sh
        # Startdate: 2020-05-29 20:18
        
        INDIR=/mnt/public/www/issues
        INGLOB=*.html
        
        SEDSCRIPT=/mnt/public/work/devuan/fix-css-in-html.sed
        
        # OUTDIR will be made in INDIR, because of the `cd` below.
        OUTDIR=css
        test ! -d "${OUTDIR}" && mkdir -p "${OUTDIR}"
        
        INSERVER=https://git.devuan.org
        
        cd "${INDIR}"
        
        orig_css="$( sed -n -r -e 's/^.*<link.*(href="[^"]+\.css").*/\1/p' ${INGLOB} | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"
        
        echo "${orig_css}" | while read line ; do
           getpath="${INSERVER}${line}"
           targetfile="${OUTDIR}/$( basename "${line}" )"
           test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
           test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
           # dynamically build a sed script
           echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
        done
      • Fix some encoding oddities
        sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
        

        This is definitely because of my choice of encoding. In fact, I bet my copy-paste of the script contents is entirely messed up for this blog post. You’ll have to check it out in the git repo. Also, this is probably the hackiest part of the whole project.

        $ {s/^'//}
        1 {s/^b'//}
        s/·/·/g # do not ask how I made this one
        s/Ã//g
        s/\\'/'/g
        s/\xc2(\x91|\x82|\x)//g
        s/\\xc2\\xb7/·/g # two characters here
        s/\\xc3\\xab/�/g
        s/\\xe1\\xb4\\x84\\xe1\\xb4\\xa0\\xe1\\xb4\\x87/CVE/g
        s/\\xe2\\x80\\x99/'/g
        s/\\xe2\\x80\\xa6/.../g
        s/(\\x..)*\\xb7/·/g # two characters here
        
      • Remove html components that are not necessary
        ./remove-useless.py

        Thankfully, I know enough BeautifulSoup to be dangerous. In fact, I went with the scrape-and-delete method because we wanted readable issue contents with minimal work. And yes, this was my best-case scenario for “minimal work.” And yes, I know this has way too much duplicated code. It works. Please submit any optimizations as a comment below, or as a PR on the git repo.

        #!/usr/bin/env python3
        # Startdate: 2020-05-30 19:30
        # Purpose: remove key, useless html elements from slurped pages
        from bs4 import BeautifulSoup
        import sys
        
        def remove_useless(contents):
           soup = BeautifulSoup(contents,"html.parser")
           try:
              sidebar = soup.find(class_="nav-sidebar")
              sidebar.replace_with("")
           except:
              pass
           try:
              navbar = soup.find(class_="navbar-gitlab")
              navbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="issuable-context-form")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-sidebar")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-actions")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-noteable-awards")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="disabled-comment")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="notes-form")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="btn-edit")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-edit")
              rightbar.replace_with("")
           except:
              pass
           try:
              mylist = soup.find_all(class_="note-actions")
              for i in mylist:
                 i.replace_with("")
           except:
              pass
           try:
              mylist = soup.find_all(class_="emoji-block")
              for i in mylist:
                 i.replace_with("")
           except:
           return soup
        
        with open("output/files-for-timestamps.txt") as f:
           lines = [line.rstrip() for line in f]
        
        for thisfile in lines:
           print("Removing useless html in file",thisfile)
           with open(thisfile) as tf:
              output=remove_useless(tf.read())
           with open(thisfile,"w",encoding='utf-8') as tf:
              tf.write(str(output.prettify()))
        
      • Fix links that point to defunct domain without-systemd.org.
        sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
        

        This requirement came in late during the development phase. I called this one “scope creep,” but thankfully it was easy enough to automate changing out links to the web.archive.org versions.

        /without-systemd\.org/{
           /archive\.org/!{
              s@(http://without-systemd\.org)@https://web.archive.org/web/20190208013412/\1@g;
           }
        }

Conclusions

I learned how to use a headless browser for this project! I already had dabbled with BeautifulSoup and jq, and of course I already know the GNU coreutils. I already had a function for fixing relative timestamps thankfully!

References

Weblinks

Obviously my scripts listed here also contain the plain URLs of the references, but this is the list of them in html format:

  1. API Docs | GitLab #pagination
  2. API Docs | GitLab #namespaced-path-encoding
  3. Issues API | GitLab
  4. basic guide to headless browser Tutorial: How to use Headless Firefox for Scraping in Linux (Web Archive)
  5. linux – No such file or directory: ‘geckodriver’ for a Python simple Selenium application – Stack Overflow
  6. geckodriver binary https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
  7. Selenium Client Driver — Selenium 3.14 documentation
  8. page source selenium.webdriver.remote.webdriver — Selenium 3.14 documentation
  9. make sure all comments load Wait until page is loaded with Selenium WebDriver for Python – Stack Overflow
  10. Selenium 101: How To Automate Your Login Process | CrossBrowserTesting.com
  11. Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
  12. relative timestamp to absolute vooblystats.py · master · B Stack / vooblystats · GitLab
  13. Python3: convert relative date to UTC timestamp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.