Grep odt file

Overview

In the GNU/Linux world, you spend a lot of time on the command line. Searching a text file is a piece of cake:
grep -iE "expression" file1

You might even use a gui, and inside that gui you might even use an open-source office suite for all those times that plain text isn’t enough. But what about when you want to search one of those odt files you vaguely remember is some form of xml?

Easy. You use unoconv or odt2txt (look those up in your package manager) and then grep the outputted file. Or you can use the –stdout option.

unoconv -f txt foo.odt

unoconv -f txt --stdout foo.odt | grep -iE "Joe Schmoe"

History

I first started tackling this problem by figuring out how to access the xml inside. I learned an odt file is zipped, but a tar xf didn’t help. Turns out it was some other compression, that unzip manages.

I also had to actually learn the tiniest bit of perl, as regular GNU grep (and I inferred sed) doesn’t do non-greedy wildcard matching.

So I got this super-complicated one-liner going before I decided to try a different approach and discovered the unoconv and odt2txt applications.

time unzip -p foo.odt content.xml | sed -e 's/\([^n]\)>\n(.*)<\/\1>/\2/;s/<text:h.*?>(.*)<\/text:h>/\1/;' -e 's/<style:(font-face|text-properties).*\/>//g;' | sed -e "s/'/\'/g;s/"/\"/g;s/<text:.*break\/>//g;"

 

References

Weblinks

  1. Unzipping an odt file https://ubuntuforums.org/showthread.php?t=899179&s=3aa7c303c4a5655e039600c4082d7a2c&p=5653494#post5653494
  2. Perl non-greedy wildcard matching http://stackoverflow.com/a/1103177/3569534