DIFFing PDF and DOCX files

Compare DIFFerences in text between PDF and DOCX files.
Linux
Utility
Author

Caleb Grant

Published

August 8, 2023

If you have a PDF and DOCX of the same document and want to check for difference in text, use diff to compare them. Since we’re using --word-diff, it doesn’t matter that the two files use wildly different line wrapping.

gs -q -sDEVICE=txtwrite -o- file1.pdf > file1.txt
pandoc -t plain file2.docx > file2.txt
git diff --no-index --word-diff file1.txt file2.txt

Or create a shortcut…

alias pdfcat='gs -q -sDEVICE=txtwrite -o-'
alias doccat='pandoc -t plain'
pdfcat file1.pdf > file1.txt
doccat file2.docx > file2.txt
git diff --no-index --word-diff file1.txt file2.txt

Credits