I have been using the Poppler library for some time, over a series of various projects. It’s an open source set of libraries and command line tools, very useful for dealing with PDF files. Poppler is targeted primarily for the Linux environment, but the developers have included Windows support as well in the source code. Getting the executables (exe) and/or dlls for the latest version however is very difficult on Windows. So after years of pain, I jumped on oDesk and contracted Ilya Kitaev, to both compile with Microsoft Visual Studio, and also prepare automated tools for easy compiling in the future. Update: MSVC isn’t very well supported, these days the download is based off MinGW.
So now, you can run the following utilities from Windows!
- PDFToText – Extract all the text from PDF document. I suggest you use the -Layout option for getting the content in the right order.
- PDFToHTML – Which I use with the -xml option to get an XML file listing all of the text segments’ text, position and size, very handy for processing in C#
- PDFToCairo – For exporting to images types, including SVG!
- Many more smaller utilities
Latest binary : poppler-0.67.0_x86
Windows Subsystem for Linux (WSL) is a great option for many windows users and developers. You can enable WSL and install Ubuntu if you are not using the “S” edition of Windows. Then you can simply install “sudo apt install poppler-utils”. If you’re a developer, you can still start the ubuntu based poppler tool(s) using the wsl command: “wsl pdftocairo …”
As it turns out though, your poppler version will be limited to a given distribution of Ubuntu at the time. 18.04 uses poppler 0.62. So in some ways our work on windows compiling can be better – it gives you the latest version.
If you need perfect support for QT and other missing features of our mingw Windows built version, then WSL might be the best way to go. I’m guessing the WSL/Ubuntu version is 64 bit for instance.
It would be nice if the Poppler team built a mmap IPC convention for processing PDF files. That way the process (either WSL or Mingw) could continue running, and process PDFs based on requests received, and returning the output to the caller. Like a server. It could also be much simpler, if it just ran as a simple web server.