Convert Markdown to PDF with Pandoc and LaTeX
I took a journey lately, trying to turn guides written with markdown into pdf files, and I did it with Pandoc and LaTeX.
February 6, 2019
I took a job as a technical writer, back in April of 2018. The company is great. Refreshing. They are genuinely concerned with how their customers (in this case, students) are doing. And while I thought I was a geek, many of my coworkers have me beat six ways to Sunday. There are some processes going on behind the scenes that are a bit painful. Everybody is working at fixing them. This little story is about one such process that I was able to rectify for us.
We create what are called Study Guides. They’re downloadable PDF files that students can grab. Students study them for whatever certification the associated course is designed to teach about. The person teaching the course provides the technical writing team with a Markdown file. Then we turn it into the PDF for students. When I showed up, there was one coworker who was able to do that. She’d take the Markdown file, run it through some kind of process, then spit out a PDF.
I remember even during my second interview, in early April, hearing something about InDesign being a bit of a ruckus. My (now) coworker at the time was the only technical writer, so it was all on her to make these. After I’d been working for a couple months, I heard more and more about the process. Her voice in meetings was filled with anguish, as file after file gave her trouble. I didn’t have a full grasp of how it worked. Being an open-source proponent, I wanted absolutely nothing to do with proprietary software. I didn’t volunteer to help, because I didn’t want to touch Adobe-anything with a ten-foot fiberglass rod.
The Current Process
As I understood it, getting from Markdown to PDF was something like:
- Edit the Markdown file for actual content.
- Run it through some conversion process that turned it into a file InDesign could use.
- Use InDesign to turn it into a PDF.
The conversion process, I found out later, happened to also be Pandoc. It converted Markdown to the InDesign format, icml. If all went as planned, well and good. But the problem arose when InDesign choked on something. A page was missing because a code block (how we’d show terminal output — three backticks in Markdown, which equates to
pre tags in HTML) was too long, or a table would make the trip wearing the wrong clothes. There were a few different things that made InDesign lose its lunch. If she had to reconvert (Markdown to icml), she’d lose any changes she’d made in InDesign. She had to start over with a fresh icml file. I remember, just before I unveiled the “Oh, I think I got it” version of what I was working on. She had been trying to make one particular document work for over a week.
This is bad. This is a waste of time. It’s expensive, when you look at man-hours spent dorking with something that should just work. And what about paying monthly license fees, for software that’s designed to do what you’re trying to do, and doesn’t?
For someone used to writing Bash and PHP scripts to get things done, this was complete hogwash. About June, I saw (remotely) my coworker get so backed up with these kinds of documents. I worried she might end up pulling out enough hair to be balder than me. There had to be open-source alternatives.
I discovered Pandoc.
What Is Pandoc?
Pandoc is an open-source converter. It takes the syntax from one language (Markdown, ODT, RTF, HTML, etc) and converts it to another. But it leaves the actual content alone. So, if I type in an Markdown document:
## Blah blah
and want to turn it into HTML, Pandoc will leave “Blah blah” alone, but take the ## and turn it into
<H2> tags. That’s it. And we were already using it to get from Markdown to an InDesign icml file.
The beauty of it though is that it can also talk LaTex. That’s an insane open-source language designed for typography. And I realized that it also talks PDF! I got to wondering Gee, it should be able to read Markdown and spit out PDF. Wonder if I can do it in a Bash script. My line of thinking seemed plausible.
It was a long road. I “finished” in mid-December, but I got there. Now we create a Markdown file. Then with a few strange hacks and a LaTex template, we spit out a PDF in seconds. Put that in your pipe and smoke it, Adobe, right?
How’d I Get There?
Funny you should ask. Like so many other people trying to figure out how to run an open-source app between some kind of sketchy docs and an IRC room, it’s a pretty crazy story.
I started out trying to figure out what exactly Pandoc was, and what LaTex was. I used to work for a cleaning company in high school. As far as I was concerned, latex was what gloves are made out of. They were what I wore cleaning toilets. Once I got squared away with a working environment, I spent a couple months worth of late nights learning simple Pandoc commands. While my results were far from beautiful, it created PDF files from Markdown files.
When I mentioned what I wanted to do to my coworkers, lo and behold someone was already doing something similar. A fellow piped up about what HE was already doing. He’d been using Pandoc and a LaTex template he found on GitHub, to make syllabuses for his courses. He found one that was close enough for what he needed, and created downloadable PDFs for his students to grab. He pointed me at the, really slick looking, LaTex template. I ran some of the Markdown files we technical writers used through it. The team had grown by two more at that point, by the way. I was pretty happy with the results. But customizing it was a bit of a bear. I had no idea how lines in the template were affecting the final PDF.
I knew I had to start from scratch.
Creating a Simple LaTex Template
This proved a bit difficult. Doing something weird in LaTex requires using a package. If I want to play with how my headings look (H1, H2, etc), I’ve got to call a package that deals with that stuff. In my case, that’s titlesec. Calling a package reminds me of including a file in PHP. There, I’ll have an
index.php file. This has a page’s content. Then I’ve got a header. That includes the top part of the page, like the html header, the nav menu, and so forth. Rather than write all that info in each PHP file, I just create something like a
header.php file. It will include all of the header, nav, and whatever else I want at the beginning of an HTML document. Then I can call it from whatever page I want to have the information in.
LaTex packages are like that, vaguely.
The trick is figuring out which packages do what. Then, how do you call them in an order that won’t cause a conversion to go belly up.
A default Pandoc Markdown to PDF conversion has something ridiculous like 3″ by default. So I started with page margins and went from there.
Fine Tuning the Process
One of the things that’s nice about this process is that testing was a one-step procedure. Unlike using InDesign, I only need to do the one Pandoc step. Essentially, it’s:
pandcod -s --template=MY_TEMPLATE_FILE filename.md --pdf-engine=xelatex -o filename.pdf
If I get it right, bully for me. If I get it wrong, I can just mess with either the Markdown file or the Latex template. Then I just run the command again. I don’t lose anything, like my coworker with the InDesign procedure. Did I say “Put that in your pipe and smoke it, Adobe,” earlier? I ended up running it a lot, and even wrote a bash script to save me some work. It essentially took the same
.md file and overwrote the same
At some point, I’d figured out which packages to include. And I tried to account for all of the Markdown situations I thought we’d ever run into as technical writers. Then I showed the rest of the crew. We had a new coworker who had no idea any of this was going on, and a couple who had only a vague notion of what was happening. I shared my screen in a Slack meeting, and ran the Pandoc command that converted a document successfully. The woman who had been fighting with InDesign cried. It was awesome. We made it.
I’m Almost Done
The way it sits now, we have a pretty spiffy LaTex template. And we’re creating some schnazzy looking PDFs in just a few seconds. I’ve dorked with the template a bit as I’ve figured out better ways to do things. I’ve got another coworker who’s quite a bit more Mac savvy than me. He’s figured out how to do in MacOS what I do in Linux (pretty much just installing packages). He also found a way to display tables way better than I’d done. We’ve managed to get our stuff up on GitHub now. When anyone figures out a better tweak to the LaTex template, we can all just pull it down to our individual computers and bang away at like nobody’s business.
What’s It All Mean?
What’s it all mean? It means that we can now produce the PDF documents almost immediately. When someone gives us their markdown file, we go over it, and BAM. Instant PDF. There’s no bottleneck at the one person with an InDesign license. Did I say “Put that in your pipe and smoke it, Adobe” earlier? And we don’t have to mess around with losing any kind of edits in the process.
Now, we grab the Markdown file from a course author. Then we edit it for grammar. We apply a bit of weirdness (like two spaces and a newline after headings to actually make a new line). Finally, we run the pandoc command and make the PDF. That’s it. We make what students need quickly. And they can get it just as fast, to use as a study aid. For the first time in a while, maybe ever (dunno even who to ask about that on my end) study guides are available before a course even launches.
Well, not really. I’m pretty happy with the results so far. I spent a weekend on headings and I’d like to tone things down in the Blocks of Code Department. Vertical space is too big. But as it sits, the PDFs we’re making now are pretty user friendly. What’s even better, I’d like to give the template away, on GitHub. I’m hoping to save another person six months of screaming at their monitor. Grab the template, as well as a README (both Markdown and finished PDF) on GitHub.