Regex to grab your citations (provided you’re sensible*)

* and by sensible, I mean, not mucking about with footnotes. Can’t abide footnotes. But I digress.

Earlier today, @archaeo_girl asked,

And Sasha Cuerda came up with this: This is awesome.

Note, however, that the pattern Sasha shares will not work if, in the middle of a parenthesis, you have a citation like this:

(Doe, 2016; Smith 2010, 2012; Graham, 2008)

See the problem? It’s that pesky , between 2010 and 2012 for Smith. My regex-fu is not strong, so one shortcut might be (assuming you’re working on a *copy* of your text in a text editor) to find all commas and replace them with semi-colons. Then Sasha’s pattern will work. After all, you’re just after the citations.

Be sure to click on the ‘replace’ button in Sasha’s regexr to see how you could extract the citations. The replace pattern puts a # at the start of a line with a citation, and makes sure that only the citation is on that line. You could then search for all lines NOT starting with a # and delete them. Hey presto, all your citations in a handy list!  (Speaking of lists, I missed the ‘list’ tool at the bottom, which has the relevant regex pattern to replace the text directly with a list. Cool beans!)

Other handy regexes:

If the text is like this:

Shawn Graham (2008) writes in Electric Archaeology…


[A-Z]\w+ \(\d{4}\)

will find `Graham (2008)`

If the text is like this:

According to Graham (2008:45), “Smurfs are the problem…”


[A-Z]\w+ +\(\d{4}:\d+\)

will find `Graham (2008:45)`.


3 thoughts on “Regex to grab your citations (provided you’re sensible*)

    1. Oh cool!

      Is there an easy way to update the regex to account for spaces in last names, like Mintz and Du Bois 2002 from your example?

Comments are closed.