Using GREP in InDesign to find prime numbers

Prime numbers are those that are divisible only by themselves and by the number one. Now I dislike arithmetic, and my heart sinks whenever I hear the word ‘divisible’, because it suggests boring activities such as counting or doing “long division sums”.

But I enjoy working with text, and trying out clever things with “find” and “change to” in applications such as InDesign. So it was a real pleasure to learn recently that GREP can be used to find prime numbers. (GREP is InDesign’s implementation of “regular expressions” for matching text.) Grasping how it can do that also helps to throw light on the concept of a prime number. And it does so in an intuitive and simple way that does not involve doing arithmetic.

Imagine an old-fashioned pavement made out of a fixed number of rectangular paving slabs laid side by side. Imagine a child walking from one end of the pavement to the other. By “avoiding the cracks” between them, the child can always reach the last slab, whatever their number, by simply stepping from one slab to the next. But by jumping over alternate slabs (i.e. every second slab), the child might not be able to land on the last slab – it depends whether there is an even number of them. Likewise, by taking big leaps of three slabs at a time, our child will only be able to land on the last slab if their total number is a multiple of three. And so on.

I hope you can see how this might continue into larger and larger numbers. So imagine this going on, with our child taking larger and larger steps, possibly with the help of a pair of stilts. For example, if the pavement is 15 slabs in length, the fifteenth slab can be reached by taking five big leaps of three slabs each, or three even longer stilted strides of five slabs each.

Now here’s the really important thing about prime numbers: the last slab of a prime number of such slabs can only be reached by taking a single step over all of the slabs that precede it.

GREP can be used to find prime numbers, because a simple GREP expression can match non-prime numbers. It manages to do that by mimicking the behavior of a child stepping over multiple paving slabs as just described.

Let’s build up a GREP expression slowly to see how this works. By analogy with reaching all the way to the final slap of a pavement, we want our GREP expression to match an entire series of letters. Let’s choose any letter at random, such as capital M.

We should start off with the simplest of GREP expressions (for clarity, they have this dark red color): the single letter M will match any single instance of the letter M. If I have a long series of Ms (like this: MMMMMMMMM) the GREP expression M+ will match the whole series at once. That’s because the plus sign asks it to match one or more Ms, and by default GREP is “greedy” – it will match as much as it can, in this case the whole series. We can change that default behavior by adding a question mark. In isolation, M+? will match the same as M on its own.

What we want to build is a GREP expression that will mimic the behavior of a child skipping over whole paving slabs (plural) rather than one-by-one by simply stepping over the cracks between them. The expression MM+? works for that purpose, because it will match two or more instances of the letter M, at the same time as matching as little as possible thanks to the ? at its end. This gets really useful when combined with parentheses to make a “unit” (MM+?), and \1 to match whatever that unit matches (the number one is used here because it’s the “first unit” in the entire expression).

Bearing in mind what I have just said about the default “greediness” of GREP and the way it can be overridden with a question mark ?, consider the following expression:

(MM+?)\1+

This expression is nearly what we’re looking for, as it matches as much as can be matched by repeatedly re-using its smallest constituent parts, where the parts in question are anything bigger than single letters. To illustrate, consider this series of six Ms: MMMMMM.

The MM+? in parentheses matches the first two Ms in MMMMMM. It won’t match all of them because the ? tells it to match as little as possible, and it won’t match just one, because it must match at least two. So now \1 matches a pair of Ms. So \1+ matches as many pairs of Ms as it can, to try and match all six Ms. As it happens, just two further pairs are needed.

This is analogous to a child reaching the final slab of a pavement of six slabs by jumping over the first two slabs in one go, then repeating the same feat twice. Reaching the last of any even number of slabs involves the same procedure, repeating the initial jump as many times as may be necessary.

But now suppose we use the same expression to try to match nine Ms. Just repeating matching pairs won’t work this time, because nine isn’t a multiple of two. This is where GREP does something clever. It “backtracks” as soon as it has to give up on its initial attempt to match the whole series by repeating a matching pair. Next, it tries matching MM+? to three Ms instead of two. This is what it must do, if you think about it, since it is trying to match as little as possible with the part of the expression in parentheses, yet as much as possible with the entire expression. The default “greediness” of GREP remains the “prime directive”, and it might be able to match more by trying repeated triples rather than repeated pairs of matched letters. And in the case of nine letters, it turns out happily again, with \1+ matching two further triples.

This is analogous to a child reaching a ninth slab by leaping over three in one go at first, then repeating it two more times.

I hope it’s obvious how this continues. GREP will keep trying out larger and larger initial matches as long as it fails to match the entire series by repeating its initial match. With non-prime numbers of letters, it will eventually succeed. But with prime numbers, it will never arrive at an initial match whose repetition succeeds in matching the entire series. So prime numbers are those that GREP can’t match when searching in series of the same character (such as the letter M).

There are couple of loose ends to tie up. GREP needs to recognize the start and the end of such a series. We might tell it only to look within entire paragraphs, in which case we should put ^ at the start of the expression and $ at the end (this is a standard GREP convention). Or we might use spaces between series to mark them off from each other, and look for any character except spaces instead of the letter M. Using standard GREP code for “positive lookbehind” (?<= ), “positive lookahead” (?= ), and “anything but” [^ ] set to spaces, it ends up like this:

(?<= )([^ ][^ ]+?)\1+(?= )

The expression (?<= )([^ ][^ ]+?)\1+(?= ) matches any series of the same character whose length is non-prime, but it won’t match any whose length is prime. It’s a straightforward matter to apply this in InDesign or any equivalent application to find prime numbers. (Interested readers familiar with InDesign can download a simple Javascript that demonstrates the basic idea here.)

I have tested several scripts for generating and testing quite large prime numbers, and GREP works remarkably efficiently when put to this unintended purpose. In doing so, I have acquired a more intuitive grasp of what prime numbers are, and why they are part of nature. For example, 13-year cicadas and 17-year cicadas only have to compete against each other every 13 × 17 = 221 years, when they emerge in the same year. It is no accident that evolution stumbles upon prime numbers in this sort of situation.

I can see why we might we might call primes the “building blocks” of the counting numbers. Best of all, I haven’t had to do any arithmetic! Hate arithmetic!

One thought on “Using GREP in InDesign to find prime numbers

  1. Pingback: This Week in InDesign Articles, Number 107 | InDesignSecrets

Leave a Reply

Your email address will not be published. Required fields are marked *