How to Find Duplicate Words

When contributing to a new open source project, from time to time I searched the codebase for occurrences of the the.

This is a common mistake in comments in English codebases.

My friend Miroslav came up with an even better way:

Use a regex to find duplicate words!

rg --pcre2 "\b(\w+)\s+\1\b"

rg stands for ripgrep, which is a blazing fast implementation of a regex command line tool, written in Rust.

When trying to understand the above regex, I found an interesting StackOverflow question, where an alternative regex was mentioned, which even handles words with apostrophes, hyphens…

rg --pcre2 "(\b\S+\b)\s+\b\1\b"

The above link to StackOverflow also explains the regex expression.