Using Regular Expressions To Find Words Used Before A Date

If you’re involved with any website that references rules, laws and regulations that need to be updated on a yearly basis, finding them all can be a real pain.

In an ideal world, you’ve used the same format on every page which means that you do a global search and replace.

However, those ideal worlds are few and far between, which means that they can many different formats.

Our client GoSimpleTax, who produce self assessment tax return software produce a wide variety of blog content around different tax issues for the self employed.

They asked us if there was a quick way to find all the references to the years 2020,2021 and 2022 on pages

The Quick Option Isn’t The Most Useful

The quick option would have to use the search functions in your favourite crawling tool to find any page that contains those dates

How ever your page very well may have

  • Copyright notices
  • Publication dates

This means that you end up with lots of repetitive information that you may or not want to change

Use Regex to Capture the Words in Front of the Date

The following regular expressions allows you to capture the words that proceed the date

(?:\w+\s+){2,5}(202[0-3])

It breaks down as follows

(?:\w+\s+){2,5}) is our first capture group says match a sequence of 2 to 5 words seperated by one or more paces

202[0-3) is our second capture group

it would match

  • mileage rates 2021
  • uk mileage rates 2022

but not

  • rates 2023

How to Use in Screaming Frog

  1. Goto to Configuration > Custom Extraction
  2. Press the Add Button
  3. Complete as followsScreaming Frog Custom extraction
  4. Press OK to Save
  5. Run a crawl as normal
  6. The results are then available in the Custom Extraction Tab

How to Use in SiteBulb

  1. Select Content Extraction (when setting up an audit)
  2. Add New Rule
  3. On the Rule tab you need to add both a CSS selector and the regex
    1. If you want to search the whole page you need to find the selector that covers the whole page.
  4. On the Data tab ensure you check the All Matched Items and Text to ensure that you get all the options
  5. Press Add Rule
  6. Run the crawl
  7. The results are then available in the Content Extraction part of the audit

Using it in Google Search Console

You can also use the regular expression within GSC

It’s Not Just For Dates

You adapt the regex to do the following

  • Look for all the words that proceed a certain word.
    • In this case look for 2 to 5 words that proceed the word tax
    • (?:\w+\s+){1,3}(tax)
  • Look for all the words that follow the a certain word
    • In this case how 1 to 3 words that follow tax return
    • tax return (?:\w+\s+){1,3}