Using Regular Expressions To Find Words Used Before A Date
If you’re involved with any website that references rules, laws and regulations that need to be updated on a yearly basis, finding them all can be a real pain.
In an ideal world, you’ve used the same format on every page which means that you do a global search and replace.
However, those ideal worlds are few and far between, which means that they can many different formats.
Our client GoSimpleTax, who produce self assessment tax return software produce a wide variety of blog content around different tax issues for the self employed.
They asked us if there was a quick way to find all the references to the years 2020,2021 and 2022 on pages
- The Quick Option Isn’t The Most Useful
- Use Regex to Capture the Words in Front of the Date
- How to use in Screaming Frog
- How to Use in SiteBulb
- How to Use in Google Search Console
- It’s not just for dates
The Quick Option Isn’t The Most Useful
The quick option would have to use the search functions in your favourite crawling tool to find any page that contains those dates
How ever your page very well may have
- Copyright notices
- Publication dates
This means that you end up with lots of repetitive information that you may or not want to change
Use Regex to Capture the Words in Front of the Date
The following regular expressions allows you to capture the words that proceed the date
(?:\w+\s+){2,5}(202[0-3])
It breaks down as follows
(?:\w+\s+){2,5}) is our first capture group says match a sequence of 2 to 5 words seperated by one or more paces
202[0-3) is our second capture group
it would match
- mileage rates 2021
- uk mileage rates 2022
but not
- rates 2023
How to Use in Screaming Frog
- Goto to Configuration > Custom Extraction
- Press the Add Button
- Complete as follows
- Press OK to Save
- Run a crawl as normal
- The results are then available in the Custom Extraction Tab
How to Use in SiteBulb
- Select Content Extraction (when setting up an audit)
- Add New Rule
- On the Rule tab you need to add both a CSS selector and the regex
- On the Data tab ensure you check the All Matched Items and Text to ensure that you get all the options
- Press Add Rule
- Run the crawl
- The results are then available in the Content Extraction part of the audit
Using it in Google Search Console
You can also use the regular expression within GSC
It’s Not Just For Dates
You adapt the regex to do the following
- Look for all the words that proceed a certain word.
- In this case look for 2 to 5 words that proceed the word tax
- (?:\w+\s+){1,3}(tax)
- Look for all the words that follow the a certain word
- In this case how 1 to 3 words that follow tax return
- tax return (?:\w+\s+){1,3}