Don’t worry, I’m not going to get overly technical – the building blocks of segmentation algorithms aren’t terribly important, unless you happen to work for Google or Yahoo that is. Basically segmentation techniques range from analysis of a site’s code (DOM) to techniques that are more based on what you actually see on the website, these are known as a ‘vision based’ techniques. There are also techniques that involve elements of both.
Whatever the technique, the main point is to establish the recurring, or ‘boilerplate’ elements of the site and, if necessary, disregard them on future crawls. By boilerplate elements I mean things like navigation options, footers, headers, etc., basically things that don’t change and therefore don’t need to be crawled regularly by the search engines.
From there, recurring elements can be identified once the search engine has looked at numerous pages on the website. These more specific elements include things like repeated phrases, perhaps something like a copyright notice that occurs below every photo.
As more pages are crawled the engine would notice more and more elements as things to disregard, things like ‘home’ tabs at the end of articles or advertising blocks.
Related posts:
- Page Segmentation Part 5: Potential Issues and implications for Spammers
- Page Segmentation Part 3: The potential Advantages for Search Engines
- Page Segmentation Part 1: What is it?
- Page Segmentation Part 4: The Advantages for your Site and Consumers
- What you need on your page to make SEO work for you
- Content is King?


