How To Boost Your Site's Crawlability And Indexability

How To Boost Your Site's Crawlability And Indexability

What is Robots.txt and how it works? Reading How To Boost Your Site's Crawlability And Indexability 19 minutes Next What is a Sitemap?

Table of content

    Google Crawlability And Indexability are the two most important factors that determine how well your WordPress site will rank in search engines. If your site is not crawlable by Google, then it will not be indexed and will not appear in search engine results. If your site is not indexable by Google, then it will not be able to rank for any keywords.

    The first step to making sure your WordPress site is Google crawlable and indexable is to create a sitemap. A sitemap is a file that tells Google what pages are on your site and what order they should be in. You can create a sitemap manually or you can use a plugin like Yoast SEO to automatically generate one for you.

    Once you have a sitemap, the next step is to make sure all of your pages are being indexed by Google. You can do this by using the Google Search Console to submit your sitemap. You can also use the Fetch as Google tool to test whether or not Google can crawl and index a specific page on your site.

    The last step to ensuring Google can crawl and index your WordPress site properly is to make sure there are no errors in your Robots.txt file. The robots.txt file tells Google which pages on your site it should not crawl. If there are any errors in your robots.txt file, Google may not be able to properly crawl and index your site.

    Making sure your WordPress site is Google crawlable and indexable is vital to ranking well in search engines. By following the steps above, you can ensure that your site will be properly indexed and will rank high in search engine results.

    The many technical ways to influence the crawling and indexing of your website can be confusing, especially when techniques are used together. In this article, I cover 6 technical SEO aspects used to influence search engine indexing.

    The 6 Technical SEO Aspects Are:

    1. Robots.txt
    2. Meta robots
    3. Canonical-tag -> (Canonical-tag combining with noindex)
    4. Rel=”next”/”prev” tag -> (Rel=”next”/”prev” tag combine with canonical tag)
    5. Hreflang tag -> (Hreflang tag combine with canonical tag) / (Hreflang tag combine with rel=”next”/”prev” tag)
    6. Google Search Console parameters

    The first 5 aspects are also put into a flowchart that you can find at the bottom of this article. You can follow this flowchart on a page-by-page basis to get the page SEO right.

    Beforehand, I want to give a little warning. Adjusting these technical SEO aspects can have major consequences. It is recommended that you handle this with caution.

    Robots.txt

    Robots.txt is a small text file that contains instructions for bots. Through this file, it is possible to direct bots at the domain level not to crawl certain domains, directories, pages, files or specific URLs. Often the search function of a website is excluded from crawling via robots.txt. This is done because these URLs are not wanted in the search results of a search engine. This is because you (often) cannot optimize the search results page for every search query.

    Note that the instructions in the robots.txt are guidelines. They are not obligations for bots. The robots.txt is always directly after the domain extension in the URL. For example, https://www.domein.nl/robots.txt. In the case of a WordPress website, the robots.txt may look like this:

    User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.phpSitemap: https://www.domein.nl/sitemap_index.xml

    Robots.txt allows the user to give instructions for all bots or just for a particular bot: for example, only Googlebot or Bingbot. This is indicated by the “user-agent” and it looks like this:

    • User-agent: Googlebot → Hello Googlebot, welcome to my website. The following instructions are for you.
    • User-agent: Bing → Hello Bingbot, welcome to my website. The following instructions are for you.
    • User-agent: * → Hello all bots, welcome to my website. The following instructions are for you.

    The last user-agent with a ‘*’ provides instructions for all bots.

    Exclude a single page

    If you do not want a page to be indexed, you can exclude it using the ‘disallow’ instruction. This is done as follows:

    User-agent: Googlebot
    Disallow: /Don-tCrawlThisPageExample
    
    Hello Googlebot, welcome to my website. You are not allowed to visit page /Don-tCrawlThisPageExample.

    Exclude a folder

    If you want to exclude an entire folder, use:

    User-agent: Googlebot
    Disallow: /Don-tCrawlThisFolderExample
    
    Hello Googlebot, welcome to my website. Folder /Don-tCrawlThisFolderExample/ may not be visited.

    The above text means that underlying pages may not be visited either. Thus:
    Bye Googlebot, welcome to my website. Folder /Don-tCrawlThisFolderExample/Don-tCrawlThisPageInsideTheFolderExample

    Allow access

    There is also the ‘allow’ instruction. With this you indicate when you do want a page to be indexed. By default, a bot crawls everything, so just adding pages with the allow instruction is not necessary. So why does the allow instruction exist? It could be that you don’t want a certain folder to be crawled by a bot, but that something within that folder is allowed to be crawled again. You then get:

    User-agent: Googlebot
    Disallow: /Don-tCrawlThisFolderExample/
    Allow: /Don-tCrawlThisFolderExample/ButCrawlThisPageInsideOfIt
    
    Hello Googlebot, welcome to my website. Folder /Don-tCrawlThisFolderExample/ is not allowed to visit 
    but page "/ButCrawlThisPageInsideOfIt" within that folder is.

    Block URLs based on characters

    In addition to blocking entire folders or pages, it is also possible to block URLs that contain certain characters. This is done with the character ‘*’. If I add the following in the robots.txt, all URLs with a question mark in them will be blocked:

    User-agent: Googlebot
    Disallow: /*?
    
    Hello Googlebot, welcome to my website. You may not visit all URLs that contain a question mark.

    Block files

    Finally, you can use the dollar sign ($) to exclude URLs with the same ending. If you have a folder with different types of files, of which you want to block only the pdf files then add this:

    User-agent: Googlebot
    Disallow: /*.pdf$
    
    Hello Googlebot, welcome to my website. All URLs ending in .pdf you are not allowed to visit.

    Sitemap in Robots.txt


    The location of the sitemap can also be added in the Robots.txt. This can help to get certain pages within the website better indexed. The sitemap is an overview of all the pages of a website.

    You actually always want to add the robots.txt. Especially because you can refer to the sitemap in it, which can contribute to getting pages indexed better. It also allows you to block pages or folders from search engines. If you don’t have robots.txt added, then bots can visit anything from your website.

    Setting the robots.txt depends entirely on the website. Some websites give bots all the space they need, while others restrict visits. It is always good to consider whether you want all pages of your website visited by bots. In doing so, it is wise to remember that blocking a page using robots.txt does not always mean that the page will not be indexed. If there are many external links to that page, then the page will be able to be indexed but then a search engine will not know what is on that page.

    Meta robots


    Meta robots are instructions found in the source code of your Web page. With these codes it is possible to give instructions to bots per page of your website. You place the codes in the

    of your page.

    Instructions for most bots:

    .
    
    Hello bot, welcome to this page. You are not allowed to index this page.

    And for a specific bot:

    
    
    Hello Googlebot, welcome to this page. You may not index this page and thus not display it in search results.

    As indicated in the code above, the first meta robot (noindex) tells bots not to index the page. The second code specifically instructs Googlebot. The noindex meta robot can also be used to avoid duplicate content. You can choose not to have a page indexed IF it is too similar to another page.

    There are more meta robots you can use on a web page. I list commonly used meta robots below:

    Nofollow

    Code:
    
    Hello Googlebot, welcome to this page. You may not follow the links listed on this page.

    Nosnippet

    Code:
    
    Hello Googlebot, welcome to this page. You may not use the information on this page in the snippet in search results.

    Noarchive

    Code:
    
    Hello Googlebot, welcome to this page. In the search results, you may not show an 'in cache' option with this link.

    Unavailable_after

    Code: 
    
    Hello Googlebot, welcome to this page. After this date, this page may no longer be indexed.

    Noimageindex

    This prevents a bot from indexing images on a page.

    Code:
    
    Hello Googlebot, welcome to this page. The images on this page should not be indexed.

    None

    This is a shortened version of noindex and nofollow together.

    Code:
    
    Hello Googlebot, welcome to this page. This page should not be indexed and the links on this page should not be followed.

    These instructions can be used within the source code of a page. The nofollow instruction can also be used for links to other pages within your website or other websites. You then pass the nofollow instruction along to a single link.

    If you want to make sure that a page will not be indexed by a search engine, then it is wise to both exclude this page in the robots.txt and place a no index meta robot on the page.

    Canonical tag


    The canonical tag is a way to avoid duplicate content and to indicate which page is the most important page. This is best explained through an example. Suppose you have an WooCommerce shoe shop and you are looking at all the shoes for men: https://www.woocommercemen.com/menshoes. You land on the web page with all the shoes for men. Often on such a page you can filter by color, size, brand and more attributes. It is also possible to rank the products by price and name. For example: https://www.woocommercemen.com/menshoes?ord=price. The default page has an H1 tag ‘men’s shoes’, possibly several H2 tags and an accompanying text (content).

    If a visitor wants a different product order, the URL changes and you actually have a second page. That second page has the same H1 and H2 tags and content. You have, as it were, the default page with all the men’s shoes and the same page, with products in different order, with the same content: duplicate content. By using a canonical tag, you can let the bot know what the original page is.

    Insert the following code:

    Hello Googlebot, page https://www.woocommercemen.com/menshoes/ord=price contains the same content as https://www.woocommercemen.com/menshoes but https://www.woocommercemen.com/menshoes is the page to index.

    The same applies when a product falls within multiple categories. For example, this is true in the following case:

    https://www.woocommercemen.com/menshoesG (original)
    https://www.woocommercemen.com/sport/menshoesG
    https://www.woocommercemen.com/brand/menshoesG

    In the above example, three pages exist for the product men’s shoeG. In this case, it makes sense to label one of these pages as original (the page with the most value, i.e., the most important page) and assign a canonical tag to this original page to the other two pages:

    https://www.woocommercemen.com/menshoesG (original)
    
    https://www.woocommercemen.com/sport/menshoesG
    
    
    https://www.woocommercemen.com/brand/menshoesG
    
    

    If you don’t do this, then the search engine won’t know which is the best page for your website for this content and will choose which page to index on its own. To avoid this, designate a page yourself as original to be indexed. This way you keep more control over indexing. You place the canonical tag in the head of the page.

    Combining the canonical tag with noindex

    The canonical thus proves to be a powerful means of excluding pages with duplicate content from indexation. Previously discussed means are the noindex and nofollow tags. It is not wise to use both a canonical tag and no index meta robot. In theory, by doing so, you are sending two signals. The canonical tag indicates that the pages are (almost) identical. While the noindex tag indicates that your pages are far from identical. So use either the no index tag or the canonical tag.

    Rel=”next”/”prev” tag

    If a category has many products then that category can be split into multiple pages.

    You can specify the relationship between this paginated content in the source code of a web page, namely in the

    . The first page is (almost) always the category page. For example: https://www.woocommercemen.com/menshoes.

    If I have 50 pairs of shoes in a category, with 12 shoes per page, then I have pages 1 through 5, which can look like this:

    https://www.woocommercemen.com/menshoes
    https://www.woocommercemen.com/menshoes/?page=2
    https://www.woocommercemen.com/menshoes/?page=3
    https://www.woocommercemen.com/menshoes/?page=4
    https://www.woocommercemen.com/menshoes/?page=5

    As above, if I have multiple pages that belong one after the other, I can reference the next page in the source code of /men shoes:
    https://www.woocommercemen.com/menshoes?page=2

    I do this by using the following code on the page https://www.woocommercemen.com/menshoesG/:

    .

    When I am on page 2 I want to link to the previous and next page. In that case I add two pieces of code:

    
    
    

    Note here that from page two I am referring back to the category page (/menshoes/) page and not to /menshoes/?page=1. If I were to do the latter then, without using a canonical, I would have duplicate content for the first page: namely /menshoes and /menshoes/?page=1.

    For page 3, I reference /?page=2 and /?page=4 and so on.

    The last page (in this example, page 5) only has a reference to the previous page:

    .

    It is important to be complete in implementing this tag. If you forget (a piece of) the code on one of the pages, the bot will not see the relationship between the pages or will not see it as well and will start looking for the relationship itself. This can cause indexing problems.

    Combine rel=”next”/”prev” with canonical

    For category pages with pagination, you don’t want a canonical from one page to another page. The rel=”prev”/”next” code reflects the relationship between pages which prevents problems with duplicate content. Some CMSs automatically place a canonical tag on each page. What then often goes wrong is that pages 2 and beyond have a canonical tag pointing to the first page.

    If you want to use the canonical tag in conjunction with the rel=”next”/”prev” tag, the page’s canonical should point to itself. So the page https://www.woocommercemen.com/menshoes/?page=2 has a canonical to itself: rel=”canonical” href=”https://www.woocommercemen.com/menshoes/?page=2″.

    If page 2’s canonical refers to page 1, products and content from page 2 cannot be indexed. A self-referencing canonical is also called a self-referencing canonical.

    Hreflang tag


    The hreflang tag is used when a website has multiple language settings. It is possible with this tag to refer bots to versions of the website in another language. For example, if I have a https://www.woocommercemen.ro in addition to the English version, I can refer to this version in the source code. A search engine recognizes the hreflang tag and then dishes the visitor the correct version of the website based on the visitor’s location and language settings.

    You do this by using the hreflang tag and it looks like this: rel=”alternate” hreflang=”x”. You place the code in the

    of the page.

    When using the hreflang tag, please note the following. The English website must refer to the Romanian website and vice versa. So it is not sufficient to refer only from the English website to the Romanian version. In addition, when using this tag, you should also add a self-referencing part. A small example:

    Source code English version
    https://www.woocommercemen.com

    
      

    Source code Romanian version
    https://www.woocommercemen.ro

    
      

    The hreflang tag also allows you to set up your website for language regions. For example, Belgium has a French-speaking part and a Dutch-speaking part.

    So I can set up my website for French-speaking Belgium with:

    and Dutch-speaking Belgium with:

    Keep in mind that this must be set for every page on the website. So it is not enough to set a hreflang tag only on the homepage. So a Dutch category page refers to the Dutch and English category page, and vice versa. The Dutch product page refers to the Dutch and English product page, and vice versa.

    Combining hreflang with canonical tag

    If you want to combine the hreflang tag with a canonical tag, you must reference within the same language with the canonical tag. If I am referring to the English version from my Dutch website with the hreflang tag, I want the canonical to be to the Dutch version. This is because of the different signals the two solutions send. As discussed, the canonical tag indicates a preference to have the most important page indexed and the least important pages not. The hreflang tag indicates which other versions of the Web site you also want in the search results. So these are conflicting signals.

    To make it complete, below are the examples of hreflang tag with canonical tag.

    Source code Dutch version

    https://www.woocommercemen.nl

    
     
       

    Source code English version

    https://www.woocommercemen.com

    
     
      

    Combining hreflang with rel=”next”/”prev”

    If you are combining the hreflang tag with the rel=”next”/”prev” tag, then logically you need to consider the following things. Make sure you keep the rel=”next”/”prev” tag the same within a language version of the website. So you should not use the rel=”next”/”prev” tag with a .com web address in the Dutch version of the website. In addition, a Dutch page 2 should refer to the Dutch page 2 and the English page 2 through the hreflang tag.

    Because examples often work more conveniently, I have placed pieces of source code below.

    Source code Dutch version

    https://www.woocommercemen.nl/menshoes

    
     
      
       

    Source code Dutch version

    https://www.domein.nl/mannenschoenen/?page=2

    
     
      
       
        

    Source code English version

    https://www.woocommercemen.com/menshoes

    
     
      
       

    Source code English version

    https://www.woocommercemen.com/menshoes/?page=2

    
     
      
       
        

    Google Search Console URL parameters


    With the Google Search Console parameters, it is possible to indicate the above points to Google. One or more of these technical aspects may not be customizable in your WordPress CMS. In that case, Google Search Console URL parameters offers a solution. If you log into Google Search Console, you can find the URL parameters option under the crawling tab.

    Within Google Search Console URL parameters you can add your own parameters that visitors can use to organize or filter content.

    After adding a parameter, you can choose from two options:

    1. The parameter does not affect the page content
    2. The page content is changed, rearranged or restricted

    For example, a parameter that does not affect content is session ID. If you do have a parameter that affects content, such as a sorting option or a filter, you can indicate in Google Search Console how this parameter affects content. The image below shows that the content can be affected by sort, restrict, specify, translate and paginate.

    Then you can specify what Google should do with the URLs that contain this parameter:

    1. Let Googlebot decide: if you are not sure what the parameter does or if the behavior is different on various parts of the website.
    2. Each URL: this way, each change to a parameter is seen as a separate URL. Use this option when you are sure that the content changes when changing the parameter.
    3. Only URLs with specified value: this option allows you to specify the value of a parameter to be crawled. If there is a parameter on the website that sorts products by price, you can specify that only the URLs that sort products by price from highest to lowest should be crawled. URLs that contain a price sorting option from low to high will then not be crawled.
    4. No URLs: this option allows you to exclude URLs with a parameter entirely. This can come in handy if you have multiple parameters in a URL in a row.

    Google Search Console URL parameters contains options that can also be fixed by previously mentioned means. For example, looking at parameters that translate or paginate are the previously mentioned hreflang tag and rel=”next”/”prev” tag, respectively.

    If implementing a particular tag fails, then you can achieve the same through this tool. The thing is, however, that these rules only apply to the Google search engine, while tag implementations apply to (almost) all search engines.

    Leave a comment

    All comments are moderated before being published.

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.