Archive for June, 2009

SEO, URL Rewriting: Don’t Forget About The Robots!

Friday, June 19th, 2009

Sections

  1. Introduction
  2. Examples of URL and Parameter Issues
  3. The Solution – Mod_Rewrite, Htaccess, and Apache
  4. Rewrite Directives and Rules
  5. Regular Expression Basic Cheat Sheet

Section I

Introduction

SEO is an exercise in creating quality and relative content. SEO is also technical by  nature. It is important, in any SEO effort, to focus on both quality content and quality code. Search engines use robots that spider, crawl, and follow each link on the internet. When they arrive at your site, they scan the home page and “spider” each sub page linked from it. These robots are algorithmic and their existence is only for a few distinct objectives, one of which being to discover and crawl all of the pages contained within websites.

These robots, while being quite technical, are primitive in the sense that they only do what they are told. This is important to know because if your website has multiple links to the same content or multiple URLs to the same content, the robots can and often get hung up on them. This is less of an issue today than in the past, but it remains an issue nonetheless. In this article, we’ll take a look at how and why this occurs, as well as present solutions to combat these issues.

Section II

Examples of URL and Parameter Issues

Let’s use the example of “printer friendly” pages. These pages are often linked to from a web page with content that the website owner thinks a user may deem print-worthy. Occasionally webmasters will simply add another parameter to the page URL to provide this functionality, such as “&print=yes”. To humans, this makes perfect sense, but robots see this as a whole new page on your website, and attempt to index it accordingly.

Beyond the issue of printer friendly pages is the issue of pages accessible by multiple URL combinations. Take, for example, the following URL:

http://www.domain.com/index.php?id=2&section=4&category=2&page=2

The parameters in the above example are “id,” “section,” “category,” and “page.” This particular page could also be referenced by re-arranging the parameters, like this:

http://www.domain.com/index.php?section=4&page=2category=2&id=2

Because the two URLs above have the same parameters, they load the exact same content. However, to search engines these two pages are separate pages, even though they load the same exact content, because the query strings (the part of the URL beginning with index.php) are not identical. This is largely the cause of duplicate content issues on websites with dynamically loaded content.

Section III

The Solution – Mod_Rewrite, Htaccess, and Apache

So what do we do, now that we are aware of the issues? We need the dynamic nature of the website to remain intact, but the search engines are running in circles (and sometimes even ignoring) trying to discover all the pages on our website. For websites that run on Apache web servers, an extremely complex and exceptionally useful module has been developed called mod_rewrite.

What is mod_rewrite? It is described as “the Swiss Army knife of URL manipulation.” A brief definition from Apache is this: “This module uses a rule-based rewriting engine (based on a regular-expression parser) to rewrite requested URLs on the fly.” The rules are written in “.htaccess” files in the actual folders of the web server.

For our URL Rewriting purposes, we will only touch on a few of the directives available with the module: RewriteEngine, RewriteCond, and RewriteRule. The purpose of the directive RewriteEngine is simply to enable or disable the runtime rewriting engine. RewriteCond defines a condition under which rewriting will take place. And RewriteRule defines the rules for the rewriting engine. The real workhorse is RewriteRule. In the following section of this article, we’ll discuss how these directives are used and how to accomplish basic tasks related to search engine optimization.

Section IV

Rewrite Directives and Rules

By default, RewriteEngine is off, so the below definition is required to make use of any other directives. To initialize the rewriting engine, the code required in the .htaccess file is as follows:

RewriteEngine On

Now that we have initialized the rewriting engine, we can begin creating rules using the RewriteRule or RewriteCond directives.  The RewriteCond directive allows the use of server variables such as HTTP_USER_AGENT, HTTP_REFERER, REMOTE_HOST, REMOTE_ADDR, REQUEST_METHOD, THE_REQUEST, REQUEST_URI, and many more.

The primary variables used are “THE_REQUEST” or “REQUEST_URI”. The former is the full HTTP request line sent by the browser to the server (i.e. “GET /index.html HTTP/1.1″). The latter is the resource requested in the HTTP request line (in the previous example, this would be “/index.html”).

Both RewriteCond and RewriteRule employ the use of regular expressions. The syntax of regular expressions is very picky, but very precise. In the below examples, we will cover some examples of regular expressions and what specific characters indicate.

RewriteCond is commonly used to rectify the issue of canonical hostnames. The goal of this rule is to force the use of a particular hostname, over other hostnames that can be used to reach the same site. For example, if you wish to force the use of www.example.com over example.com. Below is an example to accomplish this:

RewriteCond %{HTTP_HOST}      ^domain\.com [NC]

RewriteRule (.*)mce_markernbsp;      http://www.domain.com/$1 [R=301,L]

Let’s go through the above example step by step. The first line is stating the condition that the HTTP_HOST begins with domain.com. The “\” before the .com is interpreted as a literal period. This is called “escaping.” Special characters that need to be escaped in conditions include, but are not limited to the following: “.”, “?”, “*”, and “$”. The “[NC]” portion is specifying that the condition is not case sensitive, meanin that Domain.com, or DOMAIN.COM would still match this condition.

The second line is the RewriteRule, which is only executed if the above condition is met. This line would read “store any number of any character to the end of the line and 301 redirect the request to http://www.domain.com/ with the stored value appended.” Let’s break it down piece by piece.

Anything contained within parentheses “()” is used to group text as well as for backreferences. The period in the regular expression means “any single character” and because it is followed by an asterisk, which means 0 or N (where N > 0) occurrences of the preceding text, this means any number of any character. The “$” is the end-of-line anchor.

The second segment of this rule is the destination. We are instructing that the request be permanently redirected with a 301 status code (R=301) to the URL http://www.domain.com/$1. In this case, the “$1″ is actually a backreference for the text in the parentheses at the beginning of the rule. So we’re appending the value of “(.*)” to the end of the domain. The “L” at the end means that if this rule is executed, it is the last rule that needs to be checked against.

Next we’ll take a look at a simpler example. Say we have an old page that has been updated, but it now has a new URL. We don’t want to lose the traffic that goes there from the rankings that page has already acquired. This is where a 301 redirect comes in handy. We can use the RewriteRule directive to instruct the server to redirect all requests for the old URL, to the new URL. By doing this, we’re even telling the search engines that the old page no longer exists, and that is has been permanently replaced with the destination URL. Here’s the RewriteRule to perform this action:

RewriteRule ^oldpage.html newpage.html [R=301,L]

Here, we don’t have a condition to be met other than that the request specifically begins with oldpage.html. If that is true, then the user (or search engine) will be redirect with a 301 status code to newpage.html and no other rules will be executed.

Next we will examine a rule to rewrite URLs that have parameters. Consider this rule:

RewriteRule ^info/([a-z0-9_]*)/content([0-9]*).html$ info/index.php?category=$1&id=$2 [NC]

You have probably already figured out what some of this means from the previous examples, but let’s go through step by step once more. The rule requires that the request begins with “info/”, and we’re storing “any number of any letter between a and z, and any number between 0 and 9, as well as underscores” to the first backreference. Next, the request will have “/content” followed by “N amount of any number between 0 and 9,” which is stored into the second backreference. If the request URL follows that strict layout, then rewrite the content to the destination with the two backreferences, $1 and $2. Again, the [NC] means that the rule is no case sensitive. This would rewrite the request “info/How_to_write/content12.html” to “info/index.php?category=How_to_write&id=12″. If we removed the “_” from the regular expression [a-z0-9_] (i.e. [a-z0-9]), this request would not match because it contains underscores.

That pretty much covers the basics of URL rewriting. Remember, rewrite rules should always be tested in a staging or sandbox area, because a mistake in the .htaccess file can render the website unusable, and cause it to return a 500 Internal Server error to all requests.

Section V

Regular Expression Basic Cheat Sheet

“.” This matches any character except a new line.

“^” This matches the start of the string.

“$” This matches the end of the string.

“*” This matches 0 or more repetitions of the preceding regular expression, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

“+” This matches 1 or more repetitions of the preceding regular expression. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not just match ‘a’.

“?” This matches 0 or 1 of repetitions of the preceding regular expression. ab? will match either ‘a’ or ‘ab’.

“(text)” Grouping of text, used to set borders or to make backreferences.

“[]” Indicates a set of characters. Can be listed individually or as a range. Any character of the class ‘chars’. [akm$] will match any of the characters ‘a’, ‘k’, ‘m’, or ‘

. [^akm$] will match any character except ‘a’, ‘k’, ‘m’, or ‘

. In this instance, the carat is a “not” operator.

“text1|text2″ Either text1 or text2, not both.

“[a-z]” This matches any lowercase letter.

“[a-zA-Z0-9]” This matches any letter or digit.