Pages

Showing posts with label search engines. Show all posts
Showing posts with label search engines. Show all posts

Monday, 10 May 2010

How to add hNews to Blogger blogs





Making your blog hNews-compatible should help it get picked up and indexed by hNews-capable search engines like Value Added News Search.

hNews is a relatively new microformat, designed to enable more useful computer indexing or processing of news stories, blog posts etc. (Technorati tags are another microformat.) Google can recognise and make use of some microformats already with their Rich Snippets.

So see this tutorial on how to edit your Blogger template to make it hNews compatible and produce the basic hNews info in your blog as standard - post title, author, date.

Marking up the text within an individual story or post seems to be the main point of hNews, but also seems to be a real pain. Does anyone know any tools that could help bloggers do that, e.g. a plug-in for Windows Live Writer? I'd do a Greasemonkey script if I could, maybe in the summer, but Kirk is the real expert behind the Technorati tagger for Blogger.

Thursday, 8 April 2010

GoogleSharing - fix certificate issue with update





If you've been using the excellent GoogleSharing privacy add-on for Firefox by Moxie Marlinspike, which helps protect your privacy when you're doing a search via Google, you may have noticed a problem with using it last week, with an error about "invalid security certificate" saying it's not trusted:

That was due to French registrar Gandi.net revoking the site's SSL certificate without prior notice or proper timely explanation.

So a tip: you can fix this problem and get GoogleSharing working properly again by upgrading to the latest version of the extension (even though Firefox doesn't notify you that an upgrade is available, that one worked for me).

Friday, 18 July 2008

6 Sayings for Search Engine Success (Top Tips to Boost Blog Ranking)





Here's 6 sayings for SEO (search engine optimisation) success that I made up earlier. Sayings first, details later:

Improbulus's Illuminating SEO Sayings

  1. Content is king, originality the crown, and text the crown jewels.

  2. Relatedness raises relevance.

  3. The early words get the spider, and headings make things tastier.

  4. Links in to your blog's web address (URL) are good, good links in are better, good link text is best.

  5. Link out liberally - cite sites for authority and authenticity.

  6. A blog post a day keeps bots from going away; a regular diet is just how they like it.

Background

Search engines like Google and Yahoo send out software critters called robots / bots, spiders or crawlers to crawl over or "spider" webpages, following links from page to page and fetching back what they find for indexing and storage in the search company's vast databases (they're even crawling web forms now!).

When someone uses a search engine, they're actually searching its databases, and the search engine decides what results to serve or return to the user (see Google 101: How Google crawls, indexes, and serves the web, which Google recently updated).

Different engines may do that search, and decide what results to show to the user and in what order, in different ways. But they all seem to apply the same broad principles.

You can use those principles to make your blog more likely to appear in search engine results for searches where your site is relevant, i.e. boost your blog's search engine ranking so that your blog posts are shown higher up on the search results pages of search engines, and thereby attract more visitors to your blog.

The higher up the results page the better, of course; many people don't go beyond the first page of search results.

So I thought up these "SEO sayings" to summarise the main SEO principles I've learned.1 I've tried to make them more memorable by delivering doggerel, mangling metaphors, paraphrasing proverbs and abusing alliteration - I hope that works!

Some of this post will be relevant to websites which aren't blogs, too. And I've written separately on the mechanics of submitting your blog or site and sitemap to the search engines to entice them to crawl your site properly in the first place - this post is on tips for getting a better showing for your blog in the search results, which is a different matter.

Now for the detail (mostly focused on Google searches, as it's the most popular search engine globally, but the same broad principles apply to all of them).

1. Content is king, originality the crown, and text the crown jewels

Search engine bots like lots of text, especially original, non-duplicative text. Sadly, they can't (yet) index audio or video properly, although recently Google have developed algorithms for indexing Flash content like animations, though previously using Flash wasn't so good for Google search engine purposes.

This principle has several corollaries:
  1. Produce original content - write your own stuff, in your own way - and certainly avoid copying or scraping other sites (not copying others is generally a good idea anyway!)

  2. As text is the best search engine fodder, if you have images (e.g. photos), videos or audio on your blog or elsewhere, try to include ALT text, and tag the media with meaningful text descriptions, etc also avoid horrid Javascript links and if you use Ajax follow the Ajax tips (accessible sites usually rank better)

  3. Eliminate duplicate content - try to minimise duplicate versions of your content on the Web, as that may confuse the bots, who, though they try their best, may not know which one is the "real" one to index, and it can dilute your link popularity (other duplicate content issues). So, monitor and stop people who illegally copy your content (I'll post separately about how to track that and how to stop them) as bots can pick up duplicate content due to scrapers. Also, consider cleaning up your feeds as of course duplicate content in your feeds could crowd out your main blog from search results; feeds now are seen less by Google users but could still confuse the Googlebot and other bots, but I know ACE recovered much faster from the rankings hit I took due to my domain name change last year after I took the step, on Kirk's advice, of turning off my per post comment feeds in Blogger (adding rel=nofollow to links to your labels pages may also help, if you're on Blogger).
An example of text being good - Jyri Engeström of Jaiku (which was bought by Google in 2007) has given several talks, with slides, on his 5 principles for successful Web 2.0 services, which have been videoed and recorded several times. I took the time to write up a report of one talk, i.e. a search engine friendly text version. And my post got indexed, and still gets linked to.

2. Relatedness raises relevance

Relevance is good. The more relevant your post is considered to be to the particular word or phrase searched by the user, the more likely it is that the search engine will show your post on its results pages.

How do you improve relevance in the eyes of the search engines? Again, it's the text that counts. Obviously if the user searches for a word that appears a lot in your post, it'll generally do better in the search results than another post that doesn't use that word at all.

But search engines don't just go by exact matches - other words, related to the same concept, will help with relevance too.

So consider what key words describe a particular post - what main concepts, subjects or topics is it about? What are the main words where, if someone searches for those words in a search engine, you'd want your post to appear in the results? (again Google say so too).

For relevance, specific is better than general when it comes to what words you use in your post. "I like hard drive PVRs because you can watch a previously-recorded programme while another programme is recording" will do better than "This is cool stuff, I love it!". (See Google's example of specific being better than general.)

Having figured out the most relevant specific key words for your post, use them judiciously in your post title and post body (i.e. use them in the right places, as often as possible - but only where appropriate, as if you repeat the same word too often the bots may think it's keyword stuffing spam). Also my personal view is you should include them in your tags, Blogger labels, or WordPress categories, though that's not as important as using them in the title, URL and main body.

Good writers will probably be using the "right" words in their posts naturally anyway, just in the course of writing the post: it wouldn't make sense to write about kittens without mentioning the word "kittens"!

But it's better if there are several different words in the post which are "related" to each other, i.e. to do with the same concept.

So, to increase the occurrences of different but related words in the same post or indeed same blog (and maximise chances of the post or blog appearing in appropriate searches), and thereby improve relevance generally, consider these ideas:
  1. Use specific rather than general words as much as possible.
  2. Spell the same word differently in different places - using British and American spellings both (most search engines automatically search for both if you type in one, but I do it anyway; I draw the line at accented characters though).
  3. Include singular and plural versions in different places e.g. mobile, mobiles (again search engines automatically handle those variations, but I do it anyway).
  4. Use synonyms too, e.g. "cellphone" in one place, "mobile" in another; or "cat", "kitten", even "pussycat".
  5. For double words use all variations in different places within the post e.g. "doubleclick", "double-click", "double click" (because single and double words seem to be considered different words by the search engines).
  6. Include the key words and their synonyms in tags, labels or categories (though using them in the title, URL and body is more important)
  7. Consider making your blog a specialised, narrowly themed, niche one - a blog which only ever has posts about movies will, when someone searches for info on movies, generally score better than a blog which has a few posts on movies, some posts on what the writer had for breakfast and more posts about different types of motorbikes, good posts though they may be (the professional bloggers tend to have different specialised blogs; and I have been thinking of splitting out ACE to put the non-technology related posts in a different blog, including the jokes! The funnies etc might help leaven the mix for regular subscribers, but visitors from search engines far outnumber feed subscribers, for this blog anyway).
Note: some people don't think much of using different spellings and singular/plural - I can't say for sure if it's helped me, but it certainly hasn't hurt, so I use them - it's' your choice. What's more important I think is the use of synonyms, different but related words, because I believe it helps with relevance of your post.

3. The early words get the spider, and headings make things tastier

Crawlers place importance on the formal structure of your post or webpage. So use your important key words early on in the post and in the more structurally significant parts of your post, i.e.:
  1. Key words in the webpage / blog post title (but don't make the title too long or it may be thought spammy) - blogs do well here because blogging software like Blogger will automatically take your post title and use it for your webpage title (in the post page or item page), and also uses the words used in your post title within the URL of the post's permalink, thus putting key words in the URL and breaking the key words up with dashes in a search-engine friendly way
  2. More important keywords earlier in the title - blogging software like Blogger usually includes words from the title in the URL of the post, which really helps for search engine relevance, but Blogger for one cuts the URL off after about 30 characters, so I sometimes experiment with what words and order to use (by posting the title only to a test Blogger blog) so as to get all the key words into the URL of the post
  3. Use the keywords in the first 50 words (maybe even first 25 words) and of course in the body of your post itself
  4. Key words in your side headings e.g. heading3, heading4 (in my Blogger template at least, heading2 is used for the post title), and
  5. Emphasise or embolden selected keywords - good not only for bots but also human readers, who tend to scan or skim Web page content rather than read it.
As the words in the title are very important, remember to ensure your title properly describes the subject of the post. Recall that specific words are more meaningful, and therefore better, than general words: "Cute kitten photos" is a much better title than "Awwww!" or "I love these!" for helping your post rank higher when someone searches for cat pictures. (Don't just take my word for all this, take Google's.)

4. Links in to your blog's web address (URL) are good, good links in are better, good link text is best

This is the most well known factor - generally the more links there are to your blog, the higher it will rank in search results; and links from higher-ranking sites or blogs (ie sites which themselves have lots of links to them) will count more than links from sites or blogs which aren't so well linked to themselves.

The link text (anchor text) used by the person linking to your blog, i.e. the blue underlined text that people click on to get to your blog, which contains the URL of your post behind the scenes, is crucial. Bots as well as people view that text as a description of your post, so that post will be considered more relevant and rank higher if someone later searches for the link text words, or related words.

Example: I'll refer to my post on how to use Blogger feeds URLs in useful ways. Then if people search for info on Blogger feed URLs hopefully they'll see my post (eventually)!

You can get a picture of how other sites link to yours, what words and phrases they use for their anchor text, if you have a free Google Webmaster Tools account - see the Statistics, Page Analysis tab.

The implications:
  1. Cultivate links to your blog by getting the positive attention of high ranking bloggers (without annoying them!) e.g. by commenting on their blogs (UPDATE: have removed Xavier's slides as he's made them private, he obviously didn't want them shared, sorry Xavier.)
    I don't have time to comment on other blogs much these days, but I did in the early days of ACE (e.g. ACE first got real attention through David Sifry linking to my introduction to Technorati tags).
  2. Do NOT post spam comments with irrelevant links to your blog, most blogging platforms will set it so that bots will ignore and not follow those links anyway! (Blogger still have a loophole which lets spammers post links to their own URLs which will be followed; deleting comments like that is the only way round it)
  3. It is acceptable to post links to your own blog in a comment or forum post etc where it's truly relevant or helpful
  4. Remember it's links to a particular URL that count, not links to the content, so if you change your blog's domain steel yourself for a huge drop in visitors for months... (see my account of my travails when changing from blogspot.com to www.consumingexperience.com)
  5. Therefore, if you change domain you might try to get those who linked to your old URL to update their links to point to your new domain (I confess I didn't do that, too many links to figure out, and I didn't want to trouble them - I can't expect other people to put in time changing their links just to help me)
  6. If you're just starting to blog or starting a new blog, frankly the best option is use your own domain name from the start or at least as early on as possible (eg Blogger custom domains) so that links in to your blog will be to that domain rather than to say blogspot.com or wordpress.com.2
  7. If you've already started blogging e.g. on Blogspot.com, get the pain over with early, bite the bullet and switch to your own domain ASAP so that you can start building up links to the new domain's URL sooner
  8. In your new posts, where it's relevant to mention your own previous posts, link to your previous posts using meaningful anchor text - yes, that counts!

5. Link out liberally - cite sites for authority and authenticity

I started doing this because I like to back up what I say. If I read something interesting I may want to look into it further, check out the original news article or government paper quoted from or mentioned, etc - and I thought my readers would want to do the same.

I don't expect readers to just trust me and take my word for something in a vacuum, so I cite (and link to) authoritative sources like Wikipedia, news sites, government sites, academic/university sites etc. If readers wish, they can follow the link to get it from the horse's mouth.

Also, those links are helpful to my readers - if I mention a technical term and link to its Wikipedia definition, they can look it up if they want to. Yes that makes it much more time-consuming for me to write my posts, but I think it's worth it.

And as it turns out, it seems search engines actually like links out which are useful to readers, and tend to favour posts with them.

Finally, of course links out to other bloggers is also good in terms of mutual back scratching!

6. A blog post a day keeps bots from going away; a regular diet is just how they like it.

Search engines rate freshness, so a blog which is regularly updated with new posts is likely to be ranked more highly than one which is only updated every 2 or 3 months. (I confess I'm not very good at frequent blogging myself, as most of my posts are long and very time-consuming to write. I'm sure my blog would do better if I posted more often rather than like buses, nothing for a while then 3 at once!)

Bots also seem to like regularity (an apple a day helps humans with that, of course!), so a predictable publishing schedule, e.g. a post every week on a Sunday, or a post every other day, etc, should help boost your rankings - again an area I personally need to improve on.

So this means:
  1. Try to publish posts often, ideally at regular intervals
  2. Consider scheduling your posts, i.e. writing several posts in advance (if you have the time) and then setting them to publish at regular intervals during the week. I've done this a few times myself, though not as often as I should.

More info

See Webmaster guidelines for the full lowdown from the horse's mouth:
Also look at specialist advice e.g. from blogs like Search Engine Land and Search Engine Watch, and SEO Chicks where experts like Judith deCabbit Lewis post - see the Girlygeekdom post with video and MP3s of talks by Judith and Sheila Farrell on SEO at the March 2008 London Girl Geek Dinner.

So, what are your own personal top tips for increasing search engine rankings for blogs?

(With thanks to those at the London bloggers' meetup group on 24 June 2008 (particularly Ged), as always efficiently organised by Andy Bargery, for their helpful discussion on these issues. See e.g. epicurienne's writeup of the meetup.)

Notes

1. Who am I to talk? Well, while I'm no search engine optimization expert or Pro Blogger and ACE is certainly no Boing Boing, I must be doing something right - this blog reached 1 million unique visitors in April 2008 and it currently averages over 2000 visitors a day (over 90% of them through Google searches) [ update: at Feb 2009, now averaging 3000 unique visitors a day], and it's near the top for some Google searches e.g. Technorati tags or Gmail alias. I think when I blog I've always tried to make my posts useful to readers by having properly descriptive titles and headings, and relevant text at the start, and I thereby unconsciously hit the spot in relation to a number of factors that are important to the Googlebot and other bots / crawlers / spiders - so now I try to apply them consciously, most of the time anyway!

2. Domain names are relatively cheap these days especially .com names, so it's well worth getting one. Looks more professional too; you might not plan to go pro with your blog, but if it takes off, you never know...


Search engine submission, indexing, sitemaps & robots.txt - guide for bloggers





THE QUICKIE

This post covers:
  • how to submit your blog or website to the major search engines (Google, Yahoo, Microsoft, and I've thrown in Ask) to invite them to start visiting your site - just use the forms below
  • how to submit your "sitemap" to the search engines (touching on the role of the robots.txt file) to ensure that, if and when they start crawling your blog or web site, they'll index all your webpages comprehensively - use the forms below, and sign up for their various webmaster tools; if your blog is on Blogger you can try submitting several URLs (depending on how many posts you have in your blog) to cover your whole blog: http://BLOGNAME.blogspot.com/feeds/posts/default?start-index=1&max-results=500
    and http://BLOGNAME.blogspot.com/feeds/posts/default?start-index=501&max-results=500 etc
  • how to get your updated web pages re-indexed by the search engines for the changed content - this should happen automatically for Blogger blogs, at least if they're on Blogspot or custom domains.
I don't cover how to get your blog to rank higher in search engines' search results pages - see my separate post on search engine optimisation.

THE LONG & SLOW

It's been well known for some time that most people navigate to websites via search engine searches, e.g. by typing the name of the site in the search box - even when they know the direct URL or web address.

Search engines like Google and Yahoo! send out software critters called bots, robots, web crawlers or spiders to crawl or "spider" webpages, following links from page to page and fetching back what they find for indexing and storage in the search company's vast databases. When someone uses a search engine, they're actually searching its databases, and the search engine decides what results to serve or return to the user (see Google 101: How Google crawls, indexes, and serves the web, which Google recently updated).

So to maximise visitor traffic to your site you need as a minimum to get the search engines to:
  1. start indexing your website or blog for searching by their search engine users, and then
  2. index it as completely and accurately as possible, by giving them the full structure or map of your site (a sitemap), and keep that index updated.
It's important to note that the two points, although both to do with the crawling and indexing of your site, are entirely separate. Trying to do 2 is no good if 1 isn't happening at all in the first place, i.e. if the search engines are turning their noses up at your blog or site and refusing to even set a toe inside your door.

This post is an introductory howto which focuses mainly on 2 (getting your site indexed comprehensively), but I'll touch first on 1, i.e. search engine submission (submitting your blog to the search engines to ask them to index your posts). Trying to get your posts ranked higher in search engine results (the serve/return aspect and search engine optimisation or SEO) is yet another matter.

I'll only cover Google, Yahoo!, Microsoft Live Search (Live Search is the successor to MSN Search) and Ask.com. Why? Because they're the 3 or 4 key search engines in the English-speaking world, in my view: certainly, the vast majority of visitors to ACE come from one or other of them (94 or 95% Google, 3 or 4% Yahoo!). I have a negligible number of visitors from blogosphere search engines like Technorati, whether via normal searching or tags, and it's the same with Google BlogSearch, so I'm not covering Technorati or BlogSearch in this post.

Realistically, if you want to maximize or just increase traffic to your site or blog, you need to try to get indexed by and then to increase your rankings with the "normal" search engines, because virtually everyone still uses the "normal" search engines as their first port of call, not the specialised blogosphere search engines. Furthermore, the same few big search engines dominate. Independent research also bears out my own experience.1

So if you want to get more traffic to your website, it's a good idea to get indexed, and indexed comprehensively, by at least the big 3 search engines: Google, Yahoo and Microsoft.

Submitting your site to Google, Yahoo, Microsoft and Ask.com

No guarantees..

Remember: submitting your site to the search engines just gets their attention. It doesn't guarantee they'll actually decide to come and visit you. It's the equivalent of "Yoohoo, big boy, come up & see me sometime!" It doesn't mean that they'll necessarily succumb to your blandishments.

In other words, submitting or adding your site to Google etc is better than not doing so, but it's not necessarily enough to persuade them to start indexing your site in the first place or sending their bots round to your site. It takes good links in to your blog to do that, normally.

Below I cover how to try to get your site indexed by the main search engines so that it'll be searchable by users of search engine sites, but the operative word is try.

What are bots or robots and what are search engine indexes?

Bots, robots, web crawlers or spiders are software agents regularly sent out across the Web by search engines to "crawl" or "spider" webpages, sucking in their content, following links from page to page, and bringing it all back home to be incorporated into the vast indexes of content maintained by the search engines in their databases.

Their databases are what you search when you enter a query on Google, Ask.com etc.

Bots can even have names ("user agent identifiers"), e.g. the Googlebot (more on Googlebot) or Yahoo's Slurp. (See the database of bots at robotstxt.org, which may however not always be fully up to date or comprehensive.)

And of course it's not like each search engine service has only one bot - they have armies of the critters ceaselessly spidering the Web.

Also note that not everything retrieved by a bot gets into the search engine's index straightaway, there is usually a time lag between that and the index getting updated with all the info retrieved by all their bots.

How to get your site indexed by the search engines

So how do you get the bots to start stopping by? Most importantly, by getting links to your site from sites they already know and trust - the higher "ranking" the site in their eyes, the better. The search engines are fussy, they set most store by "recommendations" or "references" in the form of links to your site from sites they already deem authoritative.

How you get those first few crucial links is up to you - flatter, beg, threaten, blag, sell yourself... I'll leave that, and how to improve your search engine rankings once you've got them to start indexing you, to the many search engine optimisation (SEO) experts out there (and see my 6 SEO sayings post), but normally shrinking violeting just won't do it.

Return visits are go!

The good news is, once you manage to seduce the search engine bots into starting to visit your site, they're normally creatures of habit and will then usually keep on coming by regularly thereafter to slurp up the tasty new things you have for them on your site or blog.

How to submit your site to Google, Yahoo!, Microsoft Live Search andAsk.com to request them to crawl your site for their search indexes

  1. Google: Add your top-level (root) URL to Google (e.g. in the case of ACE it would be http://www.consumingexperience.com without a final slash or anything after the ".com") to request the Googlebot to crawl your site

  2. Yahoo: Submit your website or webpage URL or site feed URL to Yahoo after you register free with Yahoo and have signed in (I won't cover paid submissions). Again this is to request Yahoo's Slurp to visit your site, but it may not actually do so.. (More on Yahoo's Site Explorer.)

  3. Microsoft: Submit your site to Microsoft Live Search to request the MSNBot to crawl your site.

  4. Ask.com: You need to submit your sitemap (covered below) to submit your site to Ask.com.

Sitemaps generally and how to use them

Now once the main search engines have started indexing your site, how do you make sure they do that comprehensively?

By feeding them a sitemap.2 Sitemaps help make your site more search engine-friendly by improving how the search engines crawl and index your site, and I've been blogging about them since they were first introduced.

What's a sitemap, and where can I get further information on sitemaps?

A sitemap is a document in XML file format, usually stored on your own website, which sets out the structure of your site by listing, in a standardised way which the search engines' bots will recognise, the URLs of the webpages on your site that you want the bots to crawl.

The detailed sitemaps protocol can be found on Sitemaps.org (the sitemap can also optionally give the bots what's known as "metadata", i.e. further info such as the last time a web page was updated).

See also the Sitemaps FAQs, and Google's help pages on sitemaps.

Where should or can a sitemap be located or uploaded to?

A sitemap should normally be stored in the root, i.e. top level, directory of your site's Web server (see the Sitemaps protocol) e.g. http://example.com/sitemap.xml.

Important note (my emphasis):
Normally, "...all URLs listed in the Sitemap must use the same protocol (http, in this example) and reside on the same host as the Sitemap. For instance, if the Sitemap is located at http://www.example.com/sitemap.xml, it can't include [i.e. list] URLs from http://subdomain.example.com." (from the Sitemap file location section of the official Sitemaps protocol).

What does that mean? Well, suppose your site is at http://www.example.com and you want to submit a sitemap listing the URLs of webpages from http://www.example.com which you'd like the search engines to crawl; pages like:
http://www.example.com/mypage1.html
http://www.example.com/mypage2.html, etc.

In that example, the search engines will not generally accept your sitemap submission unless your sitemap file is located at http://www.example.com/whateveryoursitemapnameis.xml and if you try submitting it you'll get an error message - generally, if the URLs you're trying to include are at a higher level or different sub-domain or domain, tough luck.

In other words, they will not generally accept a sitemap listing URLs from http://www.example.com (e.g. http://www.example.com/mypage1.html, http://www.example.com/mypage2.html, etc) if the sitemap is located at:
  • http://example.com/sitemap.xml - because it doesn't have the initial "www"
  • http://another.example.com/sitemap.xml - because "another" is a different subdomain from "www", or
  • http://www.someotherdomain.com/sitemap.xml or indeed http://someotherdomain.com/sitemap.xml - because someotherdomain.com is a totally different domain.
The good news is that Google decided to accept what they call Sitemap cross-site submissions, where the sitemap for one domain can be uploaded to (and have the URL of) a different domain, but it will still be crawled by the Googlebot - provided you could verify with Google Webmaster Tools that you own or control both domains (see their FAQ).

And now, via a different method I'll come to below, you can also submit a cross-site, cross-host or cross-domain sitemap which Yahoo and Microsoft will also accept.

How do you create a sitemap?

I won't go into the ins and outs of constructing a sitemap now, there are lots of other resources for that like Google's Sitemap Generator and third party tools, as my main focus is on blogs, where sitemap generator tools aren't really relevant - as you'll see below.

Sitemaps for blogs (and other sites with feeds)

This section is about sitemaps for blogs particularly those using Blogger, but much of what I say will also apply to other sites that put out a site feed.

How do you create and submit a sitemap for a blog? As mentioned above, normally your sitemap must be at the root level of your domain before the search engines will accept it.

But one big issue with many blogs is that, unless you completely control your website (e.g. you have a self-hosted Wordpress blog or use Blogger on your own servers via FTP), you can't just upload any ol' file you like to any ol' folder you like on the server which hosts your blog's underlying files.

If you have free hosting for your blog, e.g. from Google via Blogger (typically on blogspot.com), you're stuck with what the host will let you do in terms of uploading. Which isn't much. They usually limit:
  • what types of files you can upload to their servers (e.g. file formats allowed for images on Blogger don't include .ico files for favicons, which to me makes little sense), and
  • where you can store the uploaded files - they (the host), not you, decide all that (e.g. video uploads are stored on Google Video, not the Blogspot servers).
So, even if you could build your own sitemap file for your blog, if you can't upload it to the right place in your blog server, it won't do you any good.

But all is not lost. With a blog you already have a ready-made sitemap of sorts: your feed.

How to submit sitemaps for blogs?: your feed is your sitemap

As most bloggers know, your site feed is automatically created by your blogging software and reflects your latest X posts (X is 25 with Blogger, by default), as long as you've turned your feed on i.e. enabled it. (See my introductory tutorial guides on What are feeds (including Atom vs RSS); How to publish and publicise a feed, for bloggers; How to use Feedburner; Quick start guide to using feeds with Feedburner for the impatient; and Podcasting - with a quickstart on feeds and Feedburner, and a guide to Blogger feed URLs.)

You can in fact submit a sitemap by submitting your feed URL: see how to submit a sitemap to Google for Blogger blogs (longer version), and you can even "verify" your Blogger sitemap to get more statistical information about how the Googlebot crawls your blog.

Now, your last 25 posts is not your complete blog, that is true. A sitemap should ideally map out all your blog posts, not just the last 25 or so. But remember, bots faithfully follow links. The vast majority of blogs are set up to have, in their sidebar, an archives section. This has links to archive pages for the entire blog, each of which either links to or contains the text of all individual posts for the archive period (week or month etc), or indeed both. And many blog pages have links to the next and previous posts, while Blogger blogs have links to the 10 most recent posts.

So, even starting from just a single blog post webpage, whether it's the home page or an individual item page / post page, a bot should be able to index the entire blog. (Also, of course, blogs often have links to "Previous post", "Next post", last 10 posts before the current one etc.). That's why my pal and sometime pardner Kirk always says there's not much point in putting out a sitemap for blogs.

What feed URL should you submit?

Because technically the feed, as with other sitemaps, has to be located at the highest level directory of the site you want the search engines to crawl, you shouldn't try to submit a Feedburner feed as your sitemap, as it won't be recognised; you need to submit your original source site feed.

For Blogger blogs you'll recall (see this post on Blogger feed URLs) that your blog's original feed URL is generally:
http://BLOGNAME.blogspot.com/feeds/posts/default
- and generally that feed contains your last 25 posts.

For sitemap submission purposes, if http://BLOGNAME.blogspot.com/feeds/posts/default doesn't work, you can alternatively use: http://yourblogname.com/atom.xml or http://yourblogname.com/rss.xml.

Now here's a trick - if you use Blogger you can in fact submit your entire blog, with all your posts (not just your most recent 25), to the search engines. This is done by using a combination of the max-results and start-index parameters (see post on Blogger feed URLs), and knowing that - for now, at least - Blogger allows you a maximum max-results of 500 posts.

Let's say that you have 600 posts in your blog. You would make 2 separate submissions to each of the search engines, of 2 different URLs, to catch all your posts:
http://BLOGNAME.blogspot.com/feeds/posts/default?start-index=1&max-results=500
http://BLOGNAME.blogspot.com/feeds/posts/default?start-index=501&max-results=500

The first URL produces a "sitemap" of the 500 most recent posts in your blog; the second URL produces a sitemap starting with the 501st most recent post, and all the ones before it up to 500 (if you have 600 total you don't need &max-results=500 at all, in fact; but you might if you had over 1000 posts). And so on. If in future Blogger switch the max-results back to 100 or some other number, just change the max-results figure to match, and submit more URLs instead - you can check by viewing the feed URL in Firefox and seeing if it maxes out at 100 (or whatever) even though you've used =500 for max-results. (For FTP blogs it's similar but you'll need to use your blog ID instead of main URL, see my post on Blogger feed URLs.)

For WordPress blogs - how to find the feed locations. I don't know if you can do similar clever things with query parameters in WordPress so as to submit a full sitemap.

How to submit a sitemap - use these ping forms

To submit a sitemap, assuming it's already created and uploaded and you know its URL (which you will for a blog site feed), you just need to use HTTP request (ping), or submit the URL of the sitemap via the relevant search engine sitemap submission's page (see Sitemaps.org on how to submit, and the summary of ping links at Wikipedia).

So I've included some forms below to make it easier for you to submit your sitemap to various search engines. Don't forget to include the "http://" or it won't work.

How to submit a sitemap to Google

  • Google prefer that the very first time you submit your sitemap to them, you do so via Google Webmaster Tools (it's free to create an account or you can sign in with an existing Google Account). This is a bit cumbersome but you can then get stats and other info about how they've crawled your site (and how they've processed your sitemap) by logging in to your account in future:
    1. Get a Google Account if you haven't already got one (if you have Gmail, you can use the same sign in details for your Google Account)
    2. Login with your Google Account to Google Webmaster Tools (GWT).
    3. Add your site / blog and then add your sitemap (more info) - if you're on Blogger see below for the best URL to use.
    4. Then you can resubmit an updated sitemap after that, including via the form below.

  • While Google prefer that you register for a Google Webmaster Tools account, the easiest way to submit a sitemap is just to enter your sitemap URL below (including the initial http://) and hit Submit. Then bookmark the resulting page (i.e. save to your browser Favorites) and if in future you want to re-submit your sitemap, just go back to the bookmark (opens in a new window):

    • Note: this "HTTP request" or "ping" method seems to work for URLs which are rejected when trying the method below (e.g. because they're in the wrong location - wrong domain, as explained above), but Google may well ignore the ping unless you've previously submitted your URL as mentioned below, and who knows whether the sitemap will be accepted for correct domain when they get to processing it? So it's probably better to use the method below first, unless you can't get it to work.

How to submit a sitemap to Yahoo!

  • Enter your sitemap URL below (including the http://) and hit Submit. Then bookmark the resulting page (i.e. save to your browser Favorites) and if in future you want to re-submit your sitemap, just go back to the bookmark (opens in a new window):

    • Note: as with Google, this "HTTP request" or "ping" method seems to work for URLs which are rejected when trying the method below (e.g. because they're in the "wrong" domain), but again who knows whether the sitemap will be accepted for the correct domain when they get to processing it?

  • Alternatively, more cumbersome but you can then get stats about how they've crawled your site by logging in to your account in future:

How to submit a sitemap to Microsoft Live?

Microsoft were later to the Sitemaps party but Live Search does now support sitemaps.
  • Enter your sitemap URL below (including the http://) and hit Submit. Then bookmark the resulting page (i.e. save to your browser Favorites) and if in future you want to re-submit your sitemap, just go back to the bookmark (opens in a new window):

  • Microsoft also have a Webmaster site where you can submit your sitemap.

How to submit a sitemap to Ask.com

  • Enter your sitemap URL below (including the http://) and hit Submit, again this method uses "HTTP request" or "ping"; bookmark the resulting page and go back to the bookmark to re-submit in future (opens in a new window):
    • NB for Blogger feeds don't use /posts/default format, you'll have to use http://yourblog.blogspot.com/atom.xml

    (more info).

    • Note: for some reason Ask.com will not currently accept sitemaps which don't point to a specific file like .xml or .php etc - so it won't accept feeds like Wordpress blog feeds in the format http://example.com/?feed=rss, or Blogger feeds in the format http://yourblogname.com/feeds/posts/default or indeed http://yourblogname.com/feeds/posts/default?alt=atom (but e.g. http://yourblogname.com/atom.xml does work with Ask.com).

Updates to your blog / site - how to ping the search engines

Let's assume that the search engine crawlers are now regularly visiting your site. But you don't know at what intervals or how frequently.

What if you update your blog or site by adding new content or editing some existing content? You'll want the search engines to know all about it ASAP so that they can come and slurp up your shiny new or improved content, so that their indexes are as up to date as possible. It's annoying and offputting for visitors to come to your site via a search only to find that the content they were looking for isn't there, or is totally different (they might still be able to get to what they want via the search engine's cache, but that's a different matter).

So, how do you get the search engine bots to visit you ASAP after an update, and pick up your new or edited content? By what's called "pinging" them. How you ping a search engine depends on the search engine, not surprisingly (see e.g. for Yahoo).

But guess what? To ping a search engine, all you have to do is re-submit your sitemap to it. I've already provided the ping forms above, so once you've pinged and saved the bookmark/favorite you just need to click that bookmark. Alternatively, you can also login to a search engine's site management page and use the Resubmit (or the like) buttons there.

Also, blogging platforms like Blogger will automatically ping the search engines for you when you update your blog - in Blogger this should be activated by default (Dashboard > Settings > Basic, under "Let search engines find your blog?" if it's set to "Yes" then Blogger will ping Weblogs.com, a free ping server, when you update your blog, so that search engines around the Net which get info from Weblogs.com will know to come and check out your updated pages). If your blogging software doesn't do that, you can always use Feedburner's PingShot (which I've previously blogged - see my beginners' detailed introduction to Feedburner).


How do you submit just changed or updated pages?

Sites with zillions of URLs can in fact have more than one sitemap (with an overall sitemap index file) and just submit the sitemaps for recently changed URLs. I'm not going to go into that here, sites that big will have webmasters who'll know a lot more about all this than I do.

Now recall that with blogs the easiest way to provide a sitemap, if you can't upload your own sitemap file into the right directory, is to use your feed as the sitemap.

What if you've updated some old posts on your blog, and you want the search engines to crawl the edited content of the old posts?

Aha. Well if you're on Blogger, there is a way to submit a sitemap of just your most recently-updated posts to the search engines.

This is because Blogger automatically produces a special feed that contains just your most recently-updated posts - not your most recently-published ones, but recently-updated ones. So even an old post you just updated will be in that feed. See my post on Blogger feed URLs (unofficial Blogger feeds FAQ) for more info, but essentially the URL you should use for your sitemap in this situation is
http://YOURBLOGNAME.blogspot.com/feeds/posts/default?orderby=updated (changing it to your blog's URL of course)
- for instance for ACE it would be http://www.consumingexperience.com/feeds/posts/default?orderby=updated

The "orderby=updated" ensures that the feed will contain the most recently-updated posts, even if they were first published some time ago. If I wanted it to contain just the 10 most recently-updated posts, irrespective of how many posts are normally in my main feed, I'd use:
http://www.consumingexperience.com/feeds/posts/default?orderby=updated&max-results=10

Furthermore, Blogger automatically specifies the "updated" feed as your sitemap in your robots.txt file. Which I'll now explain.

Sitemaps auto-discovery via robots.txt files

It's a bit of a pain to have to keep submitting your sitemap to the various search engines every time your site is updated, though obviously it helps if your blogging platform automatically pings Weblogs.com.

Also, the search engine bots don't have an easy time figuring out where your sitemap is located (assuming they're indexing your content in the first place, of course). Remember that sitemaps don't have to be feeds - feeds are just one type of XML file which are acceptable as sitemaps. People can build their own detailed sitemap files, name them anything (there's no standardisation on what to name sitemap files), and store them anywhere as long as they're at their domain's root level.

If you actively submit your sitemap to a search engine manually or through an automatic pinging service, that's fine. But if you don't, how's a poor hardworking lil bot, diligently doing its regular crawl, going to work out exactly what your sitemap file is called and where it lives?

So arose a bright idea - why not use the robots.txt file for sitemap autodiscovery?

What are robots.txt files?

Robots.txt files are simple text files found behind the scenes on virtually all websites.

Now a robots.txt file was initially designed to do the opposite of a sitemap - basically, to exclude sites from being indexed, by telling search engine bots which bits of a site not to index, or rather asking them not to index those bits. (There are other ways (i.e. meta tags) to do that, but I won't discuss them here.)

Legit bots are polite and will abide by the "no entry" signs because it's good form by internet standards - but not because they can be made to and they'll be torn apart bit by bit if they don't. (It's important to note that using robots.txt is no substitute for proper security to protect any truly sensitive content - the robots.txt file really contains requests rather than orders, and won't be effective to stop any pushy person or bot who decides to ignore it. Whether anyone can sue for trying to get round robots.txt files is a different matter yet again.)

Using robots.txt it's possible to block specific (law-abiding) bots by name ("user-agent"), or some or all of them; and to block crawlers by sub-directory (e.g. everything in subfolder X of folder Y is off limits), or even just one webpage - for more details see e.g. Wikipedia, the robots.txt FAQ, Google's summary, "using robots.txt files" and Google's help. Google have even provided a robots.txt generator tool, but it won't work for Blogger blogs, see below.

Where's the robots.txt file?

A robots.txt file, like a sitemap, has to be located at the root of the domain e.g. http://yoursitename.com/robots.txt. (It can't be in a sub-directory e.g. http://yoursitename.com/somefolder/robots.txt, because the system wasn't designed that way. Well you could put it in a subfolder, but search engines will only look for it at root level, so they won't find it.)

Remember that again, as for sitemaps, different subdomains are technically separate - so they could have different robots.txt files, e.g. http://yoursitename.com/robots.txt is not in the same domain as http://www.yoursitename.com/robots.txt.

However, unlike sitemap files, robots.txt files do have a standard name - robots.txt, wouldja believe. So it's dead easy to find the robots.txt file for a particular domain or sub-domain.

Using robots.txt files for sitemap auto-discovery

Given the ubiquity of robots.txt files, the thought was, why not use them for sitemaps too? More accurately, why not use them to point to where the sitemap is?

That was the solution agreed by the main search engines Google, Yahoo too, Microsoft and Ask.com for sitemap auto-discovery (Wikipedia): specify the location of your sitemap (or sitemap index if you have more than one sitemap) in your robots.txt file, and bots can auto-discover the location by reading the robots.txt file, which they can easily find on any site. (Since then, those search engines have also agreed on standardisation of robots exclusion protocol directives, big pat on the head chappies!)

Now how do you use a robots.txt file for your sitemap? In the robots.txt file for yoursitename.com (which will be located at http://yoursitename.com/robots.txt), just stick in an extra line like this (including the http etc):
sitemap: http://yoursitename.com/yoursitemapfile.xml

And then, simple, the bots will know that the sitemap for yoursitename.com can be found at the URL stated, and the main search engines have agreed (including Yahoo and Microsoft) that they'll accept it as a valid sitemap even if it's in a different domain and even if you've not verified your site. So, to manage cross host sitemap submission or cross site sitemap submission,

Adding a "sitemap" line in the robots.txt file to indicate where the sitemap file is located has two benefits. The search engines regularly download robots.txt files from the sites they crawl (e.g. Google re-downloads a site's robots.txt file about once a day). So they'll know exactly know where the sitemap file for a site is. They'll also regularly fetch that sitemap file automatically, and therefore know to crawl updated URLs as indicated in that file.

Can you create or edit your robots.txt file to add the sitemap location?

Now bloggers will have been thinking, so what if you can now use robots.txt to tell most of the search engines where your sitemap is?

First, unless you host your blog files on your own servers (e.g. self-hosted WordPress blogs or FTP Blogger blogs or Movable Type blogs), you won't be able to upload a robots.txt file to the root of your domain. For example users of Blogger custom domain or Blogspot.com blogs can't upload a robots.txt file.

Second, is there any point in doing that even if you could, given that for most blogs the sitemap is just the site feed, which the search engines will get to regularly anyway?

Well, there is, at least for Blogger users.

Blogger's robots.txt files

For the first issue, Blogger creates and uploads a robots.txt file automatically for Blogspot.com and custom domain users.

For instance, my blog's robots.txt file is at http://www.consumingexperience.com/robots.txt (and as I have a custom domain, http://consumingexperience.blogspot.com/robots.txt also redirects to http://www.consumingexperience.com/robots.txt - which is as it should be). You can check the contents of your own just by going to http://YOURBLOGNAME.blogspot.com/robots.txt

What Blogger have put in that robots.txt file also helps answer the second issue. Here's the contents of my blog's robots.txt file:
User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search

Sitemap: http://www.consumingexperience.com/feeds/posts/default?orderby=updated
The sitemap line is there, as it should be. And it gives as my sitemap the URL of the Blogger feed which lists the posts which have been most recently updated in my blog. Not the default feed URL which lists the most recently published new posts, but the feed which shows my most recently updated posts. Even though the location of that "updated" feed doesn't follow the rule about having to be located in the root of the domain concerned, that's fine because it's listed in my blog's robots.txt file, so the major search engines will accept it, and will regularly spider my updated posts, so even if I've just changed an old post, the changed contents of that old post will be properly re-indexed. And so too will yours - if you're on Blogger, at least.

More info
Don't forget to check out the Google Webmaster Tools and other tools that you've signed up for; tweaking some of the settings may help, e.g. increasing your crawl rate, the rate at which the Googlebot crawls your blog or site, to "Faster"

Notes:

1. Of the top 50 web properties in the USA in May 2008, Google Sites ranked as no. 1 followed by Yahoo! Sites and then Microsoft Sites and AOL (comScore Media Matrix); and Nielsen Online told a similar story with the top 3 of the top 10 US Web sites belonging to Google, Yahoo and Microsoft. In the UK, during 2007 Google was the most popular website by average monthly UK unique audience as well as the most visited by average monthly UK sessions (Nielsen). Of all searches conducted in the Asia Pacific in April 2008, 39.1% were on Google Sites , 24% on Yahoo! Sites (comScore Asia-Pacific search rankings for April 2008); in the USA in May 2008 61.8% of searches were on Google Sites, 20.6% Yahoo Sites and 8.5% Microsoft with AOL and Ask having 4.5% each (comScore May 2008 U.S. search engine rankings), while in the UK in April 2008 Google Sites dominated with 74.2% of all searches, with the second, eBay, at only 6 %, followed by Yahoo! Sites (4.3%) and Microsoft Sites (3.4%) (comScore April 2008 UK search rankings).

2. A short history of Sitemaps. Google introduced sitemaps in mid-2005 with the aim of "keeping Google informed about all of your new web pages or updates, and increasing the coverage of your web pages in the Google index", including adding submission of mobile sitemaps, gradually improving its informational aspects and features for webmasters in Google Webmaster Tools (a broader service which replaced "Sitemaps"), and even supporting multiple sitemaps in the same directory. Google rolled Sitemaps out for sites crawled for Google News in November 2006 and Google Maps in January 2007.

In November 2006 Google got Microsoft and Yahoo to agree to support the Sitemaps protocol as an industry standard and they launched Sitemaps.org. In April 2007 Ask.com also decided to support the Sitemaps protocol.

Another excellent Sitemaps-related innovation in 2007, at Yahoo!'s instigation this time, was sitemaps auto-discovery. Third party sitemap tools have also proliferated. There have been teething issues, but it all seems to be working now. There are even video sitemaps, sitemaps for source code and sitemaps for Google's custom search engines, but that's another matter...