Wednesday, 8 March 2006

Technorati tag pages problem: my test results

I've been looking into the problem I and others have experienced where properly-tagged posts don't appear on the appropriate tag pages of the leading blogosphere search engine Technorati, as previously mentioned.
How widespread are these problems?

I even tried doing a survey a while back: only about 80 people responded, but, for those interested, the key results to date are here (opens in a new window - 'scuse the look and the need to scroll but that's the only one-column template on offer and I needed a one-column to fit and don't have time to tinker further with that - talk about lack of choice on Google Pages!). (I won't even try to include the graphs here lest the images or iframes stop this post from being tagged. You never know...).

Now 80 doesn't seem many but it does indicate that the problem isn't unknown, and that more people have experienced the problem than have bothered to report it to Technorati. (If you want to vent, the poll is still open, see the end of this post.)

Just over a week ago, when yet another post didn't get displayed properly on the correct Technorati tag pages (e.g. the Technorati Improbulus tag page, though the post is on another blog search engine IceRocket's Improbulus tag page), I decided to investigate further, as mentioned in my previous post. Technorati's CEO David Sifry had commented there that Technorati were going to get to the bottom of this, which is very welcome news.

To recap, I think there are several aspects that need investigation here. Which bit of the affected post is behind the part of Technorati's system that's going wrong? Which part of Technorati's system is it that's going wrong?

The post: what doesn't Technorati like about the post content?

As outlined in my previous post, I can think of several possibilities (apart from the "valid XHTML" point which, as I explained in the previous post, I don't think is an issue in the case of my blog, but I tested it anyway):
  • length of post
  • lots of code in the post, displayed as such
  • lots of HTML other than links/images, e.g. forms, iframes
  • a combination of the previous, or something else I haven't thought of!

My experiment

The problem post "Technorati: favorite blogs; help others add your blog, and thoughts on Technorati Favorites" was long and had lots of code, including a form and iframe. So I split it into different sections and posted each section separately (with some normal posts in between) to see what happened. (I should have given the posts more distinctive titles, but there we are - the end of the first paragraph of each test post does summarise what bits it contains.)

All those individual posts had mostly the same tags. For speed I checked mainly my meblogging tag's tag page for the Improbulus tag, but I cross checked also on other tag pages e.g. for "A Consuming Experience", and the results were the same.

Here's screenshots of part of the Technorati Improbulus tag page:

and of my list of actual posts:

- from which you'll see that clearly some of the test posts are not on Technorati's tag pages.

Now, just to break down the various test posts by content type (you can doublecheck them on the Improbulus tag page if you want to):

A. Post with text, links, images including buttons, code: Technorati: favorite blogs; help others add your blog - OK

B. Post with text, iframe and code for iframe: Technorati: favorite blogs; show your Technorati favorites on your blog - OK

C. Post with text, links, icons, one URL as code without link, and form: Technorati favorite blogs: benefits for readers of blogs - PROBLEM

D. Post with text, links and images (bugs, issues, thoughts): Technorati favorites: bugs, issues and thoughts - OK

So the problem seemed to be with C. I suspected it was the form, perhaps because the input tags weren't closed (so it was not strictly valid XHTML). Technorati have been telling people (e.g. in their help and via their staff) that having valid XTHML will help your posts get properly indexed by their spider, in fact that seems to have been their main response over the months to people who have asked for support on this very issue.

Therefore I reposted C again, having first tweaked the form so the input tags were closed (and therefore more valid XHTML, just in case - despite my personal view - that was the source of the problem) - but still, no go.

Next, I tried breaking C down further into two separate bits - just the form (with a few links), and the rest. Guess what? The post with just the form was fine! It was the post with the rest of the content of C that wouldn't show up on the tag pages. I didn't expect that, because other problem posts I've had in the past have often contained forms and I was really wondering if that was it. Just to be sure, I tried posting C without the form, again. Same thing - a no show.

Finally, on the XHTML validation front (yet again), my template has some warnings, but then those are common to all my posts including A, B and D which did get picked up. So ignoring validation issues with the template, the only thing left wrong with the body of that post is that I didn't include "alt" attributes for the images. I reposted that post (C without the form), this time with blank alt attributes (on the basis that another post with blank alt attributes for images did get displayed properly on Technorati 's tag pages). And that post was again missing in action from the Improbulus tag page (and indeed other relevant tag pages e.g. Consuming Experience), even though a subsequent post (on ID cards) did show up. So it can't be the XHTML validity (or rather invalidity) of the main body of the post. I'd also mention that unlike other people, lately I've had no problems with pinging Technorati - I checked and they correctly showed when my blog was last updated for that particular post; they just didn't show that post on their tag pages.

Now, I'm completely stumped. Whatever Technorati's system doesn't like is, in my case at least, clearly something to do with the problem C bit - whenever I post ANYthing containing that bit (whether the original post, the extract I've called C, C with tweaked form, C without any form, C with no form and with alt attributes for the images), that post just doesn't appear on Technorati's tag pages. Whereas all the other sections of last week's post (A, B and D above) displayed fine when posted separately. (Someone with lots of time could break that the problem bit of C down further into paragraphs and post those separately too, then break those down further, in order to pin down exactly which bit of C it is that Technorati's system chokes on - but that someone will not be me...!)

What I'm puzzled about is, that section doesn't contain anything out of the ordinary; it's just text, some links, a couple of images. Why should that be a problem? I really have no idea. Well, are there maybe certain words their system doesn't like? I don't think they consciously deploy a censor (though I did briefly wonder, in the case of my long post about female sexuality which didn't get picked up on their tag pages!). If they did have a censorship mechanism it would surely filter out the post for all purposes (like when using their standard search), not just on their tag pages; and besides, from what I've heard David Sifry is not the sort of man who would brook censorship on Technorati.

Technorati: what's going wrong?

This part is really for Technorati to figure out, of course, but it seems to me that when a post isn't on the right tag pages on Technorati, the possibilities are:
  • No Technorati Crawl - Technorati's spider is skipping that post somehow
  • No Post/Tag Association - the post is indexed (and you can find it on a simple full text search on Technorati), but it's not being associated on Technorati's system with some or all of the right tags (e.g. it's not been stored in their tags database in the right place or at all, depending on how Technorati do it)
  • "Recovery" Issue - the post is indexed, it's associated with the right tags, but when you go to the relevant tag page(s), whatever's behind the scenes is not fetching back the right information
  • Tag Pages Wrong - the post is indexed, it's associated with the right tags, when you go to the relevant tag page(s) the right info is returned, but whatever is responsible for displaying the tag page just isn't showing it properly
  • Something Else - again, a combination of some of the above, or something I haven't thought of.
Whatever the problem is, in my case at least, I'm sure it's not "No Technorati Crawl" because my posts which are missing from the tag pages do show up on the basic search results pages on doing a search (e.g. this search, which I've made a bit more complex just to pick out some of those test posts of mine that aren't on the tag pages but clearly can be found just by searching).


So, it's over to Technorati now. I know that for many people the tag pages problem issue may be different from mine, and following the guidance given by Technorati (including on validation and the rel="bookmark" point) may help get their blogs picked up by Technorati, or else contacting Technorati support (by the way their new customer support specialist Janice Myint also has her own blog which offers unofficial Technorati help, I saw via this Blogher post). But, in the case of my blog, I think you'll agree from all the above that my issue has to be something else entirely.

I really don't think there's anything more useful that I can do, having established that it's not the form or missing alt attributes in images, but that there is some consistency in what things their system doesn't like. It's down to Technorati to work out what that could be, and why, and hopefully fix this ongoing issue - not just in the case of the problem C extract but hopefully also in the case of other posts, from whatever blog.

And I hope if it's something to do with the content of the post or underlying code for a post, which certainly seems to be an issue from my tests, that they will either sort the issue out internally or else share with us what that thing is, so that we can avoid including in our future posts anything that could result in our posts not appearing on Technorati 's tag pages.

This issue has been plaguing Technorati continually for over a year now to my knowledge, i.e. ever since they pioneered blog posts tagging. While Technorati have their fans (and I am in fact one of them, despite the tag pages problems), it's clear that people have been getting Technoratty (if you'll forgive the pun) for some time now - even latterly, there is still dissatisfaction with Technorati: see see e.g. this Blogher post and comments on it, or this post - and for the sake of their continued credibility I hope Technorati will get to the bottom of this issue soon.

Update 13 March 2006: After I posted these results Dave Sifry the Technorati CEO emailed me to say they're on it - see this post; if people regularly report this problem to Technorati when they encounter it, it might help them fix it faster.

Technorati Tags: , , , , , , , , , , , , , , , , , , , ,


EGM said...

Awesome research. Thanks for doing this. I think it will be helpful for both Technorati (please read it Technorati folks!) and users alike. I've been reading your blog for a while now and I have finally gotten around to adding you to my blogroll. Please accept my apologies for taking so long. This stuff is too good not to read!

Improbulus said...

Thanks for your kind comment and for adding me to your blogroll, egm.

I've updated my post above to say that David Sifry, Technorati CEO, has emailed me - and they're on it. See this post.

Mark Vicuna said...

Excellent definitive research on this problem! My tags also do not appear. I've done everything possible on my side. So now I've emailed Technorati support and am looking for a quick resolution.

Improbulus said...

Thanks Mark! I think if you have tag problems and you've done everything right the only thing you can do is contact Technorati support ASAP.