<?xml version="1.0"?>

<rdf:RDF 
  xmlns="http://purl.org/rss/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
>

<channel rdf:about="http://simon.incutio.com/syndicate/searchengines/rss1.0">
  <title>Search Engines</title>
  <link>http://simon.incutio.com/</link>
  <description>Simon Willison's Search Engines cateory</description>
  <language>en-uk</language>
  <webMaster>simon@incutio.com</webMaster>
  <items>
    <rdf:Seq>
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/10/08/yahooNewsRSS" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/09/17/googleConspiracies" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/08/02/onMetadata" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/07/24/learnToSearch" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/07/01/timeSinceOnFeedster" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/06/16/timBrayOnSearch" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/05/01/feedster" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/28/moreFunWithSearch" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/25/siteSearch" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/13/av100RandomPictures" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/12/yahooSearchUsesCSS" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/07/moreOnTheNewYahoo" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/04/07/aNewYahoo" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/03/09/thirtyFiveYearOldCookies" />
      <rdf:li rdf:resource="http://simon.incutio.com/archive/2003/03/08/roogle" />
    </rdf:Seq>
  </items>
</channel>

<item rdf:about="http://simon.incutio.com/archive/2003/10/08/yahooNewsRSS">
  <title>Yahoo News Search RSS feeds</title>
  <description>&lt;p&gt;It's not a new idea (&lt;a href=&quot;http://www.feedster.com/&quot;&gt;Feedster&lt;/a&gt; has been doing it for a while) but it's a first for a major search engine: Yahoo are &lt;a href=&quot;http://jeremy.zawodny.com/blog/archives/001001.html&quot; title=&quot;Yahoo! News Search via RSS&quot;&gt;now offering&lt;/a&gt; &lt;acronym title=&quot;Really Simple Syndication&quot;&gt;RSS&lt;/acronym&gt; feeds of the results of searches within Yahoo news. The feeds are advertisement free, probably because you have to click through to the news stories to read them in full. I wonder how long it will be before someone starts offering custom feeds like this with advertising in the feed itself? As &lt;acronym title=&quot;Really Simple Syndication&quot;&gt;RSS&lt;/acronym&gt; is an &lt;acronym title=&quot;eXtensible Markup Language&quot;&gt;XML&lt;/acronym&gt; format parsing out adverts before they get to the user is a much more obviosu step than ad-blockers in web browsers.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/10/08/yahooNewsRSS</link>
  <dc:subject>Search Engines, RSS and Syndication</dc:subject>
  <dc:date>2003-10-08T00:29:46-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/09/17/googleConspiracies">
  <title>Google conspiracy theories</title>
  <description>&lt;p&gt;Microdoc News have a &lt;a href=&quot;http://microdoc-news.info/home/BloggerNews/2003/09/15.html/1&quot; title=&quot;The Bias Towards Blogs in Search Engines&quot;&gt;poorly researched story&lt;/a&gt; suggesting that Google have been engineering their search results to favour their own properties:&lt;/p&gt;

&lt;blockquote cite=&quot;http://microdoc-news.info/home/BloggerNews/2003/09/15.html/1&quot;&gt;&lt;p&gt;It could be argued that the most important site that should appear when searching for the word &lt;em&gt;blogs&lt;/em&gt; would be the generic site where anyone with a blog can get listed for her/his three minutes of fame, which includes any blog, anywhere in any system. Weblogs.com is a directory of sorts to any current post and is like, if you please, a central nervous system to the world of blogs. However, Google does not list weblogs.com as the primary site -- blogger.com is listed as the prime, first-up site in the listings that result from the blogs search. Is that because Google Inc., owns blogger.com, or is it that blogger.com is really what one would expect as the first result?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Time to break out Python again. I won't explain the following code in detail, but essentially it downloads the &lt;acronym title=&quot;HyperText Markup Language&quot;&gt;HTML&lt;/acronym&gt; source of the front pages of both &lt;a href=&quot;http://www.blogger.com/&quot;&gt;Blogger.com&lt;/a&gt; and &lt;a href=&quot;http://www.weblogs.com/&quot;&gt;Weblogs.com&lt;/a&gt;, strips out the &lt;acronym title=&quot;HyperText Markup Language&quot;&gt;HTML&lt;/acronym&gt; tags (defined as anything between two angle brackets) and counts the number of occurrences of the individual word 'blogs'.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&amp;gt;&amp;gt;&amp;gt; import urllib, re
&amp;gt;&amp;gt;&amp;gt; striptags = re.compile('&amp;lt;[^&amp;gt;]+&amp;gt;')
&amp;gt;&amp;gt;&amp;gt; blogs = re.compile(r'\bblogs\b', re.I)
&amp;gt;&amp;gt;&amp;gt; blogger = urllib.urlopen('http://www.blogger.com/').read()
&amp;gt;&amp;gt;&amp;gt; weblogs = urllib.urlopen('http://www.weblogs.com/').read()
&amp;gt;&amp;gt;&amp;gt; len(blogger), len(weblogs)
(26369, 394323)
&amp;gt;&amp;gt;&amp;gt; blogs.findall(striptags.sub('', blogger))
['blogs', 'blogs', 'blogs', 'Blogs']
&amp;gt;&amp;gt;&amp;gt; blogs.findall(striptags.sub('', weblogs))
['Blogs', 'Blogs', 'blogs']
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The above code shows that while Blogger.com mentions the word 'blogs' four times in 26,000 characters, Weblogs.com only mentions it three times in 394,000 characters! Blogger has a far higher 'blogs' word density - in fact, the only occurrence of the word on Weblogs.com is when it happens to be a part of the name of one of the several thousand blogs listed on the page at any one time.&lt;/p&gt;

&lt;p&gt;Although word density is a reasonably useful metric for telling if Google will like something, everyone knows that Google's secret sauce is PageRank, which is based in part on the number of pages linking to a site. &lt;a href=&quot;http://www.google.com/search?q=link:www.weblogs.com&quot; title=&quot;Searched for pages linking to www.weblogs.com&quot;&gt;Two&lt;/a&gt; &lt;a href=&quot;http://www.google.com/search?q=link:www.blogger.com&quot; title=&quot;Searched for pages linking to www.blogger.com&quot;&gt;quick&lt;/a&gt; link: searches reveal 7,840 links to Weblogs.com, but a whopping 61,500 links to Blogger.com (no doubt helped by all those &quot;powered by blogger&quot; stickers).&lt;/p&gt;

&lt;p&gt;So Blogger.com not only has a higher word density for the designated search term, it also has far more links to it overall. Is it really so surprising that it's coming out on top?&lt;/p&gt;

&lt;p&gt;Further more, if you run a search for 'weblogs', Weblogs.com comes out as the &lt;a href=&quot;http://www.google.com/search?q=weblogs&quot;&gt;number one result&lt;/a&gt;. It's all in the name.&lt;/p&gt;

&lt;p&gt;Dave Winer &lt;a href=&quot;http://scriptingnews.userland.com/2003/09/16#When:2:48:50PM&quot; title=&quot;Scripting News, 16th September 2003&quot;&gt;finds it strange&lt;/a&gt; that the &lt;a href=&quot;http://google.blogspot.com/&quot;&gt;Google Weblog&lt;/a&gt; (unaffiliated with Google the company) comes out as the first result in a &lt;a href=&quot;http://www.google.com/search?q=weblog&quot;&gt;search for 'weblog'&lt;/a&gt;. My guess is that this is a result of the blog's name influencing the text of links made to it - when you link to &lt;a href=&quot;http://doc.weblogs.com/&quot;&gt;Doc Searls&lt;/a&gt; or myself (both of whom have 'weblog' in their site title) you can abbreviate it to &quot;Doc Searls&quot; or &quot;Simon Willison&quot;, but when you link to the Google Weblog you &lt;em&gt;have&lt;/em&gt; to use the fully qualified name or your link won't make sense. Google can be strongly affected by link text, as last year's &lt;a href=&quot;http://www.wordspy.com/words/Googlebombing.asp&quot; title=&quot;The Word Spy: Google bombing&quot;&gt;Google bombing&lt;/a&gt; epidemic aptly demonstrated.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/09/17/googleConspiracies</link>
  <dc:subject>Google, Search Engines</dc:subject>
  <dc:date>2003-09-17T00:51:40-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/08/02/onMetadata">
  <title>On Metadata</title>
  <description>&lt;p&gt;Tim Bray's series On Search now has a &lt;a href=&quot;http://tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC&quot; title=&quot;On Search, the Series&quot;&gt;table of contents page&lt;/a&gt; linking to each of the previous entries. The &lt;a href=&quot;http://tbray.org/ongoing/When/200x/2003/07/29/SearchMeta&quot; title=&quot;On Search: Metadata&quot;&gt;most recent article&lt;/a&gt; covers metadata, and includes some insightful commentary on the huge problem of gathering metadata from users in the first place.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/08/02/onMetadata</link>
  <dc:subject>Information Architecture, Search Engines</dc:subject>
  <dc:date>2003-08-02T21:06:50-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/07/24/learnToSearch">
  <title>Learn to search!</title>
  <description>&lt;p&gt;Slate: &lt;a href=&quot;http://slate.msn.com/id/2085668/&quot;&gt;Digging for Googleholes&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite=&quot;http://slate.msn.com/id/2085668/&quot;&gt;&lt;p&gt;
Type in the make and model of a new DVD player, and you'll get dozens of online electronic stores in the top results, all of them eager to sell you the item. But you have to burrow through the results to find an impartial product review that doesn't appear in an online catalog.
&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://www.google.com/search?q=sony+DVP-S550D&quot;&gt;sony DVP-S550D&lt;/a&gt; - shopping sites come out top&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.google.com/search?q=sony+DVP-S550D+review&quot;&gt;sony DVP-S550D review&lt;/a&gt; - review sites come out top&lt;/p&gt;

&lt;blockquote cite=&quot;http://slate.msn.com/id/2085668/&quot;&gt;&lt;p&gt;
Search for &quot;apple&quot; on Google, and you have to troll through a couple pages of results before you get anything not directly related to Apple Computer - and it's a page promoting a public TV show called Newton's Apple. After that it's all Mac-related links until Fiona Apple's home page. You have to sift through 50 results before you reach a link that deals with apples that grow on trees: the home page for the Washington State Apple Growers Association.
&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://www.google.com/search?q=apple&quot;&gt;apple&lt;/a&gt; - lots of stuff about Apple computers&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.google.com/search?q=apple+fruit&quot;&gt;apple fruit&lt;/a&gt; - lots of stuff about Apples, the fruit&lt;/p&gt;

&lt;p&gt;These are not even advanced search techniques. It's a basic rule of searching: if your first set of results aren't what you are looking for, enter more specific terms and try again.&lt;/p&gt;

&lt;blockquote cite=&quot;http://slate.msn.com/id/2085668/&quot;&gt;&lt;p&gt;
So, when you're doing research online, Google is implicitly pushing you toward information stored in articles and away from information stored in books. Assuming this practice continues, and assuming that Google continues to grow in influence, we may find ourselves in a world where, if you want to get an idea into circulation, you're better off publishing a PDF file on the Web than landing a book deal.
&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I'd say that day has already come (but replace &lt;acronym title=&quot;Portable Document File&quot;&gt;PDF&lt;/acronym&gt; with &lt;acronym title=&quot;HyperText Markup Language&quot;&gt;HTML&lt;/acronym&gt;), but I'm not sure I understand how this is a bad thing. Surely information is more valuable if it is searchable? Books are not going to die out because of the internet (how many people prefer reading from a screen?) but if you have an idea to share the internet is obviously a better medium - you reach millions more people for a fraction of the cost of traditional publishing.&lt;/p&gt;

&lt;p&gt;There are a lot of legitimate concerns about Google relating to its size and massive influence over the web's traffic, but concerns about skewed results are often the fault of the user rather than the tool. &lt;a href=&quot;http://www.google.com/help/basics.html&quot;&gt;Learn to search&lt;/a&gt;!&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/07/24/learnToSearch</link>
  <dc:subject>Google, Rants, Search Engines</dc:subject>
  <dc:date>2003-07-24T16:17:35-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/07/01/timeSinceOnFeedster">
  <title>time_since() on Feedster</title>
  <description>&lt;p&gt;This is pretty cool: Scott's taken Nat's &lt;a href=&quot;http://blog.natbat.co.uk/archive/2003/Jun/14/time_since&quot; title=&quot;The time_since() function&quot;&gt;time-since function&lt;/a&gt; and &lt;a href=&quot;http://radio.weblogs.com/0103807/2003/06/26.html#a1823&quot; title=&quot;I Did a Bad Thing and I'm Sorry&quot;&gt;added it to Feedster&lt;/a&gt;, giving a quick indication of how long ago an item was posted.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/07/01/timeSinceOnFeedster</link>
  <dc:subject>Search Engines</dc:subject>
  <dc:date>2003-07-01T23:20:17-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/06/16/timBrayOnSearch">
  <title>Tim Bray on search</title>
  <description>&lt;p&gt;I love it when bloggers stick to their word. The other day, while &lt;a href=&quot;http://www.tbray.org/ongoing/When/200x/2003/06/13/PerfectWeb&quot; title=&quot;Antibiotic Days&quot;&gt;describing a quick Perl hack&lt;/a&gt; that really impressed a major client a few years ago, Tim Bray mentioned the following:&lt;/p&gt;

&lt;blockquote cite=&quot;http://www.tbray.org/ongoing/When/200x/2003/06/13/PerfectWeb&quot;&gt;&lt;p&gt;
Then I turned on Microsoft's search engine, at that time called Index Server, now I believe called Index Services, which is a pretty nice tool (we don't have the equivalent in the Open Source world, more on that another time).
&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;And sure enough, he's just posted the &lt;a href=&quot;http://www.tbray.org/ongoing/When/200x/2003/06/15/OnSearch&quot; title=&quot;On Search: Backgrounder&quot;&gt;first in a series&lt;/a&gt; of essays on full-text search. Go read it: it's really interesting stuff. Tim's conclusion is:&lt;/p&gt;

&lt;blockquote cite=&quot;http://www.tbray.org/ongoing/When/200x/2003/06/15/OnSearch&quot;&gt;&lt;p&gt;
What we need is for Apache to come out-of-the-box with a built-in search capability that you just push a button and it works, and it's fast, and doesn't need much care and feeding, and it's internationalized, and it has the right API for when you want to get fancy.
&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Until that happens, I will happily recommend MySQL's built in fulltext search indexing for quickly adding a relatively powerful search facility to a site. I &lt;a href=&quot;http://simon.incutio.com/archive/2003/04/25/siteSearch&quot; title=&quot;Site search finally available&quot;&gt;use it on this blog&lt;/a&gt; and my only real criticism is that it insists on search words of at least 4 letters, which is less than ideal when most of your entries include &lt;acronym title=&quot;Three Letter Acronyms&quot;&gt;TLA&lt;/acronym&gt;s. Hopefully they'll provide a way around this limitation in a future release.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/06/16/timBrayOnSearch</link>
  <dc:subject>Search Engines</dc:subject>
  <dc:date>2003-06-16T15:37:43-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/05/01/feedster">
  <title>Feedster AND searching</title>
  <description>&lt;p&gt;&lt;a href=&quot;http://www.feedster.com/&quot;&gt;Feedster&lt;/a&gt; finally &lt;a href=&quot;http://radio.weblogs.com/0103807/2003/04/30.html#a1614&quot; title=&quot;Two Big Feedster Changes ...&quot;&gt;supports AND&lt;/a&gt; as the default search operator. This is a very good thing. I've decided to leave this site's &lt;a href=&quot;http://simon.incutio.com/archive/2003/04/25/siteSearch&quot; title=&quot;Site search finally available&quot;&gt;search engine&lt;/a&gt; as using OR, mainly because I feel for a small search set (approximately a thousand entries) more search results is better than fewer search results and because relevancy algorithm used by MySQL to order the results appears to be working extremely well. For large data sets such as Feedster or Google I definitely prefer to only see results containing all of my search terms.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/05/01/feedster</link>
  <dc:subject>Information Architecture, Search Engines</dc:subject>
  <dc:date>2003-05-01T15:12:16-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/28/moreFunWithSearch">
  <title>More fun with Search</title>
  <description>&lt;p&gt;While browsing around my &lt;code&gt;phoenix/&lt;/code&gt; directory I spotted a sub-directory called &lt;code&gt;searchplugins&lt;/code&gt;, which appears to control the list of search engines available in the very useful search box at the top right corner of the browser. A bit of digging later and it turns out that adding new search engines to Mozilla based browsers is remarkably easy: &lt;a href=&quot;http://www.mozilla.org/projects/search/&quot;&gt;The Mozilla Search Project&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'm a sucker for new toys, so &lt;a href=&quot;/search&quot; onclick=&quot;if ((typeof window.sidebar == 'object') &amp;amp;&amp;amp; (typeof window.sidebar.addSearchEngine == 'function')) { window.sidebar.addSearchEngine('http://simon.incutio.com/simon.src','http://simon.incutio.com/simon.gif','simon.incutio.com','Web') } else {alert('This feature is only vailable in Mozilla based browsers')} return false;&quot;&gt;click here&lt;/a&gt; to add the &lt;code&gt;simon.incutio.com&lt;/code&gt; search engine to your (Mozilla or Firebird) browser :)&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/28/moreFunWithSearch</link>
  <dc:subject>Mozilla, Search Engines</dc:subject>
  <dc:date>2003-04-28T19:58:18-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/25/siteSearch">
  <title>Site search finally available</title>
  <description>&lt;p&gt;I've finally got around to adding a &lt;a href=&quot;/search&quot;&gt;search page&lt;/a&gt; to this site. It uses MySQL's &lt;a href=&quot;http://www.mysql.com/doc/en/Fulltext_Search.html&quot;&gt;full text indexing&lt;/a&gt;, which is extremely fast and provides good results but comes at the expense of flexibility. Search terms less than 4 letters long are ignored, and multi-word searches are handled using OR rather than AND. This nearly put me off using it, but the relevancy algorithm is excellent which I think outweighs the disadvantage of not being able to use pure AND queries.&lt;/p&gt;

&lt;p&gt;MySQL 4.0 introduces far more powerful boolean mode full text searches which allow all kinds of modifiers and extra syntax, but this site currently runs on 3.23.54 so I can't play with those just yet. Jeremy Zawodny's &lt;a href=&quot;http://www.linux-mag.com/2003-01/mysql_03.html&quot; title=&quot;http://www.linux-mag.com/2003-01/mysql_03.html&quot;&gt;article on MySQL 4&lt;/a&gt; explains boolean mode and describes many other exciting new MySQL features as well.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/25/siteSearch</link>
  <dc:subject>Content Management, Search Engines</dc:subject>
  <dc:date>2003-04-25T16:55:02-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/13/av100RandomPictures">
  <title>100 random pictures</title>
  <description>&lt;p&gt;&lt;a href=&quot;http://ga2so.com/random2.php&quot;&gt;100 random AltaVista pictures&lt;/a&gt; is fascinating, if not guaranteed work-safe.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/13/av100RandomPictures</link>
  <dc:subject>Search Engines</dc:subject>
  <dc:date>2003-04-13T14:11:39-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/12/yahooSearchUsesCSS">
  <title>Yahoo Search uses CSS</title>
  <description>&lt;p&gt;In all the fuss about Yahoo's &lt;a href=&quot;http://simon.incutio.com/archive/2003/04/07/#moreOnTheNewYahoo&quot; title=&quot;More on the new Yahoo&quot;&gt;new search interface&lt;/a&gt; over the past few days, the extensive use of &lt;acronym title=&quot;Cascading Style Sheets&quot;&gt;CSS&lt;/acronym&gt; in the &lt;a href=&quot;http://search.yahoo.com/search?p=example+search&quot; title=&quot;An example search&quot;&gt;results pages&lt;/a&gt; was almost completely overlooked, probably because the page still contains a small layout table for the top and bottom navigation. The results themselves are served up as a styled ordered list, at least for modern browsers (thanks to a server side browser sniffer). More information in &lt;a href=&quot;http://archivist.incutio.com/viewlist/css-discuss/24505&quot; title=&quot;[css-d] CSS-powered Yahoo! Search&quot;&gt;this message&lt;/a&gt; from Yahoo's Brian Ghidinelli, who is seeking feedback.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/12/yahooSearchUsesCSS</link>
  <dc:subject>[X]HTML and CSS, Search Engines</dc:subject>
  <dc:date>2003-04-12T20:11:37-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/07/moreOnTheNewYahoo">
  <title>More on the new Yahoo</title>
  <description>&lt;p&gt;Unsurprisingly, the &lt;a href=&quot;http://new.search.yahoo.com/&quot;&gt;new Yahoo&lt;/a&gt; is generating a whole load of commentary. There's a good thread going on &lt;a href=&quot;http://www.37signals.com/svn/archives/000146.php&quot; title=&quot;New Yahoo vs. Old Google&quot;&gt;Signals vs Noise&lt;/a&gt;, and &lt;a href=&quot;http://iaslash.org/node.php?id=7325&quot; title=&quot;New Yahoo! Search debuts&quot;&gt;ia/&lt;/a&gt; has coverage as well. I've been playing with it a bit and it's definitely an immense improvement on the current Yahoo, although it's still not quite as usable or responsive as Google. I also noticed that the search results are &lt;em&gt;exactly&lt;/em&gt; the same as Google's (even for image search) so it looks like Yahoo haven't switched over to Inktomi just yet.&lt;/p&gt;

&lt;p&gt;It's worth clicking through the &lt;a href=&quot;http://search.yahoo.com/tour&quot;&gt;tour&lt;/a&gt; to get an overview of the new interface. The &quot;open in new window&quot; icon for each search result is a clever addition, but the smartest feature in my opinion are the specialised Yahoo shortcuts. &lt;samp&gt;mail!&lt;/samp&gt; takes you to Yahoo mail, &lt;samp&gt;calendar!&lt;/samp&gt; to Yahoo calendar and &lt;a href=&quot;http://search.yahoo.com/new_search_tour/yahoolist.html&quot; title=&quot;Full list of Yahoo shortcuts&quot;&gt;so on&lt;/a&gt;. It looks like are Yahoo hoping to out do Google by capitalising on their many other services, which seems like a very sensible approach.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/07/moreOnTheNewYahoo</link>
  <dc:subject>Information Architecture, Search Engines</dc:subject>
  <dc:date>2003-04-07T21:47:43-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/04/07/aNewYahoo">
  <title>A new Yahoo</title>
  <description>&lt;p&gt;New York Times: &lt;a href=&quot;http://www.nytimes.com/2003/04/07/technology/07YAHO.html?ex=1050292800&amp;amp;en=821ae8a3ad2b7af3&amp;amp;ei=5062&amp;amp;partner=GOOGLE&quot;&gt;Yahoo Plans Improvements in Effort to Regain Lost Ground&lt;/a&gt;. I'm guessing &lt;a href=&quot;http://new.search.yahoo.com/&quot; title=&quot;Uncluttered Yahoo interface&quot;&gt;this&lt;/a&gt; is what it's going to look like (via &lt;a href=&quot;http://lists.evolt.org/&quot;&gt;thelist&lt;/a&gt;).&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/04/07/aNewYahoo</link>
  <dc:subject>Google, Search Engines</dc:subject>
  <dc:date>2003-04-07T18:25:26-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/03/09/thirtyFiveYearOldCookies">
  <title>Thirty five year old cookies</title>
  <description>&lt;p&gt;I'm finding myself slightly confused about the Google backlash washing around the blogosphere, which is summarised quite well by &lt;a href=&quot;http://www.gavinsblog.com/2003/03/06.html#a117&quot;&gt;Gavin Sheridan&lt;/a&gt;. Most of the arguments against using Google unsurprisingly centre around privacy issues, in particular the &quot;35 year cookie&quot;. I was under the impression that cookies could only be set for a maximum of a year, but having checked &lt;a href=&quot;http://wp.netscape.com/newsref/std/cookie_spec.html&quot;&gt;Netscape's Cookie Specification&lt;/a&gt; and &lt;a href=&quot;ftp://ftp.rfc-editor.org/in-notes/rfc2965.txt&quot; title=&quot;HTTP State Management Mechanism&quot;&gt;RFC 2965&lt;/a&gt; it appears I was mistaken.&lt;/p&gt;

&lt;p&gt;So let's take a look at the cookies in question, via the Mozilla project's handy &lt;a href=&quot;http://webtools.mozilla.org/web-sniffer/view.cgi?url=http%3A%2F%2Fwww.google.com/&quot; title=&quot;View http://www.google.com/&quot;&gt;Web Sniffer utility&lt;/a&gt; (the front page for this tool is &lt;a href=&quot;http://webtools.mozilla.org/web-sniffer/&quot; title=&quot;View HTTP and HTML Source&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote cite=&quot;http://webtools.mozilla.org/web-sniffer/view.cgi?url=http%3A%2F%2Fwww.google.com/&quot;&gt;&lt;p&gt;&lt;code&gt;
HTTP/1.0 200 OK&lt;br /&gt;
Content-Length: 3403&lt;br /&gt;
Connection: Keep-Alive&lt;br /&gt;
Server: GWS/2.0&lt;br /&gt;
Date: Sun, 09 Mar 2003 14:34:32 GMT&lt;br /&gt;
Content-Type: text/html&lt;br /&gt;
Cache-control: private&lt;br /&gt;
&lt;strong&gt;Set-Cookie: PREF=ID=05ba0c124de8df6e:TM=1047220472:LM=1047220472:S=Ke2RQCqjCEowS1x-; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com&lt;/strong&gt;
&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;There it is - a 35 year cookie. Now let's take a look at some of Google's competitors.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.alltheweb.com/&quot;&gt;AllTheWeb&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite=&quot;http://webtools.mozilla.org/web-sniffer/view.cgi?url=http%3A%2F%2Fwww.alltheweb.com/&quot;&gt;&lt;p&gt;&lt;code&gt;
HTTP/1.1 200 OK&lt;br /&gt;
Date: Sun, 09 Mar 2003 14:36:42 GMT&lt;br /&gt;
Server: Apache/1.3.27 (Unix) PHP/4.2.3-atw&lt;br /&gt;
&lt;strong&gt;Set-Cookie: atw-uid=CgVSBj5rUXoAAQnFAwSFAg==; path=/; domain=.alltheweb.com; expires=Sat, 09-Mar-13 02:36:42 GMT&lt;/strong&gt;&lt;br /&gt;
X-Powered-By: PHP/4.2.3-atw&lt;br /&gt;
Last-Modified: Sun, 09 Mar 2003 14:35:00 GMT&lt;br /&gt;
Expires: Thu, 19 Apr 2001 04:25:21 GMT&lt;br /&gt;
Cache-Control: max-age=0, private&lt;br /&gt;
&lt;strong&gt;Set-Cookie: PREF=frschk=1:_lm=1047220602; expires=Fri, 07-Mar-08 14:36:42 GMT; path=/&lt;/strong&gt;&lt;br /&gt;
Connection: close&lt;br /&gt;
Content-Type: text/html; charset=iso-8859-1
&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;That's two cookies - one for 5 years and one for 10 years. Interesting to see that they're using their own modified version of &lt;acronym title=&quot;PHP: Hypertext Preprocessor&quot;&gt;PHP&lt;/acronym&gt; 4.2.3 :)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.teoma.com/&quot;&gt;Teoma&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite=&quot;http://webtools.mozilla.org/web-sniffer/view.cgi?url=http%3A%2F%2Fwww.teoma.com/&quot;&gt;&lt;p&gt;&lt;code&gt;
HTTP/1.1 200 OK&lt;br /&gt;
Server: Microsoft-IIS/5.0&lt;br /&gt;
Date: Sun, 09 Mar 2003 14:38:50 GMT&lt;br /&gt;
Connection: Keep-Alive&lt;br /&gt;
Content-Length: 6629&lt;br /&gt;
Content-Type: text/html&lt;br /&gt;
&lt;strong&gt;Set-Cookie: CTST=yes; expires=Sun, 09-Mar-2003 15:03:50 GMT; path=/&lt;/strong&gt;&lt;br /&gt;
Cache-control: private
&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;That cookie lasts for about half an hour and doesn't contain a unique identifier. Plus they're running &lt;acronym title=&quot;Internet Information Server&quot;&gt;IIS&lt;/acronym&gt;!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.altavista.com/&quot;&gt;Altavista&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite=&quot;http://webtools.mozilla.org/web-sniffer/view.cgi?url=http%3A%2F%2Fwww.altavista.com/&quot;&gt;&lt;p&gt;&lt;code&gt;
HTTP/1.0 200 OK
&lt;strong&gt;Set-Cookie: AV_POS=pos=1047220999574; path=/; domain=.altavista.com;&lt;/strong&gt;&lt;br /&gt;
&lt;strong&gt;Set-Cookie: AV_USERKEY=AVS03b87123ae55d80a1c21250000022; expires=Tuesday, 31-Dec-2013 12:00:00 GMT; path=/; domain=altavista.com;&lt;/strong&gt;&lt;br /&gt;
Server: AV/1.0.1&lt;br /&gt;
MIME-Version: 1.0&lt;br /&gt;
Cache-Control: no-cache,no-store,max-age=0&lt;br /&gt;
pragma: no-cache&lt;br /&gt;
Expires: Sun, 09 Mar 2003 14:43:19 GMT&lt;br /&gt;
&lt;strong&gt;Set-Cookie: AV_MKT=1; Domain=altavista.com; Path=/; Expires=Thu, 01-Dec-1994 16:00:00 GMT&lt;/strong&gt;&lt;br /&gt;
Content-Type: text/html; charset=ISO-8859-1&lt;br /&gt;
Content-Length: 10020&lt;br /&gt;
Date: Sun, 09 Mar 2003 14:43:19 GMT
&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;What a mess! There's a session cookie (which only lasts until the browser s closed) recording what looks like the time I first visited the front page, a 10 year cookie with a unique ID and another cookie set to expire in 1994, possibly in an attempt to wipe out cookies set by an older version of the site.&lt;/p&gt;

&lt;p&gt;So what have we learnt? Both AllTheWeb and Altavista set 10 year unique identifier cookies, while Teoma appears not to set any. At the end of the day though, what is the difference between a 10 year and a 35 year cookie? How many people are going to go a whole ten years without losing their browser's cookies, through a browser upgrade, PC upgrade, change of job or just wiping the cookie directory? Thee answer to that question is self evident, so in practise a 10 year unique identifier cookie is just as big an invasion of privacy as a 35 year cookie.&lt;/p&gt;

&lt;p&gt;On the privacy front, AllTheWeb and Altavista are just as guilty as Google.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/03/09/thirtyFiveYearOldCookies</link>
  <dc:subject>Google, Online Issues, Search Engines</dc:subject>
  <dc:date>2003-03-09T14:58:49-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>
<item rdf:about="http://simon.incutio.com/archive/2003/03/08/roogle">
  <title>Roogle</title>
  <description>&lt;p&gt;Scott Johnson has put together a blog search engine with a difference: it indexes &lt;acronym title=&quot;Randomly Syndicated Something?&quot;&gt;RSS&lt;/acronym&gt; feeds rather than crawling the blogs themselves. &lt;a href=&quot;http://www.fuzzygroup.net/roogle/&quot;&gt;Roogle&lt;/a&gt; is still under heavy development (and Scott is &lt;a href=&quot;http://radio.weblogs.com/0103807/2003/03/07.html#a1434&quot; title=&quot;What 10 odd Hours of Hacking Can Produce: An RSS Search Engine&quot;&gt;blogging it&lt;/a&gt; as he goes) but is shaping up to be a very neat tool. If your blog isn't already being indexed, you can add it using &lt;a href=&quot;http://www.fuzzygroup.net/roogle/add.php&quot; title=&quot;Add RSS Feed&quot;&gt;this form&lt;/a&gt;.&lt;/p&gt;</description>
  <link>http://simon.incutio.com/archive/2003/03/08/roogle</link>
  <dc:subject>Blogging, Search Engines</dc:subject>
  <dc:date>2003-03-08T23:40:48-00:00</dc:date>
  <dc:creator>Simon Willison</dc:creator>
</item>

</rdf:RDF>