iFilters for SharePoint Search

Once you have your SharePoint server up and running and begin to build up a corpus of content, you may find that certain documents aren’t turning up in searches.  This may be due to not having the appropriate iFilters installed.  From the Wikipedia entry:

IFilters are plugins that allow the Windows Indexing Service and the newer Windows Desktop Search to index different various file formats so that they become searchable. Without an appropriate IFilter, contents of a file cannot be indexed.

Here are a few important ones.  If your organization uses a particular type of file frequently, make sure to look into the iFilter for it.  The Wikipedia article above has links to many free and for-pay iFilters.  Note that there may be different versions of iFilters for 32-bit vs. 64-bit operating systems, different file versions, etc., so be sure to get the right ones for your needs.

2007 Office System Converter: Microsoft Filter Pack – This download will install and register IFilters with the Windows Indexing Service. These IFilters are used by Microsoft Search products to index the contents of specific document formats. This Filter Pack includes IFilters for the following formats: .docx, .docm, .pptx, .pptm, .xlsx, .xlsm, .xlsb, .zip, .one, .vdx, .vsd, .vss, .vst, .vdx, .vsx, and .vtx.

Adobe PDF IFilter v6.0 or Adobe PDF iFilter 9 for 64-bit platforms – Adobe® PDF iFilter is designed for end users or administrators who wish to index Adobe PDF documents using Microsoft indexing clients. This allows the user to easily search for text within Adobe PDF documents.

Displaying the First N Words of a Long Rich Text Column with XSL

When you want to display blog posts and announcements with DVWPs in your SharePoint Site Collection, you usually don’t want to display the full posts, but just enough to indicate what the item is about and to let the user know if they should click to see more.  An example might be showing the last 3 blog posts on your Home Page.  There isn’t any easy out of the box way to do this.

For the following examples, let’s say that the @Body column contains the text: “The <em>quick</em> <span style=”color: #a52a2a;”>brown</span> fox jumped over the lazy dog.”, which actually looks like this: “The quick brown fox jumped over the lazy dog.”

One option is to use the ddwrt:Limit function.  This allows you to specify a number of characters to show, along with some text to postpend if the original text is longer than the limit you set.  So, for instance, ddwrt:Limit(string(@Body), 25, ‘…’) would show the first 25 characters, followed by the ‘…’ string if there are more than 25 characters in the @Body column.  However, since the @Body column usually contains some HTML markup, you usually don’t get what you really want (the tags are all counted as part of the number of characters).  With our example @Body text above, you’ll get “The <em>quick</em> <span …”, which isn’t even valid HTML since the <span> tag isn’t closed.  Depending on the browser you are using, you’ll probably see something like “The quick“.

So, the first thing you might want to do is to strip out all of the HTML.  The StripHTML XSL template below will do this for you.

<xsl:template name="StripHTML">
  <xsl:param name="HTMLText"/>
  <xsl:choose>
   <xsl:when test="contains($HTMLText, '&gt;')">
    <xsl:call-template name="StripHTML">
      <xsl:with-param name="HTMLText" select="concat(substring-before($HTMLText, '&lt;'), substring-after($HTMLText, '&gt;'))"/>
    </xsl:call-template>
   </xsl:when>
   <xsl:otherwise>
    <xsl:value-of select="$HTMLText"/>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

Once you have the HTML stripped out, the ddwrt:Limit function will do what you want, but the text will probably be cut off mid-word.  Looking at our example @Body text again, the StripXSL template will return “The quick brown fox jumped over the lazy dog.”, which with the ddwrt:Limit function above will look like “The quick brown fox jumpe…”

So, an even better solution is to first strip out the HTML and then return a specific word count.  The FirstNWords XSL template below takes care of this for you.

<xsl:template name="FirstNWords">
  <xsl:param name="TextData"/>
  <xsl:param name="WordCount"/>
  <xsl:param name="MoreText"/>
  <xsl:choose>
    <xsl:when test="$WordCount &gt; 1 and
        (string-length(substring-before($TextData, ' ')) &gt; 0 or
        string-length(substring-before($TextData, '  ')) &gt; 0)">
      <xsl:value-of select="concat(substring-before($TextData, ' '), ' ')" disable-output-escaping="yes"/>
      <xsl:call-template name="FirstNWords">
        <xsl:with-param name="TextData" select="substring-after($TextData, ' ')"/>
        <xsl:with-param name="WordCount" select="$WordCount - 1"/>
        <xsl:with-param name="MoreText" select="$MoreText"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:when test="(string-length(substring-before($TextData, ' ')) &gt; 0 or
        string-length(substring-before($TextData, '  ')) &gt; 0)">
      <xsl:value-of select="concat(substring-before($TextData, ' '), $MoreText)" disable-output-escaping="yes"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$TextData" disable-output-escaping="yes"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

With our example, StripHTML returns “The quick brown fox jumped over the lazy dog.” and then a call to FirstNWords with a WordCount of 5 will give you “The quick brown fox jumped…”  Much nicer!

Note that this won’t do a perfect job if there is a lot of odd spacing or punctuation, but most of the time, it’s a much cleaner solution.

NOTE (2009-02-05): I was working with some data today that had lots of double spaces and some escaped characters, so I tweaked my FirstNWords template to work a little better by adding the test for double spaces (though it isn’t foolproof with different types of white space).

UPDATE (2009-02-27): Here’s an example of how I’ve used these templates in the past to display blog posts.  First, I create a variable called BodyText that contains the contents of the @Body column with the HTML stripped out by using the StripHTML template.  Then I output a row with a link to the post and a second row with the first 25 words of the post, followed by ‘…’, using the FirstNWords template.

<xsl:template name="USG_Blog.rowview">
  <xsl:variable name="BodyText">
    <xsl:call-template name="StripHTML">
      <xsl:with-param name="HTMLText" select="@Body"/>
    </xsl:call-template>
  </xsl:variable>
  <tr>
    <td>
      <a href="{$WebURL}Lists/Posts/Post.aspx?ID={@ID}&amp;Source={$URL}" >
        <xsl:value-of select="@Title"/>
      </a>
    </td>
  </tr>
  <tr>
    <td>
      <xsl:call-template name="FirstNWords">
        <xsl:with-param name="TextData" select="$BodyText"/>
        <xsl:with-param name="WordCount" select="25"/>
        <xsl:with-param name="MoreText" select="'...'"/>
      </xsl:call-template>
    </td>
  </tr>
</xsl:template>

As a side note, I always store these “utility” functions in a separate file for reuse and use the xsl:import tag to pull them into the DVWP I’m working on.  The import should go before the xsl:output tag, as below.

<xsl:import href="/Style Library/XSL Style Sheets/Utilities.xsl"/>
<xsl:output method="html" indent="no"/>

Microsoft Online Services Pricing

In an email from Microsoft the other day, I saw an offer for Microsoft Online Services and was curious about what’s available and what the pricing might be.  After navigating through the too byzantine process to get to it, I was able to see the pricing below.

Microsoft Online Services Pricing

Microsoft Online Services Pricing (as of 2009-01-17)

Here’s the blurb on the Business Productivity Online Standard Suite:

Microsoft Business Productivity Online Standard Suite is a set of messaging and collaboration solutions hosted by Microsoft, and consists of Exchange Online, SharePoint Online, Office Live Meeting, and Office Communications Online (coming soon). These online services are designed to give your business streamlined communication with high availability, comprehensive security, and simplified IT management. Your business benefits from always up-to-date technologies that are deployed rapidly, maximizing your valuable IT resources and reducing your need for infrastructure investments.

I’m not exactly sure why the discount appears (though part of it looks like it may be due to the fact that Office Communications Online is not yet available but is part of the bundle) but even without it the prices look very reasonable.  $7.25 a month/person for SharePoint seems like an out and out bargain based on what I’ve seen others charging.  The datasheet for the Business Productivity Online Standard Suite makes it seem like you’d be getting a real hosted MOSS portal for that price.  (As usual with Microsoft’s pricing and marketing materials, I find it a little difficult to understand exactly what you’d get without reading a lot of fine print — or signing up for the 30 day trial, which I may do.)
All in all, well worth a look if you are considering putting your toe in the water and have a difficult internal process to get hardware in-house.

Optimize Your Blog: Dead Links and SEO

Having recently moved my blog here to WordPress from Live Spaces, I’ve been cleaning things up, repointing internal links, retagging, etc.  I’ve found a couple of things that are useful and I wanted to pass them along.

One of the things that I wanted to do is to see if I had any dead links in any of my posts.  I found this nice tool: http://www.dead-links.com.  Free and easy!  It’s not perfect — it showed me that quite a few links lead to 404 errors when they were just fine — but it’s a great, quick way to see if anything you consider important isn’t working right.  If you find something missing in someone else’s site and even the search engine caches are dead, consider trying the Way Back Machine to see if you can find an earlier version to link to or copy content from (not forgetting attribution, of course).

Another goal is to increase visibility and traffic to my blog (Search Engine Optimization or SEO), and Joel Olsen’s post entitled Ranking Your Blog – Managing and Gaining Popularity has great step by step tips on how to get the major (and many of the minor) search engines to notice your blog faster and more often.

SharePoint Document Library Headscratcher

Here’s a head-scratcher for you…

In a SharePoint Document Library where you have Versioning and Check In/Check Out enabled, when you first upload a document, the “save” button on EditForm.aspx is labeled “Check In” and saves and also checks the document in (good). On a subsequent edit of the metadata (properties), you’re forced to check the document out again (good), but the “save” button is labeled “OK” and doesn’t check the document in (maybe good, maybe bad). Now I can see where this is a good thing because the user might want to fiddle the document a few times before they check it in. But what if you want to be sure they check it in?

It surely is by design, but in this case, the design isn’t what we want!

We have over 20 Content Types stored in one list, and we don’t want to give up the great capability that is there out of the box where the EditForm.aspx is “Content Type aware”. (When you switch Content Type, the form automagically adjusts the columns.) So I was thinking about using script to add a new Check In button to replace the OK button.  I’ve already got a lot of script on the page (of course it’s a custom version of the form called EditFormCustom.aspx — never edit the original!) that enforces business rules around the metadata interdependencies.

Alternatively, I was also thinking of redirecting on every save to a dashboard page where the user could see all of their docs and their disposition (a la My Site) so that they could manage them there.

Any other ideas?  (Short of writing a lot of crufty C# code.)