SharePoint Online Search Isn’t Displaying What I Expect – Part 1 – Trimmed Duplicates

This entry is part 1 of 1 in the series SharePoint Online Search Isn't Displaying What I Expect

For years, I’ve been skeptical about search indexing in SharePoint, especially in SharePoint Online in Office 365. The fact that we can’t know when a search crawl has run – thus updating the indices – is a huge part of the problem. In the early days, before Content Search Web Parts (CSWPs) were available in SharePoint Online, we routinely saw delays between content creation and that content showing up in search results of days or even weeks. Later the CSWP was enabled on SharePoint Online, and it is a fantastically powerful tool, far better than the Content Query Web Part (CQWP) which it nominally replaced.

But the value of any search-driven mechanisms in SharePoint is directly tied to the recency and frequency of updates to the search index. While the CQWP is quite inefficient – since it actually goes out to look for content at the source every time it runs (though there may be some caching) – the CSWP uses the search index and can thus return results using fewer server resources in some cases. (One downside is that you can only retrieve up to 50 results with the CSWP.) Since we don’t know when the search crawls run in SharePoint Online, and we often seem to not see the results we expect, we tend to blame to indexing for the problem.

There are many things that can contribute to the indexing issues. Load on the indexing servers can mean that your tenant isn’t crawled as frequently as you might want – taking hours or in the worst cases days to display items you know should be there. Unfortunately, there is no way to know if the index is the issue. Based on what I’ve heard, there are usually multiple indexing servers per tenant, and those indices can supposedly get out of sync. Search is also a very complex beast: probably way too complicated for most use cases, as it is based on the old FAST search platform. Most people simply want to be able to see content they add to a SharePoint list or library right away if they search for it. Period. So simple, yet often not what happens.

The other day, Julie and I were certain we had an ironclad instance where search indexing simply wasn’t working correctly. The scenario seems to be a very common one, and we stripped it down to as minimal an example as possible and sent it off the the Product Group.

In the example, we were making a REST call to the search API:

to retrieve all of the events in calendars across an Intranet. The use case is a very common one: we have a calendar per department, office, etc., and in some cases we’d like to promote those calendar events to display on the home page of the Intranet. Put aside the governance questions here, and just assume that everyone who can create events gets to decide whether to check the Show on the home page box. To make this all work, we have a few custom content types which inherit from Event with a few extra columns. We have some nice, fancy display mechanisms on the home page using AngularJS and on a Company Calendar page using fullcalendar. But most of that doesn’t matter: we were seeing the issue in the call to the search API.

Our query looked something like this:

This query will return all (we thought!) list items which inherit from the Event Content Type, because its ContentTypeId is 0x0102. Of course, our actual query was more complicated: we requested specific fields with selectproperties, asked for more items by setting the rowlimit to 50, etc. But again, at the core we were simply asking for a bunch of events.

But we weren’t seeing all the events we expected. We assumed that the search index wasn’t working correctly, just like most site admins would.

There was a series of meetings going on about some HR changes, and the company was giving employees a set of webinars from which they could choose one to attend. The events were at four different times during the day. In our call to the search service, we were only getting one of those events. It happened to be the first one added to the calendar; all events had been added over 24 hours before.

When we tried searching for the title of the events in the regular old search box, we still only saw one result. At least that was consistent, and it showed we weren’t doing something stupid in our REST call. I’ve had to blur a number of things out in this screenshot, but here’s what we saw on the search results page. In this case, the results were coming to us in /_layouts/15/osssearchresults.aspx for the particular subsite where the events were, but it didn’t matter if we tried using the search center.

I included Mikael Svenson (@mikaelsvenson) in my email to my Product Group friends because if there’s something about search Mikael doesn’t know, it isn’t worth knowing. I probably should have just asked him in the first place, but we truly believed we had found a bug.

Mikael spotted that all four of the events we expected to retrieve had the same title. This isn’t so unusual: a couple of meetings on a given day with the same title. Maybe we have five interview slots set up for a new candidate, or have several different times when the Red Cross is running a blood drive on the same day, or exactly the example we have here.

We probably should have realized something was wrong when we looked at the search results above. As you can see below – now that I’ve highlighted it – there were three plus one items shown in the histogram for the Modified date.

It turned out that because the four events were so similar, search was considering them duplicates. Of course they weren’t duplicates to us: each is a unique event with its own value to end users.

By adding trimduplicates=false to our REST call, we were able to retrieve all of the events we expected. It was a very simple fix, but given the black art of SharePoint search, not necessarily an obvious one. Perhaps we should have known better, but I don’t think this is an unusual problem. Add to that the fact that the standard SharePoint search results UI doesn’t give you any way to see the duplicates, and I think there is a significant issue.

I’ve made a suggestion on the SharePoint UserVoice to Allow easier management of the trimduplicates setting.

When we search for content, we often (some might say “usually”) need to see all results for our search query. It seems that by default, trimduplicates is set to true in SharePoint Online. This seems to be true in the search Web Parts and in the search API.

My suggestion is that we have far better and clearer ways to manage this setting, including:
* An easy toggle in the Content Search Web Part (CSWP)
* A clear way on the search results pages to choose to show duplicates where there are any. This was present in earlier versions of SharePoint, and I’m not sure when it was removed.

Deciding when duplicates are appropriate is a complex thing, and it varies greatly by use case. Giving people setting up SharePoint pages simpler control over the setting will both help build compelling user experiences on the platform and help confirm that search is indeed working properly. When someone searches for something they know is there and it doesn’t show up in search results, it undermines faith in the entire platform.

If you feel that my ideas have merit, please vote this suggestion up! Suggestions at UserVoice with enough votes truly do get the SharePoint Product Group’s attention.

If you find yourself in a situation like this, there is a tool that can be helpful to solve whatever might be going on. If you do much work with search, check out the SharePoint Search Query Tool on Codeplex. Mikael has contributed to this tool and it basically allows you to issue REST calls through a UI that “understands” SharePoint search very well.

In the screenshot below I’ve done a search against our Sympraxis tenant using the querytext='test'. That’s certainly nothing fascinating, but it points out a few of the useful aspects of the tool:

  • Simple querytext configuration
  • Easy on/off switches for the various query options; in this case unchecking Trim Duplicates was the winner.
  • An easy way to see the effect of your settings on the REST call on the URL (right at the top of the screen)
  • On the right side, a clear view of the results returned, using a number of useful formats. If you’ve written JavaScript to parse search results, you’ll know that this is really helpful.

As a little bonus, here’s the key JavaScript we use to parse the RelevantResults table from the REST call to the search API. Because we’re requesting items  which are all based on the Event Content Type, we can treat all search results the same way. In this example, we’re using AngularJS and jQuery, preparing the data for use with fullcalendar, as I mentioned above.

Hopefully this post gives you a little more insight into the inner workings of SharePoint search. To me, these little eddies and backwaters of search are what turns it into a black art. I’d love to see things get even simpler that the so-called “Quick Mode” in CSWPs.

Thanks again to Mikael and the Product Group folks who engaged on this with me. Notice that this is Part 1 – I expect to add more entries to this series.

References

Mikael pointed me to these two articles on duplicates and “shingling” in case you’d like to understand the underlying principles more fully.

Mikael’s post about duplicate trimming

SharePoint Search Query Tool

SharePoint UserVoice suggestion:

SharePoint 2013’s Search Continuous Crawl: An Enigma

I’m doing some work in SharePoint 2013 and we want to take advantage of as many out of the box capabilities as possible. We’re replacing an existing Intranet that has grown up in SharePoint from 2007 to 2010, and we’d like to rebuild with as little custom code as possible, since SharePoint 2013 now contains features that had to be custom built in the past.

The Intranet is build using a Publishing Portal and we want to use the Content Search Web Part (CSWP) to surface content in places like a home page rotator (the latest stories of certain Content Types within relative date ranges), in several “archive pages” (a list of the historical content, sorted by descending publishing date), and using search with the Search Results Web Part (SRWP) and the Refinement Web Part (RWP) in a page. The user stories and use cases here are not really all that complex: let’s show people the latest content of predetermined types regardless where it was created in the Publishing Portal.

The new Continuous Crawl capability in SharePoint 20103 sounds like it will fit the bill for us. We want the content that users see to be as fresh as possible. In fact, the TechNet article I link to below says that with Continuous Crawl “[t]he search results are very fresh, because the SharePoint content is crawled frequently to keep the search index up to date.” Sounds perfect, but we need to understand more about it.

We haven’t done much at all with the Search Service Application. We’ve got one Content Source, which is “Local SharePoint Sites”. In other words, it couldn’t be much simpler. Since search will underlie so much of the functionality, we need to understand exactly how the crawls are going to work and what sort of lag time we can expect users to have before they see content that is published. We can’t figure out exactly how Continuous Crawl works under the hood, so today I tried to do some experiments.

I set things to have no schedules to start out just to make things as fresh as possible, and just in case, I did a Full Crawl.

clip_image002[6]

clip_image004

When the Full Crawl was done, the Content Source showed this status:

clip_image006

Next I clicked the Enable Continuous Crawls radio button. Note that when I did this, the Incremental Crawl schedule was automatically set to every 4 hours. This can be changed, but the incremental schedule cannot be set to “None” while the Enable Continuous Crawls radio button is selected.

clip_image008

The Content Source status changed to this:

clip_image010

In the log, it looks like an Incremental Crawl fired off when I saved that change at 11:34.

clip_image012

I waited for the Incremental Crawl to complete and published a new News Item at 11:37. The new content showed up in the CSWPs and the search results around 11:55. For some reason, a new Incremental Crawl started at 11:55 (21 minutes after the previous crawl).

clip_image014

clip_image016

I added some more new content at 11:58. That content showed up in the CSWP by 12:09. (I’m not sure exactly how many minutes it took to get there, but it was less than 12.) There’s nothing in the logs to indicate that a crawl occurred:

clip_image018

At 12:30, There was still nothing new in the logs:

clip_image020

All in all, this is still confusing to me. Continuous Crawl seems to be working, but at some underlying schedule which isn’t visible. There have been some suggestions that the Continuous Crawl schedule is set to every 15 minutes by default, and the evidence above seems to support that since the second piece of content showed up in 12 minutes, about 15 minutes after the last crawl that was visible in the logs.

There is some PowerShell you can use to get at properties of the Continuous Crawl, but it’s not totally clear what impact they have on the schedule.

$ssa = Get-SPEnterpriseSearchServiceApplication

$ssa.SetProperty(“ContinuousCrawlInterval”,1)

Another thing that’s not clear is how many Continuous crawl threads might stack up if things get backed up. One person has suggested an unlitimited number and someone else told me there’s a maximum of 8 threads. Obviously, there’s not a clear understanding of this, either.

In researching things, there articles/posts seem useful:

This TechNet article is way too vague and only focuses on what buttons to push to turn Continuous Crawl on or off:

In my opinion, we need some much clearer documentation from Microsoft to explain how all of this holds together and I’m trying to track down the right people to see if I can help to make that happen. If you know who those people are an could give me an introduction, I’d appreciate it.