Show more relevant Titles in search results in SharePoint 2013 plus some other improvements

We’ve introduced improvements to search in SharePoint 2013 so that it will be easier to display relevant titles and authors in search results. We’ve also introduced some changes how the time of the last document modification is set. This allows now more consistent and intuitive sorting and search refinement based on this time.

In this blog we’ll tell you about these changes. They’re included in the SharePoint Server 2013 cumulative update published on October 26th 2013.

Tell me in a few words: what has changed?

The metadata extractor in the content processing pipeline extracts metadata from the content that you crawl. Before the changes we’ve introduced, the output of the metadata extractor was directly written to the corresponding managed properties. Now, we’ve created two brand new crawled properties: MetadataExtractorTitle and MetadataExtractorAuthor. The metadata extractor now writes extracted titles and authors from Word documents and PowerPoint presentations to these crawled properties. These new crawled properties map to the managed properties Title and Author.

We’ve also removed extraction of the LastModifiedTime from MetadataExtractor code. Now dates included in the document body will not influence setting the date of last modification.

How can I benefit from these improvements and get the new properties?

SharePoint Server 2013:

· Install the SharePoint Server 2013 cumulative update package published on October 26th 2013.

· Perform a full crawl of all your content sources.

Tell me the details

What has changed to allow search to display better titles?

How can I change which title is shown in the search results?

What’s new with the Author mapping?

What’s new in last saved date/time extraction?

What has changed to allow search to display better titles?

Sometimes, people save or upload Word documents or PowerPoint presentations with titles like “Document1.docx” or “Presentation1.pptx”. Before the MetadataExtractor was introduced this title would typically show up as the title in the search results. That was not so good.

To present a better title for such files in the search results, we use the MetadataExtractor in the content processing pipeline. It searches for a title in the body of Word and PowerPoint files. Currently, if the MetadataExtractor finds a good candidate for a title in the body, it writes the extracted title to the new crawled property MetadataExtractorTitle that is mapped to the managed property Title by default.

Because the title from the crawled property MetadataExtractorTitle has the first priority in the mapping to the managed property Title, there’s a good chance that the titles of Word and PowerPoint files shown in search results are more relevant.

 

Note: the custom mapping for the managed Title property should be backed up before the October CU installation. Otherwise it will be missed. The reason for this is creation of new crawled properties and thus rolling back to the default Title mappings.

How can I change which crawled property is shown as the title in the search results?

You can change which crawled property is selected to be shown as the title in the search results. This depends on the priorities of crawled properties in the search schema. If you decide to change the priority order of the mapping, make sure that the crawled property that you give priority is filled with useful Titles.

Here’s a table that shows the default priority list for the crawled properties mapped to the managed property Title:

Priority

Crawled Property

Origin

What kind of value does this crawled property contain?

0

MetadataExtractorTitle

MetadataExtractor

The title extracted from the body of Word documents and PowerPoint presentations.

1

TermTitle

SharePoint

The title of the item in SharePoint.

2

Office:2

Office

The title of the item in Word or PowerPoint, etc.

3

Ows_BaseName

SharePoint

Name of the SharePoint page. Ex: https://my/sites/wiki/Home.aspx

4

Title

Doc Parser

The title as picked up by the content processing component.

5

MailSubject

Doc Parser

The subject of an email file as picked up by the content processing component.

6

Mail:5

Mail

The subject line of an email file.

7

People:PreferredName

urn:schemas-microsoft-com:sharepoint:portal:profile:PreferredName

People

Persons first and last name

8

Basic:displaytitle

urn:schemas.microsoft.com:fulltextqueryinfo:displaytitle

Basic

Contains file name of an Office doc

9

ows_Title

SharePoint

SharePoint Page Title

10

Basic:10

Basic

Contains Filename metadata associated with file properties

11

Basic:9

Basic

Contains Path metadata associated with file properties.

Even though you can change the priority order of the mapping, if one of the crawled properties is empty, the next crawled property from the priority list will be selected.

So, even though the MetadataExtractorTitle has the first priority for the title, it will only be used if a title was extracted. If that, for some reason, wasn’t possible, the TermTitle from SharePoint will be used as the title, and so on. The same mapping order is active for other document formats. But, the MetadataExtractor doesn’t work for, for example, PDFs. For file types other than PowerPoint and Word documents, the MetadataExtractorTitle will be empty and the next crawled property title will be selected to be shown as the title.

Alternatively, if you want to use the SharePoint TermTitle as the title for all your documents, change the priority of the crawled property TermTitle to position 0. If, for some reason, the TermTitle has no value, the MetadataExtractorTitle will be shown as the title, and so on.

You can change the priority in the search schema, see Manage the search schema (TechNet, on premises) or Manage the search schema

What’s new with the Author mapping?

We’ve added the MetadataExtractorAuthor crawled property. The metadata extractor extracts authors from the body of Word documents and PowerPoint presentations and keeps them in this new crawled property. This can be useful for, for example scientific articles where all authors are listed inside the document body but are not displayed in any document properties.

The mapping to the Author managed property for any file format works like this:

1) All possible authors found during crawling are added to a non-prioritized list.

2) From that list, a concatenated string is created that excludes duplicates and empty values.

3) This string is mapped to the Author managed property.

The authors extracted by the metadata extractor are simply added to the list and included in the string.

Even though the priority is not important for the Author managed property, since all authors extracted from content are included in the string, this is where the crawled properties come from:

Crawled Property

Origin

What kind of value does this crawled property contain?

Author

Document Parser

Author as picked up by the content processing component.

MailFrom

Mail

The people names from the from line of an email file.

Mail:6

Mail

Author, MetadataAuthor

Author

Notes

The people names associated with One Note files.

Internal:3

Internal SharePoint objects

Contains metadata associated with internal SharePoint objects

Internal:105

Internal SharePoint objects

Contains metadata associated with internal SharePoint objects

Office:8

Office

ModifiedBy metadata

MetadataExtractorAuthor

MetadataExtractor

The author extracted from the body of Word documents and PowerPoint presentations.

What’s new in last saved date/time extraction?

We stopped extracting date of the last modification or creating from the document body. Even though it may be useful for PowerPoint documents where the date of presentation is mentioned on the first slide, it was introducing too much uncertainty. Let’s imagine a presentation talking about French revolution and having its dates on the first slide. Then it was highly probable that you presentation will have 14.07.1789 as creation date which, I believe is undesired.

So, with this change you still can map crawled properties to LastModifiedTime and use the managed property in the search results but there will be no output from MetadataExtractor in this list

This table shows the default crawled property mapping and priority to LastModifiedTime:

Priority

Crawled Property

Origin

What kind of value does this crawled property contain?

0

LastSavedDateTime

Document Parser

The timestamp showing when the item was last saved as picked up by the content processing component.

1

Basic:14

Basic

LastModifiedTime metadata

2

Basic:16

Basic

LastModifiedTime metadata

3

ows_Modified

SharePoint

The timestamp showing when the item was last saved in SharePoint.

4

Lastaccessed

Notes

The timestamp showing when the item was last accessed in One Note.

You can now sort search results based on the preferred date of modification, by changing the priority order, or you can perform more sophisticated logic like deleting too old documents from your site collection

We hope that by adding these changes, we’ve improved the way in which you can control search results.


Post By : Srinivas Dutta [MSFT] ,Ievgeniia Zhovtobriukh [MSFT]