Show more relevant Titles in search results in SharePoint 2013 plus some other improvements


We’ve introduced improvements to search in SharePoint 2013 so that it will be easier to display relevant titles and authors in search results. We’ve also introduced some changes how the time of the last document modification is set. This allows now more consistent and intuitive sorting and search refinement based on this time.

In this blog we’ll tell you about these changes. They’re included in the SharePoint Server 2013 cumulative update published on October 26th 2013.

Tell me in a few words: what has changed?

The metadata extractor in the content processing pipeline extracts metadata from the content that you crawl. Before the changes we’ve introduced, the output of the metadata extractor was directly written to the corresponding managed properties. Now, we’ve created two brand new crawled properties: MetadataExtractorTitle and MetadataExtractorAuthor. The metadata extractor now writes extracted titles and authors from Word documents and PowerPoint presentations to these crawled properties. These new crawled properties map to the managed properties Title and Author.

We’ve also removed extraction of the LastModifiedTime from MetadataExtractor code. Now dates included in the document body will not influence setting the date of last modification.

How can I benefit from these improvements and get the new properties?

SharePoint Server 2013:

· Install the SharePoint Server 2013 cumulative update package published on October 26th 2013.

· Perform a full crawl of all your content sources.

Tell me the details

What has changed to allow search to display better titles?

How can I change which title is shown in the search results?

What’s new with the Author mapping?

What’s new in last saved date/time extraction?

What has changed to allow search to display better titles?

Sometimes, people save or upload Word documents or PowerPoint presentations with titles like “Document1.docx” or “Presentation1.pptx”. Before the MetadataExtractor was introduced this title would typically show up as the title in the search results. That was not so good.

To present a better title for such files in the search results, we use the MetadataExtractor in the content processing pipeline. It searches for a title in the body of Word and PowerPoint files. Currently, if the MetadataExtractor finds a good candidate for a title in the body, it writes the extracted title to the new crawled property MetadataExtractorTitle that is mapped to the managed property Title by default.

Because the title from the crawled property MetadataExtractorTitle has the first priority in the mapping to the managed property Title, there’s a good chance that the titles of Word and PowerPoint files shown in search results are more relevant.

 

Note: the custom mapping for the managed Title property should be backed up before the October CU installation. Otherwise it will be missed. The reason for this is creation of new crawled properties and thus rolling back to the default Title mappings.

How can I change which crawled property is shown as the title in the search results?

You can change which crawled property is selected to be shown as the title in the search results. This depends on the priorities of crawled properties in the search schema. If you decide to change the priority order of the mapping, make sure that the crawled property that you give priority is filled with useful Titles.

Here’s a table that shows the default priority list for the crawled properties mapped to the managed property Title:

Priority

Crawled Property

Origin

What kind of value does this crawled property contain?

0

MetadataExtractorTitle

MetadataExtractor

The title extracted from the body of Word documents and PowerPoint presentations.

1

TermTitle

SharePoint

The title of the item in SharePoint.

2

Office:2

Office

The title of the item in Word or PowerPoint, etc.

3

Ows_BaseName

SharePoint

Name of the SharePoint page.
Ex: http://my/sites/wiki/Home.aspx

4

Title

Doc Parser

The title as picked up by the content processing component.

5

MailSubject

Doc Parser

The subject of an email file as picked up by the content processing component.

6

Mail:5

Mail

The subject line of an email file.

7

People:PreferredName

urn:schemas-microsoft-com:sharepoint:portal:profile:PreferredName

People

Persons first and last name

8

Basic:displaytitle

urn:schemas.microsoft.com:fulltextqueryinfo:displaytitle

Basic

Contains file name of an Office doc

9

ows_Title

SharePoint

SharePoint Page Title

10

Basic:10

Basic

Contains Filename metadata associated with file properties

11

Basic:9

Basic

Contains Path metadata associated with file properties.

Even though you can change the priority order of the mapping, if one of the crawled properties is empty, the next crawled property from the priority list will be selected.

So, even though the MetadataExtractorTitle has the first priority for the title, it will only be used if a title was extracted. If that, for some reason, wasn’t possible, the TermTitle from SharePoint will be used as the title, and so on. The same mapping order is active for other document formats. But, the MetadataExtractor doesn’t work for, for example, PDFs. For file types other than PowerPoint and Word documents, the MetadataExtractorTitle will be empty and the next crawled property title will be selected to be shown as the title.

Alternatively, if you want to use the SharePoint TermTitle as the title for all your documents, change the priority of the crawled property TermTitle to position 0. If, for some reason, the TermTitle has no value, the MetadataExtractorTitle will be shown as the title, and so on.

You can change the priority in the search schema, see Manage the search schema (TechNet, on premises) or Manage the search schema

What’s new with the Author mapping?

We’ve added the MetadataExtractorAuthor crawled property. The metadata extractor extracts authors from the body of Word documents and PowerPoint presentations and keeps them in this new crawled property. This can be useful for, for example scientific articles where all authors are listed inside the document body but are not displayed in any document properties.

The mapping to the Author managed property for any file format works like this:

1) All possible authors found during crawling are added to a non-prioritized list.

2) From that list, a concatenated string is created that excludes duplicates and empty values.

3) This string is mapped to the Author managed property.

The authors extracted by the metadata extractor are simply added to the list and included in the string.

Even though the priority is not important for the Author managed property, since all authors extracted from content are included in the string, this is where the crawled properties come from:

Crawled Property

Origin

What kind of value does this crawled property contain?

Author

Document Parser

Author as picked up by the content processing component.

MailFrom

Mail

The people names from the from line of an email file.

Mail:6

Mail

Author, MetadataAuthor

Author

Notes

The people names associated with One Note files.

Internal:3

Internal SharePoint objects

Contains metadata associated with internal SharePoint objects

Internal:105

Internal SharePoint objects

Contains metadata associated with internal SharePoint objects

Office:8

Office

ModifiedBy metadata

MetadataExtractorAuthor

MetadataExtractor

The author extracted from the body of Word documents and PowerPoint presentations.


What’s new in last saved date/time extraction?

We stopped extracting date of the last modification or creating from the document body. Even though it may be useful for PowerPoint documents where the date of presentation is mentioned on the first slide, it was introducing too much uncertainty. Let’s imagine a presentation talking about French revolution and having its dates on the first slide. Then it was highly probable that you presentation will have 14.07.1789 as creation date which, I believe is undesired.

So, with this change you still can map crawled properties to LastModifiedTime and use the managed property in the search results but there will be no output from MetadataExtractor in this list

This table shows the default crawled property mapping and priority to LastModifiedTime:

Priority

Crawled Property

Origin

What kind of value does this crawled property contain?

0

LastSavedDateTime

Document Parser

The timestamp showing when the item was last saved as picked up by the content processing component.

1

Basic:14

Basic

LastModifiedTime metadata

2

Basic:16

Basic

LastModifiedTime metadata

3

ows_Modified

SharePoint

The timestamp showing when the item was last saved in SharePoint.

4

Lastaccessed

Notes

The timestamp showing when the item was last accessed in One Note.

You can now sort search results based on the preferred date of modification, by changing the priority order, or you can perform more sophisticated logic like deleting too old documents from your site collection

We hope that by adding these changes, we’ve improved the way in which you can control search results.


Post By : Srinivas Dutta [MSFT] ,Ievgeniia Zhovtobriukh [MSFT]

 


Comments (18)

  1. Nick Hurst says:

    So first off thank you for allowing us to now remove the MetadataExtractorTitle from the Title managed property.  This will be a huge improvement as we have tons of templated documents for meeting minutes, agendas and the like which all had the same titles being displayed in search.  

    A question though, if we remove the MetadataExtractorTitle property for Title, will this also change the _layouts/15/osssearchresults.aspx site specific search page search results as well as the Enterprise Search Center?    

  2. Spses says:

    MetadataExtractorTitle is just crawled property. It affects how title of document is extracted. Title is extended as MP only. If MDE Title is removed from mapping, next one in the order will be used as title.

    In order for the changes to have effect recrawl (any type) needs to happen, so if the customer wants to affect all documents simultaneously, the full recrawl is needed.

  3. Nick Hurst says:

    So we tried both removing the MetadataExtractorTitle from the Title Managed Property, and moving it down in the crawled property order but the MetadataExtractorTitle was still being displayed in search results.  We ran several Full Crawls and Incremental Crawls with no changes being displayed.  

    Finally we did an Index Reset, then a full crawl, and now the MetadataExtractorTitle is no longer being displayed in search results.  This was all in our Dev environment, so is an Index Reset required for this change to work?  Or is there a nightly job that perhaps will update this?  We really do not want to do a Index Reset in Production if we can avoid it.  

    Any information would be appreciated.

    Thanks.  

  4. David says:

    This is a very information post – thank you!

    How do you add properties like "Mail:6" back if you remove them. We'd like to try some different configurations but I don't see these crawled properties listed.

  5. Nick Hurst says:

    We've done several full crawls and no change has occurred.

    Does the change to no longer show the MetadataExtractorTitle require the Index to be reset?

    Thank you.

  6. SharePoint Foundation 2013 says:

    If there not an option to do the same with MetaDataExtractorTitle on SharePoint Foundation?  As at the moment it is reading the footer page number in some documents, which is making it completely useless.

  7. MSP1024 says:

    Is this different for O365? I have two different enviroments (E1 and E3) both do not have the TermTitle as an option for the mapped property.

    We have documents going back to 2011 which already have a title field. We trained users to add a "title" and now they can no longer find documents with the title in search. Any suggestions on how to force the SharePoint title field if the mapped TermTitle is not there? We tried mapping other fields to the refinableString predefined fields but this did not give us consistent or accurate results.

  8. Bart says:

    I've changed the mapping order so 'TermTitle' is the top one (reset the index, full crawl)

    However for PDF files the PDF Title Property is still showing instead of the Title set in SharePoint!?

    Any suggestions would be much appreciated!

  9. ThomasWalz says:

    Thx for this detailed instructions and Explanation. I am not running the October CU. We are using a Content search Webpart with a custom Display template which should simply Show the title column. Therefore, i mapped one of the Standard RefinableString managed properties to the crawled properties "ows_Title", "Title" and "Office:2". My Goal is to Show ONLY the title column of the lists. For Office documents, this works. But in case of PDF Files, it always Shows two titles when the column AND the property of the document is filled. It Looks like this then: TITLE_COLUMN_TITLE;TITLE_DOCUMENT_PROPERTY, for example: "Test-Document;Test2"

    What could i do to tell SharePoint to Display ONLY the list column ? If i remove from the managed property everything besides "ows_title" it Shows nothing. and of course, ows_title is the first entry there…

  10. PANoone says:

    How is MetadataExtractorTitle any different from the Office:2 property, which was the initial cause of all this madness?

    I've tried everything officially suggested here to no avail. My next step is to simply remove these unwanted properties and perform an index reset.

    Surely having the SharePoint title (when it exists) appear first for any document is not a big ask.

  11. Ross McLean says:

    For what it's worth, I changed the order of the Managed Property to be SharePoint Title, Doc Title and then MetadataExtractorTitle.

    A full crawl didn't make a difference to the results, still using MetadataExtractorTitle where it thought it had a good one.

    I  reset the index and ran the full crawl and it is now picking up the SharePoint Title where populated.

  12. Sadiq says:

    Tried both removing the MetadataExtractorTitle from the Title Managed Property, and moving it down in the crawled property order but the MetadataExtractorTitle was still being displayed in search results.  We ran several Full Crawls and Incremental Crawls with no changes being displayed.  

    Finally we did an Index Reset, then a full crawl, and now the MetadataExtractorTitle is no longer being displayed in search results. So is an Index Reset required for this change to work?  Or is there a nightly job that perhaps will update this?  We really do not want to do a Index Reset in Production if we can avoid it.  

    Any information would be appreciated.

    Thanks.  

  13. Prameela says:

    I have removed metadata extractor but then in my CQWP results the items display with the title of the "View" at the end of the item name.

    How to remove this

  14. Prameela says:

    I have changed some settings with " title" managed property by removing OWS_title and Basic:Displaytitle crawled properties but it does not work even then

  15. Peter says:

    Basic:displaytitle is the Office 'Title' property, and not as stated here the Office doc file name.

  16. duminda Weebaddage says:

    @Title had issues with PDF files ,In ULS (Enable verboseEx) logs for the event af7zf and with document name , you can find how CTS document feeding, you can see each property mappings .. this the easy way define correct properties

  17. Nv says:

    Is this also applicable to SharePoint Wiki Library pages or only limited to office documents? Thanks!

  18. Bikash says:

    This is very informative. Thank you. Can you please help me with more info on how to fix issue with PDF Files ? I have a document library with 40k pdf files and for few PDF file, title from the pdf property displays in the search result page. I want to display title field from the document library for all files and ignore the title property from pdf file. Having ows_title as the first item mapped to manage property: Title and removing MetadataExtractorTitle from the mapping with a full crawl did not fix the issue.

Skip to main content