Sitecore Standard Analyzer : Managing your own stop words filter

Sitecore uses the Standard Analyzer as its Default Analyzer for most of its internal search operations (for Searches inside Content Editor).The standard analyzer uses the StopFilter (Removing stop words from a token stream.) and hence you will encounter an scenario where if you search for terms which contain common keywords like a, an, the, and. All those searches will fail because Lucene’s StopFilter will remove the stop words.

For example if you are searching for an item called “Seller and Buyer” , The standard Analyzer will process that as “Seller Buyer” , The stopwords are removed from the phrase and since there is no field with such value in the index, search returns 0 results.

Out of the many ways to solve this issue, I will show you an way where you can manage you own stop words list which means you can provide an list of stop words to lucene.

Currently, the following stopwords are declared in Sitecore:

“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”
Solution to Set up your own stopwords filter
  • Download the stopwords file
  • Edit contents of the above file to suit your needs to an file with .txt extension (This is due to my inability to attach .txt files in wordpress :))
  • Place the text file in your Data/Indexes Folder.
  • Make the below config changes in the Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config file
  • Rebuild your sitecore indexes
<param desc="defaultAnalyzer" type="Sitecore.ContentSearch.LuceneProvider.Analyzers.DefaultPerFieldAnalyzer, Sitecore.ContentSearch.LuceneProvider">
  <param desc="defaultAnalyzer" type="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net">
   <param hint="version">Lucene_30</param>
   <param desc="stopWords" type="System.IO.FileInfo, mscorlib">
   <param hint="fileName">[FULL_PATH_TO_SITECORE_ROOT_FOLDER]\Data\indexes\stopwords.txt</param>
  </param>
  </param>
</param>

Please note, all changes that are made in the stopwords.txt file will be applied only after changing value of the config file or application pool restart.  In case you do not want any of the stop words you could provide and empty  txt file too . 

My Colleague Brent Svac also blogged  an other way to solve the stop words issue.

Adam Conn also has an excellent post which explains the Sitecore 7 Analyzers in detail

Should you have any questions , Please do not hesitate to comment or contact me via sjain@horizontalintegration.com

Advertisements

9 Responses to “Sitecore Standard Analyzer : Managing your own stop words filter”

  1. Sitecore ContentSearch Fails for Lucene Reserved Keywords like and/or | Horizontal Integration Says:

    […] If you would like to fix this without changing the Standard Analyzer, please see this blog post by Sheetal Jain […]

  2. Chen Hendrawan Says:

    Hi,

    In case anybody interested, I found a way to not hard code the full path to the stop words file in your configuration. First, add this config section under /configuration/sitecore/contentSearch (or anywhere else you like, just make sure you reference it correctly)

    $(1)/stopwords.txt

    Then, in the line where you configure parameter to StandardAnalyzer, replace

    [FULL_PATH_TO_SITECORE_ROOT_FOLDER]\Data\indexes\stopwords.txt

    with

  3. Sitecore Standard Analyzer : Turn off the stop words filter | Horizontal Integration Says:

    […] a previous blog post, my colleague and friend, Sheetal Jain wrote this blog post: Sitecore Standard Analyzer : Managing your own stop words filter. If you only want to turn stop words off, you can use the following patch.config file without […]

  4. Pavan Says:

    I am trying to implement stopwords in solr search,when i see solr gui it is removing but how to remove it when we are querying through sitecore api?

    • Pavan Says:

      we are using below analyzer for fields for which we need to remove stopwords.

      <!– in this example, we will only use synonyms at query time

      –>

      Note: I have added words in stopwords.txt

  5. Pavan Says:

    filter class=”solr.StopFilterFactory” ignoreCase=”true” words=”stopwords.txt”


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: