Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 25

Distinctive Terms and cts:similar-query

MarkLogic Server includes cts:similar-query and cts:distinctive-terms. With these search APIs, you can find what is distinctive about nodes, typically from search results, from a search perspective. This chapter describes cts:similar-query and cts:distinctive-terms, and includes the following sections:

Understanding cts:similar-query

You can use cts:similar-query to find nodes that are similar, from a search perspecitve, to the model nodes that you pass into the first parameter. The cts:similar-query constructor is a cts:query constructor, and you can combine it with other cts:query constructors as described in Composing cts:query Expressions.

Instead of looking in the indexes to find the terms that match the query, like other cts:query constructors, cts:similar-query takes the nodes passed in, runs them through an indexing process, and returns a cts:query that would match the model nodes with a high degree of relevance. You can pass various index and score options into cts:similar-query to influence the cts:query that it produces.

The query that it generates finds distinctive terms of the model nodes based on the other documents in the database.

Finding the Distinctive Terms of a Set of Nodes

If you want to find the terms that cts:similar-query uses to generate its cts:query, you can use cts:distinctive-terms. The output of cts:distinctive-terms is a cts:class element with several cts:term children. Each cts:term element contains a cts:query constructor, representing a term. Each cts:term element also contains scores and confidence for that term. MarkLogic Server uses these scores in calculating relevance.

You can pass many different options into cts:distinctive-terms to control which terms it generates. The database options control which terms will be most relevant to the model nodes, and therefore affect the cts:distinctive-termsoutput. If you take an iterative approach, you can try different indexing options to see which ones give the best results for your model nodes.

The distinctive terms generated or distinctive based on the other documents in the database, therfore, you will get much better results running this against a sizable database.

Understanding the cts:distinctive-terms Output

The following shows a simple cts:distinctive-terms query with its output:

let $node := doc("/shakespeare/plays/hamlet.xml") 
return cts:distinctive-terms($node, 
   <options xmlns="cts:distinctive-terms"
         xmlns:db="http://marklogic.com/xdmp/database">
    <use-db-config>false</use-db-config>
    <max-terms>3</max-terms>  
    <db:word-searches>false</db:word-searches>
    <db:stemmed-searches>basic</db:stemmed-searches>
    <db:fast-phrase-searches>false</db:fast-phrase-searches>
    <db:fast-element-word-searches>false</db:fast-element-word-searches>
    <db:fast-element-phrase-searches>false</db:fast-element-phrase-searches>
   </options>)
=>
<cts:class name="dterms /shakespeare/plays/hamlet.xml" offset="0" xmlns:cts="http://marklogic.com/cts">
  <cts:term id="7783238741996929314" val="981" score="981" confidence="0.811494" fitness="1">
    <cts:word-query>
      <cts:text xml:lang="en">guildenstern</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:word-query>
  </cts:term>
  <cts:term id="4731147985682913359" val="956" score="956" confidence="0.801087" fitness="1">
    <cts:word-query>
      <cts:text xml:lang="en">polonius</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:word-query>
  </cts:term>
  <cts:term id="1100490632300558572" val="949" score="949" confidence="0.798149" fitness="1">
    <cts:word-query>
      <cts:text xml:lang="en">horatio</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:word-query>
  </cts:term>
</cts:class>

The output is a cts:class element, and each child is a cts:term element. The cts:term elements represent terms in a database, identified by a cts:query. Each term has numbers for val, score, confidence, and fitness.

The val and score attributes are values that approximate the score contribution of that term. The confidence attribute represents the cts:confidence value for the term. The fitness attribute represents the cts:fitness value for the term. For details on score, fitness, and confidence, see Relevance Scores: Understanding and Customizing.

The previous query only consider word-query terms. You can also have cts:element-word-query terms and cts:near-query terms for terms that are within an element or that are a word pair (a cts:near-query with a distance of 1). To see some of these kind of terms, try running a query like the following:

let $node := doc("/shakespeare/plays/hamlet.xml") 
return cts:distinctive-terms($node, 
   <options xmlns="cts:distinctive-terms"
         xmlns:db="http://marklogic.com/xdmp/database">
    <use-db-config>false</use-db-config>
    <max-terms>100</max-terms>  
    <db:word-searches>false</db:word-searches>
    <db:stemmed-searches>basic</db:stemmed-searches>
    <db:fast-phrase-searches>true</db:fast-phrase-searches>
    <db:fast-element-word-searches>true</db:fast-element-word-searches>
    <db:fast-element-phrase-searches>true</db:fast-element-phrase-searches>
   </options>)

This query enables the db:fast-element-word-searches and db:fast-element-phrase-searches options, which will cause terms to appear in the output that are constrained to a particular element. Changing the database options to cts:distictive-terms and looking at the differences in the output will help you to understand both how the index options affect which terms are distinctive and, since cts:similar-query can use these same settings, how cts:similar-query decides if a document is similar to the model nodes.

Example Design Pattern: Making a Tag Cloud

Tag clouds are a popular visualization that show various terms, usually relevant to a search, and show the more relevant ones in a larger and/or more colorful font. You can use cts:distinctive-terms feed the data used to make a tag cloud. The basic design pattern is as follows:

  • Experiment with options to create a cts:distinctive-terms query that produces results you are happy with.
  • Set a max-terms size that is equal to the number of terms you want in your tag cloud.
  • Come up with some algorithm to convert score (or fitness) into font size. For example, you might want to take the fitness and multiply it by 20 to get a font size.
  • Use the above algorithm to iterate through your results and generate some html that creates a tag cloud.

The following sample code is a simplied example of this design pattern:

xquery version "1.0-ml";

let $hits := 
  let $terms :=
   let $node := doc("/shakespeare/plays/hamlet.xml")//LINE
   return cts:distinctive-terms($node, 
   <options xmlns="cts:distinctive-terms"
         xmlns:db="http://marklogic.com/xdmp/database">
    <use-db-config>false</use-db-config>
    <max-terms>100</max-terms>  
    <db:word-searches>false</db:word-searches>
    <db:stemmed-searches>basic</db:stemmed-searches>
    <db:fast-phrase-searches>false</db:fast-phrase-searches>
    <db:fast-element-word-searches>false</db:fast-element-word-searches>
    <db:fast-element-phrase-searches>false</db:fast-element-phrase-searches>
   </options>)//cts:term
  for $wq in $terms
  where $wq/cts:word-query
  return element word {
           attribute score {
             fn:round(($wq/@val div 20))},
           $wq/cts:word-query/cts:text/string() }
return <p>{
for $hit in $hits
order by $hit/string()
return (
<span style="{fn:concat("font-size: ", 
          $hit/@score)}">{$hit/string()}
</span>, " " ) }</p> 

The above query returns html which, when displayed in a browser, shows the 100 most distinctive with the most relevant terms in a larger font.

« Previous chapter
Next chapter »