Web Services Team
UW-Extension Information Systems
Google Mini Search Appliance CFC
The new google search appliance-actually, Google Mini-has many of the same capabilities as the google search engine. As such, I was able to create a new version of the search CFC specifically for the Google Mini. This document is a quick run down of how to use the google mini and all its wonderful features.
Introduction
This CFC is only in its first revision. As such, it is subject to change and bugs. However, the methods for doing searching are nearly identical to other search CFCs that are out there: google.cfc and search.cfc.
The CFC exists on all three instances of the CFMX servers. You can create an instance of the CFC by using (Most of the examples assume you're coding in cfscript.):
// googlesa is short for google search appliancesearch = createobject("component", "cfc.googlesa");
You can inspect the contents, including the functions available to you by dumping the object (not in cfscript):
<cfdump var="#search#">
The key functions are:
- doSearch() // returns a query
- doSearchCollection() //returns a query
- doSearchLimit() //returns a query
- getRecordCount() //returns a number
- getSpellingSuggestion() //returns a string
- getKeywordMatch() //returns a query
- getSynonyms(); //returns query
These functions are discussed in more detail below.
Basic search querying
The basics of searching have not changed from other incarnations of the search cfcs. There are three basic searching functions: doSearch(), doSearchCollection() and doSearchLimit().
In general, the parameter list for all three functions is similar. They each require the search string as the first parameter. In addition, the functions can all optionally take the start position and the number of records to return as the last two parameters.
Each of these functions returns a ColdFusion query containing the results of the search. The individual records in the result contain the following columns: url, title, rank, date, description, cacheURL. URL is the address to the page. Title is the displayable name for a document. Rank is the score, but it is usually not reported. Date is the age of the document, usually from the DateLastModified property of the document. Description is the brief description returned by google including contextual highlighting of search terms. CacheURL is the url to the google cache of the document.
doSearch()
The first function, doSearch(), performs a search across the entire UW-Extension index of files. This search includes results from wpr, wpt, and Extension. You can do this search by doing the following:
results = search.doSearch("my search string");
This search will return the first reported results for "my search string". The default starting position is the first. In addition, the query will return the default number of results, currently 10.
You can alter the default set of result by changing the second and third parameters. The second and third parameters specify the "startRow" and the "maxRecords", respectively. This allows the developer to repeatedly search the google index and create paged results. For example, if you wanted to show the next 20 results starting at the 100th record:
results = search.doSearch("my search string", 100, 20);
A later section discusses how to determine correctly recordcount.
doSearchCollection()
The second function, doSearchCollection(), searches a limited number of documents as specified in the google mini administrator. The results will not, by definition, include every document in the Extension Index, but rather a subset called a "collection". There are currently eight collections. The collections have titles and names. Collections are referenced by their names. The eight collections are listed below (in title: name format):
- Business services forms: bsvcs_forms
- Business services policies: bsvcs_policy
- Cooperative extension: ces
- Extension news: ext_news
- SBDC: sbdc
- Extension only: uwex
- WPR: wpr
- WPT: wpt
Other collections are available upon request. Collections simply contain a list of urls which act as a pre-filter to your search results. For example, the UWEX contains only pages with a "uwex.edu" in the url. This includes www.uwex.edu, conferencing.uwex.edu, https://www.uwex.edu, etc.
A collection search requires both a search string and a collection name. You can do a collection search by using the following syntax:
results = search.doSearchCollection("my search string", "collectionname");
Like doSearch(), doSearchCollection() can also take the starting record position and the maximum number of records to return. For example, to show the next 20 results starting at the 100th record:
results = search.doSearchCollection("my search string", "collectionname", 100, 20);
doSearchLimit()
The final type of search is a limited search using the doSearchLimit() function. A limited search restricts the ordinary search to a fully qualified URI. This is a change from other search CFCs. The limited search allows the developer to create searches restricted to a particular location on the web. As an example, the new Extension web site allows the visitor to "Search this area only" by using a limited search.
The fully qualified URI must contain both the server name and the path. Unlike Verity, Google requires more than any arbitrary string. So, to limit a search, the user must specify: server.name/path/to/search . You do not need to include the protocol.
Appending a trailing slash in a limited search will restrict results to only that directory and will not include subdirectories. For example, limiting a search to "server.name/path/to/" will find the page "server.name/path/to/file.cfm", but not "server.name/path/to/search/file.cfm". However, both files would be found if the search was limited to "server.name/path/to".
A limited search requires both a search string and a fully qualified URI. The following will return a query with any files within "uwex.edu/ces":
results = search.doSearchLimit("my search string", "uwex.edu/ces");
Like the previous two functions, doSearchLimit can also take the starting record position and the maxium number of records. For example, to show the next 20 results starting at the 100th record:
results = search.doSearchLimit("my search string", "uwex.edu/ces", 100, 20);
Recordcounts-Google is not to be believed
The Google Mini reports "estimated" numbers of records for a given search. As such, the actual number of records may be far fewer than is actually reported, and you cannot entirely rely on the number reported.
The getRecordCount() function will return the "estimated" results from the Google search appliance. The function will return a number. Before calling this function, one of the three search functions must be called first. The number will probably be different, and it usually is, from the recordcount reported by the query returned by the search function.
The search function returns only a query containing the sub-result of all the possible results. By default, it will return only the first ten. Therefore, results.recordcount will be at most 10 and not accurately reflect all the possible results. If a particular search returns fewer than 10 results, obviously, the recordcount by both may be the same.
If you request a range of documents exceeding the number of actual results, Google will simply return a single record containing the very last record.
Recognizing "the actual last document"
There is no simple way to retrieve the "actual recordcount" from google. Instead, the developer can infer when they have reached the last record using the following logic:
if (results.recordcount AND results.recordcount LT variables.maxResultsPerPage)
recordcount = variables.resultStartingRow + results.recordcount;
This statement says that if there is at least one record found and the total number of records returned by google is fewer than the max number requested, then the record count is actually the sum of the starting row and the number of records returned this time. This is actually only a pseudo-actual number. As mentioned before, requesting a range of documents exceeding the actual number will always return one. As such, it is likely that there are actually fewer documents. Nevertheless, this check will help create your pagination script, if you have one.
Additional functions
There are three other very interesting functions: getSpellingSuggestion(), getKeywordMatch(), and getSynonyms(). Generally, Google only reports values to these functions when returning the first record set. Therefore, a paginated search results page will be able to show the results of these functions on the very first page.
In addition, these functions require one of the search functions run first.
Checking spelling
Google actually has a spellchecker that works! Where Google recognizes a possible typo in a search query, the "corrected search query" gets stored in the CFC. The getSpellingSuggestion() function returns a string. As always, to retrieve a spelling suggestion, one of the three search functions has to run first.
Keyword Matches
In the administrator, Google allows for a few hand specified results. That is, when a particular search string is entered, Google will return the manually entered search result apart from the ordinary results. On the Extension search page, the first result page for a search on "ces" will have a "Top Results" page including a link for "Cooperative Extension".
For obvious reasons, adding many keyword matches will be difficult to maintain and slow down the server. Suggestions for additional matches will be taken into consideration.
The function getKeywordMatch() returns a query containing at most three "matches". Most search queries will not report any results.
The query contains two columns: url and title. URL is the address manually entered in the administrator. Title is the name of the result, again, as specified manually in the administrator.
Synonyms
The final function is getSynonyms(). Synonyms are manually entered alternative search suggestions. That is, phrases that were manually entered as an alternative to searches that may return very bad results for which there is probably a better query. Get that? On the Extension search page, the first results page for the search query "ces" will have a "Search suggestion" of "Cooperative Extension". In this case, Google reports the manually specified alternative to "ces".
Synonyms are very specific. A search for "ces books" will not return the synonym. Like keywords, a large list can quickly become unmaintainable and will slow down the search process.The function getSynonyms() returns a query containing any synonyms entered for the given query string. The query contains only a single column: title. Title contains the "search alternative."
That's all
That's all there is to know in order to use the googleSA.cfc. Good luck!