Web Search Service

Indexing and searching of Columbia University websites

CUIT’s Web Services team manages the search index and search functionality at Columbia through a CUIT-maintained Google Search Appliance (GSA). The GSA is a special hardware/software combination housed at Columbia. Having our own search appliance allows us to customize the indexing process, collection definitions, the format of results pages, and other aspects of search.

The Google-powered search engine provides:

  • Fast results
  • Relevance-matching using Google's proprietary alogorithm
  • Sorting by date
  • Inclusion of personal web pages in search results

The following services are offered:

  • Basic indexing and implementation for CUIT-hosted sites
  • A unique search collection and front end (results design) for sites that are not hosted by CUIT
  • Site indexing for sites that are not managed by CUIT

You can also submit a ticket to the team if you need the following:

  • To have a web page indexed immediately
  • To have a page removed from the index
  • To ask a question or report a problem related to searching

Guides

You can add a basic search box to your web site to help visitors find content.

1. Insert the following HTML into your web page where you want the search box to appear:

<form method="get" action="http://search.columbia.edu/search">
<input type="text" name="q" alt="Search" value="" maxlength="256" size="32" />
<input type="submit" name="btnG" value="Search" />
<input type="hidden" name="site" value="Columbia" />
<input type="hidden" name="client" value="columbia" />
<input type="hidden" name="proxystylesheet" value="columbia2" />
<input type="hidden" name="output" value="xml_no_dtd" />
<input type="hidden" name="filter" value="0" />
</form>

 2. Customize the following required parameters:

  • Name="q" size="32"

Sets the width (in number of characters) of the search box. You can change the size to suit your site's layout, but don't change the maxlength value.

  • Name="btnG" value="Search"

The text that appears on the search button, e.g.: value="Search Departmental Site"

  • Name="client" value="columbia"
  • name="proxystylesheet" value="columbia" 
  • name="site" value="Columbia" 


The client parameter specifies the Google frontend your search will use. The proxystylesheet parameter specifies the XSLT stylesheet. You will probably not change these unless you have a custom frontend built for your site.

Note: In most cases, the client value should be identical to the proxystylesheet value. The only time you would want different values for the two parameters is when you want to retain the frontend's KeyMatch, Synonyms, Filters, and Remove URL settings, but change to a different output format.

The site parameter specifies the Google collection. The default value="Columbia" searches the entire Columbia Google collection. To search a different collection, change the value to the name of the collection you want to use.

  • e.g.: value="CUIT"

If you want to restrict your search feature to a specific directory (and its subdirectories), include the following two lines in your form:

<input type="hidden" name="as_dt" value="i"/>
<input type="hidden" name="as_sitesearch" value="<yoururl>"/>

If you don't include these lines, the search feature on your site will search the entire Columbia collection.

  • Name="as_dt" value="i"
    This setting determines whether your search should include or exclude the directory specified in "as_sitesearch". Values can be:

    • "i" (include only results in the web directory specified by as_sitesearch)
    • "e" (exclude all results in the web directory specified by as_sitesearch)
       
  • Name="as_sitesearch" value="<yoururl>"
    Pages in the specified directory will be included in or excluded from your search (according to the value of "as_dt").
    e.g.: name="as_sitesearch" value="www.columbia.edu/cu/biology"
    • You must specify the complete canonical name of the host server followed by the path of the directory.
      e.g.: www.columbia.edu/services not www/services
    • If you include a slash ("/") character is at the end of the web directory path, then only files within that directory will be searched and files in sub-directories will not be considered. e.g.:
      • www.columbia.edu/services to include sub-directories
      • www.columbia.edu/services/ to exclude sub-directories
    • As_sitesearch allows allows you to specify one directory (and all its sub-directories) as the domain to be searched—you cannot specify multiple disparate directories using this option.

If you want to restrict your search feature to a more than one specific directory (and its subdirectories), include the following in your form instead of the parameters as_dt and as_sitesearch:

<input type="hidden" name="as_oq" value="<firsturl secondurl>"/>
  • Name="as_oq" value="<firsturl secondurl>"
    This parameter adds one or more search terms (or URLs), combined with boolean OR. e.g. as_oq="http://www.worldleaders.columbia.edu http://www.sipa.columbia.edu"/

Use the following code to include a drop-down list of different areas of your site from which people can choose.

<form method="get" action="http://search.columbia.edu/search">
<input type="text" name="q" alt="Search" value="" maxlength="256" size="32"/>
<input type="submit" name="btnG" value="Search" />
<input type="hidden" name="site" value="Columbia"/>
<input type="hidden" name="client"value="columbia" />
<input type="hidden" name="proxystylesheet" value="columbia2" />
<input type="hidden" name="output" value="xml_no_dtd"/>
<input type="hidden" name="filter" value="0"/>
<input type="hidden" name="as_dt" value="i"/>

<br />Select an area to search:
<select name="as_sitesearch">
<option value="url1">Section Name 1
<option value="url2">Section Name 2
</select>
</form>

For each URL/Name pair, enter a complete URL to limit the search and a label to describe it, e.g.

<option value="http://www.law.columbia.edu/faculty">Faculty Directory
  • You must specify the complete canonical name of the host server followed by the path of the directory.
    e.g.: www.columbia.edu/services not www/services
     
  • If you include a slash ("/") character is at the end of the web directory path, then only files within that directory will be searched and files in sub-directories will not be considered. e.g.:
    • www.columbia.edu/services to include sub-directories
    • www.columbia.edu/services/ to exclude sub-directories

You can include as many options in the drop-down menu as you want, although the sample only shows two.

Tip: You will probably want to include an option to search your entire site. Place it first in the list, and it will appear as the default selection:

<option value="http://www.law.columbia.edu">Entire Law School
<option value="http://www.law.columbia.edu/faculty">Law School Faculty

There are several ways to prevent some or all of your web pages from being indexed:

  • Use a robots meta tag (entire page)
  • Use googleoff/googleon tags (partial page)
  • Use a no_crawl directory (entire directory, multiple pages)
  • Use a robots.txt file (entire site)

If you need to get a page out of the index urgently, contact the CUIT Service Desk by submitting a ticket or calling 212-854-1919.

Use a robots meta tag

If you don't want a page to be indexed, you can insert this <meta> tag within your page's HEAD section:

<meta name="robots" content="noindex, nofollow">

This tells all robots (not just Columbia's search engine) not to index the page, and not to follow any links from the page. If the page has already been indexed, it will be removed from the index the next time Google crawls the page.

You should put this tag on all pages you don't want indexed.

If you have an entire directory of files you don't want indexed, consider putting them in a no_crawl directory).

If you want a page indexed but do not want any of the links on the page to be followed, you can use the following instead:

<meta name="robots" content="index, nofollow">
Use googleoff/googleon tags

By embedding googleoff/googleon tags with their flags in your HTML page, you can disable:

  • The indexing of a word or portion of a web page
  • The indexing of anchor text
  • The use of text to create a snippet in search results

For details about the use of each googleoff/googleon flag, refer to the following table:

Use a no_crawl directory

The Google Search Appliance will not crawl any directory named "no_crawl." You can keep files and directories out of the index by creating a directory called "no_crawl" and putting all the files you want to hide from Google inside.

Using a "no_crawl" directory does not provide directory security or block people from accessing the directory. 

Use a robots.txt file

If you run your own webserver and don't want any pages to be visited by one or more robots, you can use a robots.txt file. For more information about how to do this, refer to the Robots Exclusion Site

When working with the Google Search Appliance, use these tips and guidelines provided by Google to improve the search experience for users trying to find your content.

Content and Design
  • Make web pages for users, not for search engines

Create a useful, information-rich content site. Write pages that clearly and accurately describe your content. Don't load pages with irrelevant words. Think about the words users would type to find your pages, and make sure that your site actually includes those words within it.

  • Focus on text

Focus on the text on your site. Make sure that your TITLE and ALT tags are descriptive and accurate. Since the Google crawler doesn't recognize text contained in images, avoid using graphical text and instead place information within the alt and anchor text of pictures. When linking to non-HTML documents, use strong descriptions within the anchor text that describe the links your site is making.

  • Make your site easy to navigate

Make a site with a clear hierarchy of hypertext links. Every page should be reachable from at least one hypertext link. Offer a site map to your users with hypertext links that point to the important parts of your site. Keep the links on a given page to a reasonable number (fewer than 100).

  • Ensure that your site is linked

Ensure that your site is linked from all relevant sites within your network. Interlinking between sites and within sites gives the Google crawler additional ability to find content, as well as improving the quality of the search.

Technical
  • Make sure that the Google crawler can read your content

Validate all HTML content to ensure that the HTML is well-formed. Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If extra features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine crawlers may have trouble crawling your site.

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in multiple copies of the same document being indexed for your site, as crawl robots will see each unique URL (including session ID) as a unique document.

Ensure that your site's internal link structure provides a hypertext link path to all of your pages. The Google search engine follows hypertext links from one page to the next, so pages that are not linked to by others may be missed. Additionally, you should consult the administrator of your Google Search Appliance to ensure that your site's home page is accessible to the search engine.

  • Use robots' standards to control search engine interaction with your content

Make use of the robots.txt file on your web server. This file tells crawlers which files and directories can or cannot be crawled, including various file types. If the search engine gets an error when getting this file, no content will be crawled on that server. The robots.txt file will be checked on a regular basis, but changes may not have immediate results. Each port (including HTTP and HTTPS) requires its own robots.txt file.

Use robots meta tags to control whether individual documents are indexed, whether the links on a document should be crawled, and whether the document should be cached. The "NOARCHIVE" value for robots meta tags is supported by the Google search engine to block cached content, even though it is not mentioned in the robots standard.

For information on how robots.txt files and ROBOTS meta tags work, see the Robots Exclusion standard at:
http://www.robotstxt.org/wc/exclusion.html

If the search engine is generating too much traffic on your site during peak hours, contact your Google Search Appliance administrator to customize the traffic.

  • Let the search engine know how fresh your content is

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell the Google Search Appliance whether your content has changed since it last crawled your site. Supporting this feature saves you bandwidth and overhead. Columbia's webservers support this feature.

  • Understand why some documents may be missing from the index​​​​​​​

Each time that the Google Search Appliance updates its database of web pages, the documents in the index can change. Here are a few examples of reasons why pages may not appear in the index.

  • Your content pages may have been intentionally blocked by a robots.txt file or ROBOTS meta tags.
  • Your web site was inaccessible when the crawl robot attempted to access it, due to network or server outage. If this happens, the Google Search Appliance will retry multiple times; but if the site cannot be crawled, it will not be included in the index.
  • The Google crawl robot cannot find a path of links to your site from the starting points it was given.
  • Your content pages may not be considered relevant to the query you entered. Ensure that the query terms exist on your target page.
  • Your content pages contain invalid HTML code.
  • Your content pages were manually removed from the index by the Google Search Appliance administrator.

If you still have questions, contact your Google Search Appliance administrator to get more information.

  • Avoid using frames​​​​​​​

The Google search engine supports frames to the extent that it can. Frames tend to cause problems with search engines, bookmarks, e-mail links and so on, because frames don't fit the conceptual model of the web (where every document corresponds to a single URL).

Searches that return framed pages will most likely only produce hits against the "body" HTML page and present it back without the original framed "Menu" or "Header" pages. Google recommends that you use tables or dynamically generate content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME tags. This will ultimately maintain the content owner's originally intended look and feel, as well as allow most search engines to properly index your content.

  • Avoid placing content and links in script code

​​​​​​​Most search engines do not read any information found in SCRIPT tags within an HTML document. This means that content within script code will not be indexed, and hypertext links within script code will not be followed when crawling. When using a scripting language, make sure that your content and links are outside SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such as HTML layers.​​​​​​​

Search Collection Definitions

Columbia's main search collection includes all the web pages on the main Columbia website ,including personal web pages, as long as they are not excluded by:

  • The Google administrators
  • A noindex meta tag in the page's HTML
  • Password protection or restricted-access files/directories
Web Pages Excluded

The following web pages have been excluded by the Google administrators.
  • Dynamically-generated content
  • Event calendars
  • Specific pages, at the request of their owners

Pages are excluded for a variety of system performance, copyright, license, and University policy reasons. If you think your page may have been excluded and you don't want it to be, please submit a ticket.

Crawling Schedule

The GSA is configured to crawl the entire Columbia website continuously. If your new page must be included in search results immediately, or if you have questions about the indexing of your content, please submit a ticket.
KeyMatches 

Allow you to promote specific web pages on your site by associating specific search terms, such as housing, with a set of web pages. The site with the KeyMatch appears at the top of search results with the label Suggested Link.

Related Queries 

Can be used to suggest alternate words or phrases for search queries. For example, if you search the main Columbia collection for macintosh you will see the following at the top of your search results: You could also try: Apple

Submit a ticket to CUIT if you would like to have a KeyMatch or Related Query added to the main Columbia collection/front end.

Collections

A collection is a list of URL patterns that can be referred to by a single name, such as Libraries. When a search is restricted to the collection called Libraries, the query returns only search results matching the URL patterns specified in that collection's definition. Schools and other large organizations within the University may be eligible to define a collection in the Google Search Appliance. A collection may be appropriate when:

  • The organization wishes to have a custom search page for its website, and
  • The organization's web presence includes three or more distinct URL patterns. (Fewer URL patterns can be accomplished by using parameters added to the search query.)

As a general policy, CUIT only allows one collection per school or large organization. A collection name will be assigned by CUIT once it's been approved. Defining and using a collection requires startup effort from both the CUIT Google team and the web site developer.


Request a New Collection

Submit a ticket for new collections, including the following:

  • Name of the school or organization
  • Contact person and contact information
  • The list of URL patterns you want included in your collection

New collections will use the Columbia front end and stylesheet if you do not also request a custom front end.

Custom Front Ends and Stylesheets

Front ends allow search administrators to create different search and search results pages by editing their XSLT stylesheets. Administrators can also edit KeyMatch, Related Query, and Filter information, or remove URLs from a specific front end.

  • KeyMatch lets you promote specific web pages on your site by associating specific search terms, such as housing, with a set of web pages. The site with the KeyMatch appears at the top of search results with the label Suggested Link.
  • Related queries can be used to suggest alternate words or phrases for search queries.
  • Filters can restrict searches based on domain, language, file types, or meta tags.
  • Removing URLs from a front end prevents particular URLs from being served in search results.

Request a New Front End
Submit a ticket for new front ends, including the following:

  • Name of the school or organization
  • Contact person and contact information
  • The desired name for the front end (all lower-case letters without spaces)

There is no charge for having a new front end set up if you are going to create the associated stylesheets yourself.

CUIT Web Services will create a custom front end for you for a basic fee of $1600.