Search Engine API

Invenio Search Engine can be called from within your Python programs
via both a high-level and low-level API interface.

1. High-level API

   Description:

      The high-level access to the search engine is provided by
      exactly the same function as called from the web interface when
      users submit their queries.  This should guarantee exactly the
      same behaviour, and means that you can pass to the high-level
      API all the arguments as you see them in the URL.

      There are two things to note: (i) the function does not check
      for eventual restricted status of the collection, so the
      restricted collections will be searched without asking for a
      password; (ii) the output format argument (``of'') should be set
      to ``id'' (which is the default value) meaning to return list of
      recIDs.  The function returns the list of recIDs in this case.

   Signature:

       def perform_request_search(req=None, cc=CFG_SITE_NAME, c=None, p="", f="", rg=CFG_WEBSEARCH_DEF_RECORDS_IN_GROUPS, sf="", so="d", sp="", rm="", of="id", ot="", aas=0,
                                  p1="", f1="", m1="", op1="", p2="", f2="", m2="", op2="", p3="", f3="", m3="", sc=0, jrec=0,
                                  recid=-1, recidb=-1, sysno="", id=-1, idb=-1, sysnb="", action="",
                                  d1y=0, d1m=0, d1d=0, d2y=0, d2m=0, d2d=0, dt="", verbose=0, ap=0, ln=CFG_SITE_LANG, ec=None):
          """Perform search or browse request, without checking for
             authentication.  Return list of recIDs found, if of=id.
             Otherwise create web page.

             The arguments are as follows:

               req - mod_python Request class instance.

                cc - current collection (e.g. "ATLAS").  The collection the
                     user started to search/browse from.

                 c - collection list (e.g. ["Theses", "Books"]).  The
                     collections user may have selected/deselected when
                     starting to search from 'cc'.

                ec - external collection list (e.g. ['CiteSeer', 'Google']). The
                     external collections may have been selected/deselected by the
                     user.

                 p - pattern to search for (e.g. "ellis and muon or kaon").

                 f - field to search within (e.g. "author").

                rg - records in groups of (e.g. "10").  Defines how many hits
                     per collection in the search results page are
                     displayed.  (Note that `rg' is ignored in case of `of=id'.)

                sf - sort field (e.g. "title").

                so - sort order ("a"=ascending, "d"=descending).

                sp - sort pattern (e.g. "CERN-") -- in case there are more
                     values in a sort field, this argument tells which one
                     to prefer

                rm - ranking method (e.g. "jif").  Defines whether results
                     should be ranked by some known ranking method.

                of - output format (e.g. "hb").  Usually starting "h" means
                     HTML output (and "hb" for HTML brief, "hd" for HTML
                     detailed), "x" means XML output, "t" means plain text
                     output, "id" means no output at all but to return list
                     of recIDs found.  (Suitable for high-level API.)

                ot - output only these MARC tags (e.g. "100,700,909C0b").
                     Useful if only some fields are to be shown in the
                     output, e.g. for library to control some fields.

               aas - advanced search ("0" means no, "1" means yes).  Whether
                     search was called from within the advanced search
                     interface.

                p1 - first pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f1 - first field to search within in the advanced search
                     interface.  Much like 'f'.

                m1 - first matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

               op1 - first operator, to join the first and the second unit
                     in the advanced search interface.  ("a" add, "o" or,
                     "n" not).

                p2 - second pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f2 - second field to search within in the advanced search
                     interface.  Much like 'f'.

                m2 - second matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

               op2 - second operator, to join the second and the third unit
                     in the advanced search interface.  ("a" add, "o" or,
                     "n" not).

                p3 - third pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f3 - third field to search within in the advanced search
                     interface.  Much like 'f'.

                m3 - third matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

                sc - split by collection ("0" no, "1" yes).  Governs whether
                     we want to present the results in a single huge list,
                     or splitted by collection.

              jrec - jump to record (e.g. "234").  Used for navigation
                     inside the search results.  (Note that `jrec' is ignored
                     in case of `of=id'.)

             recid - display record ID (e.g. "20000").  Do not
                     search/browse but go straight away to the Detailed
                     record page for the given recID.

            recidb - display record ID bis (e.g. "20010").  If greater than
                     'recid', then display records from recid to recidb.
                     Useful for example for dumping records from the
                     database for reformatting.

             sysno - display old system SYS number (e.g. "").  If you
                     migrate to Invenio from another system, and store your
                     old SYS call numbers, you can use them instead of recid
                     if you wish so.

                id - the same as recid, in case recid is not set.  For
                     backwards compatibility.

               idb - the same as recid, in case recidb is not set.  For
                     backwards compatibility.

             sysnb - the same as sysno, in case sysno is not set.  For
                     backwards compatibility.

            action - action to do.  "SEARCH" for searching, "Browse" for
                     browsing.  Default is to search.

                d1 - first datetime in full YYYY-mm-dd HH:MM:DD format
                     (e.g. "1998-08-23 12:34:56"). Useful for search limits
                     on creation/modification date (see 'dt' argument
                     below).  Note that 'd1' takes precedence over d1y, d1m,
                     d1d if these are defined.

               d1y - first date's year (e.g. "1998").  Useful for search
                     limits on creation/modification date.

               d1m - first date's month (e.g. "08").  Useful for search
                     limits on creation/modification date.

               d1d - first date's day (e.g. "23").  Useful for search
                     limits on creation/modification date.

                d2 - second datetime in full YYYY-mm-dd HH:MM:DD format
                     (e.g. "1998-09-02 12:34:56"). Useful for search limits
                     on creation/modification date (see 'dt' argument
                     below).  Note that 'd2' takes precedence over d2y, d2m,
                     d2d if these are defined.

               d2y - second date's year (e.g. "1998").  Useful for search
                     limits on creation/modification date.

               d2m - second date's month (e.g. "09").  Useful for search
                     limits on creation/modification date.

               d2d - second date's day (e.g. "02").  Useful for search
                     limits on creation/modification date.

                dt - first and second date's type (e.g. "c").  Specifies
                     whether to search in creation dates ("c") or in
                     modification dates ("m").  When dt is not set and d1*
                     and d2* are set, the default is "c".

           verbose - verbose level (0=min, 9=max).  Useful to print some
                     internal information on the searching process in case
                     something goes wrong.

                ap - alternative patterns (0=no, 1=yes).  In case no exact
                     match is found, the search engine can try alternative
                     patterns e.g. to replace non-alphanumeric characters by
                     a boolean query.  ap defines if this is wanted.

                ln - language of the search interface (e.g. "en").  Useful
                     for internationalization.

                ec - list of external search engines to search as well
                     (e.g. "SPIRES HEP").
          """

   Examples: (retrieving record IDs)

      >>> # import the function:
      >>> from invenio.search_engine import perform_request_search
      >>> # get all hits in a collection:
      >>> perform_request_search(cc="ATLAS Communications")
      >>> # search for the word `of' in Theses and Books:
      >>> perform_request_search(p="of", c=["Theses","Books"])
      >>> # search for `muon or kaon' within title:
      >>> perform_request_search(p="muon or kaon", f="title")
      >>> # phrase search (not the quotes):
      >>> perform_request_search(p='"Ellis, J"', f="author")
      >>> # regexp search for a system number
      >>> perform_request_search(p1="^CERN.*2003-001$", f1="reportnumber", m1="r")
      >>> # moi inside Standards gives no hits...
      >>> perform_request_search(p="moi", cc="Standards")
      >>> # but it does if we use alternative patterns:
      >>> perform_request_search(p="moi", cc="Standards", ap=1)

   Example: (retrieving MARCXML)

      >>> import cStringIO
      >>> tmp = cStringIO.StringIO()
      >>> perform_request_search(req=tmp, p='ellis', of='xm')
      >>> out = tmp.getvalue()
      >>> tmp.close()
      >>> # `out' now contains MARCXML of 12 records found

   Example: (retrieving Text MARC, certain tags only)

      >>> import cStringIO
      >>> tmp = cStringIO.StringIO()
      >>> perform_request_search(req=tmp, p='higgs', of='tm', ot=['100', '700'])
      >>> out = tmp.getvalue()
      >>> tmp.close()
      >>> print out
      000000085 100__ $$aGirardello, L$$uINFN$$uUniversita di Milano-Bicocca
      000000085 700__ $$aPorrati, Massimo
      000000085 700__ $$aZaffaroni, A
      000000001 100__ $$aPhotolab

2. Mid-level API

   Description:

      The mid-level API is provided by a search_pattern() function
      that only searches for the given pattern in the given field
      according to the given matching pattern.  This function does not
      know anything about collection.  The function does not wash its
      arguments, it expects them to be `clean' already.  The pattern
      is split into `basic search units' for which a boolean query is
      launched.  The function returns an instance of the intbitset class.
      Note that if you want to obtain the list of recIDs (as with the
      high-level API), you can invoke the ``tolist()'' method on a
      hitset.

   Signature:

      def search_pattern(req=None, p=None, f=None, m=None, ap=0, of="id", verbose=0):
          """Search for complex pattern 'p' within field 'f' according to
             matching type 'm'.  Return hitset of recIDs.

             The function uses multi-stage searching algorithm in case of no
             exact match found.  See the Search Internals document for
             detailed description.

             The 'ap' argument governs whether an alternative patterns are to
             be used in case there is no direct hit for (p,f,m).  For
             example, whether to replace non-alphanumeric characters by
             spaces if it would give some hits.  See the Search Internals
             document for detailed description.  (ap=0 forbits the
             alternative pattern usage, ap=1 permits it.)

             The 'of' argument governs whether to print or not some
             information to the user in case of no match found.  (Usually it
             prints the information in case of HTML formats, otherwise it's
             silent).

             The 'verbose' argument controls the level of debugging information
             to be printed (0=least, 9=most).

             All the parameters are assumed to have been previously washed.

             This function is suitable as a mid-level API.
          """

   Examples:

      >>> # import the function:
      >>> from invenio.search_engine import search_pattern
      >>> # search for muon or kaon in any field:
      >>> search_pattern(p="muon or kaon").tolist()
      >>> # the following finds nothing by default...
      >>> search_pattern(p="cern-moi").tolist()
      >>> # ...but it does find something if we allow alternative patterns:
      >>> search_pattern(p="cern-moi", ap=1).tolist()
      >>> # wildcard search for a report number:
      >>> search_pattern(p="CERN-LHC-PROJECT-REPORT-40*", f="reportnumber").tolist()
      >>> # regexp search for a report number with possible trailing subjects:
      >>> search_pattern(p="^CERN-LHC-PROJECT-REPORT-40(-|$)", f="reportnumber", m="r").tolist()

3. Low-level API

   Description:

      The low-level API is provided by search_unit() function that
      assumes its arguments to be already the basic search units.
      Therefore it does not know anything about boolean queries, etc.
      The function returns an instance of the intbitset class.  Note that
      if you want to obtain the list of recIDs (as with the high-level
      API), you can invoke the ``tolist()'' method on a hitset.

   Signature:

      def search_unit(p, f=None, m=None):
          """Search for basic search unit defined by pattern 'p' and field
             'f' and matching type 'm'.  Return hitset of recIDs.

             All the parameters are assumed to have been previously washed.
             'p' is assumed to be already a ``basic search unit'' so that it
             is searched as such and is not broken up in any way.  Only
             wildcard and span queries are being detected inside 'p'.

             This function is suitable as a low-level API.
          """

   Examples:

      >>> # import the function:
      >>> from invenio.search_engine import search_unit
      >>> # search moi in any field:
      >>> search_unit(p="moi").tolist()
      >>> # this one will not match:
      >>> search_unit(p="muon or kaon").tolist()
      >>> # regexp search for a report number with possible trailing subjects:
      >>> search_unit(p="^CERN-PS-99-037(-|$)", f="reportnumber", m="r").tolist()

More entry points may be created, but I think this threesome kind of
access to the search engine should cover all your needs.