[Skip Navigation] [CSUSB] / [CNS] / [Comp Sci Dept] / [R J Botting] / [Samples] / nocase
[Index] [Contents] [Source Text] [About] [Notation] [Copyright] [Comment/Contact] [Search ]
Tue Sep 18 15:26:21 PDT 2007


Opening Microsoft files (winzip, word, excel, powerpoint) on this page may require you to download a special viewer. Or you can download and save the files and use your preferred office applications to view and edit them.

Contents


    A Filter to help Case Insensitive Searches on a WWW Bibliography

      Situation

      I operate a biliography of software development. It grew from a HyperCard Stack on a Macintosh to a formal document plus search engine on the WWW. The search engine is a prototype and uses awk and other UNIX tools to scan the MATHS version of the data and select items that fit a pattern input by a user. See [ lookup ] to try the current version and [ lookup ] for the code.


      (March 2000): At this time the match is case sensitive but allows the full awk regular_expressions. So to select all items that contain "Object", "object", "OBJECT", etc. you need to use the pattern:

       		[Oo][Bb]Jj][Ee][Cc][Tt]
      After three or four months of thought I decided that it would be nice to have optional case sensitive, but is is worth have non-optional case insensitivity then unoptional case sensitivity:
      OptionValue
      Case senstive searchWorst
      Case insenstive searchBetter
      Case sensitive optionBest

      THe prgram executes as a CGI on a Solaris UNIX box. In time the box may become either a Linux Box or a BSD server. The solution must survive porting. Older awk does not have a simple way to turn off case sensitivity in pattern matching. A quick test (testawk below) shows that I can't use IGNORECASE on a solaris box.

      The standard technique is to change all letters to the same case before the match.

      However this must not be done for every line in the file being searched! Case is significant in identifying items in the bibliography and in recognising the begining and end of items. The layout of the file(in XBNF) is as follows:

    1. BIBLIOGRAPHY::=following
      Net
      As a general rule directives should not need search and case is significant but non-directive lines can be insensitive to case and should be searched for pattern matches.

      Design Choices

    2. TECHNIQUES::=following
      Net
      1. (mawk): Use awk in a more complex way -- if possible(testawk).
      2. (perl): Move from awk to perl.
      3. (prefilter): Write a special prefilter that changes some lines to uppercase.

      (End of Net TECHNIQUES)

      Detailed Design

      The design can be syntax directed. The syntax can also be greatly simplified:
    3. DATA::=following
      Net
      This would lead to an excellently simple awk solution:
       	!/\./{ $0 = toupper( $0 ); };
      if it worked ( see testawk ) on a Sun Solaris.

      The UNIX sed command can do the prefilter with a 'y' command:

       sed '/^\./!y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'
      This works when tested(testsed).

      Code

      If this had not worked (or if I need the speed) I can recode in C. The operations are
    4. OPERATIONS::=following
      NumberDescriptionWhen
      1putchar(ch);Once for each character in a directive line
      2putchar(toupper(ch))Once for each character in a normal line
      3ch=getchar();Read ahead one character at start and Read replace each character.
      Where ch is an int. The program is (after 3 or 4 debugging runs) therefore [ nocase0.c ] which can be optimized to make it look like a traditional C hack.

      Further Developments

      TBA

    . . . . . . . . . ( end of section A Filter to help Case Insensitive Searches on a WWW Bibliography) <<Contents | End>>

    Other Data


      (testawk):
       	orion:/u/faculty/dick
       	$ awk 'BEGIN{IGNORECASE=1;}
       	    > /x/{print "x is in " $0}
       	    > !/x/{print "x is not in " $0}'
       	xxx
       	x is in xxx
       	x
       	x is in x
       	X
       	x is not in X
       	XXX
       	x is not in XXX

      Apparently even the 'toupper' function is disfunctional:

       	$ awk '{print toupper($0);}'
       	test
       	test
       	xxx yyy zzz AAA
       	xxx yyy zzz AAA


      (testsed): with outputs tabbed in:

       	orion:/u/faculty/dick
       	$ sed '/^\./!y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'
       	This is a test
       	.Set
       			THIS IS A TEST
       	.Key99
       			.Set
       	 sample title object
       			.Key99
       	 		SAMPLE TITLE OBJECT

    . . . . . . . . . ( end of section Other Data) <<Contents | End>>

    Glossary and Links

  1. awk::=A Patern matching and report generating program developed by Aho+Wienberg+Kernighan, see awk0 & awk1 below.
  2. awk0::= See http://www.csci.csusb.edu/cs360/notes/awk.html
  3. awk1::= See http://www.csci.csusb.edu/cs360/notes/awk.doc.html

  4. MATHS::= See http://www.csci.csusb.edu/dick/maths/.

  5. regular_expressions::= See http://www.csci.csusb.edu/dick/samples/regular_expressions.html

  6. sed::= See http://www.csci.csusb.edu/dick/cs360/notes/34.sed.html.

  7. TBA::="To Be Announced".

  8. XBNF::=eXtreme(?) BNF, a version of BNF that I developed as a subset of MATHS, see [ XBNF in comp.lang.Glossary ] and [ XBNF in comp.text.Meta ] for more.

End