.Open A Filter to help Case Insensitive Searches on a WWW Bibliography . Situation I operate a biliography of software development. It grew from a HyperCard Stack on a Macintosh to a formal document plus search engine on the WWW. The search engine is a prototype and uses $awk and other UNIX tools to scan the $MATHS version of the data and select items that fit a pattern input by a user. See .See http://www/cgi-bin/dick/lookup to try the current version and .See http://www/dick/tools/lookup for the code. (March 2000): At this time the match is case sensitive but allows the full $awk $regular_expressions. So to select all items that contain "Object", "object", "OBJECT", etc. you need to use the pattern: .As_is [Oo][Bb]Jj][Ee][Cc][Tt] After three or four months of thought I decided that it would be nice to have optional case sensitive, but is is worth have non-optional case insensitivity then unoptional case sensitivity: .Table Option Value .Row Case senstive search Worst .Row Case insenstive search Better .Row Case sensitive option Best .Close.Table THe prgram executes as a $CGI on a Solaris UNIX box. In time the box may become either a Linux Box or a BSD server. `The solution must survive porting`. Older $awk does not have a simple way to turn off case sensitivity in pattern matching. A quick test ($testawk below) shows that I can't use IGNORECASE on a solaris box. The standard technique is to change all letters to the same case before the match. However this must not be done for every line in the file being searched! Case is significant in identifying items in the bibliography and in recognising the begining and end of items. The layout of the file(in $XBNF) is as follows: BIBLIOGRAPHY::=following .Net bibliography::= #$item. item::=$identifier_line $set_directive_line #$descriptive_line $close_set_directive_line. identifier_line::=$dot $space #$char $endline. set_directive_line::=$dot"Set" $endline. descriptive_line::=$directive_line | $normal_line. directive_line::=$dot $directive $endline. normal_line::= $line ~ $directive_line. line::=#($char~$endline) $endline. close_set_directive_line::=$dot"Close.Set" $endline. char::=`any ASCII character`. endline::=`character indicating end of line`. dot::=".". .Close.Net BIBLIOGRAPHY As a general rule directives should not need search and case is significant but non-directive lines can be insensitive to case and should be searched for pattern matches. . Design Choices TECHNIQUES::=following .Net (mawk): Use $awk in a more complex way -- if possible($testawk). (perl): Move from $awk to perl. (prefilter): Write a special prefilter that changes some lines to uppercase. .Close.Net TECHNIQUES . Detailed Design The design can be syntax directed. The syntax can also be greatly simplified: DATA::=following .Net input::= #($directive_line | $normal_line ). output::= #($copied_directive_line | $upper_case_normal_line ). .Close.Net DATA This would lead to an excellently simple $awk solution: .As_is !/\./{ $0 = toupper( $0 ); }; if it worked ( see $testawk ) on a Sun Solaris. The UNIX $sed command can do the $prefilter with a 'y' command: .As_is sed '/^\./!y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/' This works when tested($testsed). . Code If this had not worked (or if I need the speed) I can recode in C. The operations are OPERATIONS::=following .Table Number Description When .Row 1 putchar(ch); Once for each character in a directive line .Row 2 putchar(toupper(ch)) Once for each character in a normal line .Row 3 ch=getchar(); Read ahead one character at start and Read replace each character. .Close.Table Where `ch` is an int. The program is (after 3 or 4 debugging runs) therefore .See http://www/dick/samples/nocase0.c which can be optimized to make it look like a traditional C hack. . Further Developments $TBA .Close A Filter to help Case Insensitive Searches on a WWW Bibliography .Open Other Data (testawk): .As_is orion:/u/faculty/dick .As_is $ awk 'BEGIN{IGNORECASE=1;} .As_is > /x/{print "x is in " $0} .As_is > !/x/{print "x is not in " $0}' .As_is xxx .As_is x is in xxx .As_is x .As_is x is in x .As_is X .As_is x is not in X .As_is XXX .As_is x is not in XXX Apparently even the 'toupper' function is disfunctional: .As_is $ awk '{print toupper($0);}' .As_is test .As_is test .As_is xxx yyy zzz AAA .As_is xxx yyy zzz AAA (testsed): with outputs tabbed in: .As_is orion:/u/faculty/dick .As_is $ sed '/^\./!y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/' .As_is This is a test .As_is .Set .As_is THIS IS A TEST .As_is .Key99 .As_is .Set .As_is sample title object .As_is .Key99 .As_is SAMPLE TITLE OBJECT .Close Other Data . Glossary and Links awk::=A Patern matching and report generating program developed by Aho+Wienberg+Kernighan, see $awk0 & $awk1 below. awk0::=http://www.csci.csusb.edu/cs360/notes/awk.html awk1::=http://www.csci.csusb.edu/cs360/notes/awk.doc.html MATHS::=http://www.csci.csusb.edu/dick/maths/. regular_expressions::=http://www.csci.csusb.edu/dick/samples/regular_expressions.html sed::=http://www.csci.csusb.edu/dick/cs360/notes/34.sed.html. TBA::="To Be Announced". XBNF::=eXtreme(?) BNF, a version of BNF that I developed as a subset of $MATHS, see .See http://www.csci.csusb.edu/dick/samples/comp.lang.Glossary.html#XBNF and .See http://www.csci.csusb.edu/dick/samples/comp.text.Meta.html#XBNF for more.