This Course has been replaced by CSE557


    Data Element Design

      Story -- Naming Names Handout

      To understand how arbitrary coding can be, and how confusing the results are, I handout some Email on the CSUSB electronic bulletin board discussing why this campus's zip-code (92407) is listed on the Internet as "Arrowhead Farms", and some sites on the WWW list the phone numbers (909-537-nnnn) as in Riverside.

      There is also a study of some well known codes. Brian Hayes calculates how many unique codes can exist, and how many are already in use. This was published in the "American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the following coding schemes:

      1. Internet account IDs
      2. NY Stock Exchange Ticker Symbols -- up to 3 Uppercase letters -- 26+26*26+26*26*26 possible

      3. Universal Product Codes -- bar codes -- ( UPC )-- 12-digit until January 2005, and then 13-digits.
      4. Global Traded Item Numbers (GTIN)
      5. European Article Numbers (EAN)

      6. Biological Species -- Two Latin-like words
      7. The Chemical Elements -- One uppercase letter + option lower case letter. (I had to program these while working for ICI in the 1960's)
      8. Organic Chemicals -- Not elementary they have complex syntax rules and semantics.
      9. Internet Assigned Number Authority IANA names for countries -- two ASCII letters
      10. Radio Call Signs -- 3..4 Capital letters -- KVCR, KFROG, ...
      11. Telephone numbers -- 10 digits (in the USA).
      12. Social Security Numbers -- 9 digits in the USA
      13. Airport codes ( IATA ) -- Three letters.
      14. Names of Horses -- 2..18 characters ( letters plus space, period, and apostrophe).

      Story -- Changing Student IDs

      Once upon a time, it took this campus a year to change all its records from using the Social Security Number(SSN) to a campus assigned number (SID). Every record in a dozen different data bases had to be changed. Why did CSUSB spend so much time and money changing one element? -- Because (1) it is illegal to use the SSN for non-SSN type purposes. (2) We also wanted to avoid identity theft. (3) We are required to anonamize grades.

      Story -- how to code diseases -- the ICD

      [ SB10001424053111904103404576560742746021106.html?mod=WSJ_hp_MIDDLENexttoWhatsNewsFifth ]

      Data Options

        In computing we need to look for the best technology to represent our data. Each data flow in a DFD has many ways to be implemented. It could be a phone call, a memo, a signal between devices, or the sending of a record between parts of the system. Each flow from an external entity into your system needs an input device. Each store needs a storage device. Each flow between processes can be internal to a program or use a network. And each flow out to an external entity will use an output device. Choosing a good device and technology can make a system fly rather than crash. So you need to know about the options. The hardware options were covered in [ a3.html ] (Architecture) and [ c1.html ] (Selecting a project) earlier. When the data is digital we still need to choose how to encode different things. Encoding data is a return to Physical Design from Logical Design. On the other hand drawing up a logical model of data is all about ignoring the actual encoding of the data.

        Reminder -- use UML Classes to model logical data groups

          A logical data group is a collection of smaller data items that always appear together. Logical data groups tend to appear as records in files in traditional data processing systems. They are implemented as rows or tuples in a relational data base. Printed documents, input forms, and screen layouts are all associated with logical data groups.
        1. Draw a class box with two compartments (suppress the third compartment if using a tool).
        2. Put the group name in the center of the top compartment -- the class name.
        3. List the attributes inside the second compartment -- these are the elements.
        4. Two Examples of Logical Data Groups:

          [Example of two groups described in UML]

      Story -- what do you mean by data element

      I was once (late 1970s) analyzing a Social Work Department in a part of London. One of the workers, Edna, had a manual system for retrieving information on clients. She used 3><5 cards with data written on them and a large automated storage device for cards. I observed the following typical scenario:
      1. Someone makes a Phone call and asks for information on a client.
      2. Edna asks for a number to call back.
      3. Edna checks and finds the number is an extension and so secure.
      4. Edna turns to the machine.
      5. Edna dials for the right drawer, and machine rotates it to the front and opens it.
      6. Edna finds the right card (a quick binary search).
      7. Edna phones back the information.

      I noted that it took 25 seconds to do this.

      I noted the subtle but effective security. I also noted that some cards were red or green. I asked about this and it turned out that the red cards indicated threats to social workers... So in this case one data item was the color of the media.

      By the way... I recommended not computerizing this process.

      Types of code

        One piece of data has many encodings

        Be aware that data have many representations. Data is presented, input, stored in a database, and processed in many different formats. Here are some of them in one example:
        1. The way the user inputs a value. Example: by selecting an input from a menu of 12 items.
        2. The way the value is encoded and sent through the internet. Example: URL Encoded.
        3. The way the software stored it inside the program. Example: A short int in range 0..11.
        4. The way the values are stored in the data base. Example: a SMALLINT.
        5. The way that a value is output to the user. Example: as a three letter abbreviation for a month ("Jan", "Feb", ..., "Dec").

        Clever choices can improve system qualities: security, reliability, time to program, ...

        Special Encodings

          Encoding Data Elements

            There are a number of well known ways to encode data elements.
            Common Numerical Codes
              Principle -- Always specify the Units for Numbers
              This is how NASA managed to crash a probe being sent to Mars...

              Roman Notation
              Humans tend to use a notation for numbers based on the number of fingers on a hand. In fact the word digital comes from the latin word for finger. In Roman, times the 'V' stood for a hand (5= one thumb + four fingers) and the 'X' was two hands (one up + one down) or 10 fingers.

              The Roman numerals have complex syntax and semantics [ Mini-Project2.html ] (BNF) and are not good for doing arithmetic ( can you divide XXX by VII ?). Avoid them on input, use them judiciously on output, and do not store numbers in this form!

              Decimal -- Arabic Notation
              The Western World learned the decimal system from the Arabs who got it from India. We encode the digits as 0,1,2,3,4,5,6,7,8,and 9. Then we can represent any whole number by stringing them together:
            1. digit::=0..9.
            2. number::= one or more digits.
            3. 987 = 100*9 + 10*8 + 1*7.

              To be more formal we could define:

            4. For number n, digit d, value( n d ) = 10*value(n) +d.
            5. value("987") = 10*value("98") + 7 = 10*(10*value("9")+8)+ 7=
              Computers use binary (you can find examples of this in Ancient China and Leibniz). This has two digits: 0, 1. So:
            6. 110011 = 2**5 + 2*4 + 2**1 + 2**0 = 32+16+3 = 61.

              Having only two digits fits well with electrical and electronic circuits which tend to be either "on" or "off". In about 1945 Shannon defined

            7. bit::="Binary digIT", is either 0 or 1.
            8. binary_number::= one or more bits.
            9. For binary_number n, bit b, value( n b ) = 2*value(n) +b.

              This has resulted in many computer people being able to recite the powers of two up to the highest address on their favorite machine.

              Octal and Hex
              However, calling out or typing 20 binary digits is inefficient and error prone. Using decimal notation makes it hard to know what pattern of digits is needed. So older computer people tend to use base 8 (octal) and newer one the base 16 (hexadecimal) notation:
            10. nibble::= 4 bits.

              Nibbles are written and spoken using the hexadecimal digits, 0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,and F (=1111).

            11. byte::=2 nibbles.

              So, for example, "2A" in hex means "00101010" in binary.

              Binary Coded Decimal
              In commercial systems it was common to find numbers encoded using
            12. BCD::="binary coded decimal".

              Here a number like 987 was encoded by three decimal digits each represented as a nibble in binary:

            13. 1001 1000 0111

              This wastes some bits but is very convenient for important things like dollars and cents.

              Signed Integers
              In scientific computations integers are encoded using binary typically with 8, 16, 32, ... bits. One extra bit indicates whether the number is negative or positive. In SQL these can be INTEGER or SMALLINT items.

              Real Numbers
              Real numbers (measurements) are often encoded using "floating point" where a number has two parts called the mantissa and the exponent, both encoded in binary. The value is then
            14. mantissa * 2**exponent

              Floating point works well when we need a wide range of values and can put up with larger errors on the larger numbers.

              Again in commerce and fiance we need precision and speed rather than range. So a Fixed Point notation was preferred. Here you use BCD and the machine scales the number by dividing by a fixed power of ten. This is available and common in COBOL. In SQL we have DECIMAL(p,q) (p digits, q after the decimal point).

              Scaling and Fixed Point Notation
              Suppose you are working with carpenters. They measure things in eights of an inch. And 8 is a power of 2. So we can use a binary number with the last three bits representing the fraction in eighths. For example
            15. 1001111 = 9 and 7/8. This is a fixed point binary notation. You can think of it as a scaling of the integers:
            16. i/8 = whole number
            17. i%8 = fraction

              But a canny programmer would use these expressions

            18. i>>3
            19. i&7 which will be a lot faster!

              Again -- Money is naturally expressed as fixed point decimal with two decimal places. So

            20. 1234 in DECIMAL(4,2) (COBOL "PIC 99V99) will mean
            21. $12 and 34 cents.

              If your language supports this -- use it. If not, store money as long integers meaning the number of cents. Then

            22. dollars = money/100
            23. cents = money%100 (in C/C++/Java/etc).

            Standard Character Codes
              At one time each manufacturer had its own binary code for characters. This continued up to the 1980's

              In the 1960's the American standards people ( ANSI ) proposed what has become the standard 8 bit coding for characters -- ASCII

            1. ASCII::="American Standard Code for Information Interchange".

              ASCII covers all the characters needed for American needs, but has become the de facto standard on the Internet, and whenever data needs to be shared. The International Standards Organization treats ASCII as a specialized code for use in America. In the UK, the American "#" becomes the symbol for the British pound. Each European country has its own special symbols.

              IBM tried to create its own standard -- an Extended Binary Coded Decimal code named EBCDIC. This will disappear with the last mainframe.

              Recently, a new standard -- Unicode -- has been created that covers just about every character in every alphabet in the world. This is a 16-bit code. ASCII and the ISO codes appear within it.

              The Web uses HTML and HTML has introduced a number of special "entities" for showing non-ASCII characters like Σ and α. These are given numbers and encode in HTML like this:

            2. "&" digit digit digit ";"


            3. "&" mnemonic_identifier ";"

              For example the symbols "<" and ">" are encoded as "&lt;" and "&gt;". The double quote sysmbol is encoded as "&quot;".

              This link [ mathchart.html ] shows how to encode Unicode mathematical symbols in HTML and [ arrows.html ] how to do arrows. I have a partial encoding for Greek letters and other ΤΕΧ sysmbols in [ ../samples/tex2html.html ] (ΤΕΧ is a mathematical type setting system developed by Donald Knuth).

              There is a link to more on the HTML below.

            A key is an item of data that is used to uniquely define a single element in a file, database, or table. These are typically a fixed number of characters.
            Sequence numbers
            Numbers assigned in a specific order -- typically when created. Example: Number of a student on roster.

            Block sequences Blocks of numbers given to different parts of the organization to allocate (in sequence) to records of data in a file or input steam.

            Alphabetic Abbreviations and Mnemonics A string of letters is chosen to identify an entity or a group of entities of a given type. A few letters stand in for a word or phrase Example: States(CA, AL,...), IANA and DNS countries(uk, tv, ru, ...), IATA (LAX, ONT, LHR, MSP, ...). Abbreviation for a department teaching a course -- CSE PHYS MATH ENG. Abbreviations for buildings on campus.

            Arbitrary Systematic Codings Example: Library of congress subject classification, Dewey Decimal system for books. The Assoc for Computing CCS system for computer science. Linnaeus's technique for species.

            Significant subgroups
              In many cases a data element is actually made up of smaller elements each with their own coding.

              Digit Groups Different groups of digits/characters in the data are themselves coded data. For example in a 9-digit ZIP-code the first digit determines a geographical area, the next three the town, and the 5th digit a post-office. The next 4 digits identify a delivery point. Example: Zip codes(92407-1133), phone numbers((909)-537-5257),..., SSN, URLs.

              Derived Codes Mixes different coded data into one element Example: My UK Driving license number, CSUSB Library call numbers, Subscriber codes for magazines, Rooms on campus.

              Ciphers and Encrypted data Example: Spoof at the Imperial Chemical Industries was a number added to the paint sales. We have a lot of good work done since then -- look up DES. PGP, etc. on Wikipedia if you need more detail. Numbers are disguised for security or mnemonic purposes Passwords should be encrypted as soon as they are entered and never stored without salting and hashing!

              Actions Examples: A=Add, D=Delete, ..., The 50+ actions that the 'vi' editor has built in to it, Mnemonic codes in assembler. Codes that represent actions. Transaction codes -- for example with a banking application we might find deposits (coded D) and withdrawals (coded W).

              Self-checking Elements Uses an added digit or character calculated from the rest. Example: 9s remainder and 11s remainder check digits are added to a decimal number. For a detailed analysis see [ ] (SKORKS, some Ruby included).

          Encoding Compound data

            Elements usually occur together in logical grouping. The problem to be solved is to design an efficient way to combine them so that the data can be extracted in only one way. Extracting data elements from compound data is called parsing. When there is more than one way of parsing a compound data group then the coding is said to be ambiguous. Ambiguous Encodings are bad encodings.

          1. Parse::CSE372="to divide a string of characters into possibly meaningfull parts and name them".

            There are five ways of encoding compound data:

            1. Fixed Length fields: Phone number. SSN.
            2. Delimiters: Dates, times, tab delimitted spread sheet.
            3. Added count fields give the length of the next field.
            4. Complex Syntax: C++ programs.
            5. Marked Up Text: HTML, XML, ...

            Record Structures
            In a record structure a number of fields are defined, each with a given length (bits or bytes). Each field has a name and its own encoding. A very popular technique in RAM and in Data Bases. Record structures rely on each element being a fixed length. And this in turn gave us one of the most successful bugs (so far) in the history of computing: the Y2K Bug. Just about every system was at risk of breaking down when a 2 digit field could no longer handle years greater than 1999. I lost my first laptop to the Y2K bug. But just about all of these problems were found and fixed before the date arrived. This is business as usual -- computer people had to work hard and late into the night to make computers do what they were supposed to do.
            A common kind of data is the string. It can have any length When the data does not have a fixed length the end can be marked by a special sentinel character -- a delimiter. In programming languages for example we indicate the start and end of a string by using a quotation mark. Other well known delimiters are used in dates ("/") and times ":".
            Count Fields
            Another technique is to add a field indicate the length of the following data element -- an encoded counter or integer. Humans do not find this as easy as delimitted data
            Simple Syntax
            A very simple way to place several variable length fields into a single string is to make sure that the first character of each field can not be a character in the previous field. A simple example would be to choose alternating digits and letter encoding.
            Markup Languages
              In recent years there have been a lot of markup languages defined for different purposes. The idea is to have explicit identification of the parts of the data. Typically something like
              is a piece of text with added "tags" that indicate the meaning of the parts. In a [ Record Structure ] (above) the "tags" are not needed because their sequence is known and the lengths are fixed (or at least predictable). Thus we get an encoding that is guaranteed not to be ambiguous, is easy to read (kind of), but is somewhat inefficient.
              RTF -- Rich Text Format
              Here the tags indicate the appearance of the text. Probably developed with early word processor software. Still a format used with MS Word.

              SGML -- The Standard Generalized Markup Language
              IBM defined SGML so that you could create and define tags for any purpose. It is an amazingly difficult encoding to use. You can find the details on the web if you absolutely have to.

              HTML -- The HyperText Markup Language
              HTML is a specific application of SGML defined to describe and link pages on the world-Wide Web. It has gone through half-a-dozen versions. A new one stays within the XML (next) conventions.

              XML -- the eXtendable Markup Language
              HTML describes the appearance of pages. XML tries to describe the meaning of data not its appearance. It has a very simple basis using
               		</end tags>
              to delimit data. Tags can also have attributes:
               		<certificate type="participation">Unix Training</certificate>.

              XML also allows some tags to be unpaired and these are shown like this:

               		<endless tag attributes... />
              XML documents can be parsed fairly easily.

              For each application that uses XML must have a DTD -- Document Type Definition published that defines the structure of the data -- what tags can appear inside others. Defining a DTD takes a significant amount of work. But once defined you can use tools to check validity, ...

            . . . . . . . . . ( end of section Markup Languages) <<Contents | End>>

            Complex Syntax
            Complex syntax gets us into natural and artificial languages. It is rare that we need to express natural data groupings using complex syntax. When we do we can use a extension of the syntactic meta-languages like Backus-Naur Form (BNF).

            In computer science most of our knowledge about linguistic design has been put into designing programming languages. Programming languages are the most complicated schemes for encoding a domain in existence. There are hundreds of them. For more take a CSE Programming Language class like our CSE320 [ ../cs320/ ] (Advert).

          . . . . . . . . . ( end of section Encoding Compound data) <<Contents | End>>

          Experience -- coding data in the ICI Infra-Red Spectrum Analysis Program.

        . . . . . . . . . ( end of section Special Encodings) <<Contents | End>>

        Guidelines for encoding data

        1. Make your codes -
          1. Concise
          2. Expandable
          3. Stable
          4. Unique
          5. Sortable
          6. Meaningful
          7. Single Purpose
          8. Consistent
          9. Clear -- Don't allow digits and letters in one field. They look too alike: (0=O, 1=l=|, 2=Z, 5=S, 6=b, 8=B)

        2. Use standard codes and technologies whenever possible!
        3. Collect input automatically if at all possible.
        4. Get the input where it is available.
        5. All data storage needs back up and archiving.... but for how long?
        6. All output needs data control, security, and destruction.
        7. Who needs dead trees? Paper data is dead data.
        8. All stored or transmitted data implies a security problem. All data needs some kind of security -- how much and at what cost?
        9. Design I/O with care and with the help of the users.

        Information theory and combinatorics

        Information Theory is an elegant theory of coding data and signals created by Claude Shannon in the 1940s. It shows that you can measure the information needed for a given purpose and calculate the maximum rate at which data can be sent through a given channel using the same measure: bits. Do the math. MATH272+MATH262.

        Reference and Online Resources

          Universal Product Code

          I don't expect you to understand UPC for this course but if you are interested in these ubiquitous bar-codes see [ Universal_Product_Code ] for the details and history.

          Samples of Syntax Definitions

          My [ ] define a large number of sophisticated coding schemes including programming languages and meta-languages.


          For reference purposes see [ comp.text.ASCII.html ]

          Markup Languages

            For reference purposes see [ Mark Up Languages in index ] (my notes).
            [ ../samples/comp.html.syntax.html ]
            [ ../samples/xml.html ]
            XML reference
            1. Robert Eckstein
            2. The XML Pocket Reference
            3. O'Reilly ISBN 3692082709 1999
          1. Up to date reference [ 9780596100506 ]

          . . . . . . . . . ( end of section XML) <<Contents | End>>


          1. Controversy about the meaning and format(?) of <time> in HTML5: [ ] (Wired Online Novemeber 2011).

          . . . . . . . . . ( end of section HTML5) <<Contents | End>>

        . . . . . . . . . ( end of section Markup Languages) <<Contents | End>>

      . . . . . . . . . ( end of section Reference and Online Resources) <<Contents | End>>

      Review Questions

      1. When did changing the coding of one element become important in CSUSB and the recoding take more than 6 months to complete?
      2. What is IANA is and what does it do?
      3. Define: ASCII, Unicode, BCD, EBCDIC, HTML, XML.
      4. Parse the following data strings -- label the parts:
        1. CSE201-02
        2. for encoding Data
        4. telnet:
        5. LOL

      5. Take the above parsings and express it using a plausible XML syntax. For example the first might be
      6. How many ways can you encode a date?
      7. How many ways can you encode a time on a given day?
      8. How about encoding a time on any day....
      9. Research: How does the California DMV assign number plates? Can you have an obscene vanity plate? How can California enforce this rule?
      10. Compare doing the same calculation in Arabic, Roman, and Binary:
        1. 32 * 16 = ?
        2. XXXII * XVI = ?
        3. 100000 * 10000 = ?

      11. Count from 0000 to 1111 in Binary, and in Hexadecimal.
      12. Visit my office and figure out how I've encoded the day of the month in it.


  1. TBA::="To Be Announced".
  2. TBD::="To Be Done".


    Notes -- Analysis [ a1.html ] [ a2.html ] [ a3.html ] [ a4.html ] [ a5.html ] -- Choices [ c1.html ] [ c2.html ] [ c3.html ] -- Data [ d1.html ] [ d2.html ] [ d3.html ] [ d4.html ] -- Rules [ r1.html ] [ r2.html ] [ r3.html ]

    Projects [ project1.html ] [ project2.html ] [ project3.html ] [ project4.html ] [ project5.html ] [ projects.html ]

    Field Trips [ F1.html ] [ F2.html ] [ F3.html ]

    [ about.html ] [ index.html ] [ schedule.html ] [ syllabus.html ] [ readings.html ] [ review.html ] [ glossary.html ] [ contact.html ] [ grading/ ]