This Course has been replaced by CSE557
Opening the PDF files on this page may require you to download Adobe Reader or an equivalent viewer (GhostScript).


    Detailed Data Design

      A critical part of system analysis is abstracting the underlying Logical Data from the given (or planned) Physical Data. Computer Science has discovered a powerful procedure that helps this process. It is called normalization. It is an important process that becomes, with experience, an intuitive part of good design. The research was done by Edgar F. Codd [ Edgar_F._Codd ] (Wikipedia) -- who won the Turing Prize for it. His breakthrough was to realise that all data can be expressed using tables with the connections between the tables being handled by storing key values. He called such a data base a Relational Data Base and this has now become the dominant model. Converting existing and planned data to relational form is called normalizing it.

      Story -- There is nothing so practical as a good theory

      Back in the 1970's I taught at a university which sent all it's students out, every summer, to do real work. Faculty would visit the students at their enterprises to see if the student was not too unhappy and that the enterprise liked the work the student was doing. One effect was that the faculty had regular updates on the technology used in industry. I had several students working for a well known international oil company. I had several surprises when talking with their bosses. One surprise was about relational data bases. I had written them off as an academic theory of no practical worth. The company was replacing all its files, "CODASYL" data bases, and proprietary data base management systems by relational ones. Then I realized the coming importance of Codd's work.

      Data Models in Projects

      In any real project you should make both Conceptual Models [ d3.html ] and Normalized Models. Both should be expressed using data dictionaries and ERDs. As a general rule, the conceptual model gives you a simpler picture of the data needed in a project. Normalization provides a complete and detailed design for a data base. The two techniques are complementary. You can do informal and mental normalization as you do conceptual modeling. Your conceptual models will provide names for tables that you later derive during normalization. Occasionally you will uncover a gigantic mismatch between what people think (their conceptual model) and what the data actually is. In this case you have an interesting political problem to solve. Programming such a mismatch without resolving the underlying problem will probably make the situation worse.

      In this page I will review the words and processes we use, give some hints, give rules for mapping a relation model to ERDs, and describe a simplified UML profile for modeling data.

      Data Tables and Normalization

        Keys and Identifiers

          When an enterprise needs to keep track of an entity it nearly always assigns a different identifier to each individual instance. As an example all the project reports and theses in the department office have just had a number assigned them -- and stuck on the back. Now we can refer to the document just by that number. The number identifies the report or thesis. We say that the thesis or report depends on the number. We say that the number is a key identifying the thesis. Absolutely typically, the reports and theses are stored so that the key increases. This makes it easy to find one given the value of its key. However, it is harder to find a report if you only know the name of the student who wrote it. The name is not the prime key for the reports and theses.

          Pay attention to identifiers and how they are assigned. A system can work or fail depending on how it handles them. For example, a university library (a long way from here) saved thousands of dollars by assigning identifiers to books that were meaningless serial numbers, and letting the computer store the relation between this identifier and the traditional Library of Congress call number.

          In computerized system, identifiers are mirrored by keys. All data is stored as a set of records of the same type in a file or table. Each record is a set of data fields -- attributes of some real entity. In these records, there should be special field called a prime key. A prime key uniquely identifies one record in the table (= file). The value of the prime key is unique in the table. The rest of attributes in the record all depend on that key. Keys act as surrogates for records. We can then store the value of the key in another record were it links back to the record. Here it is a foreign key because it usually identifies a record in a foreign file/table. So a foreign key is an field in a table/file that is a [ prime key ] in a different table/file. However, sometimes an attribute identifies a record/row in its own table. For example a table of employees may have a column that identifies the employees boss. This is legal. And we still call the attribute/item/column a foreign key.

          Candidate Key

          A candidate key for a table can be any set of fields in the table that might be used to uniquely identify the rows in the table.

          Prime Key

          A prime key is a set of fields in a table that uniquely identifies a row in the table. Notice that several fields may be needed to uniquely identify a row. They all form a compound key.

          Foreign Key

          A foreign key is a field in a table that is a prime key in a different table. A foreign key defines a natural many-to-one relationship between one table and another. There is one row identified by the prime key. But the same value can appear in many records as a foreign key. In a diagram you should connect the box for the table with the foreign key to the box that has the prime key. Each object with the foreign key identifies one prime key entity.

          By the way, sometimes a foreign key actually identifies a different entity/row in the same table. For example, in a medical database, a table of People may have a prime key of an SSN, but a foreign key may identifie the person to be contacted in the case of an emergency by storing an SSN. As another example: I have a bank account with a numerical prime key, but it also designates another account for when I die. They store that accounts Id as a foreign key. Treat these a just another foreign key and don't panic.

          Object Identifiers

          You will also hear about OIDs:
        1. OID::="Object Identifier", an invisible internal attribute associated with each object in an object-oriented program that uniquely identifies it. OIDs are based on pointers in C++.

          Physical Keys

          Once upon a time you would find people using the address in the storage device to indicate a different item of data. This is faster but very hard to change. It is out of favor. Only in manual systems are you likely to find instructions for where to find objects!

          Secondary Keys

          You will also find Secondary Keys in many data bases. A secondary key identifies a set of rows in a table. A database system is told that about these so that it can attempt to optimize performance by creating indexes. You may secondary keys called views. Secondary keys should, be ignored when normalizing data. Once the data is normalized then any secondary keys can be documented and added to the data base schema. They should have a specific identified purpose -- satisfying some stakeholder's need.

          Notation for keys

          In word processed texts and handwritten documents underline keys. In the UML keys are attributes. Prime Keys are stereotypes. They are marked << PK >> and foreign keys << FK >>.

          Informally, when using ASCII to normalize some data I mark all keys with an asterisk in front. Foreign keys have one, and prime keys have two.

          Here is an example of the data we would associate with a Thesis in a data base about graduate students:

           	Thesis(**number, author, date, title, *supervisor,...)

        . . . . . . . . . ( end of section Keys) <<Contents | End>>

        Extracting data structure from given data -- Normalization


        1. ** Normalization is most reliable way to design a data base to support your system.
        2. ** Normalization forces you to get answers to important questions. It reduces guess work and corrects errors.
        3. ** Normalization does not give you the most efficient data base. However, it gives you a collection of the simplest possible data structures.
        4. ** Normalization makes certain anomalies impossible. Example anomalies: (1) You can't insert an instance of an entity unless another entity is already in the data base. (2) You dare not delete the last entity of a given type because it destroys all record of a parent type. (3) You can't update certain records without being forced to change a completely different one.
        5. Normalization is a way of going from existing data to the logical data: A collection of simple tables linked solely by keys.
        6. If you start with a rough design for your data, normalization improves it.
        7. Normalization is a quality control process. If you design normalized data, you won't have to normalize it.
        8. It is possible to map normalized data onto just about any data storage system. Normalized data is easy to store in text files, traditional fixed length record files, spreadsheets, and even XML, can store normalized data. Indeed -- even object oriented data inside a program should be normalized whenever possible. The attributes of object should depend only that objects identity.
        9. Normalized data is not the fastest or smallest design for the data, but it is simple, obvious, and the basis for optimized designs. Design normalized data first and then optimize. If worried do the math on the volume of the data vs your resources, and the timing of performance critical processes. Then you can always re-factor the design to work faster or use less disk space; and sometimes both.
        10. Normalized data gives you the freedom to code data elements any way you wish.
        11. Normalization, in class, is not like real normalization. In a real project normalization raises questions that you must get answered. You have to go out in the field and ask questions. In fact, many developers have found that normalization generates precisely the right questions that you have to ask stakeholders.

          Starting Point

          The first step in generating a logical data base from a sample of data is called 0NF -- "Zero Normal form". All you do is:
        12. Parse the sample -- Mark up or highlight identifiers and other data. Ignore delimiters. Attach a name to each piece of data.
        13. List names of fields in parentheses. Add asterisks for keys. Show repeated groups in parentheses:
              ( **Identifier, data, (repeating group ) , *foreign_key )

          Functional Dependencies

            One data item depends on other data if its value can be deduced from the other data values. For example, you look at some samples of "Customer Order" which describes some items that a customer wishes to get from us. When we look at our sample we notice that whenever a customer orders an item with product number "WI007B" it is described by "Widget size 7 type B". In fact the product number is all we need to know to predict the product description. We say to our selves: "Product description depends on Product number". This is an example of a dependency.

            Looking further we notice that the price seems to depend on the product type in some of our samples, but not in others. Sometimes a "WI007B" costs $5.00 and sometimes $4.95. So we say that "Price does not depend on the Product Type".

            If we go and interview the clerk who handles Customer Orders. We discover that different customers get different discounts, and that the price varies with time. So you can say "The price depends on the time, the customer, and the product type".

            You'll find that experts tend to use the phrase "Functional Dependency". Here is an explanation why.

          1. functions::mathematics=following,
              In mathematics a function models a many-to-one dependency. If a mathematician writes
            1. sin: Real >-> Real

              then they say that "sin is a function". They mean that each real number x maps to a real number called sin(x). Similarly, in C/Java/C++ when we declare

                       double distance(Point x, Point y);
              we are saying that the distances depends on two Points.

              The rule is that if x R y is a function then

            2. for all x there is one y such that (x R y).

              So, the relation between Product Number and Product Description (above) is a function!

            3. Product_number >-> Product_description.

              Hence this is a functional dependency.

            (End of Net)

          . . . . . . . . . ( end of section Functional Dependencies) <<Contents | End>>

          Normalization Process -- The key, the whole key, and nothing but the key

          Memorize the above mnemonic. Practice the steps to prepare for exercises, quizzes, finals, and real live projects.

          Normalization has numbered levels: Zero normalized form, 1st normal form, 2nd normal form, 3rd normal form, ... We abbreviate these: 0NF, 1NF, 2NF, 3NF, ...

        14. normalize::procedure=
          1. (0NF): List tables and their fields. (field, . . .). Indicate candidate keys.
          2. (1NF): The Key: no repeating groups or options. Generates two tables from one.
          3. (2NF): The Whole Key: No part key dependencies. Moves attributes between tables.
          4. (3NF): Nothing but the Key: No non-key dependencies. Generates extra tables.
          5. We won't cover 4NF and 5NF in this course.

          (End of Net)

          Tool kit for normalization

          I've used chalk boards, flip-charts, paper and pencil, PDAs, palmtops, laptops, and desktop PCs. Simple editors/word processors are a good tool for normalizing data. For example on MS word systems I use Notepad or wordpad. On UNIX vi.

          Copy/paste each successive stage into the next (DEMO in class if time).

          Show keys by typing an asterisk "*" in front. Use "**" for Prime Keys and "*" for Foreign ones. Show repeated and optional groups in parentheses:

              Name( **prime key, *foreign key, attribute, (repeating group ) (optional data) )

          Suggested Abbreviations for Normalizing data using a simple ASCII editor

          1. (....) = Repeating and/or optional group.
          2. ** = Primary Key. Note: usually underlined.
          3. * = Foreign Key.
          4. # = "number".
          5. ? = "Yes/No"|"True|False".
          6. N = Name.
          7. C = Code.

          Examples of Normalizing Samples

            CSUSB SIS+ Screen 105 Course Section Scan
              The following is a step by step normalization of a sample of data on the old CSUSB Student Information System (SIS+).

              First the sample of data:

               105 Course Section Scan
               Screen: ___ SID: _________ Course: CSCI        Term: 071             G S
                  tS    SEATS                    INSTRUCTORS/                  PRIM R T CALL
               CSCI-121  - COMPTR TECH&PEOPLE
               01         76  T      1200-0150PM WATSON P        80   4   2.00 LEC    O 10649
               CSCI-122  - BIOINFORMATICS
               01        113  M      0200-0350PM DEVLIN K       120   7   2.00 LEC    O 10650
               02        118  T      1200-0150PM DEVLIN K       120   2   2.00 LEC    O 10652
               CSCI-123  - USING COMPTR SOFTWARE
               01         30  M      0900-0950AM PERLEPES T      30       2.00 LEC  A O 10653
               02    NG   30  M      1000-1150AM PERLEPES T      30       0.00 ACT  A O 10654
               01         59  R      0200-0250PM WATSON P        60   1   2.00 LEC  A O 10655
               02    NG   29  R      0300-0450PM WATSON P        30   1   0.00 ACT  A O 10656
               03    NG   30  R      1200-0150PM WATSON P        30       0.00 ACT  A O 10657
               01         30  TR     0600-0715PM HAN K           30       4.00 LEC    O 10681
               02    NG   30  T      0730-0920PM HAN K           30       0.00 ACT    O 10682
               01         30  TR     0400-0600PM BOTTING R       30       4.00 LEC    O 10683

              0NF Screen 105
                Describe the data Give names to fields, mark keys (*), guess at prime keys (**), note repeating groups(....).

              1. Screen( **dept, ** term , ( *course, title ( *sect#, available, days, times, teachers, max, enrolled, units, mode, group, status, *call#, subtitle)))
              1NF Screen 105
                The key -- Remove repeating groups and add new tables. Each removed group takes all the prime keys of the outer groups, and its own key. It takes just the fields that repeat in the group. This may leave some key only tables. Don't worry.

                In this example there are two (nested) groups -- data determined by the class and those depending on the section.

              1. DepartmentTerm (**dept, **term )
              2. Course( **dept, **term, ** course, title )
              3. Section (**dept, **term, ** course, **sect#, days, times, #available, #enrolled, maxEnrollment, mode, group, status, * call #, subtitle)

              2NF Screen 105
                The whole key -- Now look for part key dependencies. Parts of prime keys that have data that is determined by them alone, independent of the rest of the prime key.

                In this example, we notice that the course determines it's title. And extract Course from Offering (an invented name).

              1. DepartmentTerm (**dept, **term )
              2. Course( ** course, title )
              3. Offering( **dept, **term, ** course )
              4. Section (**dept, **term, ** course, **sect#, days, times, #available, #enrolled, maxEnrollment, mode, group, status, * call #, subtitle)
              3NF Screen 105
                Nothing but the key -- now we look for dependencies between data items that are not in prime keys.

                In this case the "mode" and the "call#" determine other data. This gives us two new tables. In the Section table we have: the dept, term, course, and section depending on the call# and so these prime keys are demoted to foreign keys. Be careful to scan other records for data items that are now foreign keys (mode in Section).

              1. DepartmentTerm (**dept, **term )
              2. Course( ** course, title )
              3. Offering( **dept, **term, ** course )
              4. Section ( ** call# , *dept, *term, * course, *sect#, days, times, #available, #enrolled, maxEnrollment, *mode, group, status)
              5. ModeSubtitle(**mode, subtitle)

          1. Note -- we normalize many sample documents -- one at a time. Each sample tends to add fields to existing table, and add new tables to the data base. The question to ask is: where does this bit of data on the new document fit into the already discovered data base. What data does it depend on? What determines its value?

          2. Here is another example from the old SIS+ system

            Screen 105 + CSUSB CSCI Dept Winter Schedule 2007

            1. Separates the Course identifier into two parts: dept and number.
            2. Adds information on the building and room.
              3NF Screen 105 + Winter Schedule
              1. Department(**dept)
              2. DepartmentTerm (**dept, **term )
              3. Course( **dept, ** course#, title )
              4. Offering( **dept, **term, ** course# )
              5. Section ( ** call# , *dept, *term, *course#, *sect#, days, times, *building, *room#, #available, #enrolled, maxEnrollment, mode, group, status)
              6. ModeSubtitle(**mode, subtitle)
              7. Room(**building, **room# )
              8. Building( **building )
                For all courses
              1. #available = maxEnrollment - #enrolled.

                More: when I asked the head of department about how the maximum enrolment was chosen he explained it was a property of the room.

              2. Section ( ** call , *dept, *term, *course#, *sect#, days, times, *building, *room#, #enrolled, mode, group, status)
              3. Room(**building, **room#, maxEnrollment )

        . . . . . . . . . ( end of section Examples of Normalizing Sample data) <<Contents | End>>


      1. Don't use upper case letters unless you write badly.
      2. (Agile Normalization): remove data that is positively not a key and replace by "...".
      3. Use abbreviations: #=number and ?=Boolean for example.
      4. When you have more experience you may be able to often combine several steps going directly from 0NF to 3NF quite quickly. But this is a matter of experience and skill. Don't do this, yet. Do normalization step by step, and keep a record of each normal form. This lets you review what you have done.
      5. Mistakes can be made. So get your work checked by colleagues and stakeholders!

      6. I often start out with an incomplete set of 3NF tables with missing columns. As I uncover each data element I work out which table/entity it belongs in, and where it is a key, and whether I need more tables to record relationships and data.

      7. Most books don't mention Optimized 3NF. After getting 3NF from different samples, you may have many tables with the same prime keys.... try combining them.

        Showing 3NF in ERD Notation

          It is easy to change a 3NF schema back into an ERD.
        1. Mark all the key fields: prime(**) and foreign(*).
        2. Start at the top of the page/screen with tables that have no foreign keys and the simplest prime keys.
        3. Put each new entity with combined prime keys below these and connect them to entities that have a part of the prime key.
        4. Connect foreign keys to prime key.
        5. Connect parts of a composite key to one prime key. The prime key is the "one" end, the other key is the "many" end.
        6. Validate against process and, if needed, add extra entities/tables (views and indexes) to provide starting points into the database.

          Computer Science Brochure

            I took the 2004 CSci brochure and normalized it. This is what I found:


            1. Department(**department name, address, phone, fax, email, blurb)
            2. AccreditingBody(**body name, url)
            3. DepartmentAccredition(**department name, *body name)
            4. Program(**department name, **type, **subject, description)
            5. FacultyDepartment(**department name, **faculty name)
            6. Faculty(**faculty name, faculty title, rank, highest degree, date of highest degree, email, research description)
            7. FacultyResearch(**faculty name, **research topic)
            8. Prospect(**prospect name, prospect title, address, phone, email, international?, Financial assistance?)
            9. ProspectResponse(**prospect name, **department name, **sequence#, date)
            10. ProspectRequestsQualification(**prospect name, **department name, **sequence#, **type, **subject)
            Here is the conceptual ERD derived from the 3NF.

            Diagram summarizing above tables.

          . . . . . . . . . ( end of section Showing 3NF in ERD Notation) <<Contents | End>>

        Using the UML to model data

          Profiles customize the UML

          A UML profile is a set of stereotypes, tags, and constraints that extends the UML for a particular purpose.

          Stereotypes restrict the meaning of things in UML diagrams. They look like this

          These are often associated with a particular graphic notation.

          Tags provide extra information. They take the form: {tag_name=tag_value}.

          Constraints describe additional rules that should hold in a model. They are statements that must be true. There is a specialized language (the Object Constraint Language or OCL) for writing them if needed.

          UML Data Modeling Profiles

            I've given you links to some data base profiles below that I think are too complex for this course. My advice: KEEP IT SIMPLE.
            (proposed profile for data bases): My proposed profile for describing data bases
            stereotypea special kind ofmeaning
            conceptual model class diagram picture of the world outside the computer in terms of entity types and relationships between entities.
            logical data baseclass diagramA picture of an ideal data base independent of a particular technology or technological requirements.
            physical data baseclass diagramA detailed solution, for a particular technology, to a set of requirements.
            tableclassdatabase table
            PKattributePrime key
            FKattributeForeign key
            viewclassview of database(for a function)
            indexclassindex into a database(for a function)

            (Close Table)

            In conceptual models do not show operations, keys, tables, views, indexes, composition, aggregation, etc.
            In logical and physical data bases unmarked classes are tables.
            Use <<table>> only if you have a diagram that has both persistent (data base) classes and run time objects in the same diagram.

            (Close Table)

          . . . . . . . . . ( end of section UML Data Modeling Profiles) <<Contents | End>>

      Online Resources and Examples

        Definition of Normalization

        Here are a couple of excellent definitions: [ Database_normalization ] In this class we won't go further than third normal form.


        1. A Flyer for a meeting at CSUSB [ 372doc4.pdf ] Flyer(OrgN, Chapter, (* SpeakerN, Description, title, abstract, motivation, date, time, building, room#), (TypeOfMember, Price), (Reason), URL )

        2. Example [ 372doc1.gif ] Schedule( DeptCode, DeptName, (Day), (Person#, PersonName), (Day, Person# (State) ) )

        3. Example [ 372doc5.pdf ] Container(*name, description, (property), (operation), (picture) )

        4. Example -- Tally Roll from a drive thru with my analysis: [ 372doc3.pdf ]
        5. Also see the example at [ ltnorm.html ]

          ISSA Flyer

            Albert Garcia brought this [ 372doc4.pdf ] (in 2006) to class.

            0NF -- ISSA Flyer

            1. Flyer(OrgN, Chapter, (SpeakerN -> Description, title, abstract, motivation, date, time, building, room#), (TypeOfMember, Price), (Reason), URL )

            1NF -- ISSA Flyer

            1. Organization(**OrgN, chapter, URL)
            2. Reason(**OrgN, **reason)
            3. Speaker(**SpeakerN, Description)
            4. Talk(**OrgN, **Speaker, *Date, *Time, title, abstract, motivation, building, room#)
            5. Price(**OrgN, **TypeOfMember, price)

            2NF -- ISSA Flyer = 1NF

            3NF -- ISSA Flyer = 1NF

          . . . . . . . . . ( end of section ISSA Flyer) <<Contents | End>>

          Info Service Schedule

            Aaron produced the following print out from a SQL data base [ 372doc2.pdf ]

            0NF - Info Service Schedule

            1. Schedule( **DeptCode, DeptName, (*Day), (*Person#, *PersonName), (*Day, *Person# (State) ) )

            1NF - Info Service Schedule

            1. Department(**DeptCode, DeptName)
            2. Day(**day#, day)
            3. Person(**Person#, PersonName)
            4. Slot( **DeptCode, **day#, **Person#)
            5. SlotState( **DeptCode, **day#, **Person#, **StateNumber, StateCode)

            2NF - Info Service Schedule

            3NF - Info Service Schedule

          . . . . . . . . . ( end of section Info Service Schedule) <<Contents | End>>

      1. Old CSci Brochure [ CSBrochure2004.pdf ] (Large sideways PDF)

      UML Profiles

      (Ambler's Proposed UML profile for Data Base design): Visit Ambler's UML DB Profile [ umlDataModelingProfile.html ]
      (Rational's proposed UML profile for mapping objects to data models): PDF [ tp185.pdf ]




          The Key...


          The Whole Key...


          And Nothing but the Key...

        Optimized 3NF

      . . . . . . . . . ( end of section Template.Close Online Resources and Examples) <<Contents | End>>

      Review Questions and Exercises on Normalized Data

      1. Bring any piece of data to class -- we will normalize it.
      2. Terminology: Can you define these terms from [ d3.html ] ? Table, Row, Tuple, Field, Attribute, Key. (some terms refer to the same idea).
      3. Define: 0NF, 1NF, 2NF, 3NF.
      4. Describe the normalization process as a scenario.
      5. Take any 3NF table described above and sow what it looks like as a table in a data base.
      6. What is normalized data good for? What is normalized data bad for?
      7. Draw a rough DFD of normalization: what data does it start with and what does it output?
      8. Print out any CMS screen -- normalize it. Do you discover the same data as implicit in SIS+ above?
      9. Find any tally roll or printout that you have (and don't mind sharing) -- normalize it. Use the Template at the end of this page.
      10. Do you have any sample data from a field trip? -- Normalize it.
      11. You can apply normalization to software development artifacts. Do you have any source code to normalize? How about a data dictionary?

    . . . . . . . . . ( end of section Detailed Data Design) <<Contents | End>> By the way a Tally Roll is a british term for a receipt printed on a strip of paper in the form of a roll.


  1. TBA::="To Be Announced".
  2. TBD::="To Be Done".


    Notes -- Analysis [ a1.html ] [ a2.html ] [ a3.html ] [ a4.html ] [ a5.html ] -- Choices [ c1.html ] [ c2.html ] [ c3.html ] -- Data [ d1.html ] [ d2.html ] [ d3.html ] [ d4.html ] -- Rules [ r1.html ] [ r2.html ] [ r3.html ]

    Projects [ project1.html ] [ project2.html ] [ project3.html ] [ project4.html ] [ project5.html ] [ projects.html ]

    Field Trips [ F1.html ] [ F2.html ] [ F3.html ]

    [ about.html ] [ index.html ] [ schedule.html ] [ syllabus.html ] [ readings.html ] [ review.html ] [ glossary.html ] [ contact.html ] [ grading/ ]