Contents


    Modeling the Data in a System

      Story -- Data determines the feasibility of systems

      Recently, at this campus we rolled out a new system for handling registration etc -- including grades. It is called the "Common Management System" or CMS. A faculty member asked in the Fall of 2009 about posting "Incomplete" grades.
        Given that Faculty members now must use CMS to enter grades, why is the hardcopy multiple copy form required?

      Notice that this is a classic system improvement pattern of remove paperwork.

      And here is the reply


        It just so happens that CMS has developed an incomplete form that will be delivered to us soon. We plan to roll it out in the Winter Quarter as we need time to do set up and training. The form does have a dialog box to enter what the student needs to complete and a deadline for that work. It also allows for a default grade other than the "F" or "NC" if sufficient work was completed at the time of the contract.

        Since some students need to sign or accept (they can do the acceptance through MyCoyote Student Grades) the contract before the grade rosters will be available, this feature will be added to the class roster, too.

        Our campus requested the Incomplete form as a result of the CSU Student Records Audit that noted we had not received forms for all of the incomplete grades, they were not completed properly when received, and did not have a student signature.....This is considered a contract between the faculty member and the student, so it does require both signatures.


      The answer was that the originally implemented system could neither input nor store the required data for the function of filing an Incomplete grade.

      Moral: find out about what data exists, what is needed, and how it can be computed or input.

      By the way, also note the iterative implementation strategy: roll out only some of the functionality at each iteration. Start with a good (but incomplete) system and add to it periodically. We will compare this to some alternatives (Big Bang for example) later.

      This is a continuing story, see later in these notes on an unexpected problem with this nice new feature.

      Story -- The Available Data effects Reliability and Usage

        When we could only get rosters in hard-copy we would have to type the names and student IDs into a spreadsheet. This always introduced errors in the data.

        But when I could copy the rosters from a terminal screen into the central "SIS+" and paste the data into a spreadsheet the errors almost disappeared. Again note the pattern -- remove paperwork.

        The latest system (CMS) gives a teacher the option to download his or her roster directly as a spreadsheet. This a very useful feature. It is the fastest and most reliable system I have used. However it is less secure because the spread sheet has to be downloaded in an unencrypted form... and pieces of unencrypted data may be left on my hard disk -- even after I have deleted the downloaded file.

        Similarly, Course Management Systems on this campus (Blackboard, Moodle) also extract CMS data to populate the grading subsystem. Again, having easy access to the data makes the system an improvement over previous systems. On the other hand faculty who have uploaded materials (data) to Blackboard will not want to upload it into Moodle -- so Moodles success will depend on processes that download and re-upload data.

      Stories of data transfer between systems

      Another recent change also illustrates the importance of data. The campus has just moved student's Email handling from an internal server and data base to the cloud in the form of Google gmail. In theory it was a simple extract the data from the old system and send it to the new one... and was scheduled to be done over night... In fact some of the data in the old system was not as expected and so erroneous data was created on the new. It took another 24 hours to fix this. Something similar to this occurs when I move an address book from my old Palm Pilot to my iPod. The Palm will output a text file (export) and the iPod/iMac will import it. But the Palm Pilot allows blank lines in the Note field and the iMac/iPod does not. So I have to filter the file using some simple Unix spells...

      Introduction to data models

      System flow charts are a popular, traditional, and simple way to picture a system. They show stuff moving through it (flows), being 'stored' in it, being 'processed' by it, entering, and leaving it. This is a classic PowerPoint slide! We call the movements 'flows'. Here are some classic flows: goods, money, data, and objects. All shown as an arrow... which can be confusing.

      Computer based systems are almost entirely about handling data. In systems, the data that exists and can be created, processed, collected, and output drives the selection of good designs. We need a way to trace and define the data in our systems. We need a way to picture and visualize the data in new systems.

      To analyze and design systems that handle data we need a specialized diagram for showing the flow of data. These are called Data Flow Diagrams. We also need specialized diagrams for showing the structure and meaning of data. These are called Entity-Relationship-Diagrams.

      If you want to change a systems it is vital to understand the data in it. The technical feasibility of a new system will often depend on what data is already available. Samples of data (printouts, forms, manual files and records) are a good starting point. So are the descriptions of data in the documentation and source code of any software in the system. But you need to make a more abstract or essential model of two things: the (1) the dynamic flow of the data through the system, and (2) the static structure of the data in the system. To master the complexity of a real domain you need diagrams that just show the essentials: how the data moves, where it is stored, and how different data is related. These are best done by drawing DFDs (Data Flow Diagrams) and simple ERDs (Entity Relationship Diagrams). The details are often described in a Data Dictionary and we will cover these later.

      Information Technology is all about delivering information to people. Information is data provided to the people who need it, in their preferred format, at the right time. Information needs to be computed reliably, cheaply, and securely. Tracing the flow of data from source to sink is a vital technique to achieve this aim.

      Story -- Go with the flow

      When I worked with in the British Civil Service a colleague described the following meeting. He had been invited to visit a branch and was there for the day to consult with them about a new computer system they wanted to develop. They explained that they wanted a program to print out a 20 page report. Each page had 20 columns and 50 rows.... He and they worked on the content and format of the report and half-an-hour before lunch they had the whole thing defined and ready to be programmed. The programming would be done by another team. There was, apparently, nothing to do for the next 30 minutes.

      So the analyst asked -- "What do you do with the 20 page report?". And a manager replied -- "I look for the row with the largest value in column 17." So my friend asked: "would you like the computer to do that for you?" They replied: "Can a computer do that?" He said "Yes -- and printing one line instead of 1,000 will save money!". They liked it.

      So my friend asked: what do you do with the row of data? They told him "We multiply the 2nd column by the 4th column and subtract the 5th column". And he said: "The computer can do that too, if you want". They liked it.

      So my friend -- now hot on the track of the end of the data flow -- asked "what do you do with the result you calculated" -- they said "if it is greater than 100 we send a memo to the manager listed in column 1." At last, my friend had found an action... "Could we just send the memo for you and let you know it was sent?".

      They then went to lunch at a local pub...

      Moral -- always ask where the output data goes to. And contrariwise -- ask where input data comes from.

      ERDs and DFDs

      We analyze and design data flows using: external entities(input/source, output/sink), processes, and stores. The data flow diagram or DFD is the central diagram used in information technology. We also analyze the data. We need to know its organization and meaning. UML Entity-Relationship Diagrams are a simple tool that does this.

      Once you have a DFD is it useful for pin pointing the changes the enterprise needs to make. You can use DFDs to present the choices to management. They form an excellent start for specifying the hardware and software that will be needed. Meanwhile the static model -- the ERD -- is the starting point for designing a data base and then designing objects inside software.

      In summary DFDs and ERDs are a useful intermediate step between problems/opportunities and solutions/plans.

      DFDs -- Data Flow Diagrams

        Here is a simple example of a DFD for a project called AMP -- The Absent Minded Professor System.

        Help an absent minded prof recall the students he has taught

        A DFD is a circuit diagram of system. When done right -- following some very specific rules -- they becomes a rigorous picture of a information processing system. Sometimes we inherit DFDs as documentation of an existing legacy software system. This can be very helpful.

        They are good for

        1. Making rough notes when interviewing people.
        2. Mapping out existing systems to find out things to change and things to leave alone.
        3. Planning new systems.
        4. Planning the instalation of a new system.
        5. Verifying our designs: how will they work?
        6. Presenting our plans to management and stakeholders.
        7. Specifying a process or function as a black box -- with hidden details inside.
        8. Documenting a system to help others understand it.
        9. Getting a list of data stores to start an Entity-Relationship or Conceptual Business Model.

        Definitions of DFDs

      1. DFDs::="Data Flow Diagrams".
      2. DFD::="A diagram that shows how data moves through processes and stores, from sources to sinks, in a system". A data flow diagram has
        1. External Entities -- Where data comes from or goes to
          • Sources -- Where data comes from
          • Sinks -- Where data goes to
          • Some External Entities are both sources for some data and sinks for other data.
        2. Processes -- Where things happen to data
        3. Stores -- Where data is held ready for future use
        4. Data Flows -- connecting processes to and from entities, processes, and stores.

        Here is an example of a rough pencil and paper DFD:

        [Author (entity) writes Document(store) and prints Document to Printer]

        Each DFD summarizes a collection of simple statements. The above diagram implies some of the following facts:

        1. The Author makes changes to the document.
        2. The Author reads a preview of the document.
        3. The document is printed on the printer.

        Physical and Logical Data Flow Diagrams

        A DFD can be used to model the physical structure of a system. The physical model describes and names the hardware and software involved. Each process is one program, but may be a subsystem of programs. Each store is a separate file (think of a folder in a filing cabinet) or a table in an data base. In other words, physical DFDs show the architecture of the system not its underlying logic. However this information is better shown using an UML Deployment Diagram.

        In a logical DFD there is no mention of how something is done. No technology is mentioned. Several programs may be inside a single process. Avoid drawing DFDs that show the inner workings of a program -- they are better ways to picture internal architecture of software. One program may even implement several processes. Stores are not described in terms of their media (data base, mag tape, disk, RAM,...) but are named for the entities (outside the system) that they store information about (student, teacher, ...).

        As a rule you should aim to move to logical DFDs as soon as possible. You can then solve the logical problems in the system without getting confused in the technology. This process produces a top-level design for a new system and is the start for specifying data and programs.

        Notations for DFDs

        There are three different icons in a DFD: External entity, Process, and store.

        There are several different notations for DFD icons:

        1. Yourdon and/or De Marco
        2. Gane & Sarson
        3. SSADM(Structured System Analysis and Design Methodology)
        4. Unified Modeling Language Component diagrams.

        [Four Notations for DFDs]

        The SSADM DFD notation was developed by the British Civil Service (with LBMS Ltd.) from the Gane and Sarson notation. It is used in England and what used to be the British Commonwealth. As far as I can judge the Gane and Sarson form is most often used notation in the USA. The Gane and Sarson notation also allows a process box to have three compartments. These are used for: (top) a unique process ID. (middle) description of the function of the process. (bottom) the location where the process is carried out of the actor responsible for the process.

        I will use Gane and Sarson and encourage you to do so as well in this course. But different enterprises will use different notations.

        Below I have some notes [ UML notations for DFDs ] that show how the UML is used and explains why you should, for now, use one of the other notations rather than the UML.

        Semantics of DFDs

          Many people misunderstand DFDs -- they don't know what they mean. They have the wrong semantics. This section is about the meaning of the parts of a DFD. It is vital that you study the meaning of diagrams as well as just learning the notation (syntax).

          Semantics of External Entities in DFDs

          External entities are outside the current system. There are sources and sinks. Sources show how data that flows into the system from outside. Sinks show where data leaves the systems. Some entities are both sources and sinks. We tend to think of entities as being people. But they can be parts of other systems -- hardware and/or software. The key point is that we can not redesign external entities. Our system has to fit them. They are also the main source of disturbances that the system must handle. We can not control the input from an external source unless we have a process to handle anything that can happen and sieve out the data that is needed for our system. As a result we must include processing that accepts the good data and rejectsthe bad. This in turn means you need to specify the desirable input but the physically possible data coming from a Source.

          Semantics of Processes in DFDs

          Processes are the only active part of a DFD. It is the only place where results can be computed, data processed, and decisions made. Data does not flow without there being a process to move it. A process is best thought of as a continuously running program. They handle whole streams of data. They may wait when the data is not available but they do not stop. They may repeat the same computation on each item of data as it arrives. They can make decisions and route input data to different outputs. Processes can also wait to be asked for data and then provide it to one their outputs. Try to not see them as steps in an algorithm -- use an activity diagram (later) for algorithms.

          Some processes are subsystems. This helps keep the diagrams of complex diagrams simple. They are shown as a whole process in some DFDs. Each is also defined by a DFD. This is called the refinement of the process. Such processes can contain hidden data stores and sub-processes. There is a potential tree of refinements.

          Semantics of Data Stores in DFDs

          Stores are places where data is placed, and where it waits to be used. Some people use the CRUD mnemonic to describe the interaction between a process and a data store:
        1. CRUD::acronym=Create + Read + Update + Delete.

          Ultimately the data flows between processes and data stores are (nowadays) programmed inside a process using the Structured Query Language --(SQL).

           	SELECT StudentName FROM Student WHERE Student.id = "123-45-6789"
          However it is a mistake to go in to this level of detail in a DFD. A single data flow attached to a data store can be implemented by any number of SQL-type statements.All you have to do in the DFD is name the table ("Student" above) as the name of the data store and not the required data in the process of on the data flow.

          You should aim to have each data store labeled with the name of a single type of real world object. The data store holds records about all entities of some type or other. The name of the data store should reflect the type of entity. Ultimately they will become either tables in a database or flat files of data.

          Traditionally, creating data in a data store -- adding new records -- is shown by an arrow that flows from a process to a data store. Reading data is indicate by an arrow from the store to the process that needs it. Updates and deletions are shown as two way arrows since data has to be read and then rewritten.

          A data store is needed whenever data is reordered or reorganized. On the other hand if the store is a queue or buffer, so that the first item of data to arrive is the first to be output then we don't show a data store: arrows are understood to be buffered by a queue.

          Another simplification: you can put the same data store in several places. Traditionally you mark stores like this with an extra stripe at the left hand end. It also helps if you give each store a unique Id.

          Semantics of Arrows in DFDs

          The meaning of a data flow (arrow in a DFD) is subtler than you might guess. It depends on the symbols at each end: process, Entity, or Store.

          Notice that only a process can move data. So each data flow must either come from or go to a process. We do not permit data flows to connect entities or stores unless a process is involved.

          Connections between processes and entities define the interfaces between the system and its environment. It is rarely unambiguous what data is communicated. Thus these data flows must be described -- at least given a name. You need to describe the expected and normal data, the data that is erroneous, and the worst case possible data.

          Similarly, it is not clear when you connect one process to another process with an unlabeled arrow what is going on. The arrow needs to be named with the data being transmitted. The name will need further definition (later) in a Data Dictionary. Occasionally you will meet a doubled headed arrow -- here someone has to define the protocol that describes the conversation between the two connected processes.

          Notice that in real systems (unlike computer programs) data flows between processes are buffered. One process writes the data and the data waits in a queue until the other process reads it. The writer doesn't have to wait for the data to be taken away. For example when you send me Email it is automatically stored before I read it. Similarly "Snail Mail" is put in my box. Memos, rosters, etc. are all buffered for me. So when Modeling a real system you don't have to say that data in a data flow is in a queue. This buffering is implicit in the the Data Flow model.

          A data flow out of a store can only go to a process. It indicates that the process reads the data in the store but does not change it. External entities and stores are not allowed to read data directly -- they must get the data indirectly via a process. However, you don't have to label and document these data flows if the process can read the whole store. You only have to document the data flow from a data store if the process accesses only a part of the store.

          A data flow into a store must again come from a process. It indicates any combination of the three basic operations: Create, Update, or Delete. Again if the arrow is unlabeled then it is assumed that the process can (or will) change any item in the store.

          A double-headed arrow between a data store and a process indicates that the process may: create, read, delete and update the data in the store. Some omit the arrow heads in this case.

        . . . . . . . . . ( end of section Semantics of DFDs) <<Contents | End>>

        Drawing DFDs

        Keep DFDs simple by keeping them abstract, logical, or essential -- don't document the media and format of the information -- just give it a meaningful name. Note: you can keep a list of the current or planned media/formats in a "data dictionary". Similarly a DFD should not show the current type of a part: people, procedures, hardware, and software all tend to be implementations of processes. The type of a component should be noted in a data dictionary (see [ a5.html ] ). Neither should a DFD show steps in a user scenario like "login" and "logout". These can be analyzed later in the process using more suitable tools.

        Do DFDs quickly -- pencil and paper, chalk-board. Only tidy them up when some else needs to see them. Use a tool only to impress people. However, even when sketching roughly follow the rules and avoid the errors listed on this page.

        Some people put unique short identifiers on each part of a DFD. Avoid this if you can! But in those cases where the boxes are numbered, here are the rules: processes are numbered 1,1.1, 1.2, ... and data stores have an id that starts with "D" plus a number. External entities can be given single lower case letters to be their unique id. These ids are good for linking the same part in different diagrams. For example, the parts numbered 1.1, 1.2, 1.3, etc. are all parts of the process numbered 1. Similarly, 1.2.1, 1.2.3, etc. are subparts of process 1.2.

        Never use more than one piece of paper for a DFD. The trick is to have layers of detail. We do this by expanding, exploding, or refining a process into a lower level diagram. This is done by taking a process and drawing a DFD that would replace it in the original DFD. There are three levels of detail commonly needed: context, level-0, and level-1. Here is a picture of how refinement works:

        Three levels of DFD

        The table shows the three types of DFD and is followed by definitions and examples.
        Table
        LevelContent
        Process ContextShows one process with its inputs and outputs only.
        System ContextOne process + surrounding external entities
        Level-0Make the central process BIG and draw stores, processes, and flows inside
        Level-1Take a process on the level-0 and repeat the expansion in another DFD
        Level-n+1Take a level n process and refine it.

        (Close Table)
        Note: 3 or 4 levels is usualy enough. Don't get too detailed. Other techniques [ r1.html ] are better.

        Examples of DFD Levels -- Conext Level0 Level1

        [Context]

        [Level 0]

        [Level 1]

        A Note on level terminology

        I will be following well known textbooks on the naming of the levels. The Wikipedia seems to use a different form.

        Definitions of DFDs

      3. Context_DFD::DFD=Shows a system as a single process surrounded by external entities. This should show a single process -- your system surrounded by the external entities that send it data and get data from it. Each data flow should be named. No internal details allowed -- they come later. No data stores, no sub-processes: just establishes the Boundary between the system and its environment.

        One Process takes questions and answers from faculty and uses them to tutor students

      4. Level_0_DFD::=DFD=Shows the main functions in a system as processes.... At this level you show up to about a dozen main functions that the system provides, plus the data stores and external entities that interact with the processes. A Level_0_DFD always expands a Context_DFD

        Expansion of Tutor DFD into 2 data stores and 5 processes

      5. Level_1_DFD::DFD=Takes a single process in a Level_0_DFD and shows the details inside it.

      6. Fish_eye_DFD::DFD=Shows a DFD inside a box representing a process in another DFD. We have a central focus where we show the details but round the edge we have higher level symbols. An excellent way to refine a Context DFD to Level 0, a Level 0 process to Level 1, and so on. It is called a fish-eye diagram Because when a fish looks up out of the water it sees the whole 180 degree view compressed into a small circle. In the center of the view things look big. Further out the look small.

        Refining a DFD

        The process of finding out what is inside a process has many names: leveling, refinement, filling in the details, partitioning, exploding, decomposing, ... It is an important strategy for analyzing a problem. Start with the big picture -- the context -- then break it into smaller and smaller parts. Ultimately, as you decompose or refine processes, you will find yourself needing to express logical rules, algorithms, and types of data. Do not use a DFD to express complex logic, algorithms, or data structures. Instead, record these details by using techniques introduced later in this course:
        Table
        ProcessesActivity diagrams, Use Cases, and Scenarios. Prototypes.
        External EntitiesPersona
        Data flowsData dictionary entries and coding techniques.
        StoresEntity Relationship Diagrams, Tables, and Normalization

        (Close Table)

        Bottom-Up DFDs are chaotic

        The above is a top-down procedure. You can also draw rough DFDs of parts of the organization and link them together to get an "end-to-end" model. Here is an example from the first time this course was taught.

        [Free Information System for students 2003]

        These tend to be a little chaotic and unstructured. You may be forced to do this when interviewing people and starting design. But as soon as possible shift to top-down/refinement.

        Principle -- DFDs are systems not programs

        My Law: DFDs are good for recording how a system works. They are a way of choosing what parts of a system to change and which to protect. They can be used to define the inputs and outputs to a program. You can use them to plan a collection of new and old software (system design). BUT don't use them to design the internals of a program. You will make errors. There are more modern techniques for designing programs.

        Rules of DFDs -- DFD Errors

        Notice and learn the rules below. The key thought is that data never moves unless a process moves it.

      7. DFD_Errors::=following,
        1. Process names must start with a verb and describe an action. Try the "Hey Mom Test." A process name should make sense when prefixed by "Hey, Mom, I'm going to .....". Some describe producing an output for each input (Calculate tax) but most do more -- Prepare monthly summary from weekly data. Stores and external entities should be named with specific noun phrases. They must not indicate any activity. They are passive. All data stores must be named after the specific data they store. Information about people would be in a data store called "Person" for example.

        2. Data flows do not transfer control. An arrow is not a function call or a go to! The processes run in parallel. They can stop and wait for incoming data. It is OK for a flow to send a message, trigger, or signal without other data. However control is not transferred. The sender does not have to wait for a reply.

        3. No Flowcharts. Do not use normal flow chart symbols like decision diamonds, START, STOP etc. in a DFD. All parts of a DFD exist at the same time and operate in parallel. A process can read and store data long before and after producing an output. Processes consumes streams of data and produce streams of data.

        4. Name all data flows between processes. The data flows define the interfaces between components. There must be a match. Naming the data on a data flow is the first step to defining these interfaces. Using a data dictionary to define the data that flows make it more certain that a design will work. Unlabeled arrows between processes are often control flows and so wrong. However: arrows leaving and entering a well-named store only have to be named when they provide access to only parts of the data in the store.

        5. Data flows that change their data. Data flows are a perfect communication channel. At most they may delay the data. They can not reorganise, process, delete, or insert data that is not input into them. The data flowing in is precisely what flows out at the other end. If the data can be lost, or interfered with you add a process to describe the changes. If the data can be reordered show this as a data store and label the outgoing data to indicate the new ordering.

        6. No magic data flows. Data does not move without a process to move it. So each arrow must have at least one process. Never show arrows connecting an external entry to another entity, or to a data store. Never have an arrow that connects a data store to another store. Examples: Waiter and cook. Coordinator and secretary. Teacher and student. Student to student records. Customer to bank account.

          [no magic flows]

        7. All user input must be validated. User input has to flow into a process. This process must be able to reject data that is incorrect. But we typically omit a small feedback loop between a user and the data validation process. Be carefulto think about the weirdest possible data as well as the perfect and good data.

        8. No spontaneous generation. All processes and stores must have input. There must be a data flow into them.

        9. No black holes. All processes and data stores have outputs. There must be at least one data flow out of any process.

        10. No miracles. The input data flows must make it possible to compute all of the output data flow.

        11. Maintain balance. Each upper level matches its lower level expansions.

        12. No forks or joins. When flows meet or split you must have a process to control the joining and/or splitting.

        13. Not specific. I common mistake by beginners is overgeneralization. This cartoon [ http://xkcd.com/974/ ] expresses the error perfectly. In a DFD a common error is to show a data store labelled "Data Base". This merely sweeps the problem under the rug. It is an error. You may not use the words "Database" -- it is too general and conveys no information about the data in the store.


        Example of a DFD Error leading to a better design -- AMP Level 0

          First Iteration Context

          First iteration of contextDFD

          Resulting Level 0 with error

          First level 0 has store with no input

          Fixing the error improves the design

          Improved level 0 has processes to create and upload student data

          And the Context must change to fit the level 0

          Change context to fit level 0

          DFD Advice


          1. Number your nodes only if you have to. Example: the boss says so.

          2. But don't clutter the DFD with the format or media: phone calls, forms, EMail, disks, tapes, print outs, HTML, XML, ... (1) The DFD shows what exists, not what form it takes. (2) Our job always involves changing the format and/or media. (3) describe media and formats in a separate document called a data dictionary (4) note content in as attributes in a separate ERD (below).

          3. Keep DFDs simple by omitting backup, support, and maintenance processes as long as you can. Focus on the operation of the system first.

          Data Flow Analysis of System Development

          In this class we will look at applying systems techniques to the systems work itself. This leads to a model of system development as three parallel processes. One is concerned with understanding the current system plus the latest plans and changes -- call this "Analysis". The second process is concerned with taking ideas from the Analyzes process and designing plans that need implementing. The last process carries out the plan and changes the system.

          [Analysis -> Design -> Implement -> ...]

          Notice we can schedule the above DFD in many ways. We can run the analysis process until it produces an idea, then pass it to the design process, which can modify the plan that triggers implementation activity. It all depends on the size of the change to the model and the plan whether we get a traditional or an agile life cycle.

          DFD Smells and Patterns

            Much of the expertise that helps us understand and plan systems is encapsulated in the following hints. They are classified as smells that are to be avoided and patterns that work well enough for repeated use.

            Pattern -- Stores contain a model of reality
            In nearly all systems the purpose of storing data is to capture a picture of some real object. So, name stores after the entities that they model. For example a file containing student records should be shown as a store named "Student".

            DFD Smell -- useless storage
            Be suspicious of data stores that have inputs without outputs or have outputs without inputs. Storing something that is never needed is wasteful. Having data that can not be altered or created (no input) is a problem waiting to happen. Example: When I moved office I found I had two filing cabinet drawers full of unread paperwork. I threw it out and plan to not keep it again.

            Exception: there may be some law that requires you to keep some data for a number of years. Find out if this is true.

            DFD Smell -- wasted motion
            Take note of processes that merely move things around in a system, especially when it is data transmitted as paperwork!
            Pattern -- Remove paperwork
            One of the traditional improvements is to replace paperwork data flows and storage by electronic forms. You need to be sure that the electronic form is as secure and reliable as the paper form. Will you be able to read it 30 years form now?

            DFD Smell -- old technology
            As you abstract away from the current technology to an abstract set of data flows, processes, and stores; take note of processes and storage that use old technology. But don't clutter the DFD! These are candidates for replacement in the new system. Perhaps, when you present your problem to management you could color the old technology red? Don't forget that sometimes an old technology is more reliable than brand new technology. Some old ways of doing things need to be preserved. As an example look at [ Word-Processors-One-Writers-Retreat ] (Slashdot Features Story | Word Processors: One Writer's Retreat) which argues that simple editors are more efficient for writing than high powered "word processors".

            DFD Smell -- Overloaded Process
            When data (and stuff) flows through an organization it can pile up in buffer zones. A person's desk can slowly disappear under the incoming paperwork, for example. Look for processes that handle their input slower than it arrives. Even if it can just handle the average rate, queuing theory shows that the length of the queue grows without limit.

            The ideal solution is to have multiple copies of the process running in parallel. Input is distributed to the least loaded or first available server that can run the process. The next solution is to find ways of speeding up the process: better technology, simpler logic, ... Simple examples of this strategy are upgrading the CPU or adding RAM. But a subtler variation is reorganizing the data storage to give faster access to the data. This trick includes defragmenting disk drives. A third solution is to provide multiple parallel clones of the process running on multiple processors.

            As an example high traffic web sites may have a dozen web servers and a special load balancing "switching" server front end.

            Note -- multiple computers all running the same process are still a single process in the DFD!

            DFD Smell -- Inefficient, Intractable, and/or Non-computable Processes
            Look out for inefficient processes. These are often concerned with reorganizing data in some way or other. Many times a clever design can make them run a lot more efficiently. Sometimes you can remove the need for the process entirely be rethinking the design. Be aware that a process can often be implemented with many different algorithms and each algorithm will perform differently. You may have to specify an algorithm or give feasible limits on the efficiency of the implementation of a process.

            Computer Scientists has discovered a large family of problems that can not be solved by a computer. These can not be programmed. An example is checking to see if a program will stop or not. We have also discovered problems that apparently demand very inefficient processes to solve them. A classic example is the "Traveling Salesman problem". It is worth studying computer science theory to be able to spot these.

            There are also processes that are better done by a human than a machine. Ethical questions should not be handled by machines! Questions needing discretion should involve humans. Sometimes you need to design systems that support communication and cooperation so that complex (political) problems can be resolved by humans.

            DFD Smell -- Under-worked Human
            You will often find systems where an event occurs and triggers a message that is sent to a person, who in return does nothing with the message but pass it on to another part of the system. This smell is worst when the human has a fixed simple procedure that they use to respond to the disturbance. I recently heard of an example of a computer system that turned a light on and expected a human had to hit a button to avoid a disaster. The person did nothing and the disaster occurred. This bad design. Another classic example is any web site that expects you to type in data shown to you or even input by you on a previous page!

            A version of this is sending data to a human to re-input later. This introduces errors... there must be a better way to handle the problem. Here is the smelly system and a possible improvement.

            [Output comes straight back in]

            Pattern -- Automate Simple feedback
            If the choice of action in the above system can be computed from the message then a better system automatically carries out the action and reports to the person. The sensing + acting system should only ask for help on the difficult decisions. The best systems allow the person to input and update the desirable actions.

            [Adding a feedback loop to save human work]

            Examples: EMail -- automatically deleting messages that we don't need to see. Inventory -- automatically reorder when stocks get below a certain level. Record people's browsing, let them replay and/or edit the recordings.

            DFD Smell -- Human Input can not be trusted
            Whenever the input from a human is supposed to be a simple feedback reporting of what the human is done, then you must (1) findout if the human is rewarded for inaccurate input and (2) if there is some way to discover and correct the errors. An example was [ a1.html#LIBOR ] where traders were made more money if a bank reported inaccurate values for the rates they had used... Or consider [ contentdetail.htm?contentguid=gbsKCsrZ ] (San Bernardino County welfare fraud defendants plead guilt) where an employee could change the addresses in a data base of welfare recipients without anybody finding out for several years.

            Pattern -- Keep people in the loop
            It is common for people to reject designs and sabotage resulting systems if they take control away from the people who used to be in charge. In fact the better an automated system is, the worse people will feel about being replaced by it. They will fight it.

            On the other hand there is something wrong about forcing people to work as computers. You need a ballance.

            For example, it is highly rational to insist on being able to undo things that can be undone. For example when the CSUSB system automated the handling of Incomplete Contracts (2010) it became incredibly easy to create a contract -- no forms to fill in. No signatures to gather. Unfortunately it became impossible to remove an incomplete contract that was on file -- in the old days you ripped it up and put it in the trash can. At this time (2010) you can not do that. So small mistakes can not be corrected. There is no "Undo" feature.

            Another concern is to make sure that people can not game the system to their own advantage, but the systems loss, without being discovered. The old rule of trust -- but verify applies here.

            Summary -- let people do the thinking and machines do the boring stuff.

          . . . . . . . . . ( end of section Smells and Patterns) <<Contents | End>>

          UML notations for DFDs.

          The UML is not designed to do DFDs. The designers (the OMG -- Object Management Group) are more concerned with the details and internals of software than with interactions between parts of a larger system. But in the specification of UML2.0 there is a way to document flows between components:

          [UML DFD Symbols and sample context DFD]

          [UML Level 0 DFD]

          At this time (Fall 2009) it is still better to use a traditional notation like the Gane and Sarson in these notes.

        . . . . . . . . . ( end of section DFDs -- Data Flow Diagrams) <<Contents | End>>

        UML Data Models

          We need a simple way to describe, explore, and design data. This turns out to be a powerful technique in analyzing and designing systems.

          Data is always organized in clumps called records. A record has a collection of items of mostly different data types in it. For example the CMS probably has a record that contains all the information about a student in it. Each type of record tends to reflect a real world Entity. Each type of record is given a meaningful name and this is put in the top compartment of a UML class. These entity names should be in your DFD as well.

          Story -- Sharp Wizard Contacts Data

            I've been using small portable computes as Personal Digital Assistants -- -- --(PDA)
            for a long time. And all have had problems of one kind or another. The Sharp Wizard series, for example, had a very annoying way of handling phone numbers. You couldn't enter the name and number of a person without also inputting the title, rank, department, and organization. Not a bad model for a business person, but very irritating for your mother or spouse. Of course: you had to include the address of the organization as well... People didn't have an address of their own. You had to start top-down inputting the company, the department, and then individuals.

            The Palm Pilots and iPods I've been using for 6 or 7 years have simpler model with each contact having optional data about companies and titles. But don't get me talking about the different models of events and tasks on iPods and Palm Pilots.

          Modeling Entities and Relationships in the UML

          Use UML class diagrams [ uml1.html ] (notes introducing UML for beginners) with no operations to describe data!

          Here is an example based on a project set in a restaurant.

          [Orders are made of items from a single table served by a single waiter]

          The boxes are logical groups of data each referring to a real entity. The lines connecting the boxes are significant relationships`, for example: a Table has a single Waiter assigned to it, but a Waiter can be handling several Tables. Notice that this model does not show any attributes (the properties of the entities). It does not show the waiter's name for example. This kind of reduced model -- based on ideas about the real world is sometimes called a Domain Model. They are very useful for planning data bases (later) and for designing object oriented code (CSCI375).

          Each item of data is given a name and a type:

           		name : type
          Examples
           		address : string
           		initial : char
           		age : int
          Notice I used C++ data types... because my audience (you) has taken CS202 and can be expected to understand them. In general, you should use the words of your audience. With multiple audiences put different meanings in a data dictionary as aliases.

          When you first draw these diagrams you can just list the attribute names and jot down more information in a prototype data dictionary. For example here is an UML diagram of the data I found in a class roster.

          [TBA]

          If an item is repeated use square brackets:

           		salary_each_month [12] : money
           		children : Person[*]
           		spouse : Person [0..1]

          When you meet attributes that are actually other entities/records you should connect the boxes with an association.

          If you know of a significant relationship between records/entities then show it as a line (an association) between the boxes. In fact, in some analysis and design methods, you check every pair (and grouping) of entities looking for important relationships between them.

          Mark these relations with multiplicities:

          • Optional: 0..1
          • Many: *
          • One: 1

          Here is an ERD showing the relationships between Questions, Answers, and Comments in the DFD of my Tutoring System (above):

          Questions have many Answers and Answers are associated with Comments.

          Keep Entity-Relation-Diagrams Simple

          Note: the official database notation developed by Chen is too cumbersome for everyday data analysis and design. Use it only when you have to!

          My old student edition of Rational Rose did UML ERDs well. Dia and Visio can also handle them. But the quickest way (after a field trip, say) is on a board or a piece of paper. Keep the edges of the boxes incomplete until done. Notice.... that you can just note the relationships without any need for attributes. Here is an example that I drew on my Palm Pilot one day.

          [TBA]

          Sometimes I even omit the boxes:

          [TBA]

          Smell -- Unreal Data

          In an ideal system the data perfectly reflects the reality -- it forms a "mirror-world". Often, in real systems, the data is often approximate, omits details, and lags behind the real world. When the data also has the wrong structure it provides a distorted mirror of the world. The system will not work as well as it could. But the people in the system may not be aware of this: the file becomes the reality, the computer is the only truth they know.

          Look for lags, errors, missing data, and misfitting structures when ever you are analyzing a system.

          Normalizing a UML Data Base

            The following procedure improves the design of the data. It exposes logical structures that is implicit in your data.
            1. Draw an ERD of the entities and relationships with attributes inside the boxes.
            2. Extract all attributes marked with [*] as relations.
            3. Turn all many-to-many and n-ary relations into entities.
            4. Look at 1-to-1 associations: is either (or both) '1's really a '0..1'? If so add the "0.." and treat '0..1' a many '*'. If not, then coalesce the two boxes into one.
            5. All associations end up being many-to-1. Redraw with the 1's above the 'many's

            Simple Normalization in UML

          . . . . . . . . . ( end of section Normalizing a UML Data Base) <<Contents | End>>

        . . . . . . . . . ( end of section UML Data Models) <<Contents | End>>

      Review Questions

      1. Describe and distinguish a DFD from an ERD.
      2. Distinguish physical from logical DFDs.
      3. Name the three types of icon in a DFD. What do they represent?
      4. If there is an arrow from one icon to another in a DFD, what does it mean?
      5. What are the Gane and Sarson icons?
      6. How can you show data flows in the UML2.0?
      7. What is shown on a context DFD of a system? What is not shown?
      8. What is shown in a Level-0 DFD? What is shown in a level-1 DFD?
      9. Give an example of simple context DFD and its matching Level-0 DFD.
      10. How do you document the contents of data stores?
      11. How do you document the detailed processing of data?
      12. List the rules that a valid DFD must follow. Then check the list [DFD_Errors] above.
      13. Below is a bad first attempt at the level 0 DFD of my automatic tutoring system. It has many errors. Mark the errors with a big "X" and the number in the list [DFD_Errors] above. For example the "Make Comment" process should be marked X7.

        [Teacher provides Questions and answers and a student answers them ...]

      14. If you discover a person who does no more than reinput some previous output -- how can you improve the system?
      15. Name 6 DFD smells.
      16. Here is a recent example scenario that I experienced
        1. My doctor told the computer system that I needed a certain screening test.
        2. 1 Month later...
        3. My doctor's assistant sent me snail mail asking why I hadn't had the test done.
        4. I phoned the testing center and they told me that I was not eligible.
        5. When I explained why I needed the test they said the doctor would have to re-input the request in a different form that specified the reason for the test. (Note to save money this data must come from the doctor not the patient).
        6. I phoned my doctor and left a message explaining the situation.
        7. The doctor resubmitted the request.

        What smells here? Draw a partial DFD of the situation. Redesign the system to work better.
      17. What is the ultimate reason for storing data in a system?
      18. Is ERD below normalized? If not, show how to normalize it.

        [Question (1)-(*)Answer (*)-(*)Comment]

      . . . . . . . . . ( end of section Review Questions) <<Contents | End>>

      Online Exercises on DFDs

      1. Here [ images?hl=en&q=DFD&btnG=Search+Images&gbv=2 ] is a Google search that produces thousands of DFDs! Some of them are very good and some not so good. Look at them, figure out which notation they use. What do you like and/or dislike about some of them.
      2. List some strange ways that information/data is transmitted/stored in an enterprise that you know about.

      3. Take this diagram [ manufacturing.gif ] and redraw it as a DFD -- note: you can treat some money and material flows as data flows.

      . . . . . . . . . ( end of section Online Exercises on DFDs) <<Contents | End>>

      Typical Exam Questions and Exercises on DFDs

      1. Draw a context DFD of:
        1. my web site.
        2. CSUSB's current registration and student records system.
        3. CSUSB CSCI web site.

      2. Draw a simple but correct DFD <TBA>
      3. Given a Context DFD draw a plausible and correct level 0 fish-eye DFD.
      4. Given the fish-eye DFD of a system draw its context DFD.
      5. Given a process in a DFD draw a correct and plausible expansion/fish-eye DFD.
      6. Correct a given DFD model.
      7. Answer questions about a given DFD.

      . . . . . . . . . ( end of section Typical Exam Questions and Exercises on DFDs) <<Contents | End>>

      Exercise -- Context Diagram of a Possible Project

      Either to be done in class and/or assigned as out-of-class project work.

    1. UML::="Unified Modeling Language", [ samples/uml1.html ]

    . . . . . . . . . ( end of section Modeling the Data in a System) <<Contents | End>>

    Abbreviations

  1. TBA::="To Be Announced".
  2. TBD::="To Be Done".

    Links

    Notes -- Analysis [ a1.html ] [ a2.html ] [ a3.html ] [ a4.html ] [ a5.html ] -- Choices [ c1.html ] [ c2.html ] [ c3.html ] -- Data [ d1.html ] [ d2.html ] [ d3.html ] [ d4.html ] -- Rules [ r1.html ] [ r2.html ] [ r3.html ]

    Projects [ project0.html ] [ project1.html ] [ project2.html ] [ project3.html ] [ project4.html ] [ project5.html ] [ projects.html ]

    Field Trips [ F1.html ] [ F2.html ] [ F3.html ]

    Metadata [ about.html ] [ index.html ] [ schedule.html ] [ syllabus.html ] [ readings.html ] [ review.html ] [ glossary.html ] [ contact.html ] [ grading/ ]

End