There is also a study of some well known codes. Brian Hayes calculates
how many unique codes can exist, and how many are already in use.
This was published in the
"American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the
following coding schemes:
Demonstrates the need to document the rules for data -- and stick with these rules in every program and in every part of every program.
I noted the subtle but effective security. I also noted that some cards were red or green. I asked about this and it turned out that the red cards indicated threats to social workers... So in this case one data item was the color of the media.
By the way... I recommended not computerizing this process.
Clever choices can improve system qualities: security, reliability, time to program, ...
. . . . . . . . . ( end of section Blobs) <<Contents | End>>
For example the month in a year can be encoded as a number -- in C a short int -- because the operation of adding and subtracting one makes sense. The computer operation of addition reflects something that the clients will recognize as meaningful. In SQL you could specify "SMALLINT".
For example some body's height is a measurement. It makes sense to talk about twice the height. We may need to do arithmentic on it. So store it as a floating point number. This could be a "double" in C++/C/Java or a "DOUBLE" is SQL.
Money is an interesting data type because it makes sense to add, subtract and multiply money, but division does not work -- you can not cut a cent in half. More importantly, accounts do not approve of any form of rounding error! So you should probably make sure that money is stored as a whole number representing the number of cents (or in England pennies). But you also should specify a very wide range -- in C/C++ long int. In COBOL you had the ability to describe currency very clearly by using the "PIC" (picture) notation. Microsoft SQL provides a data type called "MONEY" but the standard and rival SQL systems force you to specify a data type like "DECIMAL(19,2)" which has 19 digits and 2 of them are to the right of the decimal point -- perfect for cents.
The Roman numerals have complex syntax and semantics [ Mini-Project2.html ] (BNF) and are not good for doing arithmetic ( can you divide XXX by VII ?). Avoid them on input, use them judiciously on output, and do not store numbers in this form!
To be more formal we could define:
Having only two digits fits well with electrical and electronic circuits which tend to be either "on" or "off". In about 1945 Shannon defined
This has resulted in many computer people being able to recite the powers of two up to the highest address on their favorite machine.
Nibbles are written and spoken using the hexadecimal digits, 0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,and F (=1111).
So, for example, "2A" in hex means "00101010" in binary.
Here a number like 987 was encoded by three decimal digits each represented as a nibble in binary:
This wastes some bits but is very convenient for important things like dollars and cents.
Floating point works well when we need a wide range of values and can put up with larger errors on the larger numbers.
Again in commerce and fiance we need precision and speed rather than range. So a Fixed Point notation was preferred. Here you use BCD and the machine scales the number by dividing by a fixed power of ten. This is available and common in COBOL. In SQL we have DECIMAL(p,q) (p digits, q after the decimal point).
But a canny programmer would use these expressions
Again -- Money is naturally expressed as fixed point decimal with two decimal places. So
If your language supports this -- use it. If not, store money as long integers meaning the number of cents. Then
In the 1960's the American standards people ( ANSI ) proposed what has become the standard 8 bit coding for characters -- ASCII
ASCII covers all the characters needed for American needs, but has become the de facto standard on the Internet, and whenever data needs to be shared. The International Standards Organization treats ASCII as a specialized code for use in America. In the UK, the American "#" becomes the symbol for the British pound. Each European country has its own special symbols.
IBM tried to create its own standard -- an Extended Binary Coded Decimal code named EBCDIC. This will disappear with the last mainframe.
Recently, a new standard -- Unicode -- has been created that covers just about every character in every alphabet in the world. This is a 16-bit code. ASCII and the ISO codes appear within it.
The Web uses HTML and HTML has introduced a number of special "entities" for showing non-ASCII characters like Σ and α. These are given numbers and encode in HTML like this:
For example the symbols "<" and ">" are encoded as "<" and ">". The double quote sysmbol is encoded as """.
This link [ mathchart.html ] shows how to encode Unicode mathematical symbols in HTML and [ arrows.html ] how to do arrows. I have a partial encoding for Greek letters and other ΤΕΧ sysmbols in [ ../samples/tex2html.html ] (ΤΕΧ is a mathematical type setting system developed by Donald Knuth).
There is a link to more on the HTML below.
Block sequences Blocks of numbers given to different parts of the organization to allocate (in sequence) to records of data in a file or input steam.
Alphabetic Abbreviations and Mnemonics A string of letters is chosen to identify an entity or a group of entities of a given type. A few letters stand in for a word or phrase Example: States(CA, AL,...), IANA and DNS countries(uk, tv, ru, ...), IATA (LAX, ONT, LHR, MSP, ...). Abbreviation for a department teaching a course -- CSE PHYS MATH ENG. Abbreviations for buildings on campus.
Arbitrary Systematic Codings Example: Library of congress subject classification, Dewey Decimal system for books. The Assoc for Computing CCS system for computer science. Linnaeus's technique for species.
Digit Groups Different groups of digits/characters in the data are themselves coded data. For example in a 9-digit ZIP-code the first digit determines a geographical area, the next three the town, and the 5th digit a post-office. The next 4 digits identify a delivery point. Example: Zip codes(92407-1133), phone numbers((909)-537-5257),..., SSN, URLs.
Derived Codes Mixes different coded data into one element Example: My UK Driving license number, CSUSB Library call numbers, Subscriber codes for magazines, Rooms on campus.
Ciphers and Encrypted data Example: Spoof at the Imperial Chemical Industries was a number added to the paint sales. We have a lot of good work done since then -- look up DES. PGP, etc. on Wikipedia if you need more detail. Numbers are disguised for security or mnemonic purposes Passwords should be encrypted as soon as they are entered and never stored without salting and hashing!
Actions Examples: A=Add, D=Delete, ..., The 50+ actions that the 'vi' editor has built in to it, Mnemonic codes in assembler. Codes that represent actions. Transaction codes -- for example with a banking application we might find deposits (coded D) and withdrawals (coded W).
Self-checking Elements Uses an added digit or character calculated from the rest. Example: 9s remainder and 11s remainder check digits are added to a decimal number. For a detailed analysis see [ http://www.skorks.com/2011/08/even-boring-form-data-can-be-interesting-for-a-developer/ ] (SKORKS, some Ruby included).
There are five ways of encoding compound data:
<name><first>Richard</first><initial>J</initial><family>Botting</family></name>is a piece of text with added "tags" that indicate the meaning of the parts. In a [ Record Structure ] (above) the "tags" are not needed because their sequence is known and the lengths are fixed (or at least predictable). Thus we get an encoding that is guaranteed not to be ambiguous, is easy to read (kind of), but is somewhat inefficient.
</end tags>to delimit data. Tags can also have attributes:
<certificate type="participation">Unix Training</certificate>.
XML also allows some tags to be unpaired and these are shown like this:
<endless tag attributes... />XML documents can be parsed fairly easily.
For each application that uses XML must have a DTD -- Document Type Definition published that defines the structure of the data -- what tags can appear inside others. Defining a DTD takes a significant amount of work. But once defined you can use tools to check validity, ...
. . . . . . . . . ( end of section Markup Languages) <<Contents | End>>
In computer science most of our knowledge about linguistic design has been put into designing programming languages. Programming languages are the most complicated schemes for encoding a domain in existence. There are hundreds of them. For more take a CSE Programming Language class like our CSE320 [ ../cs320/ ] (Advert).
. . . . . . . . . ( end of section Encoding Compound data) <<Contents | End>>
. . . . . . . . . ( end of section Special Encodings) <<Contents | End>>
. . . . . . . . . ( end of section XML) <<Contents | End>>
. . . . . . . . . ( end of section HTML5) <<Contents | End>>
. . . . . . . . . ( end of section Markup Languages) <<Contents | End>>
. . . . . . . . . ( end of section Reference and Online Resources) <<Contents | End>>
Notes -- Analysis [ a1.html ] [ a2.html ] [ a3.html ] [ a4.html ] [ a5.html ] -- Choices [ c1.html ] [ c2.html ] [ c3.html ] -- Data [ d1.html ] [ d2.html ] [ d3.html ] [ d4.html ] -- Rules [ r1.html ] [ r2.html ] [ r3.html ]
Projects [ project0.html ] [ project1.html ] [ project2.html ] [ project3.html ] [ project4.html ] [ project5.html ] [ projects.html ]
Field Trips [ F1.html ] [ F2.html ] [ F3.html ]
Metadata [ about.html ] [ index.html ] [ schedule.html ] [ syllabus.html ] [ readings.html ] [ review.html ] [ glossary.html ] [ contact.html ] [ grading/ ]