A Computer Security course: Cryptographic Techniques

Copyright © 2000 - 2002, Giovanni Chiola.

This file was last updated on Oct. 21th, 2002.



Main Cryptographic Techniques




Information Coding and Redundancy

Information is usually coded in so called "natural laguage" (English, Italian, etc.). A well known characteristic of natural languages is their high level of redundancy. Indeed, the application of "compression" algorithms can usually shrink the size of a text file to one tenth of its original size or even less. The presence of a high level of redundancy has a strong impact on the practical conditions of application of cryptographic techniques, so that it is worthwhile to discuss this characteristics in some detail.

Let us first attempt a numerical evaluation of the level of redundancy used to encode a document. Consider the use of the 8-bit ASCII code for the representation of each character. An average, purely textual document may be composed by a few thousands characters, thus summing up to perhaps 10,000 bits. An optimal encoding schema would therefore be able to express some 10**3000 different documents using the same number of bits. Clearly human fantasy is much more limited than that, so that such a huge number of documents will never be produced by the whole human specie over all the millenia of its history. Suppose for instance that each individual would be able to produce 100 different documents each days. Assuming a total population of 5 billions, this would imply a total document production not exceeding 500 billions per day, i.e. less than 200,000 billions documents per year. Even the entire production of the humanity over one million years would result in less than 2*10**20 different documents, that could be optimally encoded using less than 70 bits. Therefore the redundancy level of our average 3,000 character document could be as high as 10**2930, i.e., practically the infinity.



Symmetric Key Encryption

These are the main algorithms that can be adopted to temporarily make information not readable even if it becomes available to an attacker. The general idea is that the piece of information to be protected (which is called "clear text" in the Crypto jargon) is modified so as to produce a so called "cypher text." The modification is obtained by applying a deterministic transformation based on a so called "encryption key." The cypher text may be available to the attacker but is impossible for him/her to understand because it does not follow normal lexical or syntactical rules.

The cypher text can be considered as a letter written in an unknown language, so that one notices the presence of the message but is unable to discover its meaning. The original clear text can be retrieved from the cypher text by applying the inverse transformation, that requires the knowledge of the original encryption key. Therefore only those who know (or are able to "guess") the encryption key can retrieve the clear text given the cypher text.

The key is called "symmetric" since the same key is used for "encryption" (transformation from clear text to cypher text) and for "decryption" (transformation from cypher text to clear text). Therefore, the person that encrypts the piece of information is also able to decrypt it (it is sufficient to remember the encryption key) and vice-versa.

General ideas and principles

Among encryption algorithms that have been used in the past, probably transliteration codes are the simplest to understand. We shall start thus by briefly discussing Caesar's code and its more or less trivial extension, in order to have some simple running examples to refer to while presenting general concepts. Julius Caesar used to write his secret messages (orders to be sent to his soldiers during battles) by exchanging alphabetical letters. In particular, he used the following transformation table (not really, as the Ancient Romans used a 21 character alphabet, which was a proper subset of the current 26 character English alphabet ...):
Clear text A B C ... W X Y Z
Cypher text D E F ... Z A B C

i.e., he replaced each letter in the clear text with the letter following in the alphabetical order at a distance 3 (with a folding at the end of the alphabet). For example, the word "ENEMY" would be translated into "HQHPB" (using the 26 letters English alphabet).

This trivial cypher algorithm can be generalized by allowing one to choose the distance (in alphabetic order) of letters when comparing clear text and cypher text. One could use for instance distance 2, or distance 4, obtaining respectively the following translation tables:
A B C ... W X Y Z
C D E ... Y Z A B
A B C ... W X Y Z
E F G ... A B C D

If we use this generalized version of Caesar's code, we have to specify the "key" (ranging from 1 to 25) that allows us to write the correct translation table. In the original code the "key" was implicitly fixed to the value 3.

If the key is fixed, the only protection offered by the code is the lack of knowledge of the coding algorithm. If instead the key may vary, one can tolerate that the attacker know the coding algorithm, provided that one makes sure that the attacker does not know and cannot easily guess the encryption key. In our particular example, this is not exactly the case: guessing one key out of 25 possible values is something that is feasible in a relatively short time. In particular, the high level of redundancy of the natural language that is used to encode messages helps the attacker. In many cases, the "correct key" is the only one that produces a meaningfull result, so that the attacker can solve the problem by so called "brute force", i.e., try all possible key values until a (the only) meaningfull result is found.

One of the main issues in cryptography is the "strength" of the key as compared to the time and effort needed to perform a "brute force attack" (i.e. just trying all possible values until the correct key is found). The feasability and effectiveness of a brute force attack, however, depends not only on the number of different values that the key may assume, but also on the characteristics of the clear text to be encrypted. A crucial point here is the amount of redundancy of the language that is used to encode the clear text. Remember that in a coding scheme the redundancy coefficient is defined as the ratio between the total number of code configurations divided by the total number of used (or legal, according to the language syntax and semantics) ones. A natural langage such as English has an enormous amount of redundancy that may give a substantial advantage to the cryptanalyst (i.e., the person who tries and break the cypher in order to retrieve the clear text from the cypher text). For example, out of the 25 possible transliterations of the 5 character word "HQHPB", only 1 (namely "ENEMY") is a legal word that can be found in an English dictionary. This makes up for a redundancy coefficient 26/1 of the decryption schema. Therefore, decrypting the cypher text "HQHPB" is as simple as trying all 25 possible transliterations and comparing the result to what can be found in an English dictionary, and such an operation can be performed manually in few minutes. However, this is not always the case.

Consider a second transliteration example, namely the 3 character cypher text: "LZQ". If we try all 25 possible transliterations we can find two potential clear texts that are legal English words, namely: "MAR" (abbreviation for the month of March, at distance 1) and "OCT" (abbreviation for the month of October, at distance 3). In this case the redundancy of the decryption schema is lower than in the previous example (26/2, i.e., half). The reduced redundancy coefficient implies a harder work for the cryptanalyst, that indeed has no way of determining which one of the two months (March or October) are encoded in the cypher text unless he discovers which one of the two possible keys 1 and 3 were used.

From the above examples it should be clear that the lowest the redundancy coefficient is for the clear text, the harder the job of the cryptanalyst is in recontructing the clear text from the cypher text without knowing the encryption key. This is the reason why in modern cryptography data compression of the clear text is always used before encryption.

The simple transliteration of words has several drawbacks in terms of strength of the cypher. One of these drawbacks is that each letter is always translated in the same letter once a given key is chosen. This fact can help a lot in reducing the effort needed to guess the encryption key (especially in case of high redundancy in the clear text). For instance, if one knows statistics of occurrences of the various letters in "standard" sentences of a given language, one can exploit such information to reduce the number of trials compared to a brute force attack. Such a weakness can be overcome by adopting a more sophisticated coding.

Consider for example a book with several hundred pages that is available in the same edition to both the sender and the receiver of the cypher text. The encryption key can be defined as the number of a word contained in the book (that can be identified by the number of the page, the number of the line from the top of the page, and the number of the word starting from the left margin of the line. Each character of the clear text will be encrypted using a transliteration table that is identified by a character of a word in the cypher book starting from the first character of the word identified by the key. Assume that we have to encrypt the clear text "ENEMY", and that the encryption key identifies the following sequence of words in the refernce book: "Mister Hide was hiding in the dark room ...". Then the first character of the clear text would be transliterated using the table that associates "A" to "M", "B" to "N", etc.; the second character of the clear text would be transliterated using the table that associates "A" to "I", "B" to "J", etc.; the third character of the clear text would be transliterated using the table that associates "A" to "S", "B" to "T", etc.; the fourth character of the clear text would be transliterated using the table that associates "A" to "T", "B" to "U", etc.; the fifth character of the clear text would be transliterated using the table that associates "A" to "E", "B" to "F", etc. The cypher text would thus become: "QVWFC". Also in this case statistics and redundancy in the clear text as well as in the cypher book could help the cryptanalyst in reducing the effort needed to perform a brute force attack, however such an attack would be almost impossible to perform without the help of a computer.

The role of Computers in Cryptography

Before the advent of computers, cryptanalysis capabilities were extremely limited and most brute force attacks to sophisticated encryption codes were totally unfeasible because of the huge number of cases to consider and the large time needed to consider a single case. The availability of higher and higher computing power has completely changed scenarios in the last few decades, thus allowing one to break codes once tought to be secure even using low-end personal computers. Therefore cryptography has to face the risk that in few years time codes might become breakable by brute force attack due to the rapid increase of computer processing power available to cryptanalysts.

To be written ...

The main symmetric algorithms

The ancestor of modern symmetric encryption algorithms is the so called Data Encryption Standard (DES), a block cypher standardized from the NBS between 1972 and 1977.

To be written ...



Keys, Hash Functions, and Pass-phrases

Random Keys

To be written ...

Hash Functions

To be written ...

Passwords versus Pass-phrases

To be written ...



Public Key Cryptosystems

Challenges, Identification, and Digital Signature

To be written ...

Diffie-Hellman protocol

To be written ...

RSA

To be written ...



Digital Certificates and Authorities

DSA

To be written ...

X509.3

To be written ...

Secure Socket Layer

To be written ...



Steganography

Historical perspective

To be written ...

Digital Steganography

To be written ...