DATA ENCODING
ASCII
ASCII is a 7-bit coding scheme. Valid hex values for an ASCII character are 00 through 7F (0 to 127). While the standard ASCII character set
uses just 7 bits per character, there are several larger character sets that use 8 bits, which gives them 128 additional characters. The
extra characters are used to represent non-English characters, graphics symbols, and mathematical symbols. Several companies and
organizations have proposed extensions for these 128 characters. The DOS operating system uses a superset of ASCII called extended ASCII or
high ASCII. A more universal standard is the ISO Latin 1 set of characters, which is used by many operating systems, as well as web browsers.
The ASCII chart (7-bit)
Decimal Octal Hex Binary Value Decimal Octal Hex Binary Value
------- ----- --- ------ ----- ------- ----- --- ------ -----
000 000 000 00000000 NUL (Null char.) 064 100 040 01000000 @ (AT symbol)
001 001 001 00000001 SOH (Start of Header) 065 101 041 01000001 A
002 002 002 00000010 STX (Start of Text) 066 102 042 01000010 B
003 003 003 00000011 ETX (End of Text) 067 103 043 01000011 C
004 004 004 00000100 EOT (End of Transmission) 068 104 044 01000100 D
005 005 005 00000101 ENQ (Enquiry) 069 105 045 01000101 E
006 006 006 00000110 ACK (Acknowledgment) 070 106 046 01000110 F
007 007 007 00000111 BEL (Bell) 071 107 047 01000111 G
008 010 008 00001000 BS (Backspace) 072 110 048 01001000 H
009 011 009 00001001 HT (Horizontal Tab) 073 111 049 01001001 I
010 012 00A 00001010 LF (Line Feed) 074 112 04A 01001010 J
011 013 00B 00001011 VT (Vertical Tab) 075 113 04B 01001011 K
012 014 00C 00001100 FF (Form Feed) 076 114 04C 01001100 L
013 015 00D 00001101 CR (Carriage Return) 077 115 04D 01001101 M
014 016 00E 00001110 SO (Shift Out) 078 116 04E 01001110 N
015 017 00F 00001111 SI (Shift In) 079 117 04F 01001111 O
016 020 010 00010000 DLE (Data Link Escape) 080 120 050 01010000 P
017 021 011 00010001 DC1 (XON) (Device Control 1) 081 121 051 01010001 Q
018 022 012 00010010 DC2 (Device Control 2) 082 122 052 01010010 R
019 023 013 00010011 DC3 (XOFF)(Device Control 3) 083 123 053 01010011 S
020 024 014 00010100 DC4 (Device Control 4) 084 124 054 01010100 T
021 025 015 00010101 NAK (Negative Acknowledgement) 085 125 055 01010101 U
022 026 016 00010110 SYN (Synchronous Idle) 086 126 056 01010110 V
023 027 017 00010111 ETB (End of Trans. Block) 087 127 057 01010111 W
024 030 018 00011000 CAN (Cancel) 088 130 058 01011000 X
025 031 019 00011001 EM (End of Medium) 089 131 059 01011001 Y
026 032 01A 00011010 SUB (Substitute) 090 132 05A 01011010 Z
027 033 01B 00011011 ESC (Escape) 091 133 05B 01011011 [ (left bracket)
028 034 01C 00011100 FS (File Separator) 092 134 05C 01011100 \ (back slash)
029 035 01D 00011101 GS (Group Separator) 093 135 05D 01011101 ] (right bracket)
030 036 01E 00011110 RS (Request to Send/Record Separator) 094 136 05E 01011110 ^ (caret/circumflex)
031 037 01F 00011111 US (Unit Separator) 095 137 05F 01011111 _ (underscore)
032 040 020 00100000 SP (Space) 096 140 060 01100000 ` (grave accent)
033 041 021 00100001 ! (exclamation mark) 097 141 061 01100001 a
034 042 022 00100010 " (double quote) 098 142 062 01100010 b
035 043 023 00100011 # (number sign) 099 143 063 01100011 c
036 044 024 00100100 $ (dollar sign) 100 144 064 01100100 d
037 045 025 00100101 % (percent) 101 145 065 01100101 e
038 046 026 00100110 & (ampersand) 102 146 066 01100110 f
039 047 027 00100111 ' (single quote) 103 147 067 01100111 g
040 050 028 00101000 ( (left parenthesis) 104 150 068 01101000 h
041 051 029 00101001 ) (right parenthesis) 105 151 069 01101001 i
042 052 02A 00101010 * (asterisk) 106 152 06A 01101010 j
043 053 02B 00101011 + (plus) 107 153 06B 01101011 k
044 054 02C 00101100 , (comma) 108 154 06C 01101100 l
045 055 02D 00101101 - (minus or dash) 109 155 06D 01101101 m
046 056 02E 00101110 . (dot) 110 156 06E 01101110 n
047 057 02F 00101111 / (forward slash) 111 157 06F 01101111 o
048 060 030 00110000 0 112 160 070 01110000 p
049 061 031 00110001 1 113 161 071 01110001 q
050 062 032 00110010 2 114 162 072 01110010 r
051 063 033 00110011 3 115 163 073 01110011 s
052 064 034 00110100 4 116 164 074 01110100 t
053 065 035 00110101 5 117 165 075 01110101 u
054 066 036 00110110 6 118 166 076 01110110 v
055 067 037 00110111 7 119 167 077 01110111 w
056 070 038 00111000 8 120 170 078 01111000 x
057 071 039 00111001 9 121 171 079 01111001 y
058 072 03A 00111010 : (colon) 122 172 07A 01111010 z
059 073 03B 00111011 ; (semi-colon) 123 173 07B 01111011 { (left brace)
060 074 03C 00111100 < (less than) 124 174 07C 01111100 | (vertical bar)
061 075 03D 00111101 = (equal sign) 125 175 07D 01111101 } (right brace)
062 076 03E 00111110 > (greater than) 126 176 07E 01111110 ~ (tilde)
063 077 03F 00111111 ? (question mark) 127 177 07F 01111111 DEL (delete)
Base64 encoding
Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64
representation. Each Base64 digit represents 6 bits of data and thus three bytes can be represented by four Base64 digits (33% more
storage).
Base64 is designed to carry data stored in binary formats across channels that only reliably support text content. Base64 is particularly
prevalent on the World Wide Web where its uses include the ability to embed image files or other binary content inside text files such as
HTML and CSS.
EBCDIC
EBCDIC is an 8-bit coding scheme, mostly used in IBM computers such as mainframes and mid-range machines like the AS-400. Valid hex values
for an EBCDIC character are 00 through FF. EBCDIC uses the full 8 bits available to it, so parity checking cannot be used on an 8-bit
system. Also, EBCDIC has a wider range of control characters than ASCII.
The character encoding is based on Binary Coded Decimal (BCD), so the contiguous characters in the alphanumeric range are formed up in
blocks of up to 10 from 0000 binary to 1001 binary. Non-alphanumeric characters are almost all outside the BCD range.
There are four main blocks in the EBCDIC code page: 0000 0000 to 0011 1111 is reserved for control characters; 0100 0000 to 0111 1111 are
for punctuation; 1000 0000 to 1011 1111 for lowercase characters and 1100 0000 to 1111 1111 for uppercase characters and numbers.
The unicode standard for data representation
The Unicode standard is a character code designed to encode text for storage in computer files, with the capacity to encode all of the
characters used for written languages throughout the world. It is a profile (a subset) of the ISO 10646 standard. The design of the
Unicode standard is based on the simplicity and consistency of the ASCII character code set, (and Latin-1, and extended version of the
ASCII code set), but goes far beyond ASCII's limited ability to encode only the Latin alphabet.
Unicode and ISO 10646 were developed to create a single unified character set and closely resemble each other.
Universal Character Set (UCS)
The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It
guarantees round-trip compatibility to other character sets. No information will be lost if any text string is converted to UCS and back to
the original encoding.UCS contains the characters required to represent practically all known languages.
UCS is based on 4-octet (32-bit) coding scheme known as the 'canonical form' (UCS-4), but a 2-octet (16-bit) form (UCS-2) is used for the
Basic Multilingual Plane (BMP) that contains characters of most modern languages, where octets 1 and 2 are assumed to be 00 00.
UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value
is commonly preceded by 'U+' as in U+0041 for the character 'A'. The UCS characters U+0000 to U+007F are identical to those in US-ASCII
(ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside
the BMP are reserved for private use. UCS also defines several methods for encoding a string of characters as a sequence of bytes, such
as UTF-8 and UTF-16.
UTF (Unicode transformation formats)
UCS represents most of the world's writing systems by multi-octet characters. Multi-octet characters, however, are not compatible with many
current applications and protocols, and this has led to the development of a few so-called Unicode transformation formats (UTF). A UTF is an
algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO 10646 standard uses the
term 'UCS transformation format' for UTF.
Each UTF is reversible i.e supports lossless round tripping - mapping from any Unicode coded character sequence S to a sequence of bytes and
back will produce S again. To ensure round tripping, a UTF mapping must map all code points (except surrogate code points) to unique byte
sequences. This includes reserved (unassigned) code points and the 66 noncharacters (including U+FFFE and U+FFFF).
The UTF-8 format
UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software
that rely on US-ASCII values but are transparent to other values. In UTF-8, characters are encoded using sequences of 1 to 6 octets.
UTF-8 has the following properties:
UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility), which means that files and
strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no
ASCII byte (0x00-0x7F) can appear as part of any other character
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how
many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy
resynchronization and makes the encoding stateless and robust against missing bytes
All possible 231 UCS codes can be encoded
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long
The sorting order of Bigendian UCS-4 byte strings is preserved
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding
The UTF-16 format
UTF-16 basically arose from UCS-2 when it became clear that more than 216 code points were needed to represent all the
characters. UTF-16 uses a single 16-bit code unit to encode the most common characters, and a pair of 16-bit code units, called surrogates
to encode the less commonly used characters in Unicode.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading and trailing values of paired code
units in UTF-16. Leading (high) surrogates are from D800 to DBFF, and trailing (low) surrogates are from DC00 to DFFF. They are called
surrogates since they do not represent characters directly, but only as a pair.
For a BMP character, utf16 and ucs2 have identical storage characteristics: same code values, same encoding, same length (16 bits)
For a supplementary character, utf16 has a special sequence (surrogate) using 32 bits. For a number greater than 0xffff, 10 bits are
taken and added to 0xd800 and put in the first 16-bit word, 10 more bits are taken and added to 0xdc00 and put in the next 16-bit word.
Consequently, all supplementary characters require 32 bits, where the first 16 bits are a number between 0xd800 and 0xdbff, and the last 16
bits are a number between 0xdc00 and 0xdfff.
UTF-16 is used internally by Windows, Java and Javascript.
File Formats
Text (ASCII) Files
An ASCII file or a text file is one in which each byte represents one ASCII-coded character. An ASCII file can be considered to be a binary
file that consists of just ASCII characters. Since ASCII characters are 7-bit encodings stored in a byte, each byte of an ASCII file has its
MSB set to 0.
Text files have generally a one-to-one correspondence between the bytes and ordinary readable characters such as letters and digits and
control characters such as tabs, line feeds and carriage returns without any embedded information such as font information, hyperlinks or
inline images. Therefore any simple program to view a file makes them human-readable.
Text files can have the MIME type "text/plain", often with suffixes indicating an encoding. Common encodings for plain text include ASCII,
UTF-8, UTF-16 and ISO 8859.
Binary Files
In a binary file there is no one-to-one mapping between bytes and characters. Files that have been formatted with a word processor must be
stored and transmitted as binary files to preserve the formatting. Each byte of a binary file can have the full 256 bitstring patterns (as
opposed to an ASCII file which only has 128 bitstring patterns).
All executable programs are stored in binary files, as are most numeric data files. Binary files can also be image, sound or compressed
files etc., in short, any file content other than plain text. Usually the specification of a binary file's file format indicates how to
handle that file.
Binary files are often encoded into a plain text representation to improve survivability during transit, using encoding schemes such as
Base64.
Back
DATA ARCHITECTURE
Data architecture deals with developing and maintaining a formal description of the data and data structures within an organization - this
can include data definitions, data models, data flow diagrams etc. (in short, metadata). Data architecture includes topics such as metadata
management, business semantics, data modeling and metadata workflow management.
Data architecture encompasses
Logical Data modeling
Physical Data modeling
Development of a data strategy and associated polices
Selection of capabilities and systems to meet business information needs
A data architect is responsible for designing, creating, deploying and managing an organization's data architecture. A data architect
ensures that an organization follows a formal data standard and that its data assets are aligned with the defined data architecture and
with the goals of the business. Typically, a data architect designs, builds and optimizes databases, maintains the metadata registry and
develops strategies for data acquisition, archival and recovery.
Data architects are usually skilled in data modeling, data policies development, data strategy, data dictionaries, data warehousing,
metadata registry, relational databases, data retention, data querying languages like SQL and identifying the best system for addressing
data storage, retrieval and management.
Data modeling
Data modeling is the analysis of data objects that are used in a business or other context and the identification of the relationships among
these data objects. A data model can be thought of as a diagram or flowchart that illustrates the relationships between data.
There are several different approaches to data modeling, including:
Conceptual Data Modeling - identifies the highest-level relationships between different entities.
Enterprise Data Modeling - similar to conceptual data modeling, but addresses the unique requirements of a specific business.
Logical Data Modeling - illustrates the specific entities, attributes and relationships involved in a business function. Serves as the basis
for the creation of the physical data model.
- includes entities (tables), attributes (columns) and relationships (keys)
- independent of technology (platform/DBMS)
- Uses business names for entities and attributes
- Normalized to the fourth normal form
Physical Data Modeling - represents an application and database-specific implementation of a logical data model.
- includes tables, columns, keys, data types, triggers, procedures and access constraints
- includes keys/indices for fast data access
- Uses specific names for tables and columns, limited by DBMS
- may be denormalized to meet performance requirements
ER diagram
An entity relationship diagram (ERD) shows the relationships of entities (data components) stored in a database. It is a data-modeling
technique that can help define business processes and be used as the foundation for a relational database.
There are three basic components of an ER diagram:
Entities, which are objects or concepts that can have data stored about them
Attributes, which are properties or characteristics of entities. An ERD attribute can be denoted as a primary key or a foreign key
The relationships between and among those entities
The relational model
A relational database stores data in tables(relations) and enforces relationships among those tables. It supports normalization and
allows the setting up of integrity constraints to protect data integrity.
Database design
Steps
Identifying entities - entities (tables) are the types of information saved. Generally all information falls into four categories -
people, things, events and locations. For e.g in a database for a shop, 'shop' is a location, 'sale' is an event, 'products' are things
and 'customers' are people
Identifying relationships - determine the relationships between the entities and to determine the cardinality of each relationship
i.e one customer can be involved in multiple sales
Identifying attributes - attributes (columns) are the data elements desired to be saved for each entity
Assigning keys
Defining the data types for attributes
Normalization
The database lifecycle
Database initial study - gather requirements and specifications
Database design - involves conceptual design, software selection, logical design and physical design
Implementation and loading
Testing and evaluation - the database is tested and fine-tuned for performance, integrity, concurrent access and security constraints
Operation
Maintenance and evaluation
Populating the database
After a database has been created, there are two ways of populating the tables – either from existing data or through the use of the user
applications developed for the database.
Existing data can be from another database or data files. Facilities to import/export data in standard formats are usually available in a
DBMS. When data are held in a file format that is not appropriate for using the import function, then it is necessary to use a program/tool
that reads in the old data, transforms them as necessary and then inserts them.
The transfer of large quantities of existing data into a database is referred to as a bulk load and to it may be prudent to postpone
constraint checking until the end of the bulk loading.
Normalization
Normalization is the process of efficiently organizing data in a database. The two goals of normalization: eliminating redundant data
(for e.g storing the same data in more than one table ) and ensuring data dependencies make sense (only storing related data in a table).
First normal form (1NF) sets the very basic rules for an organized database
Eliminate duplicative columns from the same table
Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key)
Second normal form (2NF) further addresses the concept of removing duplicative data
Meet all the requirements of the first normal form
Remove subsets of data that apply to multiple rows of a table and place them in separate tables
Create relationships between these new tables and their predecessors through the use of foreign keys
Third normal form (3NF) goes one large step further
Meet all the requirements of the second normal form
Remove columns that are not dependent upon the primary key
The Boyce-Codd Normal Form, also referred to as BCNF or 3.5NF, adds one more requirement
Meet all the requirements of the third normal form
Every determinant must be a candidate key
The fourth normal form (4NF) has one additional requirement
Meet all the requirements of the third normal form
A relation is in 4NF if it has no multi-valued dependencies
Design considerations
Tables should be given meaningful, recognizable names based on the data contained
Efficient datatypes should be chosen for columns
Integer columns can be configured as unsigned when negative values are not expected to be stored to increase the range of permissible
values
Default character set should be utf8 or higher for all tables where non-English characters are expected to be stored
Foreign keys should be used to preserve data integrity. Foreign keys need to be named as 'fk_tablename_colname'
Metadata
Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with
particular instances of data easier. For e.g, author, date created and date modified and file size are examples of very basic document
metadata. Having the abilty to filter through that metadata makes it much easier for someone to locate a specific document. Metadata can
be stored and managed in a database, often called a metadata registry or metadata repository.
Most popular DBMSs provide access to their metadata with a set of tables or views often called system catalog or data dictionary which can
be accessing using plain SQL. Many of them implement the standard information schema.
Back
DATABASE MANAGEMENT SYSTEMS
A DBMS is a software package designed to define, manipulate, retrieve and manage data in a database.
A relational DBMS (RDBMS) is a DBMS based on the relational model wherein the data is stored in a tabular form (rows and columns) in tables
(relations).
DATABASE ARCHITECTURE
The main components of a DBMS are the process manager, client communications manager, query processor, transactional storage manager and
shared components/utilities.
The client communications manager is responsible for establishing connections from clients, remembering their connection state, responding
to SQL commands and returning both data and control messages. Upon receiving an SQL command, the process manager assigns a thread for
executing the command. The query processor compiles the query into an internal query plan and executes it. The operators in the query plan
make calls to the transactional storage manager which manages all data access and manipulation functions.
The storage manager includes algorithms and data structures for organizing and accessing data on disk ('access methods'). It also includes a
buffer management module that decides what data to transfer between disk and memory buffers. The access methods return control to the query
executor’s operators, and as result tuples are generated, they are placed in a buffer for the client communications manager, which ships
the results back to the caller. For large result sets, the client typically will make additional calls to fetch more data incrementally from
the query, resulting in multiple iterations through the communications manager, query executor, and storage manager.
At the end of the query the transaction is completed and the connection closed - this results in the transaction manager cleaning up state
for the transaction, the process manager freeing any control structures for the query and the communications manager cleaning up
communication state for the connection.
+----------------------------------------------------------------------------------------------------+
| +----------+ +------------------------------------------+ |
| | Admission| | Client protocols | |
| | Control | +------------------------------------------+ +---------------------------+ |
| | | Client communications manager | Catalog manager | |
| | | | Administration/Monitoring | |
| | Dispatch | +------------------------------------------+ | Replication | |
| | and | | Query parsing and authentication | | Loading | |
| |scheduling| | Query Optimizer | | Batch utilities | |
| | | | Plan Executor | | Memory manager | |
| +----------+ | DDL and utility processing | +---------------------------+ |
| Process manager +------------------------------------------+ Shared components and |
| Query processor utilities |
| |
| +------------------------------------------+ |
| | Access methods Buffer manager | |
| | Lock manager Log manager | |
| +------------------------------------------+ |
| Transactional storage manager |
+----------------------------------------------------------------------------------------------------+
Process management
The process manager assigns a thread to an SQL command received from a client and ensures that the thread's data and control outputs are
connected to the client through the communications manager. A thread is a single, sequential execution unit of an program which consists of
its own program counter, a stack and a set of registers.
A process is an instance of the execution of a program on a single computer. A process can consist of one or more threads executing
concurrently. The private memory used by a process cannot be accessed by any other process and all threads of a process share its virtual
address space.
Transaction log
A transaction log is a sequential record of all changes made to the database. It contains information to undo all changes made to the data
by any individual transaction and is a critical part of database crash recovery when committed transactions need to be applied to the data
files and all uncommitted transactions need to be rolled back. A transaction log provides greater protection against failure, better
performance and the ability to replicate data.
Like the database file, the transaction log is organized into pages - fixed size areas of memory. When a change is recorded in the
transaction log, it is made to a page in memory. The change is forced to disk when the earlier of the following happens - the page is full
or a COMMIT is executed. Completed transactions are guaranteed to be stored on disk, while performance is improved by avoiding a write to
the disk on every operation.
Every log record is uniquely identified by an incremental log sequence number (LSN). DBMSs periodically issue checkpoint commands in the
background at which the transaction log records are updated into the data files. The checkpointed LSN is stored in the log and is used
during recovery.
Transaction logging improves performance by deferring database updates to disk during periods of high server activity. Instead of recording
database updates to random, non-sequential parts of a disk, transactions are logged sequentially in the log files, saving processing time
and updating the database when server activity is lower.
Back
DATA STRUCTURES
Tables
Tables are the primary form of data storage organized as rows and columns. Columns form the table’s structure and rows form the content.
A single row in a table is called a tuple or a record.
Columns in a table have a unique name, often referred to as attributes. A domain is a unique set of values permitted for an attribute in
a table. Columns can have constraints defined on them to enforce data integrity including
primary key - enforces entity integrity which ensures that each row of a table has a unique and non-null primary key value
unique key - enforces candidate key integrity
foreign key - enforces referential integrity, if a value of an attribute (column) of a relation (table) references a value of another
attribute (either in the same or a different relation), then the referenced value must exist
check and not null constraints - restricts a column's values, a partial enforcement of domain integrity
Partitioning
Large tables can be subdivided into smaller, manageable pieces called partitions to enhance performance, manageability and availability.
Each partition has its own name and optionally its own storage characteristics. Using partitioning, different portions of a table are
stored as separate tables in different locations. The user-selected rule by which the division of data is accomplished is known as a
partitioning function.
The following tables can be considered for partitioning
large tables
tables containing historical data, in which new data is added into the newest partition for e.g a historical table where only the
current month's data is updatable and the other 11 months are read only
when the contents of a table need to be distributed across different types of storage devices
Partitioning types
Horizontal partitioning (sharding) - all partitions have the same schema. Each partition is known as a shard and holds a specific
subset of the data
Range Partitioning - maps data to partitions based on ranges of values of the partitioning key established for each partition. It is the
most common type of partitioning and is often used with dates
Hash Partitioning - maps data to partitions based on a hashing algorithm that applies to the partitioning key. The hashing algorithm evenly
distributes rows among partitions, giving partitions approximately the same size. Hash partitioning is the ideal method for distributing
data evenly across devices. Hash partitioning is also an alternative to range partitioning when the data to be partitioned is not
historical or has no obvious partitioning key
List Partitioning - explicitly controls how rows map to partitions by specifying a list of discrete values for the partitioning key in the
description for each partition. The advantage of list partitioning is that unordered and unrelated sets of data can be grouped and organized
in a natural way
Vertical partitioning - each partition holds a subset of columns of a table
Views
A view is a virtual table that derives its data from tables or other views and is defined by a query. All operations performed on a view
actually affect the base tables of the view. Views can be used in almost the same way - query, update, insert and delete - as regular tables.
A view can be used for the following purposes:
- to customize the perception each user has of the database by providing a different representation (such as subsets or supersets) of the
data that resides within other tables and views
- as a security mechanism by allowing users to access data through the view, without granting permissions for the base tables
- to provide a backward compatible interface to emulate a table whose schema has changed
Indexes
An index is a data structure defined on columns in a table to speed up data retrieval. An index is usually structured as a set of pairs
(k,p) where k is a key value of the column(s) on which the index is built and p is a pointer to the record(s) with the key value k.
The cardinality of an index refers to the uniqueness of data values contained in the indexed columns. A higher cardinality implies more
unique values and the lower the cardinality, the more duplicated values in the columns. A column with the highest possible cardinality would
have unique values in each row and one with lowest possible cardinality would have the same value for every row. DBMSs use cardinality to
help determine the optimal query plan for a given query.
Types of indexes
Indexes are basically of two types - ordered and hashed. Ordered indexes have the key values in sorted order for quick access while in
hash indexes, a hash function is used to find out the location of entries with key value k.
Ordered indexes can be classified into dense and sparse indexes. In a dense index, an index record appears for every search key value in
file containing a pointer to the actual record. In a sparse index, index records are created only for some of the key values. To locate a
record, the index record with the largest search key value less than or equal to the search key value is found and its data is located and
a sequential search finds the required data. Dense indices are faster in general but sparse indices require lesser space and less
maintenance for insertions and deletions.
Indexes can be unique or nonunique. Unique indexes guarantee that no two rows of a table have duplicate values in the key column or columns.
A composite index, also called a concatenated index, is an index on multiple columns in a table.
Some databases like Oracle allow the ability to index functions and use these indexes in querying. For e.g
create index emp_upper_idx on emp(upper(ename));
Clustered and non-clustered indexes
Every table usually has a clustered index (typically synonymous with the primary key) where the data for the rows is stored in sorted
order. Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. A
table without a clustered index is refered to as a heap.
All other indexes are secondary, non-clustered indexes. A nonclustered index contains the index key values and pointers to the data rows
that contains the key value.
Index structure
B-trees
Indexes are usually organized as B-trees (balanced trees). A B-tree is a self-balancing tree data structure that keeps data sorted and
allows searches, sequential access, insertions and deletions in logarithmic time. B-trees save time by using nodes with many branches
(called children), compared with binary trees where each node has only two children. B-trees are preferred when decision points, called
nodes, are on hard disk rather than in RAM.
The B-tree structure consists of three main levels: the Root level - the top node that contains a single index page, from which the data
search is started, the Leaf level - the bottom level of nodes that contains the data pages, with the number of leaf pages depending on the
amount of data stored in the index, and the Intermediate level - one or multiple levels between the root and the leaf levels that holds
the index key values and pointers to the next intermediate level pages or the leaf data pages. The number of intermediate levels depends
on the amount of data stored in the index.
In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may
waste some space, since nodes are not entirely full.
A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as elements are added to
the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root.
An index that is too deep will suffer from performance degradation, while an index with large number of nodes in each level can produce a
very flat index structure. An index with only 3 to 4 levels is very common.
Hash indexes
A hash index organizes the search keys with their associated pointers into a hash file structure. A hash function is applied on a search
key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). Hash indices are secondary
index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it.
Back
CHARACTER SETS
DBMSs allow storage of data in a variety of character sets and perform comparisons using a variety of collations. Character sets not only
affect storage but also how client programs communicate with the database server.
After a database is created and accumulates data, changing the character set is time consuming and complex and hence it is important to
select the right character set at installation time. Multilingual unicode character sets have an additional cost such as slightly slower
text processing compared to single-byte character sets and higher storage space requirements for non-ASCII text.
Collations
The physical storage of character strings in a database is controlled by collations. A collation specifies the bit patterns that represent
each character and the rules by which characters are sorted and compared.
Different DBMSs provide different collation options. Depending on the DBMS, collation can be specified at the server level, database level,
table level and the column level. Collations can also be specified at the expression level (by specifying which collation to use when the
query is run) and at the identifier level.
Back
SQL
SQL (Structured Query Language) is a standardized programming language used for managing relational databases and performing various
operations on the data in them. SQL consists of many types of statements, which may be classed as Data Query Language (DQL), Data Definition Language (DDL), Data Control Language (DCL) and Data Manipulation Language (DML). Although SQL is essentially a declarative language, it also includes procedural elements.
An official SQL standard was adopted by ANSI in 1986 and new versions of the SQL standard are published every few years. Some versions of SQL include proprietary extensions to the standard language for procedural programming and other functions e.g T-SQL, PL/SQL.
SQL queries and other operations take the form of commands written as statements and are aggregated into programs that enable users to add, modify or retrieve data from database tables.
Joins
The join clause combines columns from one or more tables into a new table. ANSI-standard SQL specifies five types of JOIN - INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.
Inner join
In an inner join, the columns being joined are compared using a comparison operator. Only non-null values are compared and column values for each matched pair of rows of the joined tables are combined into a result row.
Inner joins can be expressed in two syntactical ways. The explicit join notation uses the JOIN keyword, optionally preceded by the INNER keyword, to specify the table to join, and the ON keyword to specify the join condition. The implicit join notation simply lists the tables for joining in the FROM clause of the SELECT statement.
SELECT * FROM orders o INNER JOIN customers c ON o.cust_id = c.cust_id;
SELECT * FROM orders o, customers c WHERE o.cust_id = c.cust_id;
Outer join
Outer joins return all the rows from at least one of the tables referenced in the join provided the rows meet the WHERE conditions. All
rows are retrieved from the left (right) table with a LEFT (RIGHT) OUTER JOIN and all rows from both tables are retrieved in a FULL OUTER JOIN.
SELECT a.orderid, a.custid, b.custname FROM orders a LEFT OUTER JOIN customers b ON a.custid = b.custid;
Cross join
A cross join that does not have a WHERE clause produces the Cartesian product of the tables involved in the join. The size of a cartesian
product result set is the number of rows in the first table multiplied by the number of rows in the second table.
SELECT au_fname, au_lname, pub_name FROM authors CROSS JOIN publishers ORDER BY au_lname DESC;
If a WHERE clause is added, the cross join behaves as an inner join. The below queries produce the same result set
SELECT au_fname, au_lname, pub_name FROM authors CROSS JOIN publishers
WHERE authors.city = publishers.city ORDER BY au_lname DESC;
SELECT au_fname, au_lname, pub_name FROM authors INNER JOIN publishers
ON authors.city = publishers.city ORDER BY au_lname DESC;
Join algorithms
Joins are performed using different algorithms like nested-loop, sort-merge or hash join.
nested-loop join
A nested loop join is a naive algorithm that joins two sets by using two nested loops.
for each tuple r in R do
for each tuple s in S do
if r and s satisfy the join condition then
yield tuple <r,s>
The block nested loop join algorithm is a generalization of the simple nested loops algorithm that takes advantage of additional memory to reduce the number of times that S is scanned. It loads large chunks of relation R into main memory and for each chunk, it scans S and evaluates the join condition on all tuple pairs currently in memory. This reduces the number of times S is scanned to once per chunk.
sort-merge join
The join algorithm needs to find, for each distinct value of the join attribute, the set of tuples in each relation which exhibit that value. The logic of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order.
hash join
The hash join algorithm for an inner join of two relations proceeds as follows
First, prepare a hash table using the contents of one relation, ideally whichever one is smaller after applying local predicates. This relation is called the build side of the join. The hash table entries are mappings from the value of the (composite) join attribute to the remaining attributes of that row (whichever ones are needed).
Once the hash table is built, scan the other relation (the probe side). For each row of the probe relation, find the relevant rows from the build relation by looking in the hash table.
The first phase is usually called the 'build' phase, while the second is called the 'probe' phase. Similarly, the join relation on which the hash table is built is called the 'build' input, whereas the other input is called the 'probe' input. Hash joins are typically more efficient than nested loops joins, except when the probe side of the join is very small.
Subqueries
A subquery is a query that is nested inside a query or another subquery. A subquery is also called an inner query or inner select, while the statement containing a subquery is also called an outer query or outer select. Subqueries provide alternative ways to perform operations that would otherwise require complex joins and unions.
Correlated and uncorrelated subqueries
A correlated subquery is one that has a correlation name as a table or view designator in the FROM clause of the outer query and the same
correlation name as qualifier of a search condition in the WHERE clause of the subquery.
The subquery in a correlated subquery is reevaluated for every row of the table or view named on the outer query, while in a
noncorrelated subquery, the subquery is evaluated only once.
SELECT empno, lastname FROM employee
WHERE dept = 'A00' AND salary > (SELECT AVG(salary) FROM employee WHERE dept = 'A00'); - uncorrelated subquery
SELECT e1.empno, e1.lastname, e1.dept FROM employee e1
WHERE salary > (SELECT AVG(salary) FROM employee e2 WHERE e2.dept = e1.dept) ORDER BY e1.dept; - correlated subquery
Derived tables
A derived table is a statement-local temporary table created by means of a subquery in the FROM clause of a SQL SELECT statement. It exists
only in memory and behaves like a standard table or view. Derived tables are useful for generating aggregates in a table and joining them
with the details in another table.
Stored procedures
A stored procedure is a group of SQL statements named and stored in the DBMS and can be called by applications. Stored procedures resemble constructs in other programming languages because they can
accept input parameters and return multiple values in the form of output parameters to the calling program
contain programming statements that perform operations in the database
return a status value to a calling program to indicate success or failure and the reason for failure
Benefits of Using Stored Procedures
Reduced server/client network traffic - the commands in a procedure are executed as a single batch of code. This reduces network traffic between the server and client because only the call to execute the procedure is sent across the network. Without the code encapsulation provided by a procedure, every individual line of code would have to cross the network.
Stronger security - multiple users and client programs can perform operations on underlying database objects through a procedure, even if the users and programs do not have direct permissions on those underlying objects. The procedure controls what processes and activities are performed and protects the underlying database objects. This eliminates the requirement to grant permissions at the individual object level and simplifies the security layers.
When calling a procedure over the network, only the call to execute the procedure is visible. Therefore, malicious users cannot see table and database object names, embed SQL statements of their own or search for critical data.
Reuse of code - the code for any repetitious database operation is the perfect candidate for encapsulation in procedures. This eliminates needless rewrites of the same code, decreases code inconsistency
Easier maintenance - when client applications call procedures and keep database operations in the data tier, only the procedures must be updated for any changes in the underlying database. The application tier remains separate and does not have to know how about any changes to database layouts, relationships or processes.
Improved performance - by default, a procedure compiles the first time it is executed and creates an execution plan that is reused for subsequent executions. Since the query processor does not have to create a new plan, it typically takes less time to process the procedure.
If there has been significant change to the tables or data referenced by the procedure, the precompiled plan may actually cause the procedure to perform slower. In this case, recompiling the procedure and forcing a new execution plan can improve performance.
Triggers
A trigger defines a set of SQL statements that are performed in response to an insert, update or delete operation on a specified table. The trigger is activated when such an SQL operation is executed.
Triggers can be used, along with referential constraints and check constraints, to enforce data integrity rules. Triggers can also be used to cause updates to other tables, automatically generate or transform values for inserted or updated rows or invoke functions to perform tasks such as issuing alerts.
Back
TRANSACTIONS
A transaction is a logical unit of work of database processing carried out by a single user or application process which includes one or
more access operations. Transactions are ended by a commit or rollback.
COMMIT - ends a unit of work and makes the database changes that were made permanent
ROLLBACK - rolls back all changes to the last commit point and ends the transaction
A savepoint is a way of implementing subtransactions (or nested transactions) by indicating a point within a transaction that can
be 'rolled back to' without affecting any work done in the transaction before the savepoint was created. Multiple savepoints can exist
within a single transaction. Savepoints are useful for implementing complex error recovery in database applications by enabling the
application to recover from an error without aborting the entire transaction.
SAVEPOINT s1;
ROLLBACK TO SAVEPOINT s1;
RELEASE SAVEPOINT s1; - discard savepoint
A transaction can be in one of these states - active, partially committed (final operation executed but data is not committed yet), failed,
committed or aborted (rolled back to the old consistent state).
Schedules are sequences that indicate the order in which instructions of concurrent transactions are executed. A schedule for a set of
transactions contains all the instructions of those transactions preserving the order. A serial schedule is one in which transactions
execute serially one after the other while in a non-serial schedule the operations of transactions may be interleaved. The serializability
of schedules is used to find non-serial schedules that allow transactions to execute concurrently without interfering with one another.
Isolation levels
The isolation level associated with an application process defines the degree of isolation of that process from other concurrently
executing processes. The isolation level of an application process, P, therefore specifies
The degree to which rows read and updated by P are available to other concurrently executing application processes
The degree to which update activity of other concurrently executing application processes can affect P.
Transaction isolation level sets the default transactional behavior and determines how isolated each transaction is, or what kind of
locks are associated with queries inside a transaction. The four levels, in ascending order of strictness, are
READ UNCOMMITTED: Barely transactional, this setting allows for so-called 'dirty reads', where queries inside one transaction are
affected by uncommitted changes in another transaction
READ COMMITTED: Committed updates are visible within another transaction. This means identical queries within a transaction can
return differing results. This is the default in some DBMSs
REPEATABLE READ: Within a transaction, all reads are consistent
SERIALIZABLE: Updates are not permitted in other transactions if a transaction has run an ordinary SELECT query, i.e. queries are
treated as if they had a LOCK IN SHARE MODE
ACID
ACID is an acronym for the four primary attributes ensured to any transaction by a transaction manager (or transaction monitor).
Atomicity - In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.
Consistency - A transaction either creates a new and valid state of data, or if any failure occurs, returns all data to its state before
the transaction was started.
Isolation - A transaction in process and not yet committed must remain isolated from any other transaction.
Durability - Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in
its correct state.
Locking
The rationale behind locking is in isolating applications in such a way to provide as much concurrent access as possible and at the same
time guarantee as much data integrity as necessary.
There are two different approaches to transactional locking - optimistic and pessimistic. In pessimistic locking, a resource is
locked from the time it is first accessed in a transaction until the transaction is finished, making it inaccessible to other transactions
during that time.
With optimistic locking, a resource is not actually locked when it is first accessed by a transaction and instead, the state of the
resource is saved. Other transactions are able to concurrently access the resource and the possibility of conflicting changes is possible.
At commit time, when the resource is about to be updated in persistent storage, the state of the resource is read from storage again and
compared to the saved state. If the two states differ, a conflicting update was made and the transaction will be rolled back.
Locks can be of three types - shared, update and exclusive.
Shared locks allow two or more programs to read simultaneously but not change the locked space.
An exclusive lock bars all other users from accessing the space.
An update lock is less restrictive; it allows other transactions to read or acquire shared locks on the space.
Concurrency issues
Lost update - Occurs when the same data is retrieved by two applications (users), both work with the data concurrently and both change and
save the data. The last successful change to the data is kept and the first change lost.
Uncommitted read('Dirty read') - uncommitted changes to data are read by an application (user), the changes are rolled back and the reading
application gets invalid data.
Non-repeatable read - occurs when an application (user) reads a query result and later on, in the same transaction, the same query yields a
different result, the reason being the rows in the resultset were updated or deleted by someone else. Query results aren't repeatable and
hence unreliable and the reading application gets invalid data.
Phantom read - occurs when an application (user) reads a query result and later on, the same query returns additional rows (inserted by
someone else). Acceptable for many applications.
Deadlocks
A deadlock occurs when two or more application processes each hold locks on a resource that the others need and without which they cannot
proceed. After a preset time interval, the DBMS can roll back the current unit of work for one of the processes or request a process to
terminate. That frees the locks and allows the remaining processes to continue.
An example of deadlock - Client A takes an S lock on a table row. Client B tries to update the row and needs an X lock that cannot be
granted because of the S lock that A holds and the request goes on the queue of lock requests for the row. Client A now tries to update the
row and needs an X lock which cannot be granted because B already has a request for an X lock and is waiting for A to release its S lock.
Deadlocks are a usual occurrence in transactional databases, but they are not ominous unless they are so frequent that certain transactions
cannot be run at all and applications must be written so that they are always prepared to re-issue a transaction if it gets rolled back
because of a deadlock.
The occurrence of deadlocks can be reduced by keeping transactions short and less prone to collision and committing them immediately after
making a set of related changes.
Semaphores
A semaphore is a variable or abstract data type used to control access to a common resource by multiple threads in a concurrent system.
Semaphores are a useful tool in the prevention of race conditions. Semaphores which allow an arbitrary resource count are called counting
semaphores, while semaphores which are restricted to the values 0 and 1 (or locked/unlocked, unavailable/available) are called binary
semaphores and are used to implement locks.
A mutex (short for 'mutual exclusion') is a locking mechanism that sometimes uses the same basic implementation as the binary semaphore. While a binary semaphore may be colloquially referred to as a mutex, only the task that locked the mutex is supposed to unlock it.
A hot mutex is the terminology used to describe a mutex which has other threads blocked waiting for it for a large amount of time. This
contention reduces scalability on multi-core machines because some threads sit idle, essentially waiting for the ablity to run.
In many situations, data is read more often that it is modified or written. An rw-lock (readers-writer lock) allows concurrent access for read-only operations, while write operations require exclusive access. When a writer is writing the data, all other writers or readers will be blocked until the writer is finished writing. Rw-locks are usually constructed on top of mutexes and condition variables or on top of semaphores. Mutexes and rw-locks are known collectively as latches.
Latches are internal to the DBMS engine and are used to provide memory consistency, whereas locks are used to provide logical transactional consistency. To allow for maximum concurrency and provide maximum performance, latches are held only for the duration of the physical operation on the in-memory structure, unlike locks which are held for the duration of the logical transaction.
Back
QUERY PROCESSING
Query optimizer
A query optimizer is a DBMS component that analyzes SQL queries and determines efficient execution mechanisms. A query optimizer generates
one or more query plans for each query and selects the most efficient query plan to run the query.
Query plan
The output of the Query Optimizer is a query plan, also referred to as a query execution plan or just execution plan. A query plan is a
definition of
the sequence in which the source tables are accessed
the methods used to extract data from each table
Each plan is made of one or more execution steps that describe the database operations performed to produce the query result.
Join algorithms
Nested-loop join
A nested-loop join algorithm reads rows from the first table in a loop one at a time, passing each row to a nested loop
that processes the next table in the join. This process is repeated as many times as there remain tables to be joined.
In general, nested loops joins work best on small tables with indexes on the join conditions. The optimizer always tries to put the
smallest row source first, making it the driving table.
Block nested-loop join
A Block Nested-Loop uses buffering of rows read in outer loops to reduce the number of times that tables in inner loops must be read. For
e.g if 10 rows are read into a buffer and the buffer is passed to the next inner loop, each row read in the inner loop can be compared
against all 10 rows in the buffer. This reduces by an order of magnitude the number of times the inner table must be read.
Merge join
The merge join requires both inputs to be sorted on the merge columns. The query optimizer typically scans an index, if one exists on the
proper set of columns or it places a sort operator below the merge join. In rare cases, there may be multiple equality clauses but the
merge columns are taken from only some of the available equality clauses.
Because each input is sorted, the merge join operation gets a row from each input and compares them. For e.g for inner join operations, the
rows are returned if they are equal. If not equal, the lower-value row is discarded and another row is obtained from that input. This
process repeats until all rows have been processed.
Hash join
The optimizer uses the smaller of two data sets to build a hash table on the join key in memory using a deterministic hash function to
specify the location in the hash table in which to store each row. The database then scans the larger data set probing the hash table to
find the rows that meet the join condition.
While nested loops join works well with smaller data sets and merge join with moderately sized data sets, hash join is better at performing
the largest joins. Hash joins parallelize and scale better than any other join. Like merge join, hash join requires at least one equijoin
predicate, supports residual predicates and supports all outer and semi-joins and unlike merge join, it does not require ordered input
sets.
A hash join is most cost effective when the smaller data set fits in memory in which case, the cost is limited to a single read pass over
the two data sets.
Prepared statements
A prepared statement or parameterized statement is a feature used to execute identical SQL statements repeatedly with high efficiency. The
prepared statement takes the form of a template into which constant values are substituted during each execution.
The typical workflow of using a prepared statement is as follows:
Prepare: At first, the application creates the statement template and send it to the DBMS. Certain values are left unspecified, called
parameters, placeholders or bind variables (labelled '?').
INSERT INTO products (name, price) VALUES (?, ?);
Then, the DBMS compiles (parses, optimizes and translates) the statement template, and stores the result without executing it.
Execute: At a later time, the application supplies (or binds) values for the parameters of the statement template, and the DBMS executes
the statement (possibly returning a result). The application may execute the statement as many times as it wants with different values.
Prepared statements offer two main advantages:
The overhead of compiling the statement is incurred only once. However not all optimization can be performed at the time the statement
template is compiled since the plan may depend on the specific values of the parameters and the plan may change as tables and indexes
change over time.
Prepared statements are resilient against SQL injection because values which are transmitted later using a different protocol are not
compiled like the statement template.
DISTRIBUTED DATABASES
A distributed database appears to a user as a single database but is a set of databases stored on multiple computers within a network or
across networks. The data on several computers can be simultaneously accessed and modified across the network. Each database server in the
distributed database is controlled by its local DBMS and each cooperates to maintain the consistency of the global database.
Distributed databases can be homegeneous (identical DBMS in all locations) or heterogeneous (different DBMS) and offer these advantages
Management of data with transparency
Increased reliability and availability
Easier expansion
Improved performance through the proximity of the data to its points of use and parallelism in query execution
Drawbacks include increased complexity, difficulty to maintain data integrity and concurrency and security issues.
Some of the strategies and objectives in designing a distributed database are
Data fragmentation - applied to partition the relations among network sites.
Data allocation - each fragment is stored at the site with optimal distribution
Data replication - increases the availability and improves the performance of the system
Location transparency - enables a user to access data without being concerned with the site at which the data resides
Data fragmentation
The following information is used to decide fragmentation
Quantitative information - cardinality of relations, frequency of queries, site where query is run, selectivity of the queries etc.
Qualitative information - predicates in queries, types of access of data, read/write etc.
Types of Fragmentation
Horizontal - partitions a relation along its tuples
Vertical - partitions a relation along its attributes
Mixed/hybrid - a combination of horizontal and vertical fragmentation
Distributed transactions
A distributed transaction is a transaction that includes one or more statements that update data on two or more distinct nodes of a
distributed database. Like any other transaction, a distributed transaction should include all four ACID properties.
Back
SERVER ADMINISTRATION
Connection handling
Network protocols
Database servers can support and service requests on several protocols at the same time, but clients connect with a single protocol. Some of
the available protocols are
TCP/IP - is a common protocol widely used over the Internet. It communicates across interconnected networks of computers that have
diverse hardware architectures and various operating systems. TCP/IP includes standards for routing network traffic and offers advanced
security features.
Named pipes - a named pipe is a one-way or duplex pipe that provides communication between the pipe server and some pipe clients. A
pipe is a section of memory that is used for interprocess communicationa (IPC). A named pipe (also known as a FIFO for its behavior) is an
extension to the traditional pipe concept on Unix-like systems. A traditional pipe is 'unnamed' and lasts only as long as the process. A
named pipe, however, can last as long as the system is up, beyond the life of the process and can be deleted if no longer used. Usually a
named pipe appears as a file, and generally processes attach to it for IPC. Named pipes can be used to provide communication between
processes on the same computer or between processes on different computers across a network.
Generally, TCP/IP is preferred in a slow LAN, WAN or dial-up network, whereas named pipes can be a better choice when network speed is not
the issue, as it offers more functionality, ease of use and configuration options.
Shared memory - is a simple protocol only for same-computer communications i.e clients can only connect to the server on the same
machine and is available on all platforms. For same-computer communications, shared memory tends to provide better performance than TCP/IP.
Connection pooling
Connection pooling is a technique that developers can use to share database connections among requesting clients. When a connection has
been created and is placed in a runtime object pool, an application can use that connection again. Each application does not have to perform
the complete connection process every time that it uses a connection.
When the application closes the connection, the connection is cached in the runtime object pool again. Connection pooling permits an
application to use a connection from a pool of connections that do not have to be reestablished for each use.
By using pooled connections, applications can realize significant performance gains because applications do not have to perform all of the
tasks that are involved in making a connection. This can be particularly significant for middle-tier applications that connect over a
network, or for applications that repeatedly connect and disconnect, such as internet applications.
Performance improvement
Improving Disk I/O
Since disk I/O is a slower operation, high disk I/O is a common performance bottleneck. High disk I/O can be optimized by placing the
transaction log in a different volume equipped with a high I/O device. All updates to the database pass through the transaction log and the
speed at which transactions can be written to the log is the speed at which data can be processed and depends on the device where the log
resides.
Types of hard disks
SAS (Serial Attached SCSI) is a faster and historically more expensive interface. Because SAS drives are able to rotate so much faster (up
to 15K RPM) than SATA (Serial Advanced Technology Attachment) drives (typically 7.2K RPM), seek times may be substantially faster by more
than 2 times. SATA is less expensive and suited for non-critical storage and archives while SAS is expensive and fit for mission-critical
high I/O.
Solid-state drives(SSDs) can deliver substantially better I/O performance than traditional HDDs and are capable of driving tens of thousands
of I/O operations per second as opposed to hundreds for HDDs. However SSDs may not be as durable/reliable as HDDs and are not suitable for
long-term archival data.
Using RAID
RAID (redundant array of independent disks) is a disk system that contains multiple disk drives (array) to provide greater performance,
reliability, storage capacity, and lower cost. Fault-tolerant arrays are categorized in six RAID levels: 0 through 5. Each level uses a
different algorithm to implement fault tolerance.
A hardware disk array improves I/O performance because I/O functions, such as striping and mirroring, are handled efficiently in firmware.
Conversely, an OS–based RAID offers lower cost, but consumes processor cycles. When cost is an issue and redundancy and high performance
are required, RAID-5 volumes are a good solution.
Data striping (RAID 0) gives the best performance, but if one disk fails, all the data on the stripe set becomes inaccessible. A common
technique for RDBMSs is to configure the database on a RAID 0 drive and put the transaction log on a mirrored drive (RAID 1), obtaining
the best disk I/O performance for the database and maintaining data recoverability through a mirrored transaction log.
If data must be quickly recoverable, mirroring the transaction log and putting the database on a RAID 5 disk can be considered. RAID 5
provides redundancy of all data on the array. This allows for a single disk to fail and be replaced without system down time. RAID 5
offers lower performance, compared to RAID 0 or RAID 1, but greater reliability and faster recovery.
Back
BACKUP/RECOVERY
Backup
Hot and Cold backups
Hot backups are performed while the database is running. Hot backup does not block normal database operations and captures even changes
that occur while the backup is happening.
Cold backups are done while the database is stopped. To avoid service disruption, they are usually done from a replication slave, which
can be stopped without taking down the entire application or website.
Physical and logical backups
Physical backups consist of raw copies of the directories and files that store database contents. It is suitable for large, important
databases that need to be recovered quickly when problems occur.
Logical backups save information represented as logical database structure (CREATE DATABASE/TABLE statements) and content (INSERT
statements or delimited-text files). The backup does not include log or configuration files.
Physical backup methods are faster than logical because they involve only file copying without conversion.
Back
HIGH AVAILABILITY/SCALABILITY
High Availability
High availability(HA) is the ability of a system to be continuously operational for a desirably long length of time and to quickly recover
from any sort of failure to minimize interruptions for the end user. A database must be made highly available to ensure data is available
if a failure occurs.
A database can fail as result of hardware failure, software failure, network interruption, human intervention or environmental reasons
(fire, water etc.). Best practices for achieving high availability include
ensuring that multiple copies of data exist in different locations using techniques like log shipping, replication and mirroring
eliminating single points of failure or any node that would impact the system as a whole
continuously monitoring the health of servers
distributing resources geographically in case of power outages or natural disasters
implementing reliable crossover or failover. In terms of storage, a redundant array of independent disks (RAID) or storage area network
(SAN) are common approaches
setting up a system that detects failures as soon as they occur
Depending on the type of redundancy to be provided for high availability, database servers can be configured in the following
configurations
Cold standby - The secondary node acts as a backup for another identical primary system. In case the primary node fails, the secondary node
is started up and the data restored before replacing the failed node. Data from the primary node can be backed up on a storage system and
restored on the secondary server as and when required. This generally provides a recovery time of a few hours.
Warm standby - The secondary node is turned on periodically to receive updates from the primary node. Data is regularly mirrored to
the secondary node using disk based replication or shared disk. This generally provides a recovery time of a few minutes.
Hot standby - The primary and secondary nodes are both running simultaneously and data is mirrored in real time and both nodes will have
identical data but the secondary node will not process data or requests. This generally provides a recovery time of a few seconds.
Scalability
Scalability is the ability of a DBMS to handle an increase or decrease of a workload without affecting its productivity. A DBMS that is scalable is an adaptable one and should be able to grow with the company and handle varying workloads. DBMSs can be scaled either vertically or horizontally. Horizontal scaling involves adding (scaling out) or removing (scaling in) servers to handle the workload. Vertical scaling involves adding CPU, memory or other resources (scaling up) to the existing system or removing resources (scaling down).
Log shipping
Log shipping is a technology in which the backup is copied from the production server to a standby server and restored there, and then
transaction logs backed up on the production server are copied or shipped to the standby server and applied. This process keeps the two
databases synchronized.
The main advantages of log shipping are
It is simple and low-cost. All database dependencies are copied to the standby server (server dependencies such as logins are not)
It is minimally intrusive on the server
The disadvantages of log shipping are
Latency can be as low as one minute but can be more. During a backup, the log cannot be generated or shipped and during an indexing
operation, the log can be huge
Log shipping is not scalable to large numbers of databases per server
Replication
Replication is a set of technologies for copying and distributing data and database objects from one database to another and then
synchronizing between databases to maintain consistency. Using replication, data can be distributed to different locations and to remote
or mobile users over local and wide area networks.
Replication may be implemented in three different ways
Synchronous replication - when a primary server commits a transaction, all the secondary servers also commit the transaction and
acknowledge the same before the primary server marks the transaction as successful. This protects against data loss if the primary server
fails but it introduces a delay in completing a transaction
Asynchronous replication - committed data is placed in a queue on the primary server but the primary server does not wait for the
acknowledgments of the secondary server. Any data that did not get copied across the network on the second server is lost if the first
server fails
Semi-synchronous replication - when a primary server commits, at least one of the secondary servers must send a receipt that the
committed data has been received and the data is committed later on the secondary servers. This provides improved data integrity compared
to asynchronous replication because when a commit returns successfully, it is known that the data exists in at least two places.
Master-slave replication
Master-slave replication enables data from one database server (the master) to be replicated to one or more other database servers (the
slaves). The master logs all its updates which are then propagated to the slaves. Master-slave replication can be either synchronous or
asynchronous. Master-slave replication has an easy to maintain architecture, the consistency of the data is maintained and analytical applications can read from the slave(s) without impacting the master. However, it is not scalable since a single server handles all client updates and have a single point of failure, so some downtime will be involved.
Multi-Master replication
Multimaster replication, also known as peer-to-peer replication, is comprised of multiple master servers, each having its own copy of the
complete database, equally participating in an update-anywhere model. Updates made to an individual master are propagated to all other
participating masters. It is scalable since multiple servers are servicing clients and there is no single-point of failure and automatic failover is possible. However database design will become complex to take care of multiple masters updating different copies of a single table. For e.g auto-increment columns may not work because both servers may insert a record into the table at the same time causing a collision in the generated id value. Database integrity can also be compromised due to communication issues between masters.
Clustering
Clustering refers to the ability of several servers or instances to connect to a single database or multiple data nodes. An instance is
the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data.
Clustering offers two major advantages, especially in high-volume database environments:
Fault tolerance - Because there is more than one server or instance for users to connect to, clustering offers an alternative, in the
event of individual server failure
Load balancing - The clustering feature is usually set up to allow users to be automatically allocated to the server with the least
load
The disadvantages of clustering include expensive setup and maintenance and possible performance issues with joins or range/scan queries.
Clustering architectures
Clustering takes different forms, depending on how the data is stored and the allocated resources.
The first type is known as the shared-nothing architecture. In this clustering mode, each node/server is fully independent, so there is
no single point of contention. Shared-nothing is also known as 'database sharding', wherein a single database is split to reside on
multiple servers and each server will service client requests based on which server holds the rows that need to be accessed.
Contrastingly in a shared-disk architecture, all data is stored centrally and then accessed via instances stored on different servers or
nodes.
In between the above two is grid computing or distributed caching. In this setup, data is still centrally managed but controlled by a
powerful 'virtual server' that is comprised of many servers that work together as one.
Database sharding
Sharding is a database architecture pattern related to horizontal partitioning - the practice of separating a table’s rows into multiple
different tables, known as partitions. Database shards exemplify a shared-nothing architecture, which means that the shards are
autonomous and they do not share any of the same data or computing resources.
Sharding a database can help facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more
machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted
with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more
RAM or CPU.
Back
SECURITY
Encryption
Encryption is the process of obfuscating data by the use of a key or password which can make the data useless without the corresponding
decryption key or password. Encryption does not solve access control problems, but enhances security by limiting data loss even if access
controls are bypassed.
Methods of database encryption
Transparent or External Database encryption - The database and all backups are encrypted using a single encryption key. Only the data
that is stored on physical media is encrypted. This method is also called transparent because when the data is queried and loaded into
system memory or cache, the contents are decrypted. One advantage of transparent or external database encryption is that since the data
is encrypted at the database level, it cannot be read or queried without the key.
The risk of using this method is that only data inside the stored database tables is encrypted and data that is queried or retrieved from
the tables and is in memory or cache is not encrypted.
Column-Level encryption - Instead of encrypting the entire database, column-level encryption allows for individual columns within
the database tables to be encrypted using separate encryption keys. Data is more secure because each encrypted column in the table
requires a separate decryption method. This makes it more difficult for all the data within each table to be accessed at the same time,
but slows down database index and search performance, because each column now needs a unique key.
Symmetric encryption - Symmetric encryption is an improvement over transparent encryption because not only is the database encrypted
when stored and decrypted when open and accessed, but it also requires the user to have a copy of a private key in order to view and read
the data.
Asymmetric encryption - In the asymmetric encryption method, there is one public key that is used to encrypt the data and a single
private decryption key per authorized user. The advantage here is that even if the key is accidentally shared, no one else would be able
to use the key that was created for a specific, authorized user. Asymmetric encryption and decryption are relatively resource-intensive,
but they provide a higher level of security than symmetric encryption. An asymmetric key can be used to encrypt a symmetric key for
storage in a database.
Application-Level encryption - In application-level encryption, all encryption is done at the application level. This means that
unlike transparent encryption where the stored data is encrypted, all data is encrypted by the applications that use it. When implemented,
data is encrypted before it is written to the database. Any attempts at a security breach would require knowledge of how to decrypt.
Encryption algorithms define data transformations that cannot be easily reversed by unauthorized users. No single algorithm is ideal for
all situations but the following general principles apply
Strong encryption generally consumes more CPU resources than weak encryption
Long keys generally yield stronger encryption than short keys
Asymmetric encryption is slower than symmetric encryption
Long, complex passwords are stronger than short passwords
Symmetric encryption is generally recommended when the key is only stored locally, asymmetric encryption is recommended when keys
need to be shared across the wire
If lot of data is being encrypted, it should be encrypted using a symmetric key and the symmetric key should be encrypted with an
asymmetric key
Encrypted data cannot be compressed, but compressed data can be encrypted. If compression is used, data should be compressed before
encrypting it
Back
DATA WAREHOUSING
Data warehouses/Data marts
A data warehouse is an enterprise-wide solution for data collection. It may consist of numerous databases and data marts. Data warehouses
and data marts are critical components of decision support systems because they collect and organize the data upon which crucial decisions
are based.
The difference between a data warehouse and a data mart is debatable. In general, a data warehouse is used on an enterprise level, while
data marts are used on a business division/department level. A data mart only contains the required subject specific data for local
analysis.
Benefits of data warehousing
The main advantage of data warehousing is that the data stored for business analysis can most effectively be accessed by separating it from
the data in the operational systems.
The chief reason for separating data for business analysis from the operational data has always been the potential performance degradation
on the operational system that can result from the analysis processes.
Organizations that use a data warehouse to assist their analytics and business intelligence see a number of substantial benefits
Better data — adding data sources to a data warehouse enables organizations to ensure that consistent and relevant data from that
source is being collected. This ensures higher data quality and data integrity for sound decision making
Faster decisions — Data in a warehouse is in such consistent formats that it is ready to be analyzed and also provides the analytical
power and a more complete dataset to base decisions on accurate facts without relying on hunches, incomplete data or poor quality data and
risk delivering slow and inaccurate results
Business intelligence
Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for
business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and create
new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data.
Business intelligence is a term commonly associated with data warehousing. Many of the tool vendors position their products as BI software
rather than data warehousing software. There are other occasions where the two terms are used interchangeably.
Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or
data mart) system is the backend or the infrastructural, component for achieving business intelligence. BI also includes the insight
gained from doing data mining analysis, as well as unstructured data (thus the need for content management systems).
Attributes of data warehouses
Data is usually combined from more than one operational system. This provides the ability to cross-reference data from these
applications.
Data is mostly non-volatile. i.e after the data is in the data warehouse, there are no modifications to be made to this information.
Data is saved for longer periods than in transaction systems.
Setting up a data warehouse
After the tools and team personnel selections are made, the data warehouse project can begin. The following are the typical processes
involved in the data warehousing project cycle
Requirement gathering
Physical environment setup
Data modeling
ETL
OLAP cube design
Front end development
Performance tuning
Quality assurance
Rolling out to production
Production maintenance
Incremental enhancements
ETL (Extracting, Transforming and Loading)
ETL is the data warehouse acquisition process of extracting, transforming (or transporting) and loading data from source systems into the
data warehouse. Applications require real-time data for processing and there are a variety of ETL tools available that make the data
management task easier.
ETL Tools
While evaluating ETL tools, the following characteristics need to be looked at
Functional capability - This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are
either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in
both. As a result, if the data is going to be dirty coming in, the ETL tool needs to have strong cleansing capabilities. If there are going
to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation.
Ability to read directly from the data source - For each organization, there is a different set of data sources. The ETL tool selected
must be able to connect directly to the source data
Metadata support - The ETL tool plays a key role in the metadata because it maps the source data to the destination, which is an
important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata
source. As a result, it is very important to select an ETL tool that works with the overall metadata strategy.
Popular ETL tools include - Data Junction, Ascential DataStage, Ab Initio, Informatica, Oracle Warehouse Builder (OWB), SeeBeyond ETL
integrator
ETL points of measurement:
Architecture
ETL functionality
Ease of use
Reusability
Debugging
Real-time
Connectivity
General ETL tool characteristics
Buy vs Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three
things:
Complexity of the data transformation - the more complex the data transformation is, the more suitable it is to purchase an ETL tool
Data cleansing needs - if the data needs to go through a thorough cleansing exercise before it is suitable to be stored in the data
warehouse, it is preferable to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build
the ETL routine from scratch
Data volume - available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial
product is a better approach if the volume of data transferred is large
Data mining
Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for
patterns. To do this, data mining uses computational techniques from statistics and pattern recognition.
Data mining has been defined as
- the nontrivial extraction of implicit, previously unknown, and potentially useful information from data
- the science of extracting useful information from large data sets or databases
Online Analytical Processing (OLAP)
OLAP is a computing method that provides the ability to analyze metrics in different dimensions such as time, geography, gender, product
etc. For e.g, sales for the company is up. What region is most responsible for this increase? Which store in this region is most
responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types
of questions in order means that an OLAP analysis is being performed.
To facilitate this kind of analysis, data is collected from multiple data sources and stored in data warehouses, then cleansed and
organized into data cubes. Each OLAP cube contains data categorized by dimensions (such as customers, geographic sales region and time
period) derived by dimensional tables in the data warehouses. Dimensions are then populated by members (such as customer names, countries
and months) that are organized hierarchically. OLAP cubes are often pre-summarized across dimensions to drastically improve query time
over relational databases.
OLAP operations on data cubes include
Rollup - dimension reduction
Drill-down - introducing a new dimension
Slice - select a particular dimension from a given cube and provide a new sub-cube
Dice - select two or more dimensions from a given cube and provide a new sub-cube
Pivot - rotate the data axes in view in order to provide an alternative presentation of data
OLAP can be used for data mining i.e the discovery of previously undiscerned relationships between data items. An OLAP database does not
need to be as large as a data warehouse, since not all transactional data is needed for trend analysis.
OLAP products include IBM Cognos, Oracle OLAP and Oracle Essbase. OLAP features are also included in Microsoft Excel and Microsoft SQL
Server's Analysis Services.
Back
BIG DATA
Big data is high-volume, high-velocity and high-variety data that demands cost-effective, innovative forms of information processing for
enhanced insight and decision making. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization
and information privacy.
NoSQL
NoSQL ("Not Only SQL") represents a framework of non-relational databases that allows for high-performance, agile processing of information
at massive scale. i.e it is a database infrastructure that as been very well-adapted to the heavy demands of big data.
Unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency
requirements for speed and agility. NoSQL centers around the concept of distributed databases, where unstructured data may be stored across
multiple processing nodes and servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues
to explode, just add more hardware to keep up, with no slowdown in performance.
Hadoop
Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to
scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these
clusters comes from the software's ability to detect and handle failures at the application layer.
Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.
Hadoop Common - A module containing the utilities that support the other Hadoop components.
MapReduce - A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of
machines, in a reliable, fault-tolerant manner.
HDFS - A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local
nodes to create a single file system.
Yet Another Resource Negotiator (YARN) - The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It
enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.
MongoDB
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. A record in MongoDB
is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of
fields may include other documents, arrays, and arrays of documents.
MongoDB provides high performance data persistence Support for embedded data models reduces I/O activity on database system.
Indexes support faster queries and can include keys from embedded documents and arrays.
MongoDB’s replication facility, called replica set, provides automatic failover and data redundancy. A replica set is a group of MongoDB
servers that maintain the same data set, providing redundancy and increasing data availability.
Oracle Big Data SQL
Provides single, optimized SQL query for distributed data
Supports multiple data sources, including Hadoop, NoSQL and Oracle Database
Includes automatic, extensible Oracle Database external table generation
Provides Smart Scan on Hadoop to minimize data movement and maximize performance
Ensures advanced security for Hadoop and NoSQL data
Back
SAP
SAP Basics
SAP is an abbreviation of Systems, Applications and Products in Database Processing. It is an ERP (Enterprise Resource Planning) package.
The SAP framework of applications communicate using the Application Link Enabling (ALE) protocol.
Layers in the SAP client-server architecture
Presentation layer - comprises the SAPGUI or user interfaces
Application layer - applications written in the ABAP language run within this layer and communicate with the other two layers
Database layer - is an RDBMS like Oracle, SQL Server or Informix
Basis
Basis is the SAP Kernel/Solution that
Provides a runtime environment for all SAP applications
Provides a stable architecture for system enhancements
Facilitates the distribution of resources and system components
Supports decentralization of system components [and non-sap solutions].
SAP Modules
Finance (FI) components include:
FI-GL, FI-AR, FI-AP, FI-FC (Financial Control), FI-LC (Legal Consolidation), FI-FM (Financial Resources Control).
GL, AR, AP all meet basic docuentation and accounting requirements.
LC, FM are built to provide retrieval of specific info for strategic decisions.
Sales/Distribution (SD) components include:
SD-SLS (Sales), SD-SHP (Shipping), SD-BIL (Billing), SD-CAS (Sales Support), Sales Information Systems.
Materials Management (MM) components include:
MM-MRP (Material Requirements Planning), MM-PUR (Purchasing), MM-IS (Purchasing Information System), MM-WM (Warehouse Management),
MM-IM (Inventory Management/Valuation), MM-EV (Invoice Verification).
Human Resources (HR) components include:
HR-ORG (organization and Planning...education and training), HR-P&C (Planning and Control), HR-PAD (Personnel Management), HR-TIM (Time
Management), HR-TRV (Travel Expenses), HR-PAY (Payroll Accounting).
Controlling (CO) components include (Supports all known accounting procedures):
CO-CCA (Cost Element Accounting), CO-ABC (Performance Analysis Process-Related Performance Analysis), CO-OPA (Order and Project Cost
Accounting), CO-PC (Products Cost Accounting), CO-PA (Financial Statement and Market Segment Accounting, CO-PCA (Profic-Center Accounting),
CO-BPC (Company Control). Known Accounting Procedures include Accounting for actual costs, Rigid and Flexible Planning Cost Accounting,
Activity Based Costing, and Financial Statements.
Back
The ABAP language
ABAP or Advanced Business Applications Programming is the programming language used in SAP. The ABAP workbench is a collection of tools that enable the development and maintenance of ABAP programs and applications. The ABAP editor is the tool used to write ABAP programs, class methods, function modules, screen flow logic, type groups and logical databases. The Screen Painter is a tool that enables creating screens for transactions. It is used to create the screen itself, with fields and other graphical elements and to write the flow logic behind the screen.
Transaction codes used in the ABAP workbench
SE24 - Class Builder
SE30 - Runtime Analysis
SE39 - Splitscreen editor
ABAP Datatypes
Elementary datatypes
C - Character
N - Numeric
I - Integer
P - Packed
X - Hexadecimal
F - Float
D - Date
T - Time
STRING - Character sequence
XSTRING - Byte sequence
Complex datatypes
Structures
Internal tables
Reference datatypes
Objects
Defining Datatypes
Predefined datatypes - built into the kernel
Local datatypes - that can be defined in ABAP programs
Datatypes in the ABAP dictionary - that are available to all programs in the ABAP system.
Predefined ABAP Datatypes
These data types are predefined in the R/3 System kernel, and are visible in all ABAP programs. Predefined types can be used to define
local data types and objects in a program and to specify the type of interface parameters and field symbols.
Predefined elementary ABAP types with fixed length
These predefined elementary datatypes are used to specify the types of individual fields whose lengths are always fixed at runtime. The
following table shows the different fixed-length data types. All field lengths are specified in bytes.
Data Type Initial field length Valid field length Initial value Meaning
Numeric types
I 4 4 0 Integer (whole number)
F 8 8 0 Floating point number
P 8 1 - 16 0 Packed number
Character types
C 1 1 - 65535 ' … ' Text field (alphanumeric characters)
D 8 8 '00000000' Date field (Format: YYYYMMDD)
N 1 1 - 65535 '0 … 0' Numeric text field (numeric characters)
T 6 6 '000000' Time field (format: HHMMSS)
Hexadecimal type
X 1 1 - 65535 X'0 … 0' Hexadecimal field
Data types D, F, I, and T describe the technical attributes of a data object fully. Data
types C, N, P, and X are generic. When you use a generic type to define a local data type in a
program or a data objet, you must specify the field length and, in the case of type P, the
number of decimal places. When you user generic types to specify the types of interface parameters of field symbols, you do not have to
specify the technical attributes. The initial value (and initial field length in the case of the generic types), are values that are used
implicitly in short forms of the TYPES and DATA statements.
The fixed-length predefined types are divided into Numeric, Character and Hexadecimal datatypes.
Numeric Datatypes
As well as the five non-numeric types (text field (C), numeric text field (N), date field (D), time field (T), and hexadecimal field (X)),
there are three numeric types, used in ABAP to display and calculate numbers. Data type N is not a numeric type. Type N objects can only
contain numeric characters (0...9), but are not represented internally as numbers. Typical type N fields are account numbers and zip
codes.
The value range of type I numbers is -2**31 to 2**31-1 and includes only whole numbers. Non-integer results of arithmetic
operations (e.g. fractions) are rounded, not truncated.
Type I data can be used for counters, numbers of items, indexes, time periods, and so on.
Type P data allows digits after the decimal point. The number of decimal places is generic and is determined in the program. The
value range of type P data depends on its size and the number of digits after the decimal point. The valid size can be any value from 1
to 16 bytes. Two decimal digits are packed into one byte, while the last byte contains one digit and the sign. Up to 14 digits are allowed
after the decimal point. The initial value is zero. When working with type P data, it is a good idea to set the program attribute
Fixed point arithmetic.Otherwise, type P numbers are treated as integers.
Type P data can be used for such values as distances, weights, amounts of money, and so on.
- Floating point numbers - type F
The value range of type F numbers is 1x10**-307 to 1x10**308 for positive and negative numbers, including 0 (zero). The accuracy
range is approximately 15 decimals, depending on the floating point arithmetic of the hardware platform. Since type F data is
internally converted to a binary system, rounding errors can occur. Although the ABAP processor tries to minimize these effects, type F
data should not be used if high accuracy is required. Instead, use type P data.
Type F fields are used when there is a need to cope with very large value ranges and rounding errors are not critical.
Using I and F fields for calculations is quicker than using P fields. Arithmetic operations using I and F fields are very similar to
the actual machine code operations, while P fields require more support from the software. Nevertheless, type P data has to be used to
meet accuracy or value range requirements.
Character types
Of the five non-numeric types, the four types C, D, N, and T are character types. Fields with these types are known as character fields.
Each position in one of these fields takes up enough space for the code of one character. Currently, ABAP only works with single-byte
codes such as ASCII and EBCDIC. However, an adaptation to UNICODE is in preparation. Under UNICODE, each character occupies two or four
bytes.
Hexadecimal type
The remaining non-numeric type - X - always interprets individual bytes in memory. One byte is represented by a two-digit hexadecimal
display. The fields with this type are called hexadecimal fields. In hexadecimal fields, single bits can be processed.
Predefined Elementary ABAP Types with Variable Length
These predefined elementary datatypes are used to specify the types of individual fields whose lengths are not fixed until runtime.
There are two predefined ABAP data types with variable length that are generically known as strings:
- STRING for character strings
A string is a sequence of characters with variable length. A string can contain any number of alphanumeric characters. The length
of a string is the number of characters multiplied by the length required for the internal representation of a single character.
A byte string is a hexadecimal type with variable length. It can contain any number of bytes. The length of a byte string is the
same as the number of bytes.
When a string is created as a data object, only a string header is created statically. This contains administrative information. The
actual data objects are created and modified dynamically at runtime by operational statements.
The initial value of a string is the empty string with length 0. A structure that contains a string is handled like a deep structure.
This means that there are no conversion rules for structures that contain strings.
Predefined Complex Data Types
ABAP contains no predefined complex data types that can be used to define local data types or data objects in a program. All complex
data types are based on elementary ABAP types, and are constructed in ABAP programs or in the ABAP Dictionary.
Field symbols
Field symbols are placeholders or symbolic names for other fields. They do not physically reserve space for a field but point to its
contents. They are like dereferenced pointers in C (i.e pointers to which the contents operator * is applied.)
All operations programmed with field symbols are applied to the field assigned to it. For e.g a MOVE statement between two field symbols
moves the contents of the field assigned to the first field symbol to the field assigned to the second field symbol.
Some of the different ways of defining field symbols are shown below:
FIELD-SYMBOLS <FS> TYPE ANY TABLE.
FIELD-SYMBOLS <FS> LIKE LINE.
FIELD-SYMBOLS: <F1> STRUCTURE LINE1 DEFAULT WA,
<F2> STRUCTURE LINE2 DEFAULT WA.
Internal Tables
Internal tables are temporary storage areas where data processing can be carried out. Internal tables provide a means of taking data from
a fixed structure and storing it in working memory in ABAP.
Internal tables are declared as follows
DATA ITAB LIKE <TABLENAME> OCCURS 0 TIMES WITH/WITHOUT HEADER LINE
The Header Line can be used as a work area while processing the internal table. The statements used to access internal tables are shorter.
INSERT <wa> into TABLE ITAB. - without header line
INSERT TABLE ITAB. - with header line
Types of internal tables
Indexed tables
Standard tables
Sorted tables
Hashed tables
Back
ABAP Database Access
There are two ways of accessing the database from a program -
Open SQL - Open SQL statements are a subset of Standard SQL that is fully integrated in ABAP which allow access to data irrespective
of the database system that the SAP installation is using
Native SQL
Open SQL
Open SQL consists of a set of ABAP statements that perform operations on the central database in the SAP System. The results of the
operations and any error messages are independent of the database system in use. Open SQL thus provides a uniform syntax and semantics for
all of the database systems supported by SAP.
ABAP programs that only use Open SQL statements will work in any SAP System, regardless of the database system in use. Open SQL statements
can only work with database tables that have been created in the ABAP Dictionary.
Open SQL consists of the Data Manipulation Language (DML) part of Standard SQL; i.e allows reading (SELECT) and changing (INSERT, UPDATE,
DELETE) data.
Open SQL also goes beyond Standard SQL to provide statements that, in conjunction with other ABAP constructions, can simplify or speed up
database access. It also allows you to buffer certain tables on the application server, saving excessive database access.
Native SQL
Native SQL is only loosely integrated into ABAP, and allows access to all of the functions contained in the programming interface of the
respective database system. Unlike Open SQL statements, native SQL statements are not checked and converted, but instead are sent directly
to the database system. When using Native SQL, the function of the database-dependent layer is minimal. Programs that use Native SQL are
specific to the database system for which they were written. Native SQL should be avoided, wherever possible. It is used, however, in some
parts of the Basis System - for e.g for creating or changing table definitions in the ABAP dictionary.
Logical databases
Logical databases are special ABAP programs that retrieve data and make it available to application programs. The most common use of
logical databases is to read data from database tables by linking them to executable ABAP programs.
An LDB provides a particular view of database tables. The data structure in an LDB is hierarchical. Many tables are linked to each other
using foreign key relationships.
Objects in ABAP
Local and global classes
Classes in ABAP can be declared either locally or globally. Global classes and interfaces are declared in the Class builder (Transaction
code SE24) in the ABAP workbench. All ABAP programs can access global classes.
Local classes are defined within an ABAP program and can be only used in the program in which they are defined.
Definition of local classes
The definition of local classes consists of two parts: The declaration part contains the declaration for all the components (attributes,
methods, events) of the class. The implementation part contains the implementation of all the methods of the class.
CLASS classname DEFINITION.
ENDCLASS.
CLASS classname IMPLEMENTATION.
METHOD method1.
......
ENDMETHOD.
ENDCLASS.
Visibility sections in classes
Public Section - All of the components declared in the public section are accessible to all users of the class and to the methods of the
class and any classes that inherit from it. The public components of the class form the interface between the class and its users.
Protected Section - All of the components declared in the protected section are accessible to all methods of the class and of classes that
inherit from it. Protected components form a special interface between a class and its subclasses.
Private Section - Components that are declared in the private section are only visible in the methods of the same class. The private
components are not part of the external interface of the class.
CLASS classname DEFINITION.
PUBLIC SECTION.
...
PROTECTED SECTION.
...
PRIVATE SECTION.
...
ENDCLASS.
Comparison of popular RDBMSs
| Oracle | Sql Server |
MySQL | PostgreSQL |
Indexed views | | Supported | Not possible to create an index on a view | |
Indexes on functions/expressions | Supported | Allows indexes on computed columns | Supported from MySQL 8.0 |
Supported |
In-memory tables | | Memory-optimized tables | Through the MEMORY storage engine | |
Locking | | Multi-granular locking | Row-level locking(InnoDB) | |
Online schema changes | Online table redefinition through the DBMS_REDEFINITION package | |
Supported for most ALTERs like creating/dropping indexes |
Supported for some ALTERs like adding nullable columns, dropping columns |
OS support | Unix/Linux/Windows/Mac | Windows only | Unix/Linux/Windows/Mac | Unix/Linux/Windows/Mac |
Partitioning | | | Horizontal partitioning is supported but not vertical partitioning | |
Replication | Oracle multimaster replication supports both synchronous and asynchronous modes |
Both synchronous and asynchronous modes are supported by Always-on availability groups(AG) |
Asynchronous by default. Also supports semi-synchronous replication through an interface implemented by plugins | |
Back
Maintained by: VINCENT KANDASAMY, Database Architect/Administrator (kandasf@hotmail.com)
Last updated: Nov 25, 13:24 PM