TechnoBase - Database Management Systems

TechnoBase

MySQL | SQL Server | Oracle | PostgreSQL | Unix/Linux

DataBase Management Systems

Data encoding ---- ASCII ---- The Unicode standard -------- UTF (Unicode transformation formats) ------------ The UTF-8 format ------------ The UTF-16 format Data architecture ---- Data modeling -------- ER diagram ---- Database design -------- Normalization Database Management Systems ---- Database architecture -------- Process management -------- Transaction log ---- Data structures -------- Tables -------- Views -------- Indexes ------------ Index structure ---------------- B-trees ---------------- Hash indexes ---- Character sets -------- Collations

---- SQL -------- Joins -------- Subqueries ---- Transactions -------- Locking -------- Concurrency issues ------------ Deadlocks -------- Semaphores ---- Query processing -------- Query optimizer -------- Join algorithms -------- Prepared statements ---- Distributed databases ---- Server administration -------- Connection handling -------- Performance improvement ------------ Improving disk I/O ---- Backup/Recovery -------- Backup ---- High availability/Scalability -------- Replication -------- Clustering ---- Security -------- Encryption

Data warehousing ---- ETL (Extracting, Transforming and Loading) ---- Data mining ---- OLAP Big Data ---- MongoDB SAP ---- SAP Basics ---- The ABAP language -------- ABAP Datatypes -------- Internal Tables -------- ABAP Database Access -------- Objects in ABAP Comparison of popular RDBMSs

DATA ENCODING

ASCII

ASCII is a 7-bit coding scheme. Valid hex values for an ASCII character are 00 through 7F (0 to 127). While the standard ASCII character set uses just 7 bits per character, there are several larger character sets that use 8 bits, which gives them 128 additional characters. The extra characters are used to represent non-English characters, graphics symbols, and mathematical symbols. Several companies and organizations have proposed extensions for these 128 characters. The DOS operating system uses a superset of ASCII called extended ASCII or high ASCII. A more universal standard is the ISO Latin 1 set of characters, which is used by many operating systems, as well as web browsers.

The ASCII chart (7-bit)

	Decimal   Octal   Hex    Binary     Value					Decimal   Octal   Hex  Binary	Value
	-------   -----   ---    ------     -----					-------   -----   ---  ------	-----
	 000      000    000   00000000      NUL    (Null char.)	         	064      100    040   01000000	@ (AT symbol)
	 001      001    001   00000001      SOH    (Start of Header)      		065      101    041   01000001	A
	 002      002    002   00000010      STX    (Start of Text)			066      102    042   01000010	B
	 003      003    003   00000011      ETX    (End of Text)			067      103    043   01000011	C
	 004      004    004   00000100      EOT    (End of Transmission)		068      104    044   01000100	D
	 005      005    005   00000101      ENQ    (Enquiry)			        069      105    045   01000101	E
	 006      006    006   00000110      ACK    (Acknowledgment)			070      106    046   01000110	F
	 007      007    007   00000111      BEL    (Bell)			        071      107    047   01000111	G
	 008      010    008   00001000       BS    (Backspace)			        072      110    048   01001000	H
	 009      011    009   00001001       HT    (Horizontal Tab)			073      111    049   01001001	I
	 010      012    00A   00001010       LF    (Line Feed)			        074      112    04A   01001010	J
	 011      013    00B   00001011       VT    (Vertical Tab)			075      113    04B   01001011	K
	 012      014    00C   00001100       FF    (Form Feed)			        076      114    04C   01001100	L
	 013      015    00D   00001101       CR    (Carriage Return)		    	077      115    04D   01001101	M
	 014      016    00E   00001110       SO    (Shift Out)			        078      116    04E   01001110	N
	 015      017    00F   00001111       SI    (Shift In)			        079      117    04F   01001111	O
	 016      020    010   00010000      DLE    (Data Link Escape)			080      120    050   01010000	P
	 017      021    011   00010001      DC1 (XON) (Device Control 1)		081      121    051   01010001	Q
	 018      022    012   00010010      DC2       (Device Control 2)		082      122    052   01010010	R
	 019      023    013   00010011      DC3 (XOFF)(Device Control 3)		083      123    053   01010011	S
	 020      024    014   00010100      DC4       (Device Control 4)		084      124    054   01010100	T
	 021      025    015   00010101      NAK    (Negative Acknowledgement)		085      125    055   01010101	U
	 022      026    016   00010110      SYN    (Synchronous Idle)			086      126    056   01010110	V
	 023      027    017   00010111      ETB    (End of Trans. Block)		087      127    057   01010111	W
	 024      030    018   00011000      CAN    (Cancel)			        088      130    058   01011000	X
	 025      031    019   00011001       EM    (End of Medium)			089      131    059   01011001	Y
	 026      032    01A   00011010      SUB    (Substitute)			090      132    05A   01011010	Z
	 027      033    01B   00011011      ESC    (Escape)			        091      133    05B   01011011	[ (left bracket)
	 028      034    01C   00011100       FS    (File Separator)			092      134    05C   01011100	\ (back slash)
	 029      035    01D   00011101       GS    (Group Separator)			093      135    05D   01011101	] (right bracket)
	 030      036    01E   00011110       RS    (Request to Send/Record Separator)	094      136    05E   01011110	^ (caret/circumflex)
	 031      037    01F   00011111       US    (Unit Separator)			095      137    05F   01011111	_ (underscore)
	 032      040    020   00100000       SP    (Space)			        096      140    060   01100000	` (grave accent)
	 033      041    021   00100001        !    (exclamation mark)			097      141    061   01100001	a
	 034      042    022   00100010        "    (double quote)			098      142    062   01100010	b
	 035      043    023   00100011        #    (number sign)			099      143    063   01100011	c
	 036      044    024   00100100        $    (dollar sign)			100      144    064   01100100	d
	 037      045    025   00100101        %    (percent)			        101      145    065   01100101	e
	 038      046    026   00100110        &    (ampersand)			        102      146    066   01100110	f
	 039      047    027   00100111        '    (single quote)			103      147    067   01100111	g
	 040      050    028   00101000        (    (left parenthesis)			104      150    068   01101000	h
	 041      051    029   00101001        )    (right parenthesis)			105      151    069   01101001	i
	 042      052    02A   00101010        *    (asterisk)			        106      152    06A   01101010	j
	 043      053    02B   00101011        +    (plus)			        107      153    06B   01101011	k
	 044      054    02C   00101100        ,    (comma)			        108      154    06C   01101100	l
	 045      055    02D   00101101        -    (minus or dash)			109      155    06D   01101101	m
	 046      056    02E   00101110        .    (dot)			        110      156    06E   01101110	n
	 047      057    02F   00101111        /    (forward slash)			111      157    06F   01101111	o
	 048      060    030   00110000        0			         	112      160    070   01110000	p
	 049      061    031   00110001        1			         	113      161    071   01110001	q
	 050      062    032   00110010        2			         	114      162    072   01110010	r
	 051      063    033   00110011        3			         	115      163    073   01110011	s
	 052      064    034   00110100        4			         	116      164    074   01110100	t
	 053      065    035   00110101        5			         	117      165    075   01110101	u
	 054      066    036   00110110        6			         	118      166    076   01110110	v
	 055      067    037   00110111        7			         	119      167    077   01110111	w
	 056      070    038   00111000        8			         	120      170    078   01111000	x
	 057      071    039   00111001        9			         	121      171    079   01111001	y
	 058      072    03A   00111010        :    (colon)			        122      172    07A   01111010	z
	 059      073    03B   00111011        ;    (semi-colon)         		123      173    07B   01111011	{ (left brace)
	 060      074    03C   00111100        <    (less than)			        124      174    07C   01111100	| (vertical bar)
	 061      075    03D   00111101        =    (equal sign)			125      175    07D   01111101	} (right brace)
	 062      076    03E   00111110        >    (greater than)      		126      176    07E   01111110	~ (tilde)
	 063      077    03F   00111111        ?    (question mark)         		127      177    07F   01111111	DEL (delete)

Base64 encoding

Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. Each Base64 digit represents 6 bits of data and thus three bytes can be represented by four Base64 digits (33% more storage).

Base64 is designed to carry data stored in binary formats across channels that only reliably support text content. Base64 is particularly prevalent on the World Wide Web where its uses include the ability to embed image files or other binary content inside text files such as HTML and CSS.

EBCDIC

EBCDIC is an 8-bit coding scheme, mostly used in IBM computers such as mainframes and mid-range machines like the AS-400. Valid hex values for an EBCDIC character are 00 through FF. EBCDIC uses the full 8 bits available to it, so parity checking cannot be used on an 8-bit system. Also, EBCDIC has a wider range of control characters than ASCII.

The character encoding is based on Binary Coded Decimal (BCD), so the contiguous characters in the alphanumeric range are formed up in blocks of up to 10 from 0000 binary to 1001 binary. Non-alphanumeric characters are almost all outside the BCD range.

There are four main blocks in the EBCDIC code page: 0000 0000 to 0011 1111 is reserved for control characters; 0100 0000 to 0111 1111 are for punctuation; 1000 0000 to 1011 1111 for lowercase characters and 1100 0000 to 1111 1111 for uppercase characters and numbers.

The unicode standard for data representation

The Unicode standard is a character code designed to encode text for storage in computer files, with the capacity to encode all of the characters used for written languages throughout the world. It is a profile (a subset) of the ISO 10646 standard. The design of the Unicode standard is based on the simplicity and consistency of the ASCII character code set, (and Latin-1, and extended version of the ASCII code set), but goes far beyond ASCII's limited ability to encode only the Latin alphabet.

Unicode and ISO 10646 were developed to create a single unified character set and closely resemble each other.

Universal Character Set (UCS)

The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. No information will be lost if any text string is converted to UCS and back to the original encoding.UCS contains the characters required to represent practically all known languages.

UCS is based on 4-octet (32-bit) coding scheme known as the 'canonical form' (UCS-4), but a 2-octet (16-bit) form (UCS-2) is used for the Basic Multilingual Plane (BMP) that contains characters of most modern languages, where octets 1 and 2 are assumed to be 00 00.

UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value is commonly preceded by 'U+' as in U+0041 for the character 'A'. The UCS characters U+0000 to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside the BMP are reserved for private use. UCS also defines several methods for encoding a string of characters as a sequence of bytes, such as UTF-8 and UTF-16.

UTF (Unicode transformation formats)

UCS represents most of the world's writing systems by multi-octet characters. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few so-called Unicode transformation formats (UTF). A UTF is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO 10646 standard uses the term 'UCS transformation format' for UTF.

Each UTF is reversible i.e supports lossless round tripping - mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must map all code points (except surrogate code points) to unique byte sequences. This includes reserved (unassigned) code points and the 66 noncharacters (including U+FFFE and U+FFFF).

The UTF-8 format

UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. In UTF-8, characters are encoded using sequences of 1 to 6 octets.

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility), which means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8

All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes

All possible 2³¹ UCS codes can be encoded

UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long

The sorting order of Bigendian UCS-4 byte strings is preserved

The bytes 0xFE and 0xFF are never used in the UTF-8 encoding

The UTF-16 format

UTF-16 basically arose from UCS-2 when it became clear that more than 2¹⁶ code points were needed to represent all the characters. UTF-16 uses a single 16-bit code unit to encode the most common characters, and a pair of 16-bit code units, called surrogates to encode the less commonly used characters in Unicode.

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading and trailing values of paired code units in UTF-16. Leading (high) surrogates are from D800 to DBFF, and trailing (low) surrogates are from DC00 to DFFF. They are called surrogates since they do not represent characters directly, but only as a pair.

For a BMP character, utf16 and ucs2 have identical storage characteristics: same code values, same encoding, same length (16 bits)

For a supplementary character, utf16 has a special sequence (surrogate) using 32 bits. For a number greater than 0xffff, 10 bits are taken and added to 0xd800 and put in the first 16-bit word, 10 more bits are taken and added to 0xdc00 and put in the next 16-bit word. Consequently, all supplementary characters require 32 bits, where the first 16 bits are a number between 0xd800 and 0xdbff, and the last 16 bits are a number between 0xdc00 and 0xdfff.

UTF-16 is used internally by Windows, Java and Javascript.

File Formats

Text (ASCII) Files

An ASCII file or a text file is one in which each byte represents one ASCII-coded character. An ASCII file can be considered to be a binary file that consists of just ASCII characters. Since ASCII characters are 7-bit encodings stored in a byte, each byte of an ASCII file has its MSB set to 0.

Text files have generally a one-to-one correspondence between the bytes and ordinary readable characters such as letters and digits and control characters such as tabs, line feeds and carriage returns without any embedded information such as font information, hyperlinks or inline images. Therefore any simple program to view a file makes them human-readable.

Text files can have the MIME type "text/plain", often with suffixes indicating an encoding. Common encodings for plain text include ASCII, UTF-8, UTF-16 and ISO 8859.

Binary Files

In a binary file there is no one-to-one mapping between bytes and characters. Files that have been formatted with a word processor must be stored and transmitted as binary files to preserve the formatting. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).

All executable programs are stored in binary files, as are most numeric data files. Binary files can also be image, sound or compressed files etc., in short, any file content other than plain text. Usually the specification of a binary file's file format indicates how to handle that file.

Binary files are often encoded into a plain text representation to improve survivability during transit, using encoding schemes such as Base64.

Back

DATA ARCHITECTURE

Data architecture deals with developing and maintaining a formal description of the data and data structures within an organization - this can include data definitions, data models, data flow diagrams etc. (in short, metadata). Data architecture includes topics such as metadata management, business semantics, data modeling and metadata workflow management.

Data architecture encompasses

Logical Data modeling

Physical Data modeling

Development of a data strategy and associated polices

Selection of capabilities and systems to meet business information needs

A data architect is responsible for designing, creating, deploying and managing an organization's data architecture. A data architect ensures that an organization follows a formal data standard and that its data assets are aligned with the defined data architecture and with the goals of the business. Typically, a data architect designs, builds and optimizes databases, maintains the metadata registry and develops strategies for data acquisition, archival and recovery.

Data architects are usually skilled in data modeling, data policies development, data strategy, data dictionaries, data warehousing, metadata registry, relational databases, data retention, data querying languages like SQL and identifying the best system for addressing data storage, retrieval and management.

Data modeling

Data modeling is the analysis of data objects that are used in a business or other context and the identification of the relationships among these data objects. A data model can be thought of as a diagram or flowchart that illustrates the relationships between data.

There are several different approaches to data modeling, including:

Conceptual Data Modeling - identifies the highest-level relationships between different entities.

Enterprise Data Modeling - similar to conceptual data modeling, but addresses the unique requirements of a specific business.

Logical Data Modeling - illustrates the specific entities, attributes and relationships involved in a business function. Serves as the basis for the creation of the physical data model. - includes entities (tables), attributes (columns) and relationships (keys)
- independent of technology (platform/DBMS)
- Uses business names for entities and attributes
- Normalized to the fourth normal form

Physical Data Modeling - represents an application and database-specific implementation of a logical data model. - includes tables, columns, keys, data types, triggers, procedures and access constraints
- includes keys/indices for fast data access
- Uses specific names for tables and columns, limited by DBMS
- may be denormalized to meet performance requirements
ER diagram

An entity relationship diagram (ERD) shows the relationships of entities (data components) stored in a database. It is a data-modeling technique that can help define business processes and be used as the foundation for a relational database.

There are three basic components of an ER diagram:

Entities, which are objects or concepts that can have data stored about them

Attributes, which are properties or characteristics of entities. An ERD attribute can be denoted as a primary key or a foreign key

The relationships between and among those entities

The relational model

A relational database stores data in tables(relations) and enforces relationships among those tables. It supports normalization and allows the setting up of integrity constraints to protect data integrity.

Database design

Steps

Identifying entities - entities (tables) are the types of information saved. Generally all information falls into four categories - people, things, events and locations. For e.g in a database for a shop, 'shop' is a location, 'sale' is an event, 'products' are things and 'customers' are people

Identifying relationships - determine the relationships between the entities and to determine the cardinality of each relationship i.e one customer can be involved in multiple sales

Identifying attributes - attributes (columns) are the data elements desired to be saved for each entity

Assigning keys

Defining the data types for attributes

Normalization

The database lifecycle

Database initial study - gather requirements and specifications

Database design - involves conceptual design, software selection, logical design and physical design

Implementation and loading

Testing and evaluation - the database is tested and fine-tuned for performance, integrity, concurrent access and security constraints

Operation

Maintenance and evaluation

Populating the database

After a database has been created, there are two ways of populating the tables � either from existing data or through the use of the user applications developed for the database.

Existing data can be from another database or data files. Facilities to import/export data in standard formats are usually available in a DBMS. When data are held in a file format that is not appropriate for using the import function, then it is necessary to use a program/tool that reads in the old data, transforms them as necessary and then inserts them.

The transfer of large quantities of existing data into a database is referred to as a bulk load and to it may be prudent to postpone constraint checking until the end of the bulk loading.

Normalization

Normalization is the process of efficiently organizing data in a database. The two goals of normalization: eliminating redundant data (for e.g storing the same data in more than one table ) and ensuring data dependencies make sense (only storing related data in a table).

First normal form (1NF) sets the very basic rules for an organized database

Eliminate duplicative columns from the same table

Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key)

Second normal form (2NF) further addresses the concept of removing duplicative data

Meet all the requirements of the first normal form

Remove subsets of data that apply to multiple rows of a table and place them in separate tables

Create relationships between these new tables and their predecessors through the use of foreign keys

Third normal form (3NF) goes one large step further

Meet all the requirements of the second normal form

Remove columns that are not dependent upon the primary key

The Boyce-Codd Normal Form, also referred to as BCNF or 3.5NF, adds one more requirement

Meet all the requirements of the third normal form

Every determinant must be a candidate key

The fourth normal form (4NF) has one additional requirement

Meet all the requirements of the third normal form

A relation is in 4NF if it has no multi-valued dependencies

Design considerations

Tables should be given meaningful, recognizable names based on the data contained

Efficient datatypes should be chosen for columns

Integer columns can be configured as unsigned when negative values are not expected to be stored to increase the range of permissible values

Default character set should be utf8 or higher for all tables where non-English characters are expected to be stored

Foreign keys should be used to preserve data integrity. Foreign keys need to be named as 'fk_tablename_colname'

Metadata

Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For e.g, author, date created and date modified and file size are examples of very basic document metadata. Having the abilty to filter through that metadata makes it much easier for someone to locate a specific document. Metadata can be stored and managed in a database, often called a metadata registry or metadata repository.

Most popular DBMSs provide access to their metadata with a set of tables or views often called system catalog or data dictionary which can be accessing using plain SQL. Many of them implement the standard information schema.

Back

DATABASE MANAGEMENT SYSTEMS

A DBMS is a software package designed to define, manipulate, retrieve and manage data in a database.

A relational DBMS (RDBMS) is a DBMS based on the relational model wherein the data is stored in a tabular form (rows and columns) in tables (relations).

DATABASE ARCHITECTURE

The main components of a DBMS are the process manager, client communications manager, query processor, transactional storage manager and shared components/utilities.

The client communications manager is responsible for establishing connections from clients, remembering their connection state, responding to SQL commands and returning both data and control messages. Upon receiving an SQL command, the process manager assigns a thread for executing the command. The query processor compiles the query into an internal query plan and executes it. The operators in the query plan make calls to the transactional storage manager which manages all data access and manipulation functions.

The storage manager includes algorithms and data structures for organizing and accessing data on disk ('access methods'). It also includes a buffer management module that decides what data to transfer between disk and memory buffers. The access methods return control to the query executor�s operators, and as result tuples are generated, they are placed in a buffer for the client communications manager, which ships the results back to the caller. For large result sets, the client typically will make additional calls to fetch more data incrementally from the query, resulting in multiple iterations through the communications manager, query executor, and storage manager.

At the end of the query the transaction is completed and the connection closed - this results in the transaction manager cleaning up state for the transaction, the process manager freeing any control structures for the query and the communications manager cleaning up communication state for the connection.

	+----------------------------------------------------------------------------------------------------+
	|  +----------+      +------------------------------------------+                                    | 
	|  | Admission|      |         Client protocols                 |                                    |
	|  |  Control |      +------------------------------------------+     +---------------------------+  |
	|  |          |            Client communications manager              | Catalog manager           |  |               
	|  |          |                                                       | Administration/Monitoring |  |
	|  | Dispatch |      +------------------------------------------+     | Replication               |  | 
	|  |   and    |      |  Query parsing and authentication        |     | Loading                   |  |
	|  |scheduling|      |  Query Optimizer                         |     | Batch utilities           |  |
	|  |          |      |  Plan Executor                           |     | Memory manager            |  | 
	|  +----------+      |  DDL and utility processing              |     +---------------------------+  |
	|  Process manager   +------------------------------------------+         Shared components and      |
	|                                Query processor                               utilities             |
	|                                                                                                    |
	|                    +------------------------------------------+                                    | 
	|                    |  Access methods          Buffer manager  |                                    |
	|                    |  Lock manager            Log manager     |                                    | 
	|                    +------------------------------------------+                                    |
	|                          Transactional storage manager                                             |     
	+----------------------------------------------------------------------------------------------------+

Process management

The process manager assigns a thread to an SQL command received from a client and ensures that the thread's data and control outputs are connected to the client through the communications manager. A thread is a single, sequential execution unit of an program which consists of its own program counter, a stack and a set of registers.

A process is an instance of the execution of a program on a single computer. A process can consist of one or more threads executing concurrently. The private memory used by a process cannot be accessed by any other process and all threads of a process share its virtual address space.

Transaction log

A transaction log is a sequential record of all changes made to the database. It contains information to undo all changes made to the data by any individual transaction and is a critical part of database crash recovery when committed transactions need to be applied to the data files and all uncommitted transactions need to be rolled back. A transaction log provides greater protection against failure, better performance and the ability to replicate data.

Like the database file, the transaction log is organized into pages - fixed size areas of memory. When a change is recorded in the transaction log, it is made to a page in memory. The change is forced to disk when the earlier of the following happens - the page is full or a COMMIT is executed. Completed transactions are guaranteed to be stored on disk, while performance is improved by avoiding a write to the disk on every operation.

Every log record is uniquely identified by an incremental log sequence number (LSN). DBMSs periodically issue checkpoint commands in the background at which the transaction log records are updated into the data files. The checkpointed LSN is stored in the log and is used during recovery.

Transaction logging improves performance by deferring database updates to disk during periods of high server activity. Instead of recording database updates to random, non-sequential parts of a disk, transactions are logged sequentially in the log files, saving processing time and updating the database when server activity is lower.

Back

DATA STRUCTURES

Tables

Tables are the primary form of data storage organized as rows and columns. Columns form the table�s structure and rows form the content. A single row in a table is called a tuple or a record.

Columns in a table have a unique name, often referred to as attributes. A domain is a unique set of values permitted for an attribute in a table. Columns can have constraints defined on them to enforce data integrity including

primary key - enforces entity integrity which ensures that each row of a table has a unique and non-null primary key value

unique key - enforces candidate key integrity

foreign key - enforces referential integrity, if a value of an attribute (column) of a relation (table) references a value of another attribute (either in the same or a different relation), then the referenced value must exist

check and not null constraints - restricts a column's values, a partial enforcement of domain integrity

Partitioning

Large tables can be subdivided into smaller, manageable pieces called partitions to enhance performance, manageability and availability. Each partition has its own name and optionally its own storage characteristics. Using partitioning, different portions of a table are stored as separate tables in different locations. The user-selected rule by which the division of data is accomplished is known as a partitioning function.

The following tables can be considered for partitioning

large tables

tables containing historical data, in which new data is added into the newest partition for e.g a historical table where only the current month's data is updatable and the other 11 months are read only

when the contents of a table need to be distributed across different types of storage devices

Partitioning types

Horizontal partitioning (sharding) - all partitions have the same schema. Each partition is known as a shard and holds a specific subset of the data

Range Partitioning - maps data to partitions based on ranges of values of the partitioning key established for each partition. It is the most common type of partitioning and is often used with dates

Hash Partitioning - maps data to partitions based on a hashing algorithm that applies to the partitioning key. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size. Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is also an alternative to range partitioning when the data to be partitioned is not historical or has no obvious partitioning key

List Partitioning - explicitly controls how rows map to partitions by specifying a list of discrete values for the partitioning key in the description for each partition. The advantage of list partitioning is that unordered and unrelated sets of data can be grouped and organized in a natural way

Vertical partitioning - each partition holds a subset of columns of a table

Views

A view is a virtual table that derives its data from tables or other views and is defined by a query. All operations performed on a view actually affect the base tables of the view. Views can be used in almost the same way - query, update, insert and delete - as regular tables.

A view can be used for the following purposes:
- to customize the perception each user has of the database by providing a different representation (such as subsets or supersets) of the data that resides within other tables and views
- as a security mechanism by allowing users to access data through the view, without granting permissions for the base tables
- to provide a backward compatible interface to emulate a table whose schema has changed

Indexes

An index is a data structure defined on columns in a table to speed up data retrieval. An index is usually structured as a set of pairs (k,p) where k is a key value of the column(s) on which the index is built and p is a pointer to the record(s) with the key value k.

The cardinality of an index refers to the uniqueness of data values contained in the indexed columns. A higher cardinality implies more unique values and the lower the cardinality, the more duplicated values in the columns. A column with the highest possible cardinality would have unique values in each row and one with lowest possible cardinality would have the same value for every row. DBMSs use cardinality to help determine the optimal query plan for a given query.

Types of indexes

Indexes are basically of two types - ordered and hashed. Ordered indexes have the key values in sorted order for quick access while in hash indexes, a hash function is used to find out the location of entries with key value k.

Ordered indexes can be classified into dense and sparse indexes. In a dense index, an index record appears for every search key value in file containing a pointer to the actual record. In a sparse index, index records are created only for some of the key values. To locate a record, the index record with the largest search key value less than or equal to the search key value is found and its data is located and a sequential search finds the required data. Dense indices are faster in general but sparse indices require lesser space and less maintenance for insertions and deletions.

Indexes can be unique or nonunique. Unique indexes guarantee that no two rows of a table have duplicate values in the key column or columns. A composite index, also called a concatenated index, is an index on multiple columns in a table.

Some databases like Oracle allow the ability to index functions and use these indexes in querying. For e.g

	create index emp_upper_idx on emp(upper(ename));

Clustered and non-clustered indexes

Every table usually has a clustered index (typically synonymous with the primary key) where the data for the rows is stored in sorted order. Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. A table without a clustered index is refered to as a heap.

All other indexes are secondary, non-clustered indexes. A nonclustered index contains the index key values and pointers to the data rows that contains the key value.

Index structure

B-trees

Indexes are usually organized as B-trees (balanced trees). A B-tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions and deletions in logarithmic time. B-trees save time by using nodes with many branches (called children), compared with binary trees where each node has only two children. B-trees are preferred when decision points, called nodes, are on hard disk rather than in RAM.

The B-tree structure consists of three main levels: the Root level - the top node that contains a single index page, from which the data search is started, the Leaf level - the bottom level of nodes that contains the data pages, with the number of leaf pages depending on the amount of data stored in the index, and the Intermediate level - one or multiple levels between the root and the leaf levels that holds the index key values and pointers to the next intermediate level pages or the leaf data pages. The number of intermediate levels depends on the amount of data stored in the index.

In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full.

A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root. An index that is too deep will suffer from performance degradation, while an index with large number of nodes in each level can produce a very flat index structure. An index with only 3 to 4 levels is very common.

Hash indexes

A hash index organizes the search keys with their associated pointers into a hash file structure. A hash function is applied on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). Hash indices are secondary index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it.

Back

CHARACTER SETS

DBMSs allow storage of data in a variety of character sets and perform comparisons using a variety of collations. Character sets not only affect storage but also how client programs communicate with the database server.

After a database is created and accumulates data, changing the character set is time consuming and complex and hence it is important to select the right character set at installation time. Multilingual unicode character sets have an additional cost such as slightly slower text processing compared to single-byte character sets and higher storage space requirements for non-ASCII text.

Collations

The physical storage of character strings in a database is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.

Different DBMSs provide different collation options. Depending on the DBMS, collation can be specified at the server level, database level, table level and the column level. Collations can also be specified at the expression level (by specifying which collation to use when the query is run) and at the identifier level.

Back

SQL

SQL (Structured Query Language) is a standardized programming language used for managing relational databases and performing various operations on the data in them. SQL consists of many types of statements, which may be classed as Data Query Language (DQL), Data Definition Language (DDL), Data Control Language (DCL) and Data Manipulation Language (DML). Although SQL is essentially a declarative language, it also includes procedural elements.

An official SQL standard was adopted by ANSI in 1986 and new versions of the SQL standard are published every few years. Some versions of SQL include proprietary extensions to the standard language for procedural programming and other functions e.g T-SQL, PL/SQL.

SQL queries and other operations take the form of commands written as statements and are aggregated into programs that enable users to add, modify or retrieve data from database tables.

Joins

The join clause combines columns from one or more tables into a new table. ANSI-standard SQL specifies five types of JOIN - INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

Inner join

In an inner join, the columns being joined are compared using a comparison operator. Only non-null values are compared and column values for each matched pair of rows of the joined tables are combined into a result row.

Inner joins can be expressed in two syntactical ways. The explicit join notation uses the JOIN keyword, optionally preceded by the INNER keyword, to specify the table to join, and the ON keyword to specify the join condition. The implicit join notation simply lists the tables for joining in the FROM clause of the SELECT statement.

	SELECT * FROM orders o INNER JOIN customers c ON o.cust_id = c.cust_id;
	SELECT * FROM orders o, customers c WHERE o.cust_id = c.cust_id;

Outer join

Outer joins return all the rows from at least one of the tables referenced in the join provided the rows meet the WHERE conditions. All rows are retrieved from the left (right) table with a LEFT (RIGHT) OUTER JOIN and all rows from both tables are retrieved in a FULL OUTER JOIN.

	SELECT a.orderid, a.custid, b.custname FROM orders a LEFT OUTER JOIN customers b ON a.custid = b.custid;

Cross join

A cross join that does not have a WHERE clause produces the Cartesian product of the tables involved in the join. The size of a cartesian product result set is the number of rows in the first table multiplied by the number of rows in the second table.

	SELECT au_fname, au_lname, pub_name FROM authors CROSS JOIN publishers ORDER BY au_lname DESC;

If a WHERE clause is added, the cross join behaves as an inner join. The below queries produce the same result set

	SELECT au_fname, au_lname, pub_name FROM authors CROSS JOIN publishers 
		WHERE authors.city = publishers.city ORDER BY au_lname DESC;
	
	SELECT au_fname, au_lname, pub_name FROM authors INNER JOIN publishers 
		ON authors.city = publishers.city ORDER BY au_lname DESC;

Join algorithms

Joins are performed using different algorithms like nested-loop, sort-merge or hash join.

nested-loop join

A nested loop join is a naive algorithm that joins two sets by using two nested loops.

	for each tuple r in R do
	for each tuple s in S do
		if r and s satisfy the join condition then
				yield tuple <r,s>

The block nested loop join algorithm is a generalization of the simple nested loops algorithm that takes advantage of additional memory to reduce the number of times that S is scanned. It loads large chunks of relation R into main memory and for each chunk, it scans S and evaluates the join condition on all tuple pairs currently in memory. This reduces the number of times S is scanned to once per chunk.

sort-merge join

The join algorithm needs to find, for each distinct value of the join attribute, the set of tuples in each relation which exhibit that value. The logic of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order.

hash join

The hash join algorithm for an inner join of two relations proceeds as follows

First, prepare a hash table using the contents of one relation, ideally whichever one is smaller after applying local predicates. This relation is called the build side of the join. The hash table entries are mappings from the value of the (composite) join attribute to the remaining attributes of that row (whichever ones are needed).

Once the hash table is built, scan the other relation (the probe side). For each row of the probe relation, find the relevant rows from the build relation by looking in the hash table.

The first phase is usually called the 'build' phase, while the second is called the 'probe' phase. Similarly, the join relation on which the hash table is built is called the 'build' input, whereas the other input is called the 'probe' input. Hash joins are typically more efficient than nested loops joins, except when the probe side of the join is very small.

Subqueries

A subquery is a query that is nested inside a query or another subquery. A subquery is also called an inner query or inner select, while the statement containing a subquery is also called an outer query or outer select. Subqueries provide alternative ways to perform operations that would otherwise require complex joins and unions.

Correlated and uncorrelated subqueries

A correlated subquery is one that has a correlation name as a table or view designator in the FROM clause of the outer query and the same correlation name as qualifier of a search condition in the WHERE clause of the subquery.

The subquery in a correlated subquery is reevaluated for every row of the table or view named on the outer query, while in a noncorrelated subquery, the subquery is evaluated only once.

	SELECT empno, lastname FROM employee
        WHERE dept = 'A00' AND salary > (SELECT AVG(salary) FROM employee WHERE dept = 'A00');	- uncorrelated subquery
		
	SELECT e1.empno, e1.lastname, e1.dept FROM employee e1
        WHERE salary > (SELECT AVG(salary) FROM employee e2 WHERE e2.dept = e1.dept) ORDER BY e1.dept;	- correlated subquery

Derived tables

A derived table is a statement-local temporary table created by means of a subquery in the FROM clause of a SQL SELECT statement. It exists only in memory and behaves like a standard table or view. Derived tables are useful for generating aggregates in a table and joining them with the details in another table.

Stored procedures

A stored procedure is a group of SQL statements named and stored in the DBMS and can be called by applications. Stored procedures resemble constructs in other programming languages because they can

accept input parameters and return multiple values in the form of output parameters to the calling program

contain programming statements that perform operations in the database

return a status value to a calling program to indicate success or failure and the reason for failure

Benefits of Using Stored Procedures

Reduced server/client network traffic - the commands in a procedure are executed as a single batch of code. This reduces network traffic between the server and client because only the call to execute the procedure is sent across the network. Without the code encapsulation provided by a procedure, every individual line of code would have to cross the network.

Stronger security - multiple users and client programs can perform operations on underlying database objects through a procedure, even if the users and programs do not have direct permissions on those underlying objects. The procedure controls what processes and activities are performed and protects the underlying database objects. This eliminates the requirement to grant permissions at the individual object level and simplifies the security layers.

When calling a procedure over the network, only the call to execute the procedure is visible. Therefore, malicious users cannot see table and database object names, embed SQL statements of their own or search for critical data.

Reuse of code - the code for any repetitious database operation is the perfect candidate for encapsulation in procedures. This eliminates needless rewrites of the same code, decreases code inconsistency

Easier maintenance - when client applications call procedures and keep database operations in the data tier, only the procedures must be updated for any changes in the underlying database. The application tier remains separate and does not have to know how about any changes to database layouts, relationships or processes.

Improved performance - by default, a procedure compiles the first time it is executed and creates an execution plan that is reused for subsequent executions. Since the query processor does not have to create a new plan, it typically takes less time to process the procedure.

If there has been significant change to the tables or data referenced by the procedure, the precompiled plan may actually cause the procedure to perform slower. In this case, recompiling the procedure and forcing a new execution plan can improve performance.

Triggers

A trigger defines a set of SQL statements that are performed in response to an insert, update or delete operation on a specified table. The trigger is activated when such an SQL operation is executed.

Triggers can be used, along with referential constraints and check constraints, to enforce data integrity rules. Triggers can also be used to cause updates to other tables, automatically generate or transform values for inserted or updated rows or invoke functions to perform tasks such as issuing alerts.

Back

TRANSACTIONS

A transaction is a logical unit of work of database processing carried out by a single user or application process which includes one or more access operations. Transactions are ended by a commit or rollback.

COMMIT - ends a unit of work and makes the database changes that were made permanent
ROLLBACK - rolls back all changes to the last commit point and ends the transaction

A savepoint is a way of implementing subtransactions (or nested transactions) by indicating a point within a transaction that can be 'rolled back to' without affecting any work done in the transaction before the savepoint was created. Multiple savepoints can exist within a single transaction. Savepoints are useful for implementing complex error recovery in database applications by enabling the application to recover from an error without aborting the entire transaction.

	SAVEPOINT s1;
	ROLLBACK TO SAVEPOINT s1;
	RELEASE SAVEPOINT s1;	- discard savepoint

A transaction can be in one of these states - active, partially committed (final operation executed but data is not committed yet), failed, committed or aborted (rolled back to the old consistent state).

Schedules are sequences that indicate the order in which instructions of concurrent transactions are executed. A schedule for a set of transactions contains all the instructions of those transactions preserving the order. A serial schedule is one in which transactions execute serially one after the other while in a non-serial schedule the operations of transactions may be interleaved. The serializability of schedules is used to find non-serial schedules that allow transactions to execute concurrently without interfering with one another.

Isolation levels

The isolation level associated with an application process defines the degree of isolation of that process from other concurrently executing processes. The isolation level of an application process, P, therefore specifies

The degree to which rows read and updated by P are available to other concurrently executing application processes

The degree to which update activity of other concurrently executing application processes can affect P.

Transaction isolation level sets the default transactional behavior and determines how isolated each transaction is, or what kind of locks are associated with queries inside a transaction. The four levels, in ascending order of strictness, are

READ UNCOMMITTED: Barely transactional, this setting allows for so-called 'dirty reads', where queries inside one transaction are affected by uncommitted changes in another transaction

READ COMMITTED: Committed updates are visible within another transaction. This means identical queries within a transaction can return differing results. This is the default in some DBMSs

REPEATABLE READ: Within a transaction, all reads are consistent

SERIALIZABLE: Updates are not permitted in other transactions if a transaction has run an ordinary SELECT query, i.e. queries are treated as if they had a LOCK IN SHARE MODE

ACID

ACID is an acronym for the four primary attributes ensured to any transaction by a transaction manager (or transaction monitor).

Atomicity - In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.

Consistency - A transaction either creates a new and valid state of data, or if any failure occurs, returns all data to its state before the transaction was started.

Isolation - A transaction in process and not yet committed must remain isolated from any other transaction.

Durability - Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.

Locking

The rationale behind locking is in isolating applications in such a way to provide as much concurrent access as possible and at the same time guarantee as much data integrity as necessary.

There are two different approaches to transactional locking - optimistic and pessimistic. In pessimistic locking, a resource is locked from the time it is first accessed in a transaction until the transaction is finished, making it inaccessible to other transactions during that time.

With optimistic locking, a resource is not actually locked when it is first accessed by a transaction and instead, the state of the resource is saved. Other transactions are able to concurrently access the resource and the possibility of conflicting changes is possible. At commit time, when the resource is about to be updated in persistent storage, the state of the resource is read from storage again and compared to the saved state. If the two states differ, a conflicting update was made and the transaction will be rolled back.

Locks can be of three types - shared, update and exclusive.

Shared locks allow two or more programs to read simultaneously but not change the locked space.

An exclusive lock bars all other users from accessing the space.

An update lock is less restrictive; it allows other transactions to read or acquire shared locks on the space.

Concurrency issues

Lost update - Occurs when the same data is retrieved by two applications (users), both work with the data concurrently and both change and save the data. The last successful change to the data is kept and the first change lost.

Uncommitted read('Dirty read') - uncommitted changes to data are read by an application (user), the changes are rolled back and the reading application gets invalid data.

Non-repeatable read - occurs when an application (user) reads a query result and later on, in the same transaction, the same query yields a different result, the reason being the rows in the resultset were updated or deleted by someone else. Query results aren't repeatable and hence unreliable and the reading application gets invalid data.

Phantom read - occurs when an application (user) reads a query result and later on, the same query returns additional rows (inserted by someone else). Acceptable for many applications.

Deadlocks

A deadlock occurs when two or more application processes each hold locks on a resource that the others need and without which they cannot proceed. After a preset time interval, the DBMS can roll back the current unit of work for one of the processes or request a process to terminate. That frees the locks and allows the remaining processes to continue.

An example of deadlock - Client A takes an S lock on a table row. Client B tries to update the row and needs an X lock that cannot be granted because of the S lock that A holds and the request goes on the queue of lock requests for the row. Client A now tries to update the row and needs an X lock which cannot be granted because B already has a request for an X lock and is waiting for A to release its S lock.

Deadlocks are a usual occurrence in transactional databases, but they are not ominous unless they are so frequent that certain transactions cannot be run at all and applications must be written so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock.

The occurrence of deadlocks can be reduced by keeping transactions short and less prone to collision and committing them immediately after making a set of related changes.

Semaphores

A semaphore is a variable or abstract data type used to control access to a common resource by multiple threads in a concurrent system. Semaphores are a useful tool in the prevention of race conditions. Semaphores which allow an arbitrary resource count are called counting semaphores, while semaphores which are restricted to the values 0 and 1 (or locked/unlocked, unavailable/available) are called binary semaphores and are used to implement locks.

A mutex (short for 'mutual exclusion') is a locking mechanism that sometimes uses the same basic implementation as the binary semaphore. While a binary semaphore may be colloquially referred to as a mutex, only the task that locked the mutex is supposed to unlock it.

A hot mutex is the terminology used to describe a mutex which has other threads blocked waiting for it for a large amount of time. This contention reduces scalability on multi-core machines because some threads sit idle, essentially waiting for the ablity to run.

In many situations, data is read more often that it is modified or written. An rw-lock (readers-writer lock) allows concurrent access for read-only operations, while write operations require exclusive access. When a writer is writing the data, all other writers or readers will be blocked until the writer is finished writing. Rw-locks are usually constructed on top of mutexes and condition variables or on top of semaphores. Mutexes and rw-locks are known collectively as latches.

Latches are internal to the DBMS engine and are used to provide memory consistency, whereas locks are used to provide logical transactional consistency. To allow for maximum concurrency and provide maximum performance, latches are held only for the duration of the physical operation on the in-memory structure, unlike locks which are held for the duration of the logical transaction.

Back

QUERY PROCESSING

Query optimizer

A query optimizer is a DBMS component that analyzes SQL queries and determines efficient execution mechanisms. A query optimizer generates one or more query plans for each query and selects the most efficient query plan to run the query.

Query plan

The output of the Query Optimizer is a query plan, also referred to as a query execution plan or just execution plan. A query plan is a definition of

the sequence in which the source tables are accessed

the methods used to extract data from each table

Each plan is made of one or more execution steps that describe the database operations performed to produce the query result.

Join algorithms

Nested-loop join

A nested-loop join algorithm reads rows from the first table in a loop one at a time, passing each row to a nested loop that processes the next table in the join. This process is repeated as many times as there remain tables to be joined.

In general, nested loops joins work best on small tables with indexes on the join conditions. The optimizer always tries to put the smallest row source first, making it the driving table.

Block nested-loop join

A Block Nested-Loop uses buffering of rows read in outer loops to reduce the number of times that tables in inner loops must be read. For e.g if 10 rows are read into a buffer and the buffer is passed to the next inner loop, each row read in the inner loop can be compared against all 10 rows in the buffer. This reduces by an order of magnitude the number of times the inner table must be read.

Merge join

The merge join requires both inputs to be sorted on the merge columns. The query optimizer typically scans an index, if one exists on the proper set of columns or it places a sort operator below the merge join. In rare cases, there may be multiple equality clauses but the merge columns are taken from only some of the available equality clauses.

Because each input is sorted, the merge join operation gets a row from each input and compares them. For e.g for inner join operations, the rows are returned if they are equal. If not equal, the lower-value row is discarded and another row is obtained from that input. This process repeats until all rows have been processed.

Hash join

The optimizer uses the smaller of two data sets to build a hash table on the join key in memory using a deterministic hash function to specify the location in the hash table in which to store each row. The database then scans the larger data set probing the hash table to find the rows that meet the join condition.

While nested loops join works well with smaller data sets and merge join with moderately sized data sets, hash join is better at performing the largest joins. Hash joins parallelize and scale better than any other join. Like merge join, hash join requires at least one equijoin predicate, supports residual predicates and supports all outer and semi-joins and unlike merge join, it does not require ordered input sets.

A hash join is most cost effective when the smaller data set fits in memory in which case, the cost is limited to a single read pass over the two data sets.

Prepared statements

A prepared statement or parameterized statement is a feature used to execute identical SQL statements repeatedly with high efficiency. The prepared statement takes the form of a template into which constant values are substituted during each execution.

The typical workflow of using a prepared statement is as follows:
Prepare: At first, the application creates the statement template and send it to the DBMS. Certain values are left unspecified, called parameters, placeholders or bind variables (labelled '?').

	INSERT INTO products (name, price) VALUES (?, ?);

Then, the DBMS compiles (parses, optimizes and translates) the statement template, and stores the result without executing it.

Execute: At a later time, the application supplies (or binds) values for the parameters of the statement template, and the DBMS executes the statement (possibly returning a result). The application may execute the statement as many times as it wants with different values.

Prepared statements offer two main advantages:
The overhead of compiling the statement is incurred only once. However not all optimization can be performed at the time the statement template is compiled since the plan may depend on the specific values of the parameters and the plan may change as tables and indexes change over time.
Prepared statements are resilient against SQL injection because values which are transmitted later using a different protocol are not compiled like the statement template.

DISTRIBUTED DATABASES

A distributed database appears to a user as a single database but is a set of databases stored on multiple computers within a network or across networks. The data on several computers can be simultaneously accessed and modified across the network. Each database server in the distributed database is controlled by its local DBMS and each cooperates to maintain the consistency of the global database.

Distributed databases can be homegeneous (identical DBMS in all locations) or heterogeneous (different DBMS) and offer these advantages

Management of data with transparency

Increased reliability and availability

Easier expansion

Improved performance through the proximity of the data to its points of use and parallelism in query execution

Drawbacks include increased complexity, difficulty to maintain data integrity and concurrency and security issues.

Some of the strategies and objectives in designing a distributed database are

Data fragmentation - applied to partition the relations among network sites.

Data allocation - each fragment is stored at the site with optimal distribution

Data replication - increases the availability and improves the performance of the system

Location transparency - enables a user to access data without being concerned with the site at which the data resides

Data fragmentation

The following information is used to decide fragmentation
Quantitative information - cardinality of relations, frequency of queries, site where query is run, selectivity of the queries etc.
Qualitative information - predicates in queries, types of access of data, read/write etc.

Types of Fragmentation

Horizontal - partitions a relation along its tuples
Vertical - partitions a relation along its attributes
Mixed/hybrid - a combination of horizontal and vertical fragmentation

Distributed transactions

A distributed transaction is a transaction that includes one or more statements that update data on two or more distinct nodes of a distributed database. Like any other transaction, a distributed transaction should include all four ACID properties.

Back

SERVER ADMINISTRATION

Connection handling

Network protocols

Database servers can support and service requests on several protocols at the same time, but clients connect with a single protocol. Some of the available protocols are

TCP/IP - is a common protocol widely used over the Internet. It communicates across interconnected networks of computers that have diverse hardware architectures and various operating systems. TCP/IP includes standards for routing network traffic and offers advanced security features.

Named pipes - a named pipe is a one-way or duplex pipe that provides communication between the pipe server and some pipe clients. A pipe is a section of memory that is used for interprocess communicationa (IPC). A named pipe (also known as a FIFO for its behavior) is an extension to the traditional pipe concept on Unix-like systems. A traditional pipe is 'unnamed' and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process and can be deleted if no longer used. Usually a named pipe appears as a file, and generally processes attach to it for IPC. Named pipes can be used to provide communication between processes on the same computer or between processes on different computers across a network.

Generally, TCP/IP is preferred in a slow LAN, WAN or dial-up network, whereas named pipes can be a better choice when network speed is not the issue, as it offers more functionality, ease of use and configuration options.

Shared memory - is a simple protocol only for same-computer communications i.e clients can only connect to the server on the same machine and is available on all platforms. For same-computer communications, shared memory tends to provide better performance than TCP/IP.

Connection pooling

Connection pooling is a technique that developers can use to share database connections among requesting clients. When a connection has been created and is placed in a runtime object pool, an application can use that connection again. Each application does not have to perform the complete connection process every time that it uses a connection.

When the application closes the connection, the connection is cached in the runtime object pool again. Connection pooling permits an application to use a connection from a pool of connections that do not have to be reestablished for each use.

By using pooled connections, applications can realize significant performance gains because applications do not have to perform all of the tasks that are involved in making a connection. This can be particularly significant for middle-tier applications that connect over a network, or for applications that repeatedly connect and disconnect, such as internet applications.

Performance improvement

Improving Disk I/O

Since disk I/O is a slower operation, high disk I/O is a common performance bottleneck. High disk I/O can be optimized by placing the transaction log in a different volume equipped with a high I/O device. All updates to the database pass through the transaction log and the speed at which transactions can be written to the log is the speed at which data can be processed and depends on the device where the log resides.

Types of hard disks

SAS (Serial Attached SCSI) is a faster and historically more expensive interface. Because SAS drives are able to rotate so much faster (up to 15K RPM) than SATA (Serial Advanced Technology Attachment) drives (typically 7.2K RPM), seek times may be substantially faster by more than 2 times. SATA is less expensive and suited for non-critical storage and archives while SAS is expensive and fit for mission-critical high I/O.

Solid-state drives(SSDs) can deliver substantially better I/O performance than traditional HDDs and are capable of driving tens of thousands of I/O operations per second as opposed to hundreds for HDDs. However SSDs may not be as durable/reliable as HDDs and are not suitable for long-term archival data.

Using RAID

RAID (redundant array of independent disks) is a disk system that contains multiple disk drives (array) to provide greater performance, reliability, storage capacity, and lower cost. Fault-tolerant arrays are categorized in six RAID levels: 0 through 5. Each level uses a different algorithm to implement fault tolerance.

A hardware disk array improves I/O performance because I/O functions, such as striping and mirroring, are handled efficiently in firmware. Conversely, an OS�based RAID offers lower cost, but consumes processor cycles. When cost is an issue and redundancy and high performance are required, RAID-5 volumes are a good solution.

Data striping (RAID 0) gives the best performance, but if one disk fails, all the data on the stripe set becomes inaccessible. A common technique for RDBMSs is to configure the database on a RAID 0 drive and put the transaction log on a mirrored drive (RAID 1), obtaining the best disk I/O performance for the database and maintaining data recoverability through a mirrored transaction log.

If data must be quickly recoverable, mirroring the transaction log and putting the database on a RAID 5 disk can be considered. RAID 5 provides redundancy of all data on the array. This allows for a single disk to fail and be replaced without system down time. RAID 5 offers lower performance, compared to RAID 0 or RAID 1, but greater reliability and faster recovery.

Back

BACKUP/RECOVERY

Backup

Hot and Cold backups

Hot backups are performed while the database is running. Hot backup does not block normal database operations and captures even changes that occur while the backup is happening.

Cold backups are done while the database is stopped. To avoid service disruption, they are usually done from a replication slave, which can be stopped without taking down the entire application or website.

Physical and logical backups

Physical backups consist of raw copies of the directories and files that store database contents. It is suitable for large, important databases that need to be recovered quickly when problems occur.

Logical backups save information represented as logical database structure (CREATE DATABASE/TABLE statements) and content (INSERT statements or delimited-text files). The backup does not include log or configuration files.

Physical backup methods are faster than logical because they involve only file copying without conversion.

Back

HIGH AVAILABILITY/SCALABILITY

High Availability

High availability(HA) is the ability of a system to be continuously operational for a desirably long length of time and to quickly recover from any sort of failure to minimize interruptions for the end user. A database must be made highly available to ensure data is available if a failure occurs.

A database can fail as result of hardware failure, software failure, network interruption, human intervention or environmental reasons (fire, water etc.). Best practices for achieving high availability include

ensuring that multiple copies of data exist in different locations using techniques like log shipping, replication and mirroring

eliminating single points of failure or any node that would impact the system as a whole

continuously monitoring the health of servers

distributing resources geographically in case of power outages or natural disasters

implementing reliable crossover or failover. In terms of storage, a redundant array of independent disks (RAID) or storage area network (SAN) are common approaches

setting up a system that detects failures as soon as they occur

Depending on the type of redundancy to be provided for high availability, database servers can be configured in the following configurations

Cold standby - The secondary node acts as a backup for another identical primary system. In case the primary node fails, the secondary node is started up and the data restored before replacing the failed node. Data from the primary node can be backed up on a storage system and restored on the secondary server as and when required. This generally provides a recovery time of a few hours.

Warm standby - The secondary node is turned on periodically to receive updates from the primary node. Data is regularly mirrored to the secondary node using disk based replication or shared disk. This generally provides a recovery time of a few minutes.

Hot standby - The primary and secondary nodes are both running simultaneously and data is mirrored in real time and both nodes will have identical data but the secondary node will not process data or requests. This generally provides a recovery time of a few seconds.

Scalability

Scalability is the ability of a DBMS to handle an increase or decrease of a workload without affecting its productivity. A DBMS that is scalable is an adaptable one and should be able to grow with the company and handle varying workloads. DBMSs can be scaled either vertically or horizontally. Horizontal scaling involves adding (scaling out) or removing (scaling in) servers to handle the workload. Vertical scaling involves adding CPU, memory or other resources (scaling up) to the existing system or removing resources (scaling down).

Log shipping

Log shipping is a technology in which the backup is copied from the production server to a standby server and restored there, and then transaction logs backed up on the production server are copied or shipped to the standby server and applied. This process keeps the two databases synchronized.

The main advantages of log shipping are

It is simple and low-cost. All database dependencies are copied to the standby server (server dependencies such as logins are not)

It is minimally intrusive on the server

The disadvantages of log shipping are

Latency can be as low as one minute but can be more. During a backup, the log cannot be generated or shipped and during an indexing operation, the log can be huge

Log shipping is not scalable to large numbers of databases per server

Replication

Replication is a set of technologies for copying and distributing data and database objects from one database to another and then synchronizing between databases to maintain consistency. Using replication, data can be distributed to different locations and to remote or mobile users over local and wide area networks.

Replication may be implemented in three different ways

Synchronous replication - when a primary server commits a transaction, all the secondary servers also commit the transaction and acknowledge the same before the primary server marks the transaction as successful. This protects against data loss if the primary server fails but it introduces a delay in completing a transaction

Asynchronous replication - committed data is placed in a queue on the primary server but the primary server does not wait for the acknowledgments of the secondary server. Any data that did not get copied across the network on the second server is lost if the first server fails

Semi-synchronous replication - when a primary server commits, at least one of the secondary servers must send a receipt that the committed data has been received and the data is committed later on the secondary servers. This provides improved data integrity compared to asynchronous replication because when a commit returns successfully, it is known that the data exists in at least two places.

Master-slave replication

Master-slave replication enables data from one database server (the master) to be replicated to one or more other database servers (the slaves). The master logs all its updates which are then propagated to the slaves. Master-slave replication can be either synchronous or asynchronous. Master-slave replication has an easy to maintain architecture, the consistency of the data is maintained and analytical applications can read from the slave(s) without impacting the master. However, it is not scalable since a single server handles all client updates and have a single point of failure, so some downtime will be involved.

Multi-Master replication

Multimaster replication, also known as peer-to-peer replication, is comprised of multiple master servers, each having its own copy of the complete database, equally participating in an update-anywhere model. Updates made to an individual master are propagated to all other participating masters. It is scalable since multiple servers are servicing clients and there is no single-point of failure and automatic failover is possible. However database design will become complex to take care of multiple masters updating different copies of a single table. For e.g auto-increment columns may not work because both servers may insert a record into the table at the same time causing a collision in the generated id value. Database integrity can also be compromised due to communication issues between masters.

Clustering

Clustering refers to the ability of several servers or instances to connect to a single database or multiple data nodes. An instance is the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data.

Clustering offers two major advantages, especially in high-volume database environments:

Fault tolerance - Because there is more than one server or instance for users to connect to, clustering offers an alternative, in the event of individual server failure

Load balancing - The clustering feature is usually set up to allow users to be automatically allocated to the server with the least load

The disadvantages of clustering include expensive setup and maintenance and possible performance issues with joins or range/scan queries.

Clustering architectures

Clustering takes different forms, depending on how the data is stored and the allocated resources.

The first type is known as the shared-nothing architecture. In this clustering mode, each node/server is fully independent, so there is no single point of contention. Shared-nothing is also known as 'database sharding', wherein a single database is split to reside on multiple servers and each server will service client requests based on which server holds the rows that need to be accessed.

Contrastingly in a shared-disk architecture, all data is stored centrally and then accessed via instances stored on different servers or nodes.

In between the above two is grid computing or distributed caching. In this setup, data is still centrally managed but controlled by a powerful 'virtual server' that is comprised of many servers that work together as one.

Database sharding

Sharding is a database architecture pattern related to horizontal partitioning - the practice of separating a table�s rows into multiple different tables, known as partitions. Database shards exemplify a shared-nothing architecture, which means that the shards are autonomous and they do not share any of the same data or computing resources.

Sharding a database can help facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.

Back

SECURITY

Encryption

Encryption is the process of obfuscating data by the use of a key or password which can make the data useless without the corresponding decryption key or password. Encryption does not solve access control problems, but enhances security by limiting data loss even if access controls are bypassed.

Methods of database encryption

Transparent or External Database encryption - The database and all backups are encrypted using a single encryption key. Only the data that is stored on physical media is encrypted. This method is also called transparent because when the data is queried and loaded into system memory or cache, the contents are decrypted. One advantage of transparent or external database encryption is that since the data is encrypted at the database level, it cannot be read or queried without the key.

The risk of using this method is that only data inside the stored database tables is encrypted and data that is queried or retrieved from the tables and is in memory or cache is not encrypted.

Column-Level encryption - Instead of encrypting the entire database, column-level encryption allows for individual columns within the database tables to be encrypted using separate encryption keys. Data is more secure because each encrypted column in the table requires a separate decryption method. This makes it more difficult for all the data within each table to be accessed at the same time, but slows down database index and search performance, because each column now needs a unique key.

Symmetric encryption - Symmetric encryption is an improvement over transparent encryption because not only is the database encrypted when stored and decrypted when open and accessed, but it also requires the user to have a copy of a private key in order to view and read the data.

Asymmetric encryption - In the asymmetric encryption method, there is one public key that is used to encrypt the data and a single private decryption key per authorized user. The advantage here is that even if the key is accidentally shared, no one else would be able to use the key that was created for a specific, authorized user. Asymmetric encryption and decryption are relatively resource-intensive, but they provide a higher level of security than symmetric encryption. An asymmetric key can be used to encrypt a symmetric key for storage in a database.

Application-Level encryption - In application-level encryption, all encryption is done at the application level. This means that unlike transparent encryption where the stored data is encrypted, all data is encrypted by the applications that use it. When implemented, data is encrypted before it is written to the database. Any attempts at a security breach would require knowledge of how to decrypt.

Encryption algorithms define data transformations that cannot be easily reversed by unauthorized users. No single algorithm is ideal for all situations but the following general principles apply

Strong encryption generally consumes more CPU resources than weak encryption

Long keys generally yield stronger encryption than short keys

Asymmetric encryption is slower than symmetric encryption

Long, complex passwords are stronger than short passwords

Symmetric encryption is generally recommended when the key is only stored locally, asymmetric encryption is recommended when keys need to be shared across the wire

If lot of data is being encrypted, it should be encrypted using a symmetric key and the symmetric key should be encrypted with an asymmetric key

Encrypted data cannot be compressed, but compressed data can be encrypted. If compression is used, data should be compressed before encrypting it

Back

DATA WAREHOUSING

Data warehouses/Data marts

A data warehouse is an enterprise-wide solution for data collection. It may consist of numerous databases and data marts. Data warehouses and data marts are critical components of decision support systems because they collect and organize the data upon which crucial decisions are based.

The difference between a data warehouse and a data mart is debatable. In general, a data warehouse is used on an enterprise level, while data marts are used on a business division/department level. A data mart only contains the required subject specific data for local analysis.

Benefits of data warehousing

The main advantage of data warehousing is that the data stored for business analysis can most effectively be accessed by separating it from the data in the operational systems.

The chief reason for separating data for business analysis from the operational data has always been the potential performance degradation on the operational system that can result from the analysis processes.

Organizations that use a data warehouse to assist their analytics and business intelligence see a number of substantial benefits

Better data � adding data sources to a data warehouse enables organizations to ensure that consistent and relevant data from that source is being collected. This ensures higher data quality and data integrity for sound decision making

Faster decisions � Data in a warehouse is in such consistent formats that it is ready to be analyzed and also provides the analytical power and a more complete dataset to base decisions on accurate facts without relying on hunches, incomplete data or poor quality data and risk delivering slow and inaccurate results

Business intelligence

Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data.

Business intelligence is a term commonly associated with data warehousing. Many of the tool vendors position their products as BI software rather than data warehousing software. There are other occasions where the two terms are used interchangeably.

Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend or the infrastructural, component for achieving business intelligence. BI also includes the insight gained from doing data mining analysis, as well as unstructured data (thus the need for content management systems).

Attributes of data warehouses

Data is usually combined from more than one operational system. This provides the ability to cross-reference data from these applications.

Data is mostly non-volatile. i.e after the data is in the data warehouse, there are no modifications to be made to this information.

Data is saved for longer periods than in transaction systems.

Setting up a data warehouse

After the tools and team personnel selections are made, the data warehouse project can begin. The following are the typical processes involved in the data warehousing project cycle

Requirement gathering

Physical environment setup

Data modeling

ETL

OLAP cube design

Front end development

Performance tuning

Quality assurance

Rolling out to production

Production maintenance

Incremental enhancements

ETL (Extracting, Transforming and Loading)

ETL is the data warehouse acquisition process of extracting, transforming (or transporting) and loading data from source systems into the data warehouse. Applications require real-time data for processing and there are a variety of ETL tools available that make the data management task easier.

ETL Tools

While evaluating ETL tools, the following characteristics need to be looked at

Functional capability - This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if the data is going to be dirty coming in, the ETL tool needs to have strong cleansing capabilities. If there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation.

Ability to read directly from the data source - For each organization, there is a different set of data sources. The ETL tool selected must be able to connect directly to the source data

Metadata support - The ETL tool plays a key role in the metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with the overall metadata strategy.

Popular ETL tools include - Data Junction, Ascential DataStage, Ab Initio, Informatica, Oracle Warehouse Builder (OWB), SeeBeyond ETL integrator

ETL points of measurement:

Architecture

ETL functionality

Ease of use

Reusability

Debugging

Real-time

Connectivity

General ETL tool characteristics

Buy vs Build

When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things:

Complexity of the data transformation - the more complex the data transformation is, the more suitable it is to purchase an ETL tool

Data cleansing needs - if the data needs to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse, it is preferable to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch

Data volume - available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large

Data mining

Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition.
Data mining has been defined as
- the nontrivial extraction of implicit, previously unknown, and potentially useful information from data
- the science of extracting useful information from large data sets or databases

Online Analytical Processing (OLAP)

OLAP is a computing method that provides the ability to analyze metrics in different dimensions such as time, geography, gender, product etc. For e.g, sales for the company is up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that an OLAP analysis is being performed.

To facilitate this kind of analysis, data is collected from multiple data sources and stored in data warehouses, then cleansed and organized into data cubes. Each OLAP cube contains data categorized by dimensions (such as customers, geographic sales region and time period) derived by dimensional tables in the data warehouses. Dimensions are then populated by members (such as customer names, countries and months) that are organized hierarchically. OLAP cubes are often pre-summarized across dimensions to drastically improve query time over relational databases.

OLAP operations on data cubes include

Rollup - dimension reduction

Drill-down - introducing a new dimension

Slice - select a particular dimension from a given cube and provide a new sub-cube

Dice - select two or more dimensions from a given cube and provide a new sub-cube

Pivot - rotate the data axes in view in order to provide an alternative presentation of data

OLAP can be used for data mining i.e the discovery of previously undiscerned relationships between data items. An OLAP database does not need to be as large as a data warehouse, since not all transactional data is needed for trend analysis.

OLAP products include IBM Cognos, Oracle OLAP and Oracle Essbase. OLAP features are also included in Microsoft Excel and Microsoft SQL Server's Analysis Services.

Back

BIG DATA

Big data is high-volume, high-velocity and high-variety data that demands cost-effective, innovative forms of information processing for enhanced insight and decision making. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization and information privacy.

NoSQL

NoSQL ("Not Only SQL") represents a framework of non-relational databases that allows for high-performance, agile processing of information at massive scale. i.e it is a database infrastructure that as been very well-adapted to the heavy demands of big data.

Unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes and servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance.

Hadoop

Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.

Hadoop is composed of four core components�Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.

Hadoop Common - A module containing the utilities that support the other Hadoop components.

MapReduce - A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.

HDFS - A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.

Yet Another Resource Negotiator (YARN) - The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

MongoDB

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

MongoDB provides high performance data persistence

Support for embedded data models reduces I/O activity on database system.

Indexes support faster queries and can include keys from embedded documents and arrays.

MongoDB�s replication facility, called replica set, provides automatic failover and data redundancy. A replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and increasing data availability.

Oracle Big Data SQL

Provides single, optimized SQL query for distributed data

Supports multiple data sources, including Hadoop, NoSQL and Oracle Database

Includes automatic, extensible Oracle Database external table generation

Provides Smart Scan on Hadoop to minimize data movement and maximize performance

Ensures advanced security for Hadoop and NoSQL data

Back

SAP

SAP Basics

SAP is an abbreviation of Systems, Applications and Products in Database Processing. It is an ERP (Enterprise Resource Planning) package. The SAP framework of applications communicate using the Application Link Enabling (ALE) protocol.

Layers in the SAP client-server architecture

Presentation layer - comprises the SAPGUI or user interfaces

Application layer - applications written in the ABAP language run within this layer and communicate with the other two layers

Database layer - is an RDBMS like Oracle, SQL Server or Informix

Basis

Basis is the SAP Kernel/Solution that

Provides a runtime environment for all SAP applications

Provides a stable architecture for system enhancements

Facilitates the distribution of resources and system components

Supports decentralization of system components [and non-sap solutions].

SAP Modules

Finance (FI) components include: FI-GL, FI-AR, FI-AP, FI-FC (Financial Control), FI-LC (Legal Consolidation), FI-FM (Financial Resources Control). GL, AR, AP all meet basic docuentation and accounting requirements. LC, FM are built to provide retrieval of specific info for strategic decisions.

Sales/Distribution (SD) components include: SD-SLS (Sales), SD-SHP (Shipping), SD-BIL (Billing), SD-CAS (Sales Support), Sales Information Systems.

Materials Management (MM) components include: MM-MRP (Material Requirements Planning), MM-PUR (Purchasing), MM-IS (Purchasing Information System), MM-WM (Warehouse Management), MM-IM (Inventory Management/Valuation), MM-EV (Invoice Verification).

Human Resources (HR) components include: HR-ORG (organization and Planning...education and training), HR-P&C (Planning and Control), HR-PAD (Personnel Management), HR-TIM (Time Management), HR-TRV (Travel Expenses), HR-PAY (Payroll Accounting).

Controlling (CO) components include (Supports all known accounting procedures): CO-CCA (Cost Element Accounting), CO-ABC (Performance Analysis Process-Related Performance Analysis), CO-OPA (Order and Project Cost Accounting), CO-PC (Products Cost Accounting), CO-PA (Financial Statement and Market Segment Accounting, CO-PCA (Profic-Center Accounting), CO-BPC (Company Control). Known Accounting Procedures include Accounting for actual costs, Rigid and Flexible Planning Cost Accounting, Activity Based Costing, and Financial Statements.

Back

The ABAP language

ABAP or Advanced Business Applications Programming is the programming language used in SAP. The ABAP workbench is a collection of tools that enable the development and maintenance of ABAP programs and applications. The ABAP editor is the tool used to write ABAP programs, class methods, function modules, screen flow logic, type groups and logical databases. The Screen Painter is a tool that enables creating screens for transactions. It is used to create the screen itself, with fields and other graphical elements and to write the flow logic behind the screen.

Transaction codes used in the ABAP workbench

SE24 - Class Builder
SE30 - Runtime Analysis
SE39 - Splitscreen editor

ABAP Datatypes

Elementary datatypes

C - Character

N - Numeric

I - Integer

P - Packed

X - Hexadecimal

F - Float

D - Date

T - Time

STRING - Character sequence

XSTRING - Byte sequence

Complex datatypes

Structures

Internal tables

Reference datatypes

Objects

Defining Datatypes

Predefined datatypes - built into the kernel

Local datatypes - that can be defined in ABAP programs

Datatypes in the ABAP dictionary - that are available to all programs in the ABAP system.

Predefined ABAP Datatypes

These data types are predefined in the R/3 System kernel, and are visible in all ABAP programs. Predefined types can be used to define local data types and objects in a program and to specify the type of interface parameters and field symbols.

Predefined elementary ABAP types with fixed length

These predefined elementary datatypes are used to specify the types of individual fields whose lengths are always fixed at runtime. The following table shows the different fixed-length data types. All field lengths are specified in bytes.

	Data Type	Initial field length	Valid field length	Initial value		Meaning
	Numeric types
		I		4			4			0		Integer (whole number)
		F 		8 			8 			0 		Floating point number 
		P 		8 			1 - 16 			0 		Packed number 

	Character types
		C 		1 			1 - 65535 		' � ' 		Text field (alphanumeric characters) 
		D 		8 			8 			'00000000' 	Date field (Format: YYYYMMDD) 
		N 		1 			1 - 65535 		'0 � 0' 	Numeric text field (numeric characters) 
		T 		6 			6 			'000000' 	Time field (format: HHMMSS) 

	Hexadecimal type	
		X 		1 			1 - 65535 		X'0 � 0' 	Hexadecimal field

Data types D, F, I, and T describe the technical attributes of a data object fully. Data types C, N, P, and X are generic. When you use a generic type to define a local data type in a program or a data objet, you must specify the field length and, in the case of type P, the number of decimal places. When you user generic types to specify the types of interface parameters of field symbols, you do not have to specify the technical attributes. The initial value (and initial field length in the case of the generic types), are values that are used implicitly in short forms of the TYPES and DATA statements.

The fixed-length predefined types are divided into Numeric, Character and Hexadecimal datatypes.

Numeric Datatypes

As well as the five non-numeric types (text field (C), numeric text field (N), date field (D), time field (T), and hexadecimal field (X)), there are three numeric types, used in ABAP to display and calculate numbers. Data type N is not a numeric type. Type N objects can only contain numeric characters (0...9), but are not represented internally as numbers. Typical type N fields are account numbers and zip codes.

integers - type I

The value range of type I numbers is -2**31 to 2**31-1 and includes only whole numbers. Non-integer results of arithmetic operations (e.g. fractions) are rounded, not truncated.

Type I data can be used for counters, numbers of items, indexes, time periods, and so on.

Packed numbers - type P

Type P data allows digits after the decimal point. The number of decimal places is generic and is determined in the program. The value range of type P data depends on its size and the number of digits after the decimal point. The valid size can be any value from 1 to 16 bytes. Two decimal digits are packed into one byte, while the last byte contains one digit and the sign. Up to 14 digits are allowed after the decimal point. The initial value is zero. When working with type P data, it is a good idea to set the program attribute Fixed point arithmetic.Otherwise, type P numbers are treated as integers.

Type P data can be used for such values as distances, weights, amounts of money, and so on.

Floating point numbers - type F

The value range of type F numbers is 1x10**-307 to 1x10**308 for positive and negative numbers, including 0 (zero). The accuracy range is approximately 15 decimals, depending on the floating point arithmetic of the hardware platform. Since type F data is internally converted to a binary system, rounding errors can occur. Although the ABAP processor tries to minimize these effects, type F data should not be used if high accuracy is required. Instead, use type P data.

Type F fields are used when there is a need to cope with very large value ranges and rounding errors are not critical.

Using I and F fields for calculations is quicker than using P fields. Arithmetic operations using I and F fields are very similar to the actual machine code operations, while P fields require more support from the software. Nevertheless, type P data has to be used to meet accuracy or value range requirements.

Character types

Of the five non-numeric types, the four types C, D, N, and T are character types. Fields with these types are known as character fields. Each position in one of these fields takes up enough space for the code of one character. Currently, ABAP only works with single-byte codes such as ASCII and EBCDIC. However, an adaptation to UNICODE is in preparation. Under UNICODE, each character occupies two or four bytes.

Hexadecimal type

The remaining non-numeric type - X - always interprets individual bytes in memory. One byte is represented by a two-digit hexadecimal display. The fields with this type are called hexadecimal fields. In hexadecimal fields, single bits can be processed.

Predefined Elementary ABAP Types with Variable Length

These predefined elementary datatypes are used to specify the types of individual fields whose lengths are not fixed until runtime. There are two predefined ABAP data types with variable length that are generically known as strings:

STRING for character strings

A string is a sequence of characters with variable length. A string can contain any number of alphanumeric characters. The length of a string is the number of characters multiplied by the length required for the internal representation of a single character.

XSTRING for byte strings

A byte string is a hexadecimal type with variable length. It can contain any number of bytes. The length of a byte string is the same as the number of bytes.

When a string is created as a data object, only a string header is created statically. This contains administrative information. The actual data objects are created and modified dynamically at runtime by operational statements.

The initial value of a string is the empty string with length 0. A structure that contains a string is handled like a deep structure. This means that there are no conversion rules for structures that contain strings.

Predefined Complex Data Types

ABAP contains no predefined complex data types that can be used to define local data types or data objects in a program. All complex data types are based on elementary ABAP types, and are constructed in ABAP programs or in the ABAP Dictionary.

Field symbols

Field symbols are placeholders or symbolic names for other fields. They do not physically reserve space for a field but point to its contents. They are like dereferenced pointers in C (i.e pointers to which the contents operator * is applied.)

All operations programmed with field symbols are applied to the field assigned to it. For e.g a MOVE statement between two field symbols moves the contents of the field assigned to the first field symbol to the field assigned to the second field symbol.

Some of the different ways of defining field symbols are shown below:

	FIELD-SYMBOLS <FS> TYPE ANY TABLE.
	FIELD-SYMBOLS <FS> LIKE LINE.
	FIELD-SYMBOLS: <F1> STRUCTURE LINE1 DEFAULT WA,
	<F2> STRUCTURE LINE2 DEFAULT WA.

Internal Tables

Internal tables are temporary storage areas where data processing can be carried out. Internal tables provide a means of taking data from a fixed structure and storing it in working memory in ABAP.

Internal tables are declared as follows

	DATA ITAB LIKE <TABLENAME> OCCURS 0 TIMES WITH/WITHOUT HEADER LINE

The Header Line can be used as a work area while processing the internal table. The statements used to access internal tables are shorter.

	INSERT <wa> into TABLE ITAB.	- without header line
	INSERT TABLE ITAB.		- with header line

Types of internal tables

Indexed tables

Standard tables

Sorted tables

Hashed tables

Back

ABAP Database Access

There are two ways of accessing the database from a program -

Open SQL - Open SQL statements are a subset of Standard SQL that is fully integrated in ABAP which allow access to data irrespective of the database system that the SAP installation is using

Native SQL

Open SQL

Open SQL consists of a set of ABAP statements that perform operations on the central database in the SAP System. The results of the operations and any error messages are independent of the database system in use. Open SQL thus provides a uniform syntax and semantics for all of the database systems supported by SAP.

ABAP programs that only use Open SQL statements will work in any SAP System, regardless of the database system in use. Open SQL statements can only work with database tables that have been created in the ABAP Dictionary.

Open SQL consists of the Data Manipulation Language (DML) part of Standard SQL; i.e allows reading (SELECT) and changing (INSERT, UPDATE, DELETE) data.

Open SQL also goes beyond Standard SQL to provide statements that, in conjunction with other ABAP constructions, can simplify or speed up database access. It also allows you to buffer certain tables on the application server, saving excessive database access.

Native SQL

Native SQL is only loosely integrated into ABAP, and allows access to all of the functions contained in the programming interface of the respective database system. Unlike Open SQL statements, native SQL statements are not checked and converted, but instead are sent directly to the database system. When using Native SQL, the function of the database-dependent layer is minimal. Programs that use Native SQL are specific to the database system for which they were written. Native SQL should be avoided, wherever possible. It is used, however, in some parts of the Basis System - for e.g for creating or changing table definitions in the ABAP dictionary.

Logical databases

Logical databases are special ABAP programs that retrieve data and make it available to application programs. The most common use of logical databases is to read data from database tables by linking them to executable ABAP programs.

An LDB provides a particular view of database tables. The data structure in an LDB is hierarchical. Many tables are linked to each other using foreign key relationships.

Objects in ABAP

Local and global classes

Classes in ABAP can be declared either locally or globally. Global classes and interfaces are declared in the Class builder (Transaction code SE24) in the ABAP workbench. All ABAP programs can access global classes.

Local classes are defined within an ABAP program and can be only used in the program in which they are defined.

Definition of local classes

The definition of local classes consists of two parts: The declaration part contains the declaration for all the components (attributes, methods, events) of the class. The implementation part contains the implementation of all the methods of the class.

	CLASS classname DEFINITION.
	ENDCLASS.

	CLASS classname IMPLEMENTATION.
		METHOD method1.
			......
		ENDMETHOD.
	ENDCLASS.

Visibility sections in classes

Public Section - All of the components declared in the public section are accessible to all users of the class and to the methods of the class and any classes that inherit from it. The public components of the class form the interface between the class and its users.

Protected Section - All of the components declared in the protected section are accessible to all methods of the class and of classes that inherit from it. Protected components form a special interface between a class and its subclasses.

Private Section - Components that are declared in the private section are only visible in the methods of the same class. The private components are not part of the external interface of the class.

	CLASS classname DEFINITION. 
	PUBLIC SECTION.
	...
	PROTECTED SECTION.
	...
	PRIVATE SECTION.
	...
	ENDCLASS.

Comparison of popular RDBMSs

	Oracle	Sql Server	MySQL	PostgreSQL
Indexed views		Supported	Not possible to create an index on a view
Indexes on functions/expressions	Supported	Allows indexes on computed columns	Supported from MySQL 8.0	Supported
In-memory tables		Memory-optimized tables	Through the MEMORY storage engine
Locking		Multi-granular locking	Row-level locking(InnoDB)
Online schema changes	Online table redefinition through the DBMS_REDEFINITION package		Supported for most ALTERs like creating/dropping indexes	Supported for some ALTERs like adding nullable columns, dropping columns
OS support	Unix/Linux/Windows/Mac	Windows only	Unix/Linux/Windows/Mac	Unix/Linux/Windows/Mac
Partitioning			Horizontal partitioning is supported but not vertical partitioning
Replication	Oracle multimaster replication supports both synchronous and asynchronous modes	Both synchronous and asynchronous modes are supported by Always-on availability groups(AG)	Asynchronous by default. Also supports semi-synchronous replication through an interface implemented by plugins

Back

Maintained by: VINCENT KANDASAMY, Database Architect/Administrator (kandasf@hotmail.com)
Last updated: Nov 25, 13:24 PM