Database systems are used for a variety of use cases
Temporary hot data storage.
Long-lived cold storage.
Complex analytical operations
Access value by key
Time-series data storage
Storage large blob
There are different ways to group the databases, one of the popular way group databases into 3 categories
OLTP (Online Transaction Processing) databases: Handles a large number of user-facing requests and transactions. Queries are often short-lived.
OLAP (Online Analytical Processing) databases: Handles complex aggregations. Used for analytics queries.
HTAP (Hybrid Transactional & Analytical Processing) databases
Another way of categorization is
Document oriented stores
DBMS architecture is based on a client-server model
Nodes are server
Applications are client
It is responsible for
Communication across the nodes in the cluster.
Communication between client to the database server.
Query Parser: Parse and validate the query and perform an access control check.
Query Optimizer: Optimizes the query and generates the execution plan.
For remote execution send the query to the remote nodes and collect results.
For local execution send the execution plan to the storage engine to execute.
A storage engine is responsible for storing and access of the data.
It has the following components:
Transaction Manager: Manages the transaction and makes sure the database is always in a consistent state.
Lock Manager: Lock the database objects and make sure database integrity.
Access Method: Manage access and organization of the data on the disk.
Buffer management: Cache data pages in the memory
Recovery Manager: Responsible for recovering database from recovery from crash.
Memory vs Disk-based DBMS
Database system stores the data in memory as well as on the disk.
In-memory databases: Store data primarily in memory and use disk for logging and recovery.
Disk-based databases: Store data primarily on disk and use memory for caching purposes.
Durability in Datastores
Durability: Once data is stored in the database it shouldn’t be lost.
In-memory databases maintain the backup on the disk to provide durability guarantees and prevent the loss of the volatile data.
Whenever the database receives an operation to process, it first stores it in the log, eg. WAL logs in B-Tree based databases.
To avoid a complete replay of logs in case of failure or crash, the database also creates a backup copy.
Logs are applied to backup copy in batches. After the batch of logs is processed, the database marks this point as a snapshot, and logs up to this point can be discarded. This process is called checkpointing.
Column Vs Row oriented database
Most of the data systems store data as a set of records, which consists of rows and columns in a table.
The field is the intersection of the column.
A row represents an entity with a set of properties(Fields)
There are two ways of storing the table in the disk:
Horizontally: Store values belonging to the same row together.
Vertically: Store values belonging to the same column together.
Row oriented databases are MYSQL, PostgreSQL, SQLServer, etc.
Column-oriented databases are MonetDB and C-Store.
Row oriented data layout
Row oriented databases store data in rows.
This approach is better in scenarios where you want to access a row (entity) based on the keys. In such scenarios storing rows together improves spatial locality.
Column-oriented data layout
Column-oriented datastores partition data vertically (by column). Values for the same column are stored contiguously on disk.
Column-oriented databases are a good fit for analytical purposes.
Column-oriented datastores are Apache Kudu, ClickHouse, etc.
Data files and Index Files
The primary goal of the database is to store the data and allow to access it.
Why database don’t store the data in plain files?
Database store data in specialized files because
Storage Efficiency: The specialized files organizes data in an optimized way.
Access Efficiency: Data can be accessed at the smallest possible time.
Update efficiency: Update should be optimized to minimize the number of changes on the disk.
In the database, data is stored in the tables, and each table is stored in a separate file.
Lookup on data is done via a search key.
Databases use an additional data structure known as an index, which helps in locating data efficiently without scanning the entire table.
Data files store the data and Index files store the records metadata and use it to locate the record.
Files are partitioned into pages, which typically have the size of a single or multiple disk blocks.
Following are different ways to implement the data file:
Index organized tables
Heap organized files
Hash organized files
An index on the primary data file is called the primary index.
If the order of the data follows the order of the search key, then the index is known as the clustered index.
If the order of data doesn’t follow the order of keys, then the index is known as an unclustered index.
The Tech Platform