In the first part of the series of posts on Cloud Computing for Beginners, we talked a little bit about the history, service models and use of the main platforms in the segment. In this second part, we will deepen into the characteristics of two another fundamental components of the cloud: the database and storage. Clearly, it wouldn’t be possible to talk about cloud computing without talking about database and storage. And here, it is also necessary to consider different types of databases and storage resources.
Let’s go to the first one: the Relational Database. The term is old and dates back to 1970, at IBM, from the research [pdf] “A Relational Model of Data for Large Shared Data Banks” conducted by the computer scientist E. F. Codd.
What is a Relational Database and a RDBMS?
This is the most common type of database. It is a system that classifies information in “clean” and organized structures. A Relational Database Management System (RDBMS) accommodates a large number of records, provides data for many users simultaneously, and serves as a central data repository for computer programs (often called “applications”).
A database facilitates the task of data management, making information more accessible, secure and useful. In this case, one of the advantages is simplicity: the data model that supports the relational database is more easily implemented and managed than other known models. All operations in a relational database are done using a language called SQL (Structured Query Language), which defines how the commands sent to the database should be carried out.
Another advantage is the “concurrency control”, a concept that is very easy to understand: imagine an airline reservation system. There is a flight with a remaining seat and two people are trying to book that seat at the same time. Both people check the flight status and are told that a seat is still available. They insert their payment information and click on the reservation button at the same time. What should happen? If the system is working properly, only one person should be given a seat and the other person should be informed that there is no longer a seat available, right? Concurrency control is what makes the magic happen.
Transactional Database
To understand what a Transactional Database is, let’s first understand what the term transaction means. In technical terms, a transaction is a set of sequences of actions, which are independent and dependent at the same time. Yes, it’s that confusing, but we’ll clarify it below.
A transaction is considered completed only when all actions that are part of it are successfully completed. However, if one of the actions fails, the transaction as a whole will be considered a failure and all actions will need to be reversed or undone.
The ideal example is what happens in a bank transaction: it is only considered successful when the exact amount deducted from an account is successfully credited to another account. If the amount is deducted, but not credited, the entire transaction must be reverted to the early stage.
Microsoft defines a transactional database as follows:
Each database transaction has a defined starting point, followed by steps to modify the data in the database. In the end, the database confirms the changes to make them permanent or revert the changes to the starting point, when the transaction can be tried again
In almost all cases, the technology used for transactional databases is the Relational Databases, which ensure the correct support for transactions.
NoSQL database
A NoSQL database (named like that for being a “no SQL” or “non-relational” database) provides a mechanism for storing and recovering data modeled differently from the tabular relations used in Relational Databases. NoSQL databases can be of several types, depending on your data model. The main types are:
Document
Key-value stores
Column-oriented
Graph
Also, they are different from those used by default in relational databases (SQL), making some operations faster in NoSQL (and some slower, depending on the characteristics). The data structures used by NoSQL databases are also seen as “more flexible” than relational database tables, and may have a specific suitability, depending on the problem it must solve.
Many NoSQL databases have emerged to solve some specific problems to the contexts in which they were created. DynamoDB, for example, emerged at Amazon to solve shopping cart problems and started as a key-value database, but later evolved to become a Document Oriented Database as well.
Apache Cassandra emerged at Facebook to solve the problem of the message inbox and it is a column-oriented database, but it has characteristics of other types of NoSQL databases as well.
Redis is currently the most popular key-value database around the world and emerged in a context of analyzing Internet log text files and then grew rapidly when sponsored by Pivotal Software (a subsidiary of VMWare). It was later adopted by GitHub and Instagram, among many other companies.
NoSQL databases are increasingly used in real-time web applications and fit perfectly into many modern applications, such as mobile devices, web and games, which require flexible, scalable, high-performance and highly functional databases to provide excellent user experiences.
The number of NoSQL databases has increased a lot and today it is very common to find multiple databases of different types being used by the companies, each one being designated to serve specific purposes, according to the demand and need. As more companies have adopted NoSQL solutions, cloud platforms started to offer this type of database as a managed model. This makes the adoption easier and more efficient, leaving companies more focused on their business problems.
We will certainly talk many times about databases of different types on our blog and this is only the first article. It is interesting to mention that after the huge success of some NoSQL solutions, relational databases have also evolved to support more flexible models of data, in addition to their traditional structures. The database market as a whole has evolved considerably over the past ten years.
The DB-Engines.com website uses some public data sources to build a popularity ranking of the databases, which is very interesting to assess the level of maturity and adoption of each solution. Currently, the top 10 most popular databases worldwide are:
- Oracle (originally relational, but already supports several models)
- MySql (originally relational, but also supports DocumentDB)
- Microsoft SQL Server (originally relational, but also supports DocumentDB and graphs)
- PostgreSQL (originally relational, but also supports DocumentDB)
- MongoDB (document oriented)
- IBM DB2 (originally relational, but also supports DocumentDB)
- Redis (originally key-value, but already supports several models)
- Elasticsearch (document oriented)
- SQLite (relational)
- Cassandra (hybrid column-oriented)
Block Storage
The most common type of storage we know of is file storage, or traditional file storage – which we use in Google Drive, for example. In this case, the data is organized and represented as a hierarchy of files in folders.
With Block Storage, things change a little bit. It is an approach to file storage in which each storage volume works as an individual hard disk configured by the storage administrator. In this model, data is saved on the storage media in fixed size pieces called blocks. Each block is associated with a unique address, and the address is the only metadata assigned to each block.
To manage block storage, an independent software program controls how blocks are made available and organized in storage units. The software also handles data recovery, using metadata to locate the desired blocks, and then, organize the data into complete files.
Block storage is typically abstracted by a file system or database management system (DBMS) for use by applications and end users. It is the type of cloud storage most compatible with traditional applications that need a “hard drive” to handle files.
Object Storage
The origin of object storage (also known as object-based storage) dates back to the late 1990s, from a famous company called Seagate, responsible years ago for also launching the first physical HD on commercial computers in 1980. It is a data storage architecture that manages and manipulates them as separate units, called objects. Unlike other storage architectures, such as file systems, which manage data as a file hierarchy, and block storage, which manages data as blocks, these objects are kept in a single repository and not arranged in files inside other folders. Each object includes its own data, a variable amount of metadata and a unique global identifier.
The importance of object storage is in the fact that the efficiency of static data performance can reach its maximum peak. Besides, it also enables more significant data analysis through the use of metadata, with faster recovery due to the lack of folder hierarchy and better optimization of resources through high customization.
Object storage can be implemented at several levels, including: device level (object storage device), system level and interface level. In each case, the object storage seeks to enable features not addressed by other storage architectures, such as: interfaces that can be directly programmable by the application; and a namespace, (a set of signals used to identify and refer to objects of several types) that can cover multiple instances of physical hardware and data management functions, such as replication and distribution at the object-level granularity.
Object storage systems allow the retention of large volumes of unstructured data and are, therefore, used by large platforms for certain purposes, such as storing photos on Facebook, songs on Spotify or files on online collaboration services, such as Dropbox .
Log Management
Another important component of the cloud is the log centralizer, which manages all system logs and is widely used to identify the source of problems.
A log is basically a file with data that can include from the name of the server to the type of log, application, tags, network address, etc. It is created by developers to anticipate and prevent errors in the operation of an application or even to understand user behavior. At every moment, huge volumes of data collected by these logs are created and need to be analyzed and debugged in real time. For this reason, it is essential to have a scalable and centralized log management solution capable of automating the process and making monitoring, alert generation and reporting lead to business operations with the least possible problems on servers and systems.
Before the growth of the cloud, the log files were inside the servers, along with the other files. However, with cloud architectures the average size of servers has decreased and the number has greatly increased. We have dozens or even hundreds of servers working together to meet customer demand. With this amount, it becomes very difficult to access each server individually to consult these log files, which motivates the use of a centralized structure that allows the consultation of all data at the same time.
In addition to the number of servers, the cloud includes a very interesting feature of Elasticity. This feature makes it possible to obtain a variable number of servers according to the demand at the moment. Therefore, on peak days such as Black Friday, it is possible to have hundreds of servers processing customers’ purchases. Right after Black Friday, it is likely to have a lower demand of clients; with that, it is possible to serve customers with a much smaller number of servers, perhaps hundreds of times less.
From this feature of Elasticity, the cloud supports many servers that can be active for a few days – while others, perhaps only for a few minutes. Log messages generated on these short-lived servers are important and cannot be discarded. Thus, it is essential to have a centralized structure that can receive, process and then make available these logs produced on all servers. This is exactly the role of the Log Management.
The most popular Log Management system today are Elasticsearch and Graylog, both open source solutions. There are also some well-known enterprise solutions like Splunk.
We will talk more about these solutions in a future article on Monitoring and Observability (this is a concept that has been widely adopted).