What is a database and why would you use one? This article explores some of the most commonly asked questions and frequently overlooked aspects of databases.
I also include some of the best questions I’ve received in my 3rd and 4th year lectures on databases, to try paint a quick and concise understanding of most of the most used technologies of the 21st century so far.
- What is a database?
- Some of the big concepts in a database
- What is a data warehouse?
- Considerations when building your own platform
[read more=”Read more” less=”Read less”]
What is a database?
Databases in 30 seconds
If you need to store some data, for example a username and password you need to store it in a file. Non-volatile storage, if you turn the power off and turn it back on again, it will remember it.
If you are working without a database you have to write the code to
- Open a file
- Close a file
- Create : Add to the file, usually at the bottom but could be anywhere inside
- Read : Search the file for data
- Update : Edit data in the file
- Delete : Remove from the file, could be from anywhere
Now if you are working in a big environment with multiple users you have a few more headaches.
- Multiple users updating the same piece of information
- Multiple users searching whilst multiple other users are looking up data
- Fixing corrupt data
- Making backups
- Where high availability is required, then making sure the file is identical on multiple machines.
That is a lot of complicated code. Instead you could download and use a database management system (DBMS) which has all this code built into it and is quite often FREE.
Why do you use databases? Because they’re a no brainer excellent choice. Reusability at its finest.
Since Edgar F. Codd proposed databases in the 1970’s the concept has grown and grown. Regardless of specifics a database has some standard components.
- There is an external program who usually wants to access the data
- There is an ODBC (Open DataBase Connectivity) Connector which allows the data pass in and out of the database
- There is a DBMS (DataBase Management System) which is the computer program which does all the managing of the data. More often than not the DBMS will come with an administration program as well.
- Then the data is stored in a specific format or file system. See Data Storage Engine below.
The people who set the options of the database are called DataBase Administrators or DBAs. They will usually be expert in the administration program that comes with the database.
DBAs tend to become masters in their version of a database, as there are usually a lot of settings options. It can take many years to master the finer settings of a Db to suit it to the applications needs.
OLTP vs. OLAP
OnLine Transaction Processing Vs. OnLine Analytical Processing
A system designed for high volume high processing data storage is called an OLTP. Think data in a till system whereby every time you hear a beep at the till that is another data record going into the database.
OLAP is designed for data intense reporting. When you have a lot of data to report on you can’t just turn the till off and wait for the report to finish. So it’s often better to copy the data to another database usually called a Data Warehouse.
There are some good technological reasons for having the two different types. Most students learning to program are only going to be playing with OLTP solutions. Consequently as you become a data scientist processing large volumes of data requires different knowledge. The difference can be a report generating in 2 hours on an OLTP while in contrast generating in 20 seconds on an OLAP.
“If the only tool you have is a hammer, you tend to see every problem as a nail.” Abraham Maslow (1908-1970)
Pick the right tool for the job.
Some of the big concepts in a database
Data Storage Engines
Different programs have different ways of managing the data internally, these are called storage management engines. Some DBMS can only use one whilst others give the people who set the settings (The) options.
The DBMS over the years has found different and better ways of storing different types of data furthermore knowing your options allows you choose what’s best for you.
When databases first came out, non-volatile storage was very very expensive and data storage was charged by the kilobit. In 1967, 1 megabyte of hard drive space cost US$1M
So database’s were programs that managed a lot of cost and for that reason were very very precise in their initial design.
Over time the underlying principle of precision remains. Yet people are a bit more generous with data storage as a result of plummeting costs of non-volatile storage.
MySQL as of version 8.0 has 16 different storage engines. The user never knows any difference but the performance of the engine varies greatly by the types of data, volume of data and database configuration settings.
So what are you storing in there
Over the years, stored data types changed greatly and as a result many DSEs have cropped up to manage them.
Initially it was simple data records however over time that requirement changed incredibly.
Google wanted to store website information and couldn’t find any engine fast enough, so they went and created their own. https://en.wikipedia.org/wiki/Google_File_System
Digital images suddenly meant the storage and management of blobs (Binary Large OBjects) of data by the DBMS consequently stimulating the invention of solutions like MongoDb with GridFS to better handle the challenge.
There were also a myriad of other media challenges. Finding words in audio files and songs. Optical Character Recognition in scanned documents or in PDF documents. The list is extensive and research into many of them, as previously stated, can become a lifetime of learning.
The biggest challenge came with video which are quite simply enormous data wise. YouTube were the winners with the global best solution. They use MySQL for their data.
Scanning through security footage for a particular face results in monstrous data sets developing. Managing the data from a Boeing 787 from London to New York in 2013 was an estimated half a terrabyte of data.
You walk into a library. You want to find a book by a particular title or by author in distinct sections of the library as you might find in a book store. The Dewey Decimal System, with a seemingly unreadable number, categorizes every book whilst being located at the bottom of the spine of the book.
You need a simple card system with all the books in name order for the reason that you go into the right letter and instantly you have the data you want.
Repeated data happens because you have the names on all the books already yet the index massively speeds up your searching consequently sacrificing data storage space for speed gain.
You can create indexes on anything in a database so when you need indexes you create them knowing you will lose space.
The most common example I apply is when a phone number of customer number field of data is often searched. If it’s what your business searches by rather than by name, then it deserves and index.
Rollback is one of the overheads that is brilliant and massively challenging in a database.
Imagine you transfer money from account 1 to account 2. It’s two transactions.
- Add money to account 2
- Subtract money from account 1
What if the computer turns off precisely in the middle due to a power failure. That wouldn’t be good. So we have to keep an audit trail. Ok.
- Start an audit log to say what we’re going to do and store it in case the power fails
- Add money to account 2
- Subtract money from account 1
- Go back and delete the audit log from the storage
Changing data in RAM is very fast. Writing to and from the hard drive is very slow. This one action of having the capability to roll back introduces massive operational overhead. What if your database is mostly reading. Instead of it being just “read the data” it’ll be “audit log for read, read, delete the audit log.”
Different systems have different options for turning this feature on and off.
If you’re a bank you can’t live without it.
If you are mostly a reading service where transactions from 1 to 2 rarely happen then you can dispense reasonably with rollback and enjoy a massive performance improvement.
Cache tweaking is a dark art in databases for the reason that data is kept in RAM in a predictive way.
If you lookup customer 1’s transaction, chances are you’re going to look at transaction 2 and 3 as well. So the DBMS proactively will load the data into RAM, the cache, so when it’s needed it’s almost instantly available.
Space is always a consideration so there are limits on how much information to cache and in which scenarios depending on the hardware availability. Cache like Rollback choices can result in massive instant performance improvements.
Data is stored sequentially where possible. I know there are lots of scenarios where this isn’t possible but for simplicity.
Optimisation is the overhead process of reshuffling things around to get them back into the best order for performance.
There are different optimisation processes which free up hard drive space, cache space and can greatly improve database performance. Forgetting to optimise your database frequently can cause it to grind over time.
What is a data warehouse?
The point of it
As previously mentioned there is a big difference in OLTP and OLAP.
Imagine you’re doing a report in an excel spreadsheet for the reason that you are counting up all the sales for a month. In the middle of adding up all the sums by hand someone comes along and adds another record to your spreadsheet and as a result your totals are now out.
To prevent this happening, you decide, when I’m generating my report I’m going to lock the data so I can finish my report.
If you said to a major supermarket chain we have to lock all the tills until we have added up all the sales for the previous month for all the stores, there would be very unhappy employees and very unhappy customers.
Instead, you ship your data out to a data warehouse and the warehouse locks up for lots of time generating reports.
Whilst the data is in the warehouse you can also do experimental reporting. This is using advanced maths, called an algorithm. You throw the algorithm at the data and it finds pattern in the data. This is where the expression “Big Data” comes from. These algorithms can lead to astonishing finds.
One of the most famous tales in this area is that of diapers and beer.
If you go into a shop to buy diapers, what is the most common thing also to be found in the shopping?
The most unlikely answer of answers was beer as a result of fathers picking up a six pack whilst sent to buy diapers.
If the data just stays in the warehouse and isn’t analysed it is known as “Dark Data” as quite literally no one is a shining a light into it to see what’s in there.
Considerations when building your own platform
When you are building a solution from a college project, to experimental solution, to a real time million user high availability solution it doesn’t hurt to have a chat with people who’ve been there and done that.
The questions you need to ask are:
- Does the DBMS cost money or is there a free equivalent?
- How can I make the data highly available?
- How do I ensure backups and restoration?
- To protect against theft how is encryption implemented?
- What storage engine do I need to use? This is usually dependant on the data you’re storing
- How can you report on slow and performance issues?
- What DBMS experience do you have as this will tend to influence your choice.
- What analytics and reporting do you plan to do after its built and working?
I have been building, designing, tweaking and doing my masters thesis on the MySQL platform for over a decade. I’ve connected solutions to over 10 warehouses of very different technologies. I am still amazed at how much can be done, optimised and tweaked to improve the performance of solutions.
If you’d like to chat I’d be very happy to discuss your ideas with you and happy to offer you solutions if you’re experiencing some headaches. Please feel free to get in touch here or on social media.