Definition
Today I am going to discuss the difference between SQL and NoSQL databases I'm going to start out by defining what SQL and NoSQL is followed by some general guidelines on how to pick one over the other when starting a new project so let's start out by defining what SQL is so SQL stands for structured query language and the language typically looks like what we have on the right here so SQL allows you to perform some create read update or delete also known as crud operations through a universal language that is pretty much consistent across multiple different underline relational database engines so as I said there's multiple different underlying relational database engines and these typically are things like my SQL Postgres or Microsoft SQL Server so there's four key components that I like to talk about about databases so there's structure storage scale and access so let's quickly go through these and talk about how it relates to SQL so starting the top left with structure you typically interact with tables in a relational database engine.
Example
As an example we have a accounts table here and a table consists of rows and columns the columns correspond to the types and the rows correspond to the individual entities that exist in the table so we have an account ID here we have a creation ID and a country ID and every row here is an entity or an element so a1 is the account ID at a creation date of September 30th and it's contrary is number one now in a SQL table you must have a primary key and a primary key corresponds to the unique identifier that identifies a specific row so in this Rose case the primary key is a 1 you have attributes that correspond to just general data about the record and then you can also have things called foreign keys which are links to other tables now SQL databases also enforce constraints so for instance down here we're seeing that the account ID column is only allowed to be a variable character or a string similarly the creation date is only allowed to be a date and the country ID is only allowed to be an integer these constraints make it such that if a user's trying to insert or update and the data violates the constraint that's imposed on the table the operation will fail so this is a great way to enforce consistency across the content of your tables thirdly possibly the most important is that we have relations and let's just bring up a second image here that has a relationship so in this table we have the accounts which was what we originally had previously and we saw that we had a country ID with this vague name here so country ID 1 and separately over here we also have a country table that exists on its own and in there are the identifiers x' for each of these country IDs so country ID one corresponds to USA two for Canada three frost area so on and so forth so what SQL databases allow us to do is build relationships so we can say country ID 1 corresponds to USA 2 to Canada and 3 to Australia what this allows us to do is perform some very flexible queries such as give me all the accounts where the country ID is equal to USA now in terms of storage the storage pattern is concentrated so in a relational database engine there's typically one node that contains an entire copy of your data it's typically not partitioned or segregated in any way unless you're using some advanced strategies now down to the bottom left in terms of scale there's two approaches here so the first one is vertical so if you have a machine that's hosting your database engine and you're suffering in the performance area the option here is to build a better machine so have more memory get a better CPU at a faster hard-disk these are ways to vertically scale your machine so that you can get better performance keep in mind here that there is a physical limitation in terms of how powerful a machine that you can make so there is an upper ceiling on vertical scaling the second approach is horizontal which just means adding more machines and when you add more machines in a horizontally skilled RDS environment you typically perform that by distributing your data across multiple nodes so in a horizontally scaled environment you have one master and then you have multiple read replicas so when traffic is coming in not only can that traffic read off of your master but can also read off of your read replicas so what this allows you to do is to offload some of the pressure from the initial master node on to read replicas so that you can ensure optimal performance on your database now in terms of access it's typically raw SQL so writing the raw create read update and delete syntax for your query you typically require a direct database connection to the endpoint of the database and typically these days people are using object relational mapper z-to construct their queries and what these basically are are abstractions that people use to add criteria to an object in a very programmatic way and then allow that to generate a SQL statement that can be run against the engine so we've basically covered the four key pillars of SQL here let's move on to talking about what is no SQL so although it probably doesn't need to be said it's anything that is non relational and there's many different implementations of no SQL so it's kind of an overloaded term there's implementations that use table structures similar to what we saw in the SQL approach some use document and some use graph and the basic idea here is that no SQL is built the scale with high performance but it also comes at a cost and that cost is that your queries are less flexible so let's go through the same exercise as we did before and talk about the structure storage scale and access so in terms of structure it is very implementation dependent so as I said there's table implementations there's document implementations that store JSON objects under the hood and there's also graph databases so say for instance something like a Facebook social network where you have friends and those friends also have friends a graph database is a perfect representation of that domain but the general theme among all these is that they rely on key value stores so generally in a no SQL database you need to know the key that you're looking for when you are performing your query in terms of storage it relies on hashing the input so if you have a key that you're looking for that key is given into a hashing function and the result of that hashing function is a value and that value is distributed on to one of multiple nodes so say for instance if we add key one to three the key 1 to 3 would be inserted into the hashing function the output would be a physical location on partition 1 where that data row is stored and if you think about what this looks like from the read perspective now the same kind of concept applies so the same hashing is run when the key is provided and the engine knows exactly which partition to find the record on in terms of scale it's a very simple process and that is to simply add more partitions so instead of having the two partitions like we had in the top right in this example we can just add four partitions and allow those individual partitions to scale individually so add more disk storage add more memory and more CPU typically in an SQL database engine all of this is managed for you so you don't have to worry about the underlying details on how to add more partitions now moving on to access there's two major flavors here the first one is REST API Zen you use REST API is to hit a specific endpoint that has a certain functionality associated with it optionally you can perform create read update or delete in vendor specific languages so if you use vendor like dynamo DB that will a very different way to perform a query as something like MongoDB which isn't another no SQL option so now that we know a little bit about SQL and no SQL let's move on to talking about when you should use what so here we have a SQL.
NoSQL table so let's firstly go through why
you would want to use SQL so when your access patterns aren't defined if you
don't know how your business use case is going to evolve and you're not sure if
the way you're storing your data is going to allow you to query it in an
effective way later then using SQL is definitely the preferred option secondly
when you want to perform flexible queries this kind of ties into the first
point if your access patterns aren't defined you won't know what queries you're
going to perform so it's important to stay flexible so you can adapt to
changing use cases third when you want to perform relational queries so this
really is the bread and butter of SQL engines if your domain is by definition
relational and you want to perform queries that can make multiple hops among
different tables then SQLis the choice for you fourth if you want to enforce
field constraints this allows you to normalize and keep your table consistent
so that malformed data cannot make its way into your table and lastly for SQL when
you want to use a well-documented access language SQL is generally universal
across all the relational database engines that are underlined so you can find
a ton of support and there's a ton of different communities that you can learn
from to help you solve your now moving on to why you would want to use no SQL so
when your access pattern is defined so when you know exactly how you're going
to be interacting with your database it's definitely a time to consider no SQL secondly
when your primary key is known this is part of the limitations of no SQL as we
were talking about before the input to the hashing function is your primary key
so if you don't know your primary key you're not necessarily going to be able
to find the data that you're looking for and thirdly when your data model fits
as I was alluding to before in a Facebook type domain where you have friends
and they have friends that's a very natural fit for a no SQL graph based
database so no SQL may be a choice for you and finally the most important one
when you need high performance and low latency this is where no SQL shines it
can scale out horizontally pretty much infinitely while maintaining a
consistent high performance and low latency so now that we've done a little bit
of a comparison here let's walk through some example scenarios of when you
would pick one or the other so how to pick so if you have a small project
that's low scale with an unknown access patterns you're better off using SQL so
if you're just starting out and you're not really sure if your product is going
to take off and you don't necessarily know how it's going to be used
SQL definitely makes sense and it's probably the best choice for you and secondly
maybe you have a large project that is high scale that requires relational
queries I'd highly suggest using SQL however using a managed solution with read
replicas so you can ensure a high performance and consistency for all your
customers now the main option here since this is an AWS Channel is to use
Amazon Arora which is a fully managed SQL database they also have an alternative
option called Amazon RDS that allows you to use a different underlying
relational database engine such as my SQL Postgres Oracle and others and still
gives you that managed functionality and allows you to have read replicas
monitoring and everything else that goes along with a managed database engine
and if you have a medium or a large project that is high scale and requires high
performance then no SQL is definitely the choice for you so there's typically
two engines in the AWS ecosystem there's DynamoDB which is by far the most
popular and most developed and there's also a DB and MongoDB as a canonical
name document DB in the AWS world so if you're looking for on AWS document DB
is the choice for you unfortunately I can't recommend NoSQL for small projects
I think no SQL is a better fit for those that have an established use case that
require ultra-fast performance and high scaling abilities.
check out my Fiverr: https://www.fiverr.com/share/qPPZWd
Comments
Post a Comment
If you have any queries please let me know.