SQL vs NoSQL

Definition

Today I am going to discuss the difference between SQL and NoSQL databases I'm going to start out by defining what SQL and NoSQL is followed by some general guidelines on how to pick one over the other when starting a new project so let's start out by defining what SQL is so SQL stands for structured query language and the language typically looks like what we have on the right here so SQL allows you to perform some create read update or delete also known as crud operations through a universal language that is pretty much consistent across multiple different underline relational database engines so as I said there's multiple different underlying relational database engines and these typically are things like my SQL Postgres or Microsoft SQL Server so there's four key components that I like to talk about about databases so there's structure storage scale and access so let's quickly go through these and talk about how it relates to SQL so starting the top left with structure you typically interact with tables in a relational database engine.

Example

As an example we have a accounts table here and a table consists of rows and columns the columns correspond to the types and the rows correspond to the individual entities that exist in the table so we have an account ID here we have a creation ID and a country ID and every row here is an entity or an element so a1 is the account ID at a creation date of September 30th and it's contrary is number one now in a SQL table you must have a primary key and a primary key corresponds to the unique identifier that identifies a specific row so in this Rose case the primary key is a 1 you have attributes that correspond to just general data about the record and then you can also have things called foreign keys which are links to other tables now SQL databases also enforce constraints so for instance down here we're seeing that the account ID column is only allowed to be a variable character or a string similarly the creation date is only allowed to be a date and the country ID is only allowed to be an integer these constraints make it such that if a user's trying to insert or update and the data violates the constraint that's imposed on the table the operation will fail so this is a great way to enforce consistency across the content of your tables thirdly possibly the most important is that we have relations and let's just bring up a second image here that has a relationship so in this table we have the accounts which was what we originally had previously and we saw that we had a country ID with this vague name here so country ID 1 and separately over here we also have a country table that exists on its own and in there are the identifiers x' for each of these country IDs so country ID one corresponds to USA two for Canada three frost area so on and so forth so what SQL databases allow us to do is build relationships so we can say country ID 1 corresponds to USA 2 to Canada and 3 to Australia what this allows us to do is perform some very flexible queries such as give me all the accounts where the country ID is equal to USA now in terms of storage the storage pattern is concentrated so in a relational database engine there's typically one node that contains an entire copy of your data it's typically not partitioned or segregated in any way unless you're using some advanced strategies now down to the bottom left in terms of scale there's two approaches here so the first one is vertical so if you have a machine that's hosting your database engine and you're suffering in the performance area the option here is to build a better machine so have more memory get a better CPU at a faster hard-disk these are ways to vertically scale your machine so that you can get better performance keep in mind here that there is a physical limitation in terms of how powerful a machine that you can make so there is an upper ceiling on vertical scaling the second approach is horizontal which just means adding more machines and when you add more machines in a horizontally skilled RDS environment you typically perform that by distributing your data across multiple nodes so in a horizontally scaled environment you have one master and then you have multiple read replicas so when traffic is coming in not only can that traffic read off of your master but can also read off of your read replicas so what this allows you to do is to offload some of the pressure from the initial master node on to read replicas so that you can ensure optimal performance on your database now in terms of access it's typically raw SQL so writing the raw create read update and delete syntax for your query you typically require a direct database connection to the endpoint of the database and typically these days people are using object relational mapper z-to construct their queries and what these basically are are abstractions that people use to add criteria to an object in a very programmatic way and then allow that to generate a SQL statement that can be run against the engine so we've basically covered the four key pillars of SQL here let's move on to talking about what is no SQL so although it probably doesn't need to be said it's anything that is non relational and there's many different implementations of no SQL so it's kind of an overloaded term there's implementations that use table structures similar to what we saw in the SQL approach some use document and some use graph and the basic idea here is that no SQL is built the scale with high performance but it also comes at a cost and that cost is that your queries are less flexible so let's go through the same exercise as we did before and talk about the structure storage scale and access so in terms of structure it is very implementation dependent so as I said there's table implementations there's document implementations that store JSON objects under the hood and there's also graph databases so say for instance something like a Facebook social network where you have friends and those friends also have friends a graph database is a perfect representation of that domain but the general theme among all these is that they rely on key value stores so generally in a no SQL database you need to know the key that you're looking for when you are performing your query in terms of storage it relies on hashing the input so if you have a key that you're looking for that key is given into a hashing function and the result of that hashing function is a value and that value is distributed on to one of multiple nodes so say for instance if we add key one to three the key 1 to 3 would be inserted into the hashing function the output would be a physical location on partition 1 where that data row is stored and if you think about what this looks like from the read perspective now the same kind of concept applies so the same hashing is run when the key is provided and the engine knows exactly which partition to find the record on in terms of scale it's a very simple process and that is to simply add more partitions so instead of having the two partitions like we had in the top right in this example we can just add four partitions and allow those individual partitions to scale individually so add more disk storage add more memory and more CPU typically in an SQL database engine all of this is managed for you so you don't have to worry about the underlying details on how to add more partitions now moving on to access there's two major flavors here the first one is REST API Zen you use REST API is to hit a specific endpoint that has a certain functionality associated with it optionally you can perform create read update or delete in vendor specific languages so if you use vendor like dynamo DB that will a very different way to perform a query as something like MongoDB which isn't another no SQL option so now that we know a little bit about SQL and no SQL let's move on to talking about when you should use what so here we have a SQL.

NoSQL table so let's firstly go through why you would want to use SQL so when your access patterns aren't defined if you don't know how your business use case is going to evolve and you're not sure if the way you're storing your data is going to allow you to query it in an effective way later then using SQL is definitely the preferred option secondly when you want to perform flexible queries this kind of ties into the first point if your access patterns aren't defined you won't know what queries you're going to perform so it's important to stay flexible so you can adapt to changing use cases third when you want to perform relational queries so this really is the bread and butter of SQL engines if your domain is by definition relational and you want to perform queries that can make multiple hops among different tables then SQLis the choice for you fourth if you want to enforce field constraints this allows you to normalize and keep your table consistent so that malformed data cannot make its way into your table and lastly for SQL when you want to use a well-documented access language SQL is generally universal across all the relational database engines that are underlined so you can find a ton of support and there's a ton of different communities that you can learn from to help you solve your now moving on to why you would want to use no SQL so when your access pattern is defined so when you know exactly how you're going to be interacting with your database it's definitely a time to consider no SQL secondly when your primary key is known this is part of the limitations of no SQL as we were talking about before the input to the hashing function is your primary key so if you don't know your primary key you're not necessarily going to be able to find the data that you're looking for and thirdly when your data model fits as I was alluding to before in a Facebook type domain where you have friends and they have friends that's a very natural fit for a no SQL graph based database so no SQL may be a choice for you and finally the most important one when you need high performance and low latency this is where no SQL shines it can scale out horizontally pretty much infinitely while maintaining a consistent high performance and low latency so now that we've done a little bit of a comparison here let's walk through some example scenarios of when you would pick one or the other so how to pick so if you have a small project that's low scale with an unknown access patterns you're better off using SQL so if you're just starting out and you're not really sure if your product is going to take off and you don't necessarily know how it's going to be used SQL definitely makes sense and it's probably the best choice for you and secondly maybe you have a large project that is high scale that requires relational queries I'd highly suggest using SQL however using a managed solution with read replicas so you can ensure a high performance and consistency for all your customers now the main option here since this is an AWS Channel is to use Amazon Arora which is a fully managed SQL database they also have an alternative option called Amazon RDS that allows you to use a different underlying relational database engine such as my SQL Postgres Oracle and others and still gives you that managed functionality and allows you to have read replicas monitoring and everything else that goes along with a managed database engine and if you have a medium or a large project that is high scale and requires high performance then no SQL is definitely the choice for you so there's typically two engines in the AWS ecosystem there's DynamoDB which is by far the most popular and most developed and there's also a DB and MongoDB as a canonical name document DB in the AWS world so if you're looking for on AWS document DB is the choice for you unfortunately I can't recommend NoSQL for small projects I think no SQL is a better fit for those that have an established use case that require ultra-fast performance and high scaling abilities.

check out my Fiverr: https://www.fiverr.com/share/qPPZWd

data science and data analysis for beginner 2022

Search This Blog

SQL vs NoSQL

Definition

Example

Labels

Comments

Post a Comment