Understanding Data Sources

Yesterday was DSNY12's first official day of training (hooray!!), and we covered many introductory data topics. In this blog post, I'll be covering common data sources and their use cases, in order of simplest to most abstract.

When one typically thinks of data, the first thing that comes to mind is probably a data table, consisting of fields (columns) and records (rows). Tables are the building blocks for the rest of the data sources we will soon explore, starting with...

... flat files! These are often used for importing/exporting data, especially between systems, and can take the form of CSVs, Excel sheets, JSON files, and more. They are also commonly used as intermediate storage between transformation and loading in ETL (Extract, Transform, Load) processes since they are easy to ingest, so long as the dataset is small- or medium-sized.

For our purposes as data analysts, flat files are the most straightforward to work with since they're structured and in plain text. However, the larger the dataset, the harder it is for the computer to process, load, and search through.

APIs (Application Programming Interfaces) help different software applications share data with one another. They enable developers to fetch data in real-time and bridge websites, cloud platforms, and other software. Popular API clients include Postman and Insomnia, and APIs can be used for anything from online payments to weather updates.

Databases are essentially collections of information. There are many types of databases, but for now I'll focus on relational and non-relational databases.

Relational databases store data in tables and typically use SQL. They are generally used for structured data and can be managed via software like MySQL, PostgreSQL, and Microsoft SQL Server. Relational databases are advantageous because they uphold data integrity through ACID (atomicity, consistency, isolation, durability), which means they will always be accurate regardless of any interruptions in data processing. They also scale vertically, meaning they require larger servers for expansion.

Non-relational databases (also known as NoSQL databases) can store many types of data, regardless of whether they're structured or unstructured. For this reason, they're used when flexibility is needed, data requirements change often, or datasets are large. In contrast to their relational counterpart, non-relational databases guarantee performance but not consistency since the database state can change at any point (adhering to BASE: Basically Available, Soft-state, Eventually Consistent). These databases can be accessed via software like MongoDB and Cassandra, and they scale horizontally, requiring additional servers to expand.

Relational databases tend to be used in transactional systems, such as banking, since they require high data integrity. Non-relational databases are used for systems with unstructured data that changes and scales rapidly, which is common in real-time analytics and social media.

Data warehouses combine structured data from multiple sources. As opposed to databases, which are generally used for tracking real-time operations, data warehouses are used for large-scale analysis since they tend to include historical data and prioritize data quality & speed. This makes them very useful for data analysts like ourselves! (They actually have a lot more to them than this, which you are welcome to explore here.)

Data lakes, on the other hand, can contain structured, unstructured, and raw data. This allows organizations to dump all their data in a single location without having to adhere to a rigid schema (which would be the case for data warehouses). Data lakes are very cost-effective and can be used for machine learning, real-time analytics, data archiving, and much more!

Finally, data lakehouses combine the advantages of data lakes with those of data warehouses, and their primary purpose is to enable both business intelligence and machine learning. To make this work, there are three layers involved in their architecture: the storage layer (raw data), the staging layer (metadata management), and the semantic layer (user-facing apps and tools). There are a lot of details I've omitted, but you can read more about them here.

Why does all this matter? As budding consultants, having a strong foundation in this area could help us understand how clients might store their data and why. Knowing the advantages and limitations of data sources could also help us identify opportunities for improvement.

That's all for today. Happy blogging! :-)

Author:

Stefani Hermanto

View Profile