Understanding Data in Data Science: Formats and Classifications Made Easy
DATA SCIENCE TOUTORIALS Data Science
Marya  

Understanding Data in Data Science: Formats and Classifications Made Easy

Understanding data in data science is the foundation of machine learning, analytics, and data-driven decision-making. It begins with knowing the types, formats, and classifications of data used in real-world applications.

What is Data?

Everything in the field of data science starts with data. Knowing what data is and the many forms and formats it might take is crucial whether you’re developing dashboards, machine learning models, or business choices.

What is Data?

To understand analytics, one must start by understanding data in data science. Raw facts, numbers, or symbols gathered by measurements, observations, or interactions are referred to as data. Data is meaningless on its own. However, after being processed and examined, it becomes data that can be used to make decisions and identify trends.

Example:
Raw data: “Maria, 28, Karachi”
Processed information: “Maria is a 28-year-old from Karachi.”

Types of Data

A key part of understanding data in data science is knowing the difference between structured and unstructured data. Whether you’re working with structured or unstructured data, understanding data in data science is critical. A key part of understanding data in data science is recognizing structured and unstructured data. Data can be defined as raw facts or figures collected from different sources. These facts can be numbers, text, images, audio, or video. In its unprocessed form, data doesn’t provide value — but with the right techniques, it becomes powerful information that drives decision-making.

There are three main types of data based on their structure and organization:

1. Structured Data

This data is arranged neatly into rows and columns according to a predetermined structure (like spreadsheets or SQL databases).

Examples:

  • Excel spreadsheets
  • SQL databases
  • CSV files with columns like Name, Age, City

Key Features:

  • Easily searchable
  • Stored in relational databases
  • Best suited for data analytics

2. Unstructured Data

There is no set structure or format for this kind of data. It is not compatible with conventional databases.

Examples:

  • Images and videos
  • Emails
  • Social media posts
  • Scanned documents
  • PDFs

Key Features:

  • Complex and harder to analyze
  • Requires techniques like NLP, image processing
  • Grows rapidly in today’s digital age (90% of data is unstructured!)

3. Semi-Structured Data

This falls within the category of unstructured and structured data. It has tags or markers to distinguish data pieces rather than adhering strictly to a tabular format.

Examples:

  • JSON files
  • XML files
  • NoSQL databases (like MongoDB)
  • HTML documents

Key Features:

  • Flexible and adaptable
  • Contains metadata
  • Easily converted into structured format for processing

Common Data Formats

Let’s now look at popular formats used to store and share data in data science projects:

1. CSV (Comma-Separated Values)

  • Simple text format
  • Columns separated by commas
  • Great for tabular data
  • Easily opened in Excel or imported into Python/Pandas

Example:

Name, Age, City  
Rafay, 28, Karachi  
Ahsan, 32, Lahore

2. JSON (JavaScript Object Notation)

  • Lightweight data-interchange format
  • Often used in web APIs
  • Supports nested data structures
{
  "name": "Maria",
  "age": 28,
  "city": "Karachi"
}

3. Excel (XLS/XLSX)

  • Microsoft Excel format
  • Can include formulas, charts, and multiple sheets
  • Popular among non-programmers

Pros:

  • User-friendly
  • Great for manual data entry and small-scale analysis

Cons:

  • Not ideal for automation or large datasets

4. SQL (Structured Query Language)

  • Not a file format, but a language used to interact with structured databases
  • Data stored in relational tables
  • Allows querying, updating, inserting, and deleting data

Example SQL query:

SELECT * FROM customers WHERE city = 'Karachi';

Quick Comparison Table

TypeExamplesIdeal Use
StructuredSQL, CSV, ExcelTraditional analysis, queries
UnstructuredImages, Videos, PDFsNLP, computer vision, storage
Semi-StructuredJSON, XML, HTMLAPIs, flexible storage formats

Why It Matters

Without fully understanding data in data science, models can become unreliable or misleading. Without understanding data in data science, your models may be based on flawed assumptions. Knowing your data helps you:

  • Choose the right model
  • Clean and preprocess data properly
  • Avoid bias and errors
  • Interpret results accurately

Summary:

  • By understanding data in data science, you gain insight into how to organize and use data effectively
  • Data is the backbone of data science—raw facts that become valuable when processed.
  • Types include structured, unstructured, and semi-structured data.
  • Key formats like CSV, JSON, Excel, and SQL help store and manage data in diverse ways.

Understanding these distinctions is the first step toward becoming a data expert.

Leave A Comment