1 of 23

INTRODUCTION TO DATA SCIENCE

What is Data?

2026

– FARDINA FATHMIUL ALAM

(fardina@umd.edu)

CMSC 320: Introduction to Data Science

2 of 23

Topics we will cover

Chapter 2: https://ffalam.github.io/CMSC320TextBook/chapter2/Chapter_2_0.html

  1. Types of Data
    1. Broader Categories of Data
  2. Data Formats
  3. How to get Data?

2

How to structure and represent your data efficiently is crucial for optimal performance and accurate results?

3 of 23

Data

Data is raw information, facts, or statistics that can be in various forms such as numbers, text, images, or more.

3

4 of 23

Broad Category of Data

4

Structured Data, Unstructured Data

5 of 23

Types of Data

Structured

  • Tabular (rows/columns) → e.g., spreadsheets, SQL tables
  • Time-based / Time-series → e.g., stock prices, sensor data
  • Graph data → e.g., social networks, biological networks, knowledge graphs

Unstructured

  • Text → e.g., emails, reviews, documents
  • Images / Video → e.g., photos, medical imaging, surveillance video
  • Audio → e.g., speech, music, phone calls�

Semi-Structured

  • JSON → common for APIs / web data
  • XML / HTML → web pages, documents
  • Logs / YAML → system logs, config files

5

6 of 23

Data Formats in Data Science

Determine how data is organized and how efficiently it can be read, written, and processed.

6

7 of 23

Data Formats

  • CSV / TSV
  • Image
    • .jpg
    • .png
  • Audio
    • .wav
    • .mpg
  • SQL Database
    • mySQL
    • Postgres
    • etc…

7

  • No SQL Database
    • Bigtable
    • Accumulo

  • JSON
  • XML / HTML

8 of 23

CSV/ TSV

  • CSV (Comma-Separated Values)
  • TSV (Tab-Separated Values)
    • Both are common file formats used to store structured data in a simple tabular format, with rows and columns.
    • Easy to import and export across various data analysis tools and programming languages.

8

Any CSV reader worth anything can parse files with any delimiter, not just a comma (e.g., “TSV” for tab-separated values)

Delimiter: The separator character : the comm (,), the tab (\t), colon (:) and semi-colon (;) characters.

9 of 23

CSV Files in

Python

Don’t write your own CSV or JSON parser

(We’ll use pandas to do this much more easily and efficiently)

import csv

with open("schedule.csv", "r") as f:

reader = csv.reader(f, delimiter= ",", quotechar='"')

next(reader)

for row in reader:

print(row)

Output:

Input file: schedule.csv

10 of 23

Databases

10

A database is an organized collection of structured information, or data, typically stored electronically in a computer system.

  • Handle more complex data relationships, often organized in tables.
  • It efficiently manages and allows retrieval, updating, and manipulation of information for various applications.

11 of 23

Relational Database

Organize data into tables (relations)

  • Tables contain rows (records) and columns (attributes)
  • Primary keys: uniquely identify each row
  • Foreign keys: reference keys in other tables to create relationships
  • Tables are linked through these relationships (joins)
  • Queried using SQL to retrieve, insert, update, and delete data

11

12 of 23

DBMS (Database Management System)

12

Software that manages databases and provides an interface for interacting with the stored data.

13 of 23

JSON

JSON = JavaScript Object Notation (lightweight data interchange format)�

  • Text-based format for storing & transmitting data (e.g., Web APIs, client–server communication)
  • Human-readable & machine-friendly
  • Represents data as key–value pairs in a hierarchical

Structure

  • Core building blocks:�
    • Objects (enclosed in { }): collections of key–value pairs�
    • Arrays (enclosed in [ ]): ordered lists of values�
  • Supports common data types: string, number, boolean, null, array, object�
  • Widely used in web APIs, mobile apps, config files, and modern data pipelines�

13

In this Example:

14 of 23

JSON: Example

14

15 of 23

Json in Python

In Python, the json module is commonly used to work with JSON data.

  • Use json.dumps() to convert Python objects to JSON format
  • Use json.loads() to convert JSON data back to Python objects.

15

16 of 23

XML / HTML

HTML (Hypertext Markup Language): fixed tags, for web content

XML (eXtensible Markup Language): custom tags, for structured data

Both: markup languages using tags to structure information

16

HTML

XML

17 of 23

Data Format: Images

17

Image data encompasses the visual content and properties of an image (visual representation), including details such as colors, shapes, patterns, and pixel values.

Images as Data

  • Images are grids of pixels�
  • Each pixel stores color information�
  • Color images use RGB channels (Red, Green, Blue)�
  • Channels act like layers of color information�
  • Channel values range 0–255 (low → high intensity)�

18 of 23

Data Format: Images

18

Images can be compressed to save storage space.

    • Lossy vs. Lossless

  • Lossy Compression: Sacrifices some data to reduce file size.
    • Suitable for photos where minor quality loss is acceptable (e.g., JPEG).
  • Lossless Compression: Reduces file size without any loss of quality.
    • Ideal for preserving image quality in medical imaging, digital art, or graphics (e.g., PNG, GIF).

→Varying Underlying Resolutions (level of detail and clarity)

19 of 23

Where Does Data Come From?

19

20 of 23

How to Get Data?

  • Given to you by your company
  • Gathered from databases
  • From the internet (for example: web scrapping)
  • From a restful API

20

Web Scraping: involves extracting data directly from web pages. It doesn't rely on APIs; instead, it simulates a web browser to retrieve and parse HTML content from websites.

Web scraping can be used to extract data when an API isn't available or when you need to collect information from web pages that aren't designed for programmatic access (should be done with caution, considering legal, ethical, and access restrictions).

21 of 23

Beautiful Soup and Parsing HTML

Beautiful Soup is a Python library for parsing HTML and XML, making web scraping (extracting data directly from web pages) and data extraction easier.

21

Notes: Don't write own parser. Install Beautiful soup and Use Beautiful Soup to parse HTML content by creating a Beautiful Soup object.

22 of 23

Restful APIs (Application Programming Interface)

import requests

response = requests.get("http://api.open-notify.org/astros.json")

print(response)

22

RESTful APIs

  • API = Application Programming Interface�
  • Provide structured & documented access to data/services�
  • Enable communication via requests and responses�
  • Reliable way to obtain web data�
  • API docs explain how to use endpoints & interpret returned data

“If you send me a specific request, I will return some information in a structured and documented format.”

23 of 23

Summary

Key Concepts for Data Preparation

  • Data Types: (tabular, text, graph, unstructured) shape how data is cleaned and processed�
  • File Formats: (CSV, JSON, etc.) affect ingestion and transformation�
  • Databases: store and retrieve data for efficient management�
  • Data Acquisition: REST APIs and web scraping gather data at the start of the lifecycle�

Understanding these concepts is essential for moving through the data science lifecycle toward meaningful insights.

23