INTRODUCTION TO DATA SCIENCE
What is Data?
2026
– FARDINA FATHMIUL ALAM
(fardina@umd.edu)
CMSC 320: Introduction to Data Science
Topics we will cover
Chapter 2: https://ffalam.github.io/CMSC320TextBook/chapter2/Chapter_2_0.html
2
How to structure and represent your data efficiently is crucial for optimal performance and accurate results?
Data
Data is raw information, facts, or statistics that can be in various forms such as numbers, text, images, or more.
3
Broad Category of Data
4
Structured Data, Unstructured Data
Types of Data
Structured
Unstructured
Semi-Structured
5
Data Formats in Data Science
Determine how data is organized and how efficiently it can be read, written, and processed.
6
Data Formats
7
CSV/ TSV
8
Any CSV reader worth anything can parse files with any delimiter, not just a comma (e.g., “TSV” for tab-separated values)
Delimiter: The separator character : the comm (,), the tab (\t), colon (:) and semi-colon (;) characters.
CSV Files in
Python
Don’t write your own CSV or JSON parser
(We’ll use pandas to do this much more easily and efficiently)
import csv
with open("schedule.csv", "r") as f:
reader = csv.reader(f, delimiter= ",", quotechar='"')
next(reader)
for row in reader:
print(row)
Output:
Input file: schedule.csv
Databases
10
A database is an organized collection of structured information, or data, typically stored electronically in a computer system.
Relational Database
Organize data into tables (relations)
11
DBMS (Database Management System)
12
Software that manages databases and provides an interface for interacting with the stored data.
JSON
JSON = JavaScript Object Notation (lightweight data interchange format)�
Structure
13
In this Example:
JSON: Example
14
Json in Python
In Python, the json module is commonly used to work with JSON data.
15
XML / HTML
HTML (Hypertext Markup Language): fixed tags, for web content
XML (eXtensible Markup Language): custom tags, for structured data
Both: markup languages using tags to structure information
16
HTML
XML
Data Format: Images
17
Image data encompasses the visual content and properties of an image (visual representation), including details such as colors, shapes, patterns, and pixel values.
Images as Data
Data Format: Images
18
Images can be compressed to save storage space.
→Varying Underlying Resolutions (level of detail and clarity)
Where Does Data Come From?
19
How to Get Data?
20
Web Scraping: involves extracting data directly from web pages. It doesn't rely on APIs; instead, it simulates a web browser to retrieve and parse HTML content from websites.
Web scraping can be used to extract data when an API isn't available or when you need to collect information from web pages that aren't designed for programmatic access (should be done with caution, considering legal, ethical, and access restrictions).
Beautiful Soup and Parsing HTML
Beautiful Soup is a Python library for parsing HTML and XML, making web scraping (extracting data directly from web pages) and data extraction easier.
21
Notes: Don't write own parser. Install Beautiful soup and Use Beautiful Soup to parse HTML content by creating a Beautiful Soup object.
Restful APIs (Application Programming Interface)
import requests
response = requests.get("http://api.open-notify.org/astros.json")
print(response)
22
RESTful APIs
“If you send me a specific request, I will return some information in a structured and documented format.”
Summary
Key Concepts for Data Preparation
Understanding these concepts is essential for moving through the data science lifecycle toward meaningful insights.
23