JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 14

Random Split Method

Course: Databases and Data Mining
University Lecture
Instructor: Jamolbek Mattiev

2 of 14

Learning Objectives

• Understand data splitting
• Learn Random Split mechanism
• Compare with K-Fold Cross Validation
• Interpret evaluation variance

3 of 14

Why Do We Split Data?

• Estimate generalization performance
• Prevent overfitting
• Simulate unseen data
• Ensure objective evaluation

4 of 14

Basic Concept

• Randomly divide dataset
• Training set + Test set
• Common ratios: 70/30, 80/20
• Equal probability for each sample

5 of 14

Mathematical Formulation

D = {x1, x2, ..., xn}
D_train ∪ D_test = D
D_train ∩ D_test = ∅

6 of 14

Advantages

• Simple
• Fast
• Computationally efficient
• Suitable for large datasets

7 of 14

Disadvantages

• High variance
• Depends on random seed
• Not stable for small datasets

8 of 14

When to Use?

• Large datasets
• Balanced classes
• Baseline experiments

9 of 14

Introduction to K-Fold

• Dataset split into K subsets
• Train K times
• Average performance
• More robust evaluation

10 of 14

Random Split vs K-Fold

Random Split: Single evaluation
K-Fold: Multiple evaluations
Random Split: Higher variance
K-Fold: Lower variance

11 of 14

Practical Recommendations

• Fix random_state
• Use stratification for classification
• Report split ratio in research

12 of 14

Visual Illustration: Random Split (80/20 Example)

13 of 14

Comparison Chart: Random Split vs K-Fold Variance

14 of 14

Summary

• Random Split is simple but variable
• K-Fold provides stable estimation
• Choose method based on dataset size
• Always ensure reproducibility