1 of 14

Random Split Method

  • Course: Databases and Data Mining
  • University Lecture
  • Instructor: Jamolbek Mattiev

2 of 14

Learning Objectives

  • • Understand data splitting
  • • Learn Random Split mechanism
  • • Compare with K-Fold Cross Validation
  • • Interpret evaluation variance

3 of 14

Why Do We Split Data?

  • • Estimate generalization performance
  • • Prevent overfitting
  • • Simulate unseen data
  • • Ensure objective evaluation

4 of 14

Basic Concept

  • • Randomly divide dataset
  • • Training set + Test set
  • • Common ratios: 70/30, 80/20
  • • Equal probability for each sample

5 of 14

Mathematical Formulation

  • D = {x1, x2, ..., xn}
  • D_train ∪ D_test = D
  • D_train ∩ D_test = ∅

6 of 14

Advantages

  • • Simple
  • • Fast
  • • Computationally efficient
  • • Suitable for large datasets

7 of 14

Disadvantages

  • • High variance
  • • Depends on random seed
  • • Not stable for small datasets

8 of 14

When to Use?

  • • Large datasets
  • • Balanced classes
  • • Baseline experiments

9 of 14

Introduction to K-Fold

  • • Dataset split into K subsets
  • • Train K times
  • • Average performance
  • • More robust evaluation

10 of 14

Random Split vs K-Fold

  • Random Split: Single evaluation
  • K-Fold: Multiple evaluations
  • Random Split: Higher variance
  • K-Fold: Lower variance

11 of 14

Practical Recommendations

  • • Fix random_state
  • • Use stratification for classification
  • • Report split ratio in research

12 of 14

Visual Illustration: Random Split (80/20 Example)

13 of 14

Comparison Chart: Random Split vs K-Fold Variance

14 of 14

Summary

  • • Random Split is simple but variable
  • • K-Fold provides stable estimation
  • • Choose method based on dataset size
  • • Always ensure reproducibility