ABCDEFGHIJKLMNOPQRSTUVWXYZAAAB
1
HDBSCAN() Results
https://zoom.us/j/98764352308?pwd=M0cya1IyNDBqTHRIYnlvOVBlZFVzZz09
2
df0.labels.value_counts().iloc[:15]
3
min_cluster_size(default)102030501020050100120801002010080
4
min_samples(default)1020105030502001008012020100130
5
-1 44536-1 42373-1 43735-1 41228-1 42760-1 42829-1 44973-1 44334-1 43118-1 43510-1 42324-1 43808
0 41661 -1 9016 2 107 1 105
-1 41389
6
302 21483 451642 294524 544213 60562 61981 49411 46652 62041 570012 40588 46659 6442
7
133 212100 20437 85136 5084 3274 7772 2692 6165 5712 51810 14189 61613 1021
8
169 16239 19141 47445 21618 2050 5043 2446 5044 3595 43914 63013 50412 388
9
177 15958 17016 28017 19115 1603 3674 2145 3370 3023 3295 48412 3376 317
10
103 14045 13939 18420 1912 1391 2140 1260 2213 2144 2149 3905 2210 209
11
124 13246 13718 18132 17017 1365 1224 1071 1210 1794 25611 1077 195
12
361 11520 12232 16837 1653 1093 1057 24810 1055 149
13
22 11119 12124 12234 15110 10713 1842 921 121
14
8 1073 10711 11523 1399 1066 1817 908 117
15
9 1052 1057 10725 13711 1050 1463 892 113
16
259 764 10523 10616 1277 1058 1421 7710 110
17
138 7113 828 10527 1260 1053 1156 753 107
18
270 7132 7833 10111 1221 831 1074 5511 106
19
282 7150 7522 9510 1216 782 1050 484 105
20
21
50k samples. So less than 5k (10%) we don't feel like that is a core idea. So min_cluster_size is 5k.
22
min_samples if larger (means more conservative).
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100