For each conversation, we use tSNE to reduce the dimensionality of the sample features from 338 to 2. That way, we can visualize the conversation in 2D.
The tSNE reduction is governed by the parameters PERPLEXITY, LEARNING_RATE, EARLY_EXAGGERATION. We explore many different combinations of those parameters, but below we show only two cases.
For each conversation, we can choose the reduction (its parameters) we want based on two criteria:
For each of those cases, we use DBSCAN to perform clustering. To evaluate the clustering quality, we employ the Silhouette Score. It ranges in [-1, 1], with the higher value the better. Values close to 0 indicate a lot of cluster overlap, which is what happens in our case.
After each table, there is a plot of the Silhouette Score for each conversation, and for each tSNE reduction chosen. We can see that in general the scores are not very high, indicating cluster overlap, but the reductions that were visually chosen based on how well separable they seemed to us, achieve generally higher Silhouette Scores, which affirms the the idea that the KL divergence reported by tSNE isn't necessarily a good indication of the quality of the dimensionality reduction. Also, it seems that it would be difficult to create an unsupervised pipeline of dimensionality reduction -> clustering.
For the keynote, we also report the DBCV score of the clustering. That also ranges in [-1, 1] with higher values denoting better clustering, but we see in that case too that the custering doesn't score very well. We were not able to generate the DBCV scores for the GECO, since it takes an impossibly long time.
GECO consists of 22 conversations, with a different amount of samples each
In the last table row, we show the parameters values for the tSNE dimensionality reduction performed on the samples of all the conversations. We did not perform clustering of any kind in that case, and thus the silhouette score is missing
dataset | conversation | samples | visually_best-perplexity | visually_best-learning_rate | visually_best-early_exaggeration | visually_best-error | visually_best-silhouette_score | least_error-perplexity | least_error-learning_rate | least_error-early_exaggeration | least_error-error | least_error-silhouette_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
geco | A-C | 7305 | 30 | 500 | 96 | 0.12697070837 | 30 | 300 | 72 | 3.18630480766 | 0.17529582647 | |
geco | A-K | 6082 | 30 | 500 | 60 | 0.0958508899303 | 30 | 300 | 60 | 3.1814520359 | 0.094064527546 | |
geco | B-A | 7399 | 30 | 500 | 84 | 0.136919530881 | 30 | 300 | 72 | 1.92834278285 | 0.1228764784 | |
geco | B-M | 6849 | 30 | 400 | 84 | 0.0704248756905 | 30 | 300 | 72 | 3.38323879242 | 0.033136514687 | |
geco | C-E | 7207 | 30 | 600 | 72 | 0.180117599958 | 30 | 300 | 60 | 3.31098413467 | 0.139214125532 | |
geco | C-F | 6785 | 30 | 600 | 84 | 0.035997793531 | 30 | 300 | 60 | 3.02890968323 | 0.124499953377 | |
geco | D-G | 7474 | 30 | 400 | 72 | 0.199116441582 | 30 | 300 | 72 | 3.41262435913 | 0.149164377931 | |
geco | D-L | 7132 | 30 | 400 | 72 | 0.129629677843 | 30 | 300 | 60 | 3.27400302887 | 0.0990507352624 | |
geco | E-I | 7600 | 30 | 600 | 96 | 0.16424023702 | 30 | 300 | 60 | 3.29113650322 | 0.177465500015 | |
geco | E-J | 6571 | 30 | 400 | 72 | 0.201191280157 | 30 | 300 | 72 | 3.23993587494 | 0.139621174158 | |
geco | F-E | 6398 | 30 | 500 | 84 | 0.173880558222 | 30 | 300 | 84 | 3.31300210953 | 0.107269947021 | |
geco | F-J | 7119 | 30 | 300 | 72 | 0.0365933080195 | 30 | 300 | 72 | 3.26936674118 | 0.0365933080195 | |
geco | G-B | 6359 | 30 | 500 | 84 | 0.139788398018 | 30 | 300 | 72 | 3.08002882504 | 0.113393778441 | |
geco | G-L | 6653 | 30 | 400 | 96 | 0.100282666395 | 30 | 300 | 84 | 3.13517452804 | 0.0618217291316 | |
geco | J-D | 7203 | 30 | 400 | 72 | 0.113432025187 | 30 | 400 | 72 | 3.08036528811 | 0.113432025187 | |
geco | J-I | 6959 | 30 | 300 | 96 | 0.181400558249 | 30 | 400 | 72 | 3.12636015273 | 0.0918031080127 | |
geco | K-C | 7029 | 30 | 300 | 96 | 0.0655813024315 | 30 | 300 | 72 | 3.18346490322 | 0.127871875407 | |
geco | K-F | 6784 | 30 | 600 | 72 | -0.00341078136991 | 30 | 300 | 72 | 3.02144397585 | 0.0291772637315 | |
geco | L-B | 7094 | 30 | 500 | 72 | 0.167543907436 | 30 | 300 | 72 | 3.21274310312 | 0.121965833939 | |
geco | L-M | 6820 | 30 | 500 | 72 | 0.234679457854 | 30 | 300 | 72 | 3.2197365993 | 0.17567878998 | |
geco | M-A | 7032 | 30 | 400 | 96 | 0.103079439092 | 30 | 400 | 72 | 3.15585459005 | 0.156329180833 | |
geco | M-K | 6646 | 30 | 400 | 96 | 0.120475897323 | 30 | 300 | 72 | 3.1814139377 | 0.127679601312 | |
geco | all-conversations | 152500 | 50 | 400 | 72 | 50 | 300 | 72 | 1.59171654445 |
The keynote was analyzed as a whole.
dataset | conversation | samples | visually_best-perplexity | visually_best-learning_rate | visually_best-early_exaggeration | visually_best-error | visually_best-silhouette_score | least_error-perplexity | least_error-learning_rate | least_error-early_exaggeration | least_error-error | least_error-silhouette_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
keynote | keynote | 919 | 30 | 600 | 84 | 0.0733729339157 | 30 | 500 | 60 | 1.45440489433 | 0.0935016939854 |