For each conversation, we use tSNE to reduce the dimensionality of the sample features from 338 to 2. That way, we can visualize the conversation in 2D.

The tSNE reduction is governed by the parameters PERPLEXITY, LEARNING_RATE, EARLY_EXAGGERATION. We explore many different combinations of those parameters, but below we show only two cases.

For each conversation, we can choose the reduction (its parameters) we want based on two criteria:

  • Least KL divergence (reported as "error" in the tables below)
  • A visual inspection, during which we look for clusters that are best separated (no KL error reported)

For each of those cases, we use DBSCAN to perform clustering. To evaluate the clustering quality, we employ the Silhouette Score. It ranges in [-1, 1], with the higher value the better. Values close to 0 indicate a lot of cluster overlap, which is what happens in our case.

After each table, there is a plot of the Silhouette Score for each conversation, and for each tSNE reduction chosen. We can see that in general the scores are not very high, indicating cluster overlap, but the reductions that were visually chosen based on how well separable they seemed to us, achieve generally higher Silhouette Scores, which affirms the the idea that the KL divergence reported by tSNE isn't necessarily a good indication of the quality of the dimensionality reduction. Also, it seems that it would be difficult to create an unsupervised pipeline of dimensionality reduction -> clustering.

For the keynote, we also report the DBCV score of the clustering. That also ranges in [-1, 1] with higher values denoting better clustering, but we see in that case too that the custering doesn't score very well. We were not able to generate the DBCV scores for the GECO, since it takes an impossibly long time.

GECO

GECO consists of 22 conversations, with a different amount of samples each

In the last table row, we show the parameters values for the tSNE dimensionality reduction performed on the samples of all the conversations. We did not perform clustering of any kind in that case, and thus the silhouette score is missing

dataset conversation samples visually_best-perplexity visually_best-learning_rate visually_best-early_exaggeration visually_best-error visually_best-silhouette_score least_error-perplexity least_error-learning_rate least_error-early_exaggeration least_error-error least_error-silhouette_score
geco A-C 7305 30 500 96   0.12697070837 30 300 72 3.18630480766 0.17529582647
geco A-K 6082 30 500 60   0.0958508899303 30 300 60 3.1814520359 0.094064527546
geco B-A 7399 30 500 84   0.136919530881 30 300 72 1.92834278285 0.1228764784
geco B-M 6849 30 400 84   0.0704248756905 30 300 72 3.38323879242 0.033136514687
geco C-E 7207 30 600 72   0.180117599958 30 300 60 3.31098413467 0.139214125532
geco C-F 6785 30 600 84   0.035997793531 30 300 60 3.02890968323 0.124499953377
geco D-G 7474 30 400 72   0.199116441582 30 300 72 3.41262435913 0.149164377931
geco D-L 7132 30 400 72   0.129629677843 30 300 60 3.27400302887 0.0990507352624
geco E-I 7600 30 600 96   0.16424023702 30 300 60 3.29113650322 0.177465500015
geco E-J 6571 30 400 72   0.201191280157 30 300 72 3.23993587494 0.139621174158
geco F-E 6398 30 500 84   0.173880558222 30 300 84 3.31300210953 0.107269947021
geco F-J 7119 30 300 72   0.0365933080195 30 300 72 3.26936674118 0.0365933080195
geco G-B 6359 30 500 84   0.139788398018 30 300 72 3.08002882504 0.113393778441
geco G-L 6653 30 400 96   0.100282666395 30 300 84 3.13517452804 0.0618217291316
geco J-D 7203 30 400 72   0.113432025187 30 400 72 3.08036528811 0.113432025187
geco J-I 6959 30 300 96   0.181400558249 30 400 72 3.12636015273 0.0918031080127
geco K-C 7029 30 300 96   0.0655813024315 30 300 72 3.18346490322 0.127871875407
geco K-F 6784 30 600 72   -0.00341078136991 30 300 72 3.02144397585 0.0291772637315
geco L-B 7094 30 500 72   0.167543907436 30 300 72 3.21274310312 0.121965833939
geco L-M 6820 30 500 72   0.234679457854 30 300 72 3.2197365993 0.17567878998
geco M-A 7032 30 400 96   0.103079439092 30 400 72 3.15585459005 0.156329180833
geco M-K 6646 30 400 96   0.120475897323 30 300 72 3.1814139377 0.127679601312
geco all-conversations 152500 50 400 72     50 300 72 1.59171654445  
Silhouette scores for GECO

KEYNOTE

The keynote was analyzed as a whole.

dataset conversation samples visually_best-perplexity visually_best-learning_rate visually_best-early_exaggeration visually_best-error visually_best-silhouette_score least_error-perplexity least_error-learning_rate least_error-early_exaggeration least_error-error least_error-silhouette_score
keynote keynote 919 30 600 84   0.0733729339157 30 500 60 1.45440489433 0.0935016939854
Silhouette scores for keynote DBCV scores for keynote