update results with full list of topics

jrouly · Dec 9, 2018 · c523950 · c523950
1 parent 5cb01bd
commit c523950
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 8 deletions.
diff --git a/paper/sections/discussion.tex b/paper/sections/discussion.tex
@@ -1,16 +1,21 @@
 \section{Discussion}
 
-Perhaps the most striking result is that only four topics of five hundred total can be considered strictly overlapping.
-An additional eighteen topics (for a total of twenty two) can be considered weakly overlapping for the same threshold of 1\%.
+Perhaps the most striking result is that only ten topics of five hundred total can be considered strictly overlapping.
+That is, there is only two percent overlap.
+An additional twenty four topics (for a total of thirty four) can be considered ``weakly overlapping'' by relaxing the $\theta$ constraint to only one $D_j$ rather than all.
+However, this still only gives a Jaccard similarity coefficient of 0.68.
 There are several ways to interpret this.
 
 The most direct reading of the data indicates, simply, that universities are not teaching to what the workforce expects.
 Of the five hundred extracted topics, some are expressed by curricular data, some are expressed by career data, but the degree of overlap is low.
-The Jaccard index of the two sets is only 0.02, indicating a fairly significant dissimilarity.
+The low Jaccard index of the two sets indicates a fairly significant dissimilarity.
 
 However, perhaps the most probable interpretation is that the number of extracted topics was simply significantly too low.
 Considering that almost fifty thousand documents were modeled across the two corpora, and considering that any given document can express many different topics, five hundred topics is simplty too low a number.
 Increasing that count to several hundred thousand would most likely garner significantly higher levels of overlap.
 The LDA process is storage and memory bound.
 Increasing the number of topics significantly beyond five hundred would require significant computational resources.
-This is a future goal for this work.
+
+One future extension to this work is the systematic analysis of the hyper parameters.
+Specifically: the number of inferred topics ($k$), relevance ($\rho$), and intersection ($\theta$).
+The author hypothesizes that an optimal maximum Jaccard index can be obtained most directly by increasing $k$ beyond five hundred.
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
@@ -55,9 +55,9 @@ \subsection{Domain Overlap}
 From an LDA run with five hundred topics, ten topics were identified as ``strictly overlapping'' for certain thresholds.
 Namely, $|\Omega(\rho=0.02, \theta=0.055)| = 10$.
 For a set of five hundred total topics, this produces a Jaccard index of 0.02.
-Table~\ref{tab:overlap} displays some of the strictly overlapping topics for the highest possible overlap threshold.
-Notice that the topics in Table~\ref{tab:topics} are the same as in Table~\ref{tab:overlap}.
-Figure~\ref{fig:overlap} gives a visual reference of the complete strictly overlapping set of topics.
+Table~\ref{tab:overlap} displays the ten strictly overlapping topics in $\Omega(\rho=0.02, \theta=0.055)$.
+Notice that the topics in Table~\ref{tab:topics} are a subset of the topics rendered in Table~\ref{tab:overlap}.
+Figure~\ref{fig:overlap} gives an additional visual reference of the complete strictly overlapping set of topics.
 The closer the two bars are to equal for a given topic, the more similar that topic's distribution is within each data set.
 
 
@@ -70,8 +70,14 @@ \subsection{Domain Overlap}
     \hline
     16 & 9,274 & 23.43 & 568 & 9.77 \\
     273 & 8,873 & 22.42 & 583 & 10.02 \\
+    274 & 2,240 & 5.66 & 1,117 & 19.21 \\
+    345 & 2,676 & 6.76 & 543 & 9.34 \\
+    390 & 3,130 & 7.91 & 356 & 6.12 \\
     418 & 3,069 & 7.75 & 426 & 7.32 \\
-    424 & 5,561 & 14.05 & 499 & 8.58
+    419 & 2,590 & 6.54 & 553 & 9.51 \\
+    424 & 5,561 & 14.05 & 499 & 8.58 \\
+    425 & 11,807 & 29.83 & 388 & 6.67 \\
+    434 & 3,793 & 9.58 & 349 & 6.00
   \end{tabular}
   \caption{Strictly overlapping topics for $\rho=0.02$ and $\theta=0.055$}~\label{tab:overlap}
 \end{table}