Skip to content

Commit

Permalink
update results with full list of topics
Browse files Browse the repository at this point in the history
  • Loading branch information
jrouly committed Dec 9, 2018
1 parent 5cb01bd commit c523950
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 8 deletions.
13 changes: 9 additions & 4 deletions paper/sections/discussion.tex
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
\section{Discussion}

Perhaps the most striking result is that only four topics of five hundred total can be considered strictly overlapping.
An additional eighteen topics (for a total of twenty two) can be considered weakly overlapping for the same threshold of 1\%.
Perhaps the most striking result is that only ten topics of five hundred total can be considered strictly overlapping.
That is, there is only two percent overlap.
An additional twenty four topics (for a total of thirty four) can be considered ``weakly overlapping'' by relaxing the $\theta$ constraint to only one $D_j$ rather than all.
However, this still only gives a Jaccard similarity coefficient of 0.68.
There are several ways to interpret this.

The most direct reading of the data indicates, simply, that universities are not teaching to what the workforce expects.
Of the five hundred extracted topics, some are expressed by curricular data, some are expressed by career data, but the degree of overlap is low.
The Jaccard index of the two sets is only 0.02, indicating a fairly significant dissimilarity.
The low Jaccard index of the two sets indicates a fairly significant dissimilarity.

However, perhaps the most probable interpretation is that the number of extracted topics was simply significantly too low.
Considering that almost fifty thousand documents were modeled across the two corpora, and considering that any given document can express many different topics, five hundred topics is simplty too low a number.
Increasing that count to several hundred thousand would most likely garner significantly higher levels of overlap.
The LDA process is storage and memory bound.
Increasing the number of topics significantly beyond five hundred would require significant computational resources.
This is a future goal for this work.

One future extension to this work is the systematic analysis of the hyper parameters.
Specifically: the number of inferred topics ($k$), relevance ($\rho$), and intersection ($\theta$).
The author hypothesizes that an optimal maximum Jaccard index can be obtained most directly by increasing $k$ beyond five hundred.
14 changes: 10 additions & 4 deletions paper/sections/results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@ \subsection{Domain Overlap}
From an LDA run with five hundred topics, ten topics were identified as ``strictly overlapping'' for certain thresholds.
Namely, $|\Omega(\rho=0.02, \theta=0.055)| = 10$.
For a set of five hundred total topics, this produces a Jaccard index of 0.02.
Table~\ref{tab:overlap} displays some of the strictly overlapping topics for the highest possible overlap threshold.
Notice that the topics in Table~\ref{tab:topics} are the same as in Table~\ref{tab:overlap}.
Figure~\ref{fig:overlap} gives a visual reference of the complete strictly overlapping set of topics.
Table~\ref{tab:overlap} displays the ten strictly overlapping topics in $\Omega(\rho=0.02, \theta=0.055)$.
Notice that the topics in Table~\ref{tab:topics} are a subset of the topics rendered in Table~\ref{tab:overlap}.
Figure~\ref{fig:overlap} gives an additional visual reference of the complete strictly overlapping set of topics.
The closer the two bars are to equal for a given topic, the more similar that topic's distribution is within each data set.


Expand All @@ -70,8 +70,14 @@ \subsection{Domain Overlap}
\hline
16 & 9,274 & 23.43 & 568 & 9.77 \\
273 & 8,873 & 22.42 & 583 & 10.02 \\
274 & 2,240 & 5.66 & 1,117 & 19.21 \\
345 & 2,676 & 6.76 & 543 & 9.34 \\
390 & 3,130 & 7.91 & 356 & 6.12 \\
418 & 3,069 & 7.75 & 426 & 7.32 \\
424 & 5,561 & 14.05 & 499 & 8.58
419 & 2,590 & 6.54 & 553 & 9.51 \\
424 & 5,561 & 14.05 & 499 & 8.58 \\
425 & 11,807 & 29.83 & 388 & 6.67 \\
434 & 3,793 & 9.58 & 349 & 6.00
\end{tabular}
\caption{Strictly overlapping topics for $\rho=0.02$ and $\theta=0.055$}~\label{tab:overlap}
\end{table}
Expand Down

0 comments on commit c523950

Please sign in to comment.