-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update results with full list of topics
- Loading branch information
Showing
2 changed files
with
19 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,21 @@ | ||
\section{Discussion} | ||
|
||
Perhaps the most striking result is that only four topics of five hundred total can be considered strictly overlapping. | ||
An additional eighteen topics (for a total of twenty two) can be considered weakly overlapping for the same threshold of 1\%. | ||
Perhaps the most striking result is that only ten topics of five hundred total can be considered strictly overlapping. | ||
That is, there is only two percent overlap. | ||
An additional twenty four topics (for a total of thirty four) can be considered ``weakly overlapping'' by relaxing the $\theta$ constraint to only one $D_j$ rather than all. | ||
However, this still only gives a Jaccard similarity coefficient of 0.68. | ||
There are several ways to interpret this. | ||
|
||
The most direct reading of the data indicates, simply, that universities are not teaching to what the workforce expects. | ||
Of the five hundred extracted topics, some are expressed by curricular data, some are expressed by career data, but the degree of overlap is low. | ||
The Jaccard index of the two sets is only 0.02, indicating a fairly significant dissimilarity. | ||
The low Jaccard index of the two sets indicates a fairly significant dissimilarity. | ||
|
||
However, perhaps the most probable interpretation is that the number of extracted topics was simply significantly too low. | ||
Considering that almost fifty thousand documents were modeled across the two corpora, and considering that any given document can express many different topics, five hundred topics is simplty too low a number. | ||
Increasing that count to several hundred thousand would most likely garner significantly higher levels of overlap. | ||
The LDA process is storage and memory bound. | ||
Increasing the number of topics significantly beyond five hundred would require significant computational resources. | ||
This is a future goal for this work. | ||
|
||
One future extension to this work is the systematic analysis of the hyper parameters. | ||
Specifically: the number of inferred topics ($k$), relevance ($\rho$), and intersection ($\theta$). | ||
The author hypothesizes that an optimal maximum Jaccard index can be obtained most directly by increasing $k$ beyond five hundred. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters