Topic Modeling
Before starting topic modeling, the text was prepared by lemmatizing the words. Orginally the Porter stemmer was used, however, it was deemed too aggressive in cutting off the word endings. It became hard to decipher the topic modeling results, because it could not be determined what the word fragment actually represented. Then the WordNet Lemmatizer (WNL) was used and the results were more intelligible. Another prepatory step was to turn each paragraph of the speech into its own .txt file. Once these steps were completed, topic modeling can begin.
MALLET was the tool used for topic modeling. The tutorial on how to use MALLET was from The Programming Historian. Mallet was run from 5 to 25 topics (at increments of 1) to see which would produce more useful results. Although at 5 or 6 topics, it is still possible to see what some of the topics are, they seem too broad. Some of the topics that were groupded together would probably have been split into seperate topics if they were tagged manually. At 25 topics, they became much more specific.
When looking at the topic keys, some of them seem somewhat similar. It was not clear why some of them might be seperate topics. For example, at 14 topics, topics 7 and 8 can both be interpreted as "money" topics. Here are the key words:
Topic # | Key Words |
---|---|
7 | year wa fiscal expenditure revenue june silver number treasury total increase pension government amount receipt day cent sum money |
8 | job american year tax business work ve family make time reform million home school economy tonight america back deficit |
Compare the following sections of text indicated to belong to topics 7 and 8:
Topic 7 | Topic 7 | Topic 8 |
---|---|---|
Washington Sotu (1796) P:31 | Lincoln SotU (1861) P:43 | Obama SotU (2013) P:10 |
I have directed an estimate of the appropriations necessary for the service of the ensuing year to be submitted from the proper Department, with a view of the public receipts and expenditures to the latest period to which an account can be prepared. | The revenue from all sources during the fiscal year ending June 30, 1861, including the annual permanent appropriation of $700,000 for the transportation of "free mail matter," was $9,049,296.40, being about 2 per cent less than the revenue for 1860. | Over the last few years, both parties have worked together to reduce the deficit by more than $2.5 trillion, mostly through spending cuts, but also by raising tax rates on the wealthiest 1 percent of Americans. As a result, we are more than halfway towards the goal of $4 trillion in deficit reduction that economists say we need to stabilize our finances. |
If these paragraphs were tagged manually, they would perhaps all have been attributed to the same topic such as "money" or "budget". But these 3 paragraphs were frequently seperated into two categories regardless of the number of total topics. By looking at some surrounding paragraphs, which were also assigned to the same categories, it becomes clearer why they might be categoried the way they were:
Topic 7 | Topic 8 |
---|---|
Lincoln SotU (1861) P:43, 44, 45, 46 | Obama SotU (2013) P:10, 12, 13 |
The revenue from all sources during the fiscal year ending June 30, 1861, including the annual permanent appropriation of $700,000 for the transportation of "free mail matter," was $9,049,296.40, being about 2 per cent less than the revenue for 1860. The expenditures were $13,606,759.11, showing a decrease of more than 8 per cent as compared with those of the previous year and leaving an excess of expenditure over the revenue for the last fiscal year of $4,557,462.71. The gross revenue for the year ending June 30, 1863, is estimated at an increase of 4 per cent on that of 1861, making $8,683,000, to which should be added the earnings of the Department in carrying free matter, viz, $700,000, making $9,383,000. The total expenditures for 1863 are estimated at $12,528,000, leaving an estimated deficiency of $3,145,000 to be supplied from the Treasury in addition to the permanent appropriation. |
Over the last few years, both parties have worked together to reduce the deficit by more than $2.5 trillion, mostly through spending cuts, but also by raising tax rates on the wealthiest 1 percent of Americans. As a result, we are more than halfway towards the goal of $4 trillion in deficit reduction that economists say we need to stabilize our finances. In 2011, Congress passed a law saying that if both parties couldn't agree on a plan to reach our deficit goal, about a trillion dollars' worth of budget cuts would automatically go into effect this year. These sudden, harsh, arbitrary cuts would jeopardize our military readiness. They'd devastate priorities like education and energy and medical research. They would certainly slow our recovery and cost us hundreds of thousands of jobs. And that's why Democrats, Republicans, business leaders, and economists have already said that these cuts—known here in Washington as the sequester—are a really bad idea. Now, some in Congress have proposed preventing only the defense cuts by making even bigger cuts to things like education and job training, Medicare, and Social Security benefits. That idea is even worse. |
Topic 8 perhaps have components such as "politics" combined with the "money" or "budget" part. It might also be worth noting that the 7 and 8 seem to be from two time periods. Topic 7 appears before FDR and topic 8 appears after. By looking at the data from topic modeling and examining possible variables, comparisons can be made based on some of these variables.
Grouping Topics Together
After looking over the topic modeling results with various number of "topics" (from 5 to 25), it was decided to use the results for 14 topics. It was further decided to group these topics into 4 super-topics and 1 for other. The 4 super-topics can be roughly labeled "War/Peace", "Economy", "Law", "Sovereignty". Below are the topic key words for each group:
War/Peace
Topic # | Key Words |
---|---|
1 | vessel gun completed ship navy inch board defense construction work cruiser made battle contract building year plan torpedo mortar |
3 | ha country national nation war military great state service time present made public citizen progress le success result improvement |
4 | nation world people peace ha war america great life time make year american thing men future freedom power purpose |
13 | war wa force enemy army men troop soldier year german naval officer british navy day number ha attack nurse |
Economy
Topic # | Key Words |
---|---|
7 | year wa fiscal expenditure revenue june silver number treasury total increase pension government amount receipt day cent sum money |
8 | job american year tax business work ve family make time reform million home school economy tonight america back deficit |
10 | energy market industry american america product country job clean farmer price natural production city good program science health enterprise |
Law
Topic # | Key Words |
---|---|
6 | people government public interest duty citizen justice free principle good party law individual american trust country advantage political constitution |
9 | congress government law ha department present report service condition general consideration subject attention time secretary made legislation provision act |
12 | law state court united case labor constitution supreme slave criminal capital class offense judge question chinese circuit violation number |
Sovereignty
Topic # | Key Words |
---|---|
0 | indian land reservation tribe civilization acre school peace portion made territory civilized savage effort treatment great turkish turkey area |
5 | state united government ha treaty wa country american citizen power foreign britain authority claim commerce relation question vessel made |
11 | state wa union insurgent territory island line spain part insurrection pacific large election resource cuba capital kentucky railway route |
The remaining topic appears to mostly come from the greetings and salutation sections of the speeches, so they were not included in the remaining calculations.
Topic # | Key Words |
---|---|
2 | god fellow president house mr representative voice country man men oath earth senate day destiny citizens woman member heart |
Dynamic Index with PHP
The speeches pages includes an Index panel. It includes links to every speech. This index could be created simply by manually writing the html. However, this is a time consuming task if the corpus includes hundreds of files. It is also an issue when new documents are added to the corpus. One way to resolve this is by dynamically generating the index with PHP.
The method used includes two parts. First, create an array that store the titles of each document. The second part is to output the values stored in the array and add html format to them. The .php file can be seen here. The first part of the php was originally written by Dr. Birnbaum. This section created an array called "presidents". It stored the name of the presidents as the key. The value of each key is another array called "author". This array contained the values of title of each speech and the date. The second part of the php "read" the values of the array and displayed them in a nested list. It read each key of the "presidents" array, and displayed it as a small header. Then it read each "author" array, and displayed every valued contained within as a list.
Stem or Lemmatize Text
In preparation for topic modeling, The speech text were lemmatized. The goal is to reduce the different forms into a base form of the word.
Prerequisites:
- Have Python installed (Ex: Anaconda)
- Have NLTK installed (Anaconda includes NLTK package or from NLTK directly)
- Have your source text in a .txt format (I found this easier than trying to process from HTML)
Steps:
1. Open a command window (Terminal for Mac), and import nltk | import nltk from nltk import word_tokenize |
---|---|
2. Read source text file. Enter the file path where I have "/Users/dingyi....". (Referencing a local file path in Python) | f=open('/Users/dingyi/Desktop/LemmaTestDoc.txt') raw=f.read() |
3. Tokenize and lowercase all text | tokens=word_tokenize(raw) words=[w.lower() for w in tokens] |
4. User a Stemmer or Lemmatizer from NLTK | |
i. Porter | porter=nltk.PorterStemmer() [porter.stem(w) for w in words] |
ii. Lancaster | lancaster=nltk.LancasterStemmer() [lancaster.stem(w) for w in words] |
iii. Word Net | wnl=WordNetLemmatizer() [wnl.lemmatize(w) for w in words] |
5. Output into a file | output=open('outputTest.txt', 'w') print(stemmed, file=output) stemmed=[porter.stem(w) for w in words] for stem in stemmed: print(stem, file=output) output.close() |
Creating Pie Charts
The pie charts in the Analysis section were created using SVG. The easist way found was to create a circle as the base, and then create smaller and smaller slices of the pie on top. What remains visible would represent each category needed. In this way, the 'path' element can be more of less copied for each category needed. The only part that needs to change is the final coordinates on the arc.
Some basic settings used:
- Centered at (0,0)
- Radius of 100
- Starting at the 3-o'clock position and sweep the arc clockwise
Steps:
1. Create a circle centered on (0,0) | <circle r="100" cx="0" cy="0" color="lightgray"/> |
---|---|
2. Create path elements | |
i. Use (0,0) as starting point - "M0,0" | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
ii. Take line to the 3-o'clock position - "L100,0" | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
iii. Use "A100,100" for the radii | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
iv. Keep the next part "0" | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
v. Determine how the arc should sweep. Use "1,1" for arcs over 50%, use "0,1" if arc less than 50% | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
vi. Coordinates for the end of the arc ** | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
vii. Use "z" to stop the path | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
viii. Styple the path element as needed | <path d="M0,0 L100,0 A100,100 0 1,1 30.90,-95.11 Z" style="color:#ff9900; stroke:black; stroke-width:1"/> |
3. Move the pie chart to the viewable area and scale up to the desired sized "pie" | <g transform="translate(175, 175), scale(1.5, 1.5)"> |
** The coordinates for the end of the arc is the only part of the <path> element that needs to change. To calculate the coordinates, it might be beneficial to read "The Maths" section of the Pie Are Square - Charting with SVG article. The basic steps are as follows:
- Determine the percentage of pie. Note that in this method, progressively smaller pieces are laid on top of each other. So percentage is the sum of all categories minus 1 category, then minus 2, etc. For example, if creating a pie chart with 5 categories of 20% each, there should be 1 <circle> and 4 <path> elements. The first path takes 80% of pie, next takes 60% etc.
- Determine angle using radian. Percentage from step 1 times 2 pi. For example, at 80%, it is (0.80 * 2 * pi).
- Determine x-coordinate using cos(angle in radian). Using same example of 80%, it is cos(5.0265). Then multiply by radius. This is why using radius = 100 is simple.
- Determine y-coordinate using sin(angle in radian). Using same example of 80%, it is sin(5.0265). Again, multiply by radius.