Fall 2019 Internship: Smithsonian Institution Annual Reports

I am a George Mason University (GMU) History graduate student. My master’s focus is on American history and technology. I applied for an internship as part of GMU’s Graduate Certificate in Digital Public Humanities. Luckily, I was chosen for this internship at the Smithsonian Institution. For the past few months I had the opportunity to intern with the Smithsonian Institution’s Chief Information Officer Data Science Lab. Under the supervision of Dr. Rebecca Dikow, I was given the sources of the SI Annual Reports, ranging from the 1840s to 2009. The tools I used, the process of analyzing and researching the data, and how I learned about my work habits proved informative throughout this semester. This all leads to possibilities for future work. 

I decided to use the open-source tool called Voyant to help in the visualization of these files. I have used this tool before in my graduate studies and thought that this could be useful for this project. When I first received the files, it was overwhelming due to the sheer number of documents I had to go through. My initial process was to read all the files, beginning to end. I started on this journey with at least 3 or 4 of the Reports and with this process I figured out that it would take a significant amount of time to read through at least 30 reports. I then decided to use the Voyant tool and insert the reports into it. Although, even this required some planning. I ended up organizing how I inserted/analyzed the SI Reports based on topic. I created folders in the SI’s One Drive based on topic and included the SI Reports’ Voyant links. The initial topics that I ended up having to have a Voyant result for each SI Report year. This was a great strategy, but it was too granular to see the bigger picture of how words, topics, and opinions changed over time. Then it was decided to group the SI Reports by decades. This gave a broader view of the words throughout the decade. Two other topics that were useful to this project, one was to group the SI Reports during the years that the Smithsonian Secretaries held the position. The other was to take the SI Reports for the years surrounding the passage of the 19th Amendment in 1919. 

Analyzing the data proved to be the hardest part of this project. I learned that it is not straight-forward. I tried to look at the data given unaltered which proved insightful in some instances but then I realized that I needed to experiment with different graphs to look at the data. This prove useful to look at which words had the highest frequency within the data. For example, in the theme of Women’s Right to Vote, I looked at the years 1918, 1920, and 1921 of the SI Annual Reports. Some words had more of a historical significance than others. By looking at the cirrus image, one can see that the words, “work,” “mr,” “american,” “dr,” “new,” and “time,” for example had a high level of frequency within that data set. This is just one example, but it demonstrates how words in relation to the time period can take on different historical meanings given its context. 

The next part of my process of looking at the data is doing research. I focused mostly on the Women’s Right to Vote theme and I initially identified women that stood out within those three years (1918, 1920, and 1921). I would sometimes find words that I would consider the first or last name of a person. Through an initial Google search, I would discover that this person would be a male. In relation to only looking at women, I initially thought that this information was useless in my research to find women in these reports. I was proved wrong. I then realized that yes although I did find men within these years, it could be useful to compare them against the women that I did find, in terms of family background, education, career, etc. This comparison could demonstrate the historical nuisances of the time and how that influenced women particularly at the Smithsonian Institution. I did find women that would be interesting to search through and find more information on.  

The next steps if this project will continue on is to be able to find more primary and secondary sources on these women found in the SI Reports There can also be some research and analysis as to what the historical aspects that women faced during this time and how the SI Reports support or contradict that. I would like to also see what other tools are out in the field to use to help analyze and make sense of the data. Voyant is a great tool but I would like to see if other textual analysis tools get the same or different results. One tool that I want to learn more and explore is spaCy. I want to be able to look at the data from all of the years instead of just by theme or decade. This tool might be able to provide that. I would also like to see the data inputted into a different kind of data analysis tool that is not just looking at the words and frequency. This could be combined with other projects like women in STEM or the SI American Women’s History Project. 

Overall of the course of the semester, it has been to input the data into an open-source tool, figure out how to group the data to best analyze, and then going with that plan to analyze the data. The continuation of this would be find more historical data on these women and see what the results come up with. This internship along with my mentor, Dr. Dikow, has taught me to look at a project through various angles and do not be afraid by what you may find. It will prove useful in some capacity. I’ve learned through this whole process that there is not one way to work on a project. I found new insights as to how I best work on a project – breaking down the work in manageable pieces. It’s about experimenting with the data to see what you get. Everything that you experiment and get the results of can tell a story. Every word, every phrase, and every person have a purpose.  

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.