Digital corpora for conservation culturomics
Digital content for culturomics can be obtained from multiple sources, may include 1 or more data formats (e.g., text, images, and videos), and often varies in metadata availability (e.g., associated temporal and spatial data). These characteristics can pose challenges for researchers when selecting and compiling data for analyses. It is, therefore, useful to think of digital content for culturomics analyses in terms of collections of items, such as web pages, books, or social-network posts that can be used to generate structured data sets for subsequent analysis. In the culturomics literature, such collections are often referred to as corpora. In the context of conservation culturomics, the broader definition of corpora – collections of knowledge or
evidence – is best suited to account for collections of both textual and non-textual data types, such as images and videos. In other words, any set of texts, images, videos, songs, paintings, or other products of human culture from which a structured data set can be derived for analysis represent potential corpora for
conservation culturomics.
There are two key dimensions of digital corpora that are relevant to conservation culturomics. One refers to the content featured in the elements composing each corpus. This is the original scope of culturomics analyses and generally focuses on what is represented in the corpus and the context of such representation. The other dimension refers to engagement with the elements that compose the corpus and focuses on assessing interactions with elements of the corpus, including searches, views, comments, and shares.
More detailed information on digital corpora for conservation culturomics research can be found in Correia et al. 2021.
evidence – is best suited to account for collections of both textual and non-textual data types, such as images and videos. In other words, any set of texts, images, videos, songs, paintings, or other products of human culture from which a structured data set can be derived for analysis represent potential corpora for
conservation culturomics.
There are two key dimensions of digital corpora that are relevant to conservation culturomics. One refers to the content featured in the elements composing each corpus. This is the original scope of culturomics analyses and generally focuses on what is represented in the corpus and the context of such representation. The other dimension refers to engagement with the elements that compose the corpus and focuses on assessing interactions with elements of the corpus, including searches, views, comments, and shares.
More detailed information on digital corpora for conservation culturomics research can be found in Correia et al. 2021.
Web pages
Most digital content on the internet is available through web pages, so they can be considered the quintessential corpus for culturomics analyses. Web pages are text documents available on the internet, and their information set often contains other types of non-textual data, such as images, audio, and video. We focus on sets of web pages as the corpus used for analysis, but individual websites or platforms dedicated to specific content, such as Wikipedia or YouTube, can also be used as individual corpora. Search engines, such as Google, Yahoo, and Bing, specialize in crawling and indexing large volumes of web content and
provide a good starting point to access data on web-page content and engagement.
The number and content of web pages can be used to quantify the cultural salience of species or places of conservation
importance and or to assess the overlap between societal and scientific interest in conservation topics. Web pages located on the deep web or the dark web may also represent useful content for culturomics analyses, although they are often more difficult to
access. Engagement with internet web pages through web searches (e.g., Google Trends and Naver Trends) or web-page visitation (e.g., Google Analytics and Bing Webmaster Tools) can also be used in conservation research to explore public reactions to conservation interventions.
provide a good starting point to access data on web-page content and engagement.
The number and content of web pages can be used to quantify the cultural salience of species or places of conservation
importance and or to assess the overlap between societal and scientific interest in conservation topics. Web pages located on the deep web or the dark web may also represent useful content for culturomics analyses, although they are often more difficult to
access. Engagement with internet web pages through web searches (e.g., Google Trends and Naver Trends) or web-page visitation (e.g., Google Analytics and Bing Webmaster Tools) can also be used in conservation research to explore public reactions to conservation interventions.
Book collections
Books have been used as a medium to record and transmit information for centuries. Their contents, including text and images, are increasingly being digitized and made available through the internet. This process has facilitated the computational analysis of book contents for a range of purposes and was the genesis of culturomics analyses. The Google Books project, for example, has digitized over 5 million books whose content is accessible to researchers in pre-analyzed format through Google Ngram Viewer. Other platforms, such as Project Gutenberg, may also be used for analysis.
Book contents can be used to assess historical trends of interest in environmental and conservation topics, the evolution of human connection and disconnection with nature, and identify popular species through time. Engagement with books has not yet been well explored for conservation purposes but holds great potential. Platforms, such as Goodreads or WorldCat, compile information about book reviews and book availability in libraries and may provide a basis for initial analyses in this area.
Book contents can be used to assess historical trends of interest in environmental and conservation topics, the evolution of human connection and disconnection with nature, and identify popular species through time. Engagement with books has not yet been well explored for conservation purposes but holds great potential. Platforms, such as Goodreads or WorldCat, compile information about book reviews and book availability in libraries and may provide a basis for initial analyses in this area.
Video-Sharing Platforms
Video repositories provide another fertile source of data for conservation culturomics. Both the amount of video uploads and the time spent watching video content online have vastly increased recently and are likely to continue to grow in coming years. Video sharing platforms, such as YouTube or Vimeo, allow researchers to explore aspects of digital (or digitized) video corpora for culturomics research, including video metadata and engagement based on views and likes.
The content of online videos can be used in conservation to explore illegal activities, assess how different recreation practices affect threatened species, and characterize human–wildlife conflict. Conservation research based on video corpora can also
take advantage of data on video engagement, drawing from the social-networking capabilities of many video-sharing platforms to explore views, likes, and online comments.
The content of online videos can be used in conservation to explore illegal activities, assess how different recreation practices affect threatened species, and characterize human–wildlife conflict. Conservation research based on video corpora can also
take advantage of data on video engagement, drawing from the social-networking capabilities of many video-sharing platforms to explore views, likes, and online comments.
News Media
News items are a particularly interesting source of data for culturomics because they are often produced in real time, unlike other cultural products, such as books, that lag real-world events. Ongoing efforts to digitize historical periodicals are making large amounts of news items, including text and image data, available for culturomics analysis. Meanwhile, news media have a growing presence online; sound and video recordings are becoming increasingly prominent alongside text and images. News-aggregating platforms, such as GDELT and Webhose, provide compilations of recent news items from across the globe. News from specific media outlets, including The New York Times and The Guardian, may also be accessible using application programming interfaces (APIs) provided by these platforms.
Online news can be used in conservation research to understand the impact of how specific conservation actions are communicated in the media, assess how the media attention given to conservation compares with other topics, and evaluate changing perceptions of what constitutes newsworthy wildlife events over time. Engagement with online news can also be used to explore the role of news media in linking conservation research to social media and to evaluate the sentiment of responses to news reports of charismatic species.
Online news can be used in conservation research to understand the impact of how specific conservation actions are communicated in the media, assess how the media attention given to conservation compares with other topics, and evaluate changing perceptions of what constitutes newsworthy wildlife events over time. Engagement with online news can also be used to explore the role of news media in linking conservation research to social media and to evaluate the sentiment of responses to news reports of charismatic species.
Social Networks
Data from online social-networking platforms have been used widely in the scientific literature to explore aspects of human culture that relate to environmental and nature conservation topics. Social-networking data are usually available from dedicated social-media platforms. Twitter, Facebook, Instagram, TikTok, and Sina Weibo are among the most popular worldwide. However,
platforms that specialize in other services (e.g., video and image sharing, news reporting, and blog hosting), such as Flickr, YouTube, and Blogger, can also provide social-networking features. Social-media data are usually composed of text, images, videos, or a combination of these and can be used for a wide range of potential applications in conservation that extend beyond
culturomics research.
Data pertaining to both social media content and engagement can be used for a wide range of conservation purposes. These include analyzing species’ popularity and associated sentiment, monitoring wildlife trade online, studying the emergence of digital citizen science communities, and assessing nature-based recreational preferences.
platforms that specialize in other services (e.g., video and image sharing, news reporting, and blog hosting), such as Flickr, YouTube, and Blogger, can also provide social-networking features. Social-media data are usually composed of text, images, videos, or a combination of these and can be used for a wide range of potential applications in conservation that extend beyond
culturomics research.
Data pertaining to both social media content and engagement can be used for a wide range of conservation purposes. These include analyzing species’ popularity and associated sentiment, monitoring wildlife trade online, studying the emergence of digital citizen science communities, and assessing nature-based recreational preferences.
Digital Encyclopedias
Encyclopedias are reference works that aim to compile human knowledge and, as such, are prime material for exploring aspects of human culture. Although several digital encyclopedias have emerged since the World Wide Web became publicly available, Wikipedia is the most widely used. Wikipedia is a free online encyclopedia curated by volunteers and currently composed of
over 50 million entries in approximately 300 languages. Each Wikipedia entry contains text data describing the topic being addressed and may also feature a combination of image, video, and audio data that are freely available to anyone (Arroyo-Machado et al. 2022). Data on public engagement with Wikipedia content is also openly available, including information on page views and edits. Because of these characteristics, Wikipedia data have been used widely in scientific research. However, other digital encyclopedias, such as Encyclopaedia Britannica or Everipedia, also represent potential corpora for culturomics research.
Data from digital encyclopedias can be used to explore various conservation issues, including the popularity of threatened species, the effect of nature documentaries on public interest toward featured species, and seasonal dynamics of public interest in nature.
over 50 million entries in approximately 300 languages. Each Wikipedia entry contains text data describing the topic being addressed and may also feature a combination of image, video, and audio data that are freely available to anyone (Arroyo-Machado et al. 2022). Data on public engagement with Wikipedia content is also openly available, including information on page views and edits. Because of these characteristics, Wikipedia data have been used widely in scientific research. However, other digital encyclopedias, such as Encyclopaedia Britannica or Everipedia, also represent potential corpora for culturomics research.
Data from digital encyclopedias can be used to explore various conservation issues, including the popularity of threatened species, the effect of nature documentaries on public interest toward featured species, and seasonal dynamics of public interest in nature.