Data sources

Data Sources and Access

There are multiple ways to access digital data and often more than 1 platform offers access to the same corpora. Hence, data may be compiled from a single (e.g., a news aggregator) or multiple sources (e.g., multiple individual news outlets). Data can be accessible through dedicated API (application program interface) services (e.g., YouTube API and Twitter API) and online interfaces for data access (e.g., Google Trends and GDELT). Some sources offer free access but may provide data in pre-analyzed format (e.g., Google Trends and Google Ngram Viewer) or restrict the amount and type of data accessible (e.g., YouTube API and Twitter API). Some insights on using APIs for data collaction on social media can be found in Lomborg, S., Bechmann. Other services may charge for access but in exchange provide wider access to data (e.g., Webhose,
DataStreamer, and DiffBot). It may also be possible to obtain raw data by scraping it directly from the web page if permitted–it is crucial to consult the Terms of Service and the robots.txt file (which contains instructions on what sections of each website can be crawled) prior to scraping. Data collection often requires good knowledge of web architecture and programing languages, such as R or Python, which can represent an initial barrier for conservation researchers wanting to engage with culturomics.
Many books and online courses on programing languages and API architecture (e.g., RESTful APIs) can help overcome this initial barrier, but the development of data aggregation and access platforms with online user interfaces geared toward researchers might facilitate this process even further. Some insights on using APIs for data collection on social media can also be found in Lomborg & Bechmann 2014.
Collecting raw data can often lead to large and unstructured data sets, so researchers should also consider how to store data and whether to subdivide it before storage. For example, in text corpora, it may be possible to identify and filter out homonyms (e.g., instances of the word jaguar that refer to the car brand rather than the animal) that are not relevant to the research focus before analysis. These limitations may force researchers to find a balance between their research budget, ease of data access and storage, and the type of data available when selecting which data to obtain. In extreme cases, the desired data may be inaccessible, access may cease, or the scale and content of the data may change during the project, requiring earlier decisions regarding the research design to be reassessed. A good example of this problem is the social-networking platform Instagram,
which substantially restricted access to public data at the end of 2018, following the Cambridge Analytica scandal. Other common instances that may disrupt research include adjustments to APIs and changes to data indexing and preprocessing procedures.
More detailed information on digital data sources for conservation culturomics research can be found in Correia et al. 2021.