Revision as of 19:29, 14 March 2023

Aims and Scope

Data collection and cleaning procedure

We will be using Publish or Perish (PoP) to conduct all queries, which will accessing Google Scholar, Scopus and Web of Science. Each have their own specifications, so the keywords will be defined in a spreadsheet and then concatenated to generate usable strings. These databases impose string character limits for search strings, so we will be generating many small queries rather than lengthier strings that contain multiple keywords. This will generate a massive amount of data with lots of overlap, but there are ways of dealing with this (this is the purpose of this page). One benefit is that we will get a more granular sense of what concepts relate to each item generated in the query results, which may be very valuable during the analytical stage.

Search parameters

Here are some notes pertaining to the search parameters that we will be using:

Google Scholar

Character limit is 100 per query
To specify keyword in publication title - source:<xx> ()
1000 results per query by default
Does not provide DOIs
Can specify to search titles only
No API, must use Publish or Perish

Scopus

200 results per query by default
Implicit synonym-matching/autocorrect (archaeology/archeology are interchangeable, but archaeological/archeological may not be)
To specify keyword in publication title: SRCTITLE()
To specify keyword in title only: TITLE()
In PoP, journal keywords are entered in separate field
Journal field (SRCTITLE()) does not allow wildcards or boolean terms (AND/OR/NOT/*)
Limited recent publications
Punctuation is ignored: heart-attack or heart attack return the same results
The hyphen is treated as punctuation and therefore ignored if it is not in an exact phrase
Wildcards must be used with words because they cannot be standalone
When an hyphen is placed between a wildcard and a word, the wildcard will be dropped, e.g.:
- title-abs-key (*-art) will be searched as title-abs-key(art)
- abs(iwv-*) will be searched as abs(iwv)
To find documents that contain an exact phrase, enclose the phrase in braces: {oyster toadfish}
- {heart-attack} and {heart attack} will return different results because the dash is included.
- Wildcards are searched as actual characters, e.g. {health care?} returns results such as: Who pays for health care?

Web of Science

Can only search titles
Each search term in the query must be explicitly tagged with a field tag. Different fields must be connected with search operators.
Extraneous spaces are ignored by the product. For example, extra spaces around opening and closing parentheses ( ) and equal (=) signs are ignored.
The dollar sign ($) is useful for finding both the British and American spellings of the same word. For example, flavo$r finds flavor and flavour.
The search engine treats hyphens (-) and apostrophes (') in names as spaces. For example:
- AU=O Brien returns the same number of results as AU=O'Brien.
More info: https://images.webofknowledge.com//WOKRS531NR4/help/WOS/hp_advanced_examples.html

CrossRef

API access in R

Scopus: https://github.com/muschellij2/rscopus

Web of Science: https://github.com/vt-arc/wosr

CrossRef: https://github.com/ropensci/rcrossref

Google Scholar: N/A

Data cleaning

Some of the keywords remain vague and will generate too many irrelevant results, so we need to devise a methodological approach to weed out irrelevant items in bulk. Costis had suggested that we remove a certain amount from the items with the lowerst PageRank when less than a certain threshold amount of that subset are deemed relevant by a human reviewer. However, this remains somewhat unclear to me and it would be very helpful if Costis could write this out in more detail. \\ \\ Verify that the results are sorted by relevance, and then export them from PoP as BibTex (.bib) files. We will then import them into Zotero as independent collections.

Zotero does not have batch import functionality, so I'm trying to figure out a workaround that would save us time and energy. Here's what I propose:

1. Use the Web API to create the collections.

See the docs: https://www.zotero.org/support/dev/web_api/v3/write_requests#creating_a_collection
also https://www.zotero.org/support/dev/web_api/v3/write_requests#creating_multiple_objects
API access in R: https://github.com/giocomai/zoteroR

2. Go through the collections and import the contents of bib files via the clipboard (control + shift + command + i on a mac)

Can't import the actual bibliographic items using the API, it's limited to 50 write commands per write request.

We will use Zotero's merge function to combine items with matching metadata across all collections. Their association with each collection will be maintained, but there will be only one item spread across them. We will therefore be able to combine the different sets of metadata provided by each database for overlapping items. This will be crucial, since Google Scholar does not include DOIs in their results, but Scopus and Web of Science do. For the remainder, we will use the zotero-shortdoi plugin: https://github.com/bwiernik/zotero-shortdoi

After this is done, we will export a .bib file for each collection, and pass them into an R script I wrote that uses the DOI to query the CrossRef database, obtain article abstracts when they are available, and export a new .bib file with that metadata included https://gist.github.com/zackbatist/bfeaa66b64c7afe749a7f5c6f9e596c2. We will then re-import those .bib files to Zotero, and begin weeding out irrelevant items based on their abstracts. We may need to look up the article and obtain the abstract manually if abstracts were not included in the CrossRef database. This final stage of sorting and cleaning the data will generate a list of around 100-120 articles, whose full-text PDFs will be imported to MaxQDA for qualitative coding.

Literature review: Difference between revisions