Title: | Barcelona Vocabulary Questionnaire Database and Helper Functions |
---|---|
Description: | Download, clean, and process the Barcelona Vocabulary Questionnaire (BVQ) data. BVQ is a vocabulary inventory developed for assesing the vocabulary of Catalan-Spanish bilinguals infants from the Metropolitan Area of Barcelona (Spain). This package includes functions to download the data from formr servers, and return the processed data in multiple formats. |
Authors: | Gonzalo Garcia-Castro [cre, aut] , Daniela S. Ávila-Varela [aut] , Nuria Sebastian-Galles [ctb] |
Maintainer: | Gonzalo Garcia-Castro <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.0 |
Built: | 2025-01-09 08:35:11 UTC |
Source: | https://github.com/gongcastro/bvq |
This function tries to log in to the formr API with the user-provided
password (argument password
) or retrieving it from the global environment
(FORMR_PWD
in .Renviron)
bvq_connect(google_email = NULL, password = NULL)
bvq_connect(google_email = NULL, password = NULL)
google_email |
E-mail used in Google Drive account. If |
password |
Character string with the password to formr ( |
Logical. TRUE
if Google and formr authentication was successful,
FALSE
if authentication of any of the two failed.
## Not run: bvq_connect() ## End(Not run)
## Not run: bvq_connect() ## End(Not run)
This function retrieves information about the items in a particular section in the BVQ questionnaire. This includes item names, item types, text, choices, settings, and other metadata.
bvq_items(section, version = "bvq-1.0.0")
bvq_items(section, version = "bvq-1.0.0")
section |
Name of the version of the questionnaire for which the items of a section will be retrieved. Check the output of |
A list of length 3, which includes:
survey: A tibble::tibble containing the items included in the questionnaire and several properties. Each row corresponds to a single item, and each column corresponds to a particular property:
type: a character string indicating the type of the item (see formr documentation).
name: a character string indicating the name of the item, as it appears in the output of bvq_responses()
.
label: a character string indicating the text shown to participants when filling out the questionnaire.
optional: a logical value indicating whether providing an answer to the item is mandatory for participants.
class: a character string indicating the CSS class with of the item.
showif: a character string indicating R code that determines under what conditions the item is shown to participants.
value: default value of the item.
block_order: character string (a latter) indicating the order in which the block that the item belongs to appears in the survey.
item_order: integer indicating the order in which the item appears within the block it belongs to belongs.
choices: A tibble::tibble containing the choices given to participants for some items. Each row corresponds to a choice, and each column corresponds to a particular choice property:
list_name': character string indicating the name of the name of the choice list (which may repeat across different items).
name: character string indicating the name that a particular choice will be assigned in the code.
label: character string indicating the text that will be show to participants for a particular choice.
settings: A tibble::tibble containing the settings for the survey. Each row corresponds to one setting, and each column indicates the setting names and values:
item: name of the setting.
value: value of the setting.
Gonzalo Garcia-Castro
## Not run: bvq_items("bvq_06_words_catalan", version = "bvq-1.0.0") ## End(Not run)
## Not run: bvq_items("bvq_06_words_catalan", version = "bvq-1.0.0") ## End(Not run)
This function generates a data frame that contains
participant-level information. Each row is a given participant's response
and each column is a variable. The same participant will always be
identified with the same id
. The variable time
indexes how
many times a participant has been sent the questionnaire, independently of
whether a response was obtained from them later.
bvq_logs( participants = bvq_participants(), responses = bvq_responses(participants), bilingual_threshold = 0.8, other_threshold = 0.1 )
bvq_logs( participants = bvq_participants(), responses = bvq_responses(participants), bilingual_threshold = 0.8, other_threshold = 0.1 )
participants |
Participants data frame, as generated by
|
responses |
Responses data frame, as generated by
|
bilingual_threshold |
Numeric scalar ranging from 0 to 1 indicating the minimum degree of exposure to Catalan or Spanish to consider a participant as Monolingual. |
other_threshold |
Numeric scalar ranging from 0 to 1 indicating the minimum degree of exposure to languages other than Catalan and Spanish to consider a participant as Other. |
A data frame (actually, a tibble::tibble with participant-level information. Each row corresponds to a questionnaire response and each column represents a variable. The output includes the following variables:
child_id: a character string with five digits indicating a participant's identifier in the database from the Laboratori de Recerca en Infància at Universitat Pompeu Fabra. This value is always the same for each participant, so that different responses from the same participant share the same id
.
response_id: a character string identifying a single response to the questionnaire. This value is always unique for each response to the questionnaire, even for responses from the same participant.
time: a numeric value indicating how many times a given participant has been sent the questionnaire, regardless of whether they completed it or not.
study: a character string indicating the study in which the participant was invited to fill in the questionnaire. Frequently, participants that filled in the questionnaire came to the lab to participant in a study, and were then invited to fill in the questionnaire later. This value indicates what study each participant was tested in before being sent the questionnaire.
version: a character string indicating what version of the questionnaire a given participant filled in. Different versions may contain a different subset of items, and the administration instructions might vary slightly (see formr questionnaire templates in the [GitHub repository(https://github.com/gongcastro/multilex)). Also, different versions were designed, implemented, and administrated at different time points (e.g., before/during/after the COVID-related lockdown).
version_list: a character string indicating the specific list of
items a participant was assigned to. Only applies in the case of short
versions of BVQ, such as bvq-short, bvq-long, bvq-lockdown, or bvq-1.0.0, where the
list of items was partitioned into several versions.#' * date_sent: a date value (see lubridate package) in yyyy/mm/dd
format indicating the date in which the questionnaire was sent to participants.
days_from_sent: a numeric value indicating the number of days elapsed since participants were sent the questionnaire (as indicated by date_sent
) and completed the questionnaire.
date_birth: a date value (see lubridate package) in yyyy/mm/dd
format indicating participants birth date.
age: a numeric value indicating the number of months elapsed since participants' birth date until they filled in the last item of their questionnaire response.
age_today: a numeric value indicating the number of months elapsed since participants' birth date until the present day, as indicated by lubridate::now.
months_from_last_response: a numeric value indicating the number of months elapsed since participants' last questionnaire response (as indicated by time_stamp
) until the present day, as indicated by lubridate::now.
edu_parent1: a character string indicating the educational attainment of one of the parents/caregivers.
edu_parent2: a character string indicating the educational attainment of the other parent/caregiver, if any.
dominance: a character string indicating the language of highest exposure ("Catalan"
or "Spanish"
), as reported by parents. If exposure is identical for both language, "Catalan" is assigned.
lp: a character string indicating participants' language profile, classified using parental reports of language exposure (see doe_spanish
, doe_catalan
, and doe_others
), and the thresholds passed in the bilingual_threshold
and other_threshold
.
doe_spanish: a numeric value ranging from 0 to 1 indicating participants' daily exposure to Spanish, as estimated by parents/caregivers. This value aggregates participants' exposure to any variant of Spanish (e.g., European and American Spanish).
doe_catalan: a numeric value ranging from 0 to 1 indicating participants' daily exposure to Catalan, as estimated by parents/caregivers. This value aggregates participants' exposure to any variant of Catalan (e.g., Catalan from Mallorca or Barcelona).
doe_others: a numeric value ranging from 0 to 1 indicating participants' daily exposure to languages other than Spanish or Catalan, as estimated by parents/caregivers, aggregating participants' exposure to all those other languages (e.g., Norwegian, Arab, Swahili).
completed: a logical value that returns TRUE
if progress
is 1, and FALSE
otherwise.
Gonzalo Garcia-Castro
## Not run: responses <- bvq_responses() logs <- bvq_logs(responses = responses) ## End(Not run)
## Not run: responses <- bvq_responses() logs <- bvq_logs(responses = responses) ## End(Not run)
This function generates a data frame with the estimated proportion of
children that understand and/or produce some items for a selected age range
and participant profiles. Estimated proportions and corresponding standard
errors and confidence intervals are computed adjusting for zero- and
one-inflation (see function prop_adj()
).
bvq_norms( participants = bvq_participants(), responses = bvq_responses(participants), ..., te = NULL, item = NULL, age = c(0, Inf) )
bvq_norms( participants = bvq_participants(), responses = bvq_responses(participants), ..., te = NULL, item = NULL, age = c(0, Inf) )
participants |
Participants data frame, as generated by
|
responses |
Responses data frame, as generated by |
... |
< |
te |
Translation equivalent for which the norms should be computed.
|
item |
Character string indicating the item to compute norms for. If
left |
age |
Numeric vector of length two (min-max) indicating the age range of participants to compute norms for. |
A data frame (actually, a tibble::tibble with the proportion of
participants in the sample that understand or produce the items indicated
in item
, or the translation equivalents indicated in te
.
The output contains the following variables:
te: an integer identifying the translation equivalent (a.k.a., pair of cross-language synonyms, doublets) the item belongs to.
item: character string indicating the item identifier (e.g., spa_mesa
). This value is unique for each item. Responses to the same item from different participants are linked by the same item
value.
language: a character string indicating the language the item response belongs to: "Catalan"
if item in Catalan), "Spanish"
if item in Spanish.
age: an numeric vector of length 1 or 2 indicating the age range of participants (in months) for which the estimates should be computed. If a non-integer is provided (e.g., 15.36
, it is rounded downwards using floor()
.)
type: a character string indicating the vocabulary type computed: "understands"
if option 'Understands' was selected, and "produces"
if option 'Understands & Says' was selected.
item_dominance: a character string that takes the value "L1"
if the item belongs to participants' language of most exposure, and L2 if the item belongs to participants' language of least exposure.
label: a character string indicating the text presented to participants in the questionnaire (replacing the item
identifier).
.sum: a positive integer indicating the number of positive responses: responses
is 2 (Understands) or 3 (Understands & Says) for type = "understands"
, and 3 (Understands & Says) if type = "produces"
.
.n: a positive integer indicating the total number number of responses (useful for computing proportions).
.prop: a numeric value ranging from 0 to 1 (both included) indicating the estimated proportion of participants that provided a positive response, adjusted following Gelman et al.'s method to account for zero- and one-inflation (see function prop_adj).
Additionally, any variables specified in the .by
argument are preserved as grouping variables.
Gonzalo Garcia-Castro
## Not run: responses <- bvq_responses() bvq_norms( participants = participants, responses = responses, item = "cat_casa", age = c(22, 22), lp ) my_items <- c("cat_gos", "cat_gat") bvq_norms( participants = participants, responses = responses, item = my_items, te = TRUE, age = c(15, 16) ) ## End(Not run)
## Not run: responses <- bvq_responses() bvq_norms( participants = participants, responses = responses, item = "cat_casa", age = c(22, 22), lp ) my_items <- c("cat_gos", "cat_gat") bvq_norms( participants = participants, responses = responses, item = my_items, te = TRUE, age = c(15, 16) ) ## End(Not run)
This function generates a data frame with the information of all participants that have participated or are candidates to participate in any of the versions of BVQ.
bvq_participants(...)
bvq_participants(...)
... |
Unused. |
A data frame (actually, a tibble::tibble) with all participants that have participated or are candidates to participate in any of the versions of BVQ Each row corresponds to a questionnaire response and each column represents a variable. The output includes the following variables:
child_id: a character string with five digits indicating a participant's identifier in the database from the Laboratori de Recerca en Infància at Universitat Pompeu Fabra. This value is always the same for each participant, so that different responses from the same participant share the same id
.
response_id: a character string identifying a single response to the questionnaire. This value is always unique for each response to the questionnaire, even for responses from the same participant.
time: a numeric value indicating how many times a given participant has been sent the questionnaire, regardless of whether they completed it or not.
date_birth: a date value in yyyy/mm/dd
format indicating participants birth date.
age_now: a numeric value indicating the number of months elapsed since participants' birth date until the present day, as indicated by lubridate::now()
.
version: a character string indicating what version of the questionnaire a given participant filled in. Different versions may contain a different subset of items, and the administration instructions might vary slightly (see formr questionnaire templates in the GitHub repository. Also, different versions were designed, implemented, and administrated at different time points (e.g., before/during/after the COVID-related lockdown).
version_list: a character string indicating the specific list of items a participant was assigned to. Only applies in the case of short versions of BVQ, such as bvq-short, bvq-long, bvq-lockdown, or bvq-1.0.0, where the list of items was partitioned into several versions.
date_test: a date value (see lubridate package) in yyyy/mm/dd
format indicating the date in which the participant was tested in the
associated study, if any.
date_sent: a date value (see lubridate
package) in yyyy/mm/dd
format indicating the date in which the
participant was sent the questionnaire.
call: a character string indicating the status of the participant's response:
"successful"
: participant completed the questionnaire)
"sent"
: participant has been sent the email but has not completed it
yet)
"pending"
: participant is still to be sent the questionnaire.
"reminded"
: a week has elapsed since the participant was sent the questionnaire,
and has been already reminded of it.
"stop"
: participant has not completed the questionnaire after
two weeks since they were sent the questionnaire.
Gonzalo Garcia-Castro
## Not run: bvq_participants() ## End(Not run)
## Not run: bvq_participants() ## End(Not run)
This function generates a data frame with participant's responses to each
item, along with some session-specific metadata. It takes participants
(the output of bvq_participants()
) and runs
(a character vector that can
take zero, one, or multiple of the following values: "formr2"
,
"formr-short"
, "formr-lockdown"
) as arguments.
bvq_responses(participants = bvq_participants())
bvq_responses(participants = bvq_participants())
participants |
Participants data frame, as generated by
|
A data frame (actually, a tibble::tibble containing participant's responses to each item, along with some session-specific metadata. The output includes the following variables:
child_id: a character string with five digits indicating a participant's identifier in the database from the Laboratori de Recerca en Infància at Universitat Pompeu Fabra. This value is always the same for each participant, so that different responses from the same participant share the same child_id
.
response_id: a character string identifying a single response to the questionnaire. This value is always unique for each response to the questionnaire, even for responses from the same participant.
time: a numeric value indicating how many times a given participant has been sent the questionnaire, regardless of whether they completed it or not.
version: a character string indicating what version of the questionnaire a given participant filled in. Different versions may contain a different subset of items, and the administration instructions might vary slightly (see formr questionnaire templates in the GitHub repository). Also, different versions were designed, implemented, and administrated at different time points (e.g., before/during/after the COVID-related lockdown).
version_list: a character string indicating the specific list of
items a participant was assigned to. Only applies in the case of short
versions of BVQ, such as bvq-short, bvq-long, bvq-lockdown, or bvq-1.0.0, where the
list of items was partitioned into several versions.#' * item: character string indicating the item identifier (e.g., spa_mesa
). This value is unique for each item. Responses to the same item from different participants are linked by the same item
value.
response: integer indicating the participant's response to a give item: 1
if "No"
(the participant does not understand or produce the word), 2
if "Understands" (the participants understands the word), or 3
if "Understands and Says" (the participant understands and produces the item).
date_birth: lubridate::Date indicating participants birth date.
date_started: lubridate::Date indicating when participants logged to the questionnaire for the first time.
date_finished: lubridate::Date indicating when participants logged to the questionnaire for the last time.
sex: a character string indicating participants' biological sex, as reported by the parents.
doe_spanish: a numeric value ranging from 0 to 1 indicating participants' daily exposure to Spanish, as estimated by parents/caregivers This value aggregates participants' exposure to any variant of Spanish (e.g., European and American Spanish).
doe_catalan: a numeric value ranging from 0 to 1 indicating participants' daily exposure to Catalan, as estimated by parents/caregivers This value aggregates participants' exposure to any variant of Catalan (e.g., Catalan from Majorca or Barcelona).
doe_others: a numeric value ranging from 0 to 1 indicating participants' daily exposure to languages other than Spanish or Catalan, as estimated by parents/caretakers, aggregating participants' exposure to all those other languages (e.g., Norwegian, Arab, Swahili).
edu_parent1: a character string indicating the educational attainment of one of the parents/caretakers.
edu_parent2: a character string indicating the educational attainment of the other parent/caretaker, if any.
Gonzalo Garcia-Castro
## Not run: bvq_responses() ## End(Not run)
## Not run: bvq_responses() ## End(Not run)
This function generates a data frame with the vocabulary of each participant
(keeping longitudinal data from the same participant in different rows).
Comprehensive and productive vocabulary sizes are computed as raw counts
(*_count
) and as proportions *_prop
.
bvq_vocabulary( participants = bvq_participants(), responses = bvq_responses(participants), ..., .scale = "prop" )
bvq_vocabulary( participants = bvq_participants(), responses = bvq_responses(participants), ..., .scale = "prop" )
participants |
Participants data frame, as generated by
|
responses |
Responses data frame, as generated by
|
... |
< |
.scale |
A character vector that takes the value |
A dataset (actually, a tibble::tibble with each participant's comprehensive and/or vocabulary size in each language. This data frame contains the following variables:
child_id: a character string with five digits indicating a participant's identifier in the database from the Laboratori de Recerca en Infància at Universitat Pompeu Fabra. This value is always the same for each participant, so that different responses from the same participant share the same child_id
.
response_id: a character string identifying a single response to the questionnaire. This value is always unique for each response to the questionnaire, even for responses from the same participant.
age: a numeric value indicating the number of months elapsed since participants' birth date until they filled in the last item of their questionnaire response.
type: a character string indicating the vocabulary type computed: "understands"
if option "Understands" was selected, and "produces"
if option "Understands & Says" was selected.
total_count: integer indicating the number of items selected as "Understands" or "Understands and Says" in both languages.
l1_count: positive integer indicating the number of items selected as "Understands" or "Understands and Says" in the dominant language (L1).
l2_count: positive integer indicating the number of items selected as "Understands" or "Understands and Says" in the non-dominant language (L2).
concept_count: positive integer indicating the number of translation equivalents (a.k.a. cross-language synonyms or doublets) in which "at list one of the items was selected as "Understands" or "Understands and Says". This is a measure of the number of lexicalised concepts.
te_count: positive integer indicating the number of translation equivalents (out of the total number of items the participant answered to) in which at both items was selected as "Understands" or "Understands and Says". This is a measure of the number of lexicalised concepts.
total_prop: numeric value ranging from 0 to 1 (both included) indicating the proportion of items selected as "Understands" or "Understands and Says" in both languages.
l1_prop: numeric value ranging from 0 to 1 (both included) indicating the proportion of of items selected as "Understands" or "Understands and Says" in the dominant language (L1).
l2_prop: numeric value ranging from 0 to 1 (both included) indicating the proportion of of items selected as "Understands" or "Understands and Says" in the non-dominant language (L2).
concept_prop: numeric value ranging from 0 to 1 (both included) indicating the proportion of of translation equivalents (a.k.a. cross-language synonyms or doublets) in which at least one of the items was selected as "Understands" or "Understands and Says". This is a measure of the number of lexicalised concepts.
te_prop: numeric value ranging from 0 to 1 (both included) indicating the proportion of of translation equivalents (aka. cross-language synonyms or doublets) in which both items were selected as "Understands" or "Understands and Says". This is a measure of the number of lexicalised concepts.
The specific subset of columns returned by bvq_vocabulary()
depends
on the elements of ...
and .scale
.
contents: list containing the items marked as acquired.
Gonzalo Garcia-Castro
Summarise language profile
get_doe(...)
get_doe(...)
... |
Columns with the degree of exposures to be summed up
for (all others will be considered as |
A numeric vector with the row-wise sums of the columns specified in ...
.
Gonzalo Garcia-Castro
library(dplyr) x <- data.frame( doe_cat_1 = seq(0, 1, 0.1), doe_cat_2 = c(0, rep(c(0.1, 0), each = 5)), doe_spa_1 = c(0, rep(c(0.1, 0), each = 5)), doe_spa_2 = c(1, 0.7, 0.6, 0.5, 0.3, 0.1, 0.4, 0.3, 0.2, 0.1, 0) ) y <- mutate(x, doe_other = 1 - get_doe(matches("cat|spa")), doe_cat = get_doe(doe_cat_1, doe_cat_2), doe_spa = get_doe(matches("spa")) ) (y)
library(dplyr) x <- data.frame( doe_cat_1 = seq(0, 1, 0.1), doe_cat_2 = c(0, rep(c(0.1, 0), each = 5)), doe_spa_1 = c(0, rep(c(0.1, 0), each = 5)), doe_spa_2 = c(1, 0.7, 0.6, 0.5, 0.3, 0.1, 0.4, 0.3, 0.2, 0.1, 0) ) y <- mutate(x, doe_other = 1 - get_doe(matches("cat|spa")), doe_cat = get_doe(doe_cat_1, doe_cat_2), doe_spa = get_doe(matches("spa")) ) (y)
Deal with repeated measures
get_longitudinal(x, longitudinal = "all")
get_longitudinal(x, longitudinal = "all")
x |
A data frame containing a column for participants (each participant
gets a unique ID), and a column for times (a numeric value indicating how
many times each participant appears in the data frame counting this one).
One participant may appear several times in the data frame, with each time
with a unique value of |
longitudinal |
A character string indicating what subset of the participants should be returned:
|
A subset of the data frame x
with only the selected cases,
according to longitudinal
.
Gonzalo Garcia-Castro
child_id <- c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10) sums <- rle(sort(child_id))[["lengths"]] dat <- data.frame(child_id, time = unlist(sapply(sums, function(x) seq(1, x)))) (dat) get_longitudinal(dat, "first") get_longitudinal(dat, "only")
child_id <- c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10) sums <- rle(sort(child_id))[["lengths"]] dat <- data.frame(child_id, time = unlist(sapply(sums, function(x) seq(1, x)))) (dat) get_longitudinal(dat, "first") get_longitudinal(dat, "only")
Launch bvq Shiny App in a browser
launch_app()
launch_app()
The BVQ Shiny App provides a visual interface to the bvq R package to explore the database. Its GitHub repository contains the data, documentation, and R scripts needed to run the BVQ Shiny app.
https://github.com/gongcastro/bvq-app
A dataset containing candidate words to be included in the questionnaires with some lexical properties. Transcriptions were (a) generated manually, (b) retrieved from Wiktionary. All transcriptions have been manually double-checked and fixed if necessary.
pool
pool
A data frame with 1601 rows and 20 variables:
item: item label, as indicated in the formr survey spreadsheets, items are unique within and across questionnaires.
language: language the item belongs to.
te: index associated to translation equivalents across languages.
label: item label, as presented to participants in the front-end of the questionnaire, some labels are not unique within or across questionnaires.
xsampa: phonological transcription in X-SAMPA format.
n_lemmas: an integer indicating the number of different lemmas showed in the item label to participants. for instance, the Spanish item "spa_hierba"
was shown to in the questionnaire as "hierba / césped"
. Lemma with similar roots were considered as one, such as the Spanish item "spa_tonto"
, presented as "tonto / tonta"
in the questionnaire.
is_multiword: an logical indicating whether the item included a multi-word phrase as presented in the questionnaire. For instance the Spanish item "spa_cepillodientes"
was shown as "cepillo de dientes"
in the questionnaire, which includes three words.
subtlex_lemma: word label, as included in the corresponding version. of SUBTLEX.
wordbank_lemma: word label, as indexed in Wordbank.
childes_lemma: word label, as it appears in the CHILDES English
corpora (based on wordbank_lemma
).
semantic_category: semantic/functional category the items belongs to.
class: Functional category (verb, nouns, adjective, etc.).
version: what short version of the questionnaire does this item appear on?
include: should this item be included in analyses?
Proportion, adjusted for zero- and one-inflation
prop_adj(x, n)
prop_adj(x, n)
x |
Number of successes |
n |
Number of tries |
It is very common that a large proportion of the participants know or do not know some word. Vocabulary sizes and word prevalence norms in package are calculated using an estimate that adjusts for zero- and one-inflation so that, at the population level such estimates are more likely to be accurate.
A numeric scalar.
prop_adj(4, 60)
prop_adj(4, 60)
This function prints some informative messages about a participants progress through the BVQ, and returns a vector of logical values indicating the surveys that the participant has completed.
track_progress(response_id, participants = NULL, ...)
track_progress(response_id, participants = NULL, ...)
participants |
Participants data frame, as generated by
|
... |
Arguments passed to download_surveys. |
respose_id |
a character string identifying a single response to the questionnaire. This value is always unique for each response to the questionnaire, even for responses from the same participant. |
A logical vector indicating the surveys that the participant has completed.
Gonzalo Garcia-Castro
## Not run: track_progress("1911", participants, verbose = FALSE) ## End(Not run)
## Not run: track_progress("1911", participants, verbose = FALSE) ## End(Not run)