Vocabulary sizes

The bvq_vocabulary() function allows to extract vocabulary sizes for individual responses to any of the questionnaires. It takes the output of the bvq_responses() function as an argument, and returns several measures of vocabulary size base on such dataset.

To compute vocabulary size, we first need to run bvq_responses() (although if this argument is not provided, bvq_responses() is run under the hood):

library(bvq)

# vocabularies will be computed from these datasets
participants <- bvq_participants()
responses <- bvq_responses(participants = participants)

bvq_vocabulary(participants, responses)
#> # A tibble: 34,465 × 16
#>    child_id response_id  time version      version_list date_birth date_started
#>    <chr>    <chr>       <dbl> <chr>        <chr>        <date>     <date>      
#>  1 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  2 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  3 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  4 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  5 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  6 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  7 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  8 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#>  9 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#> 10 58298    BL1879          2 bvq-lockdown C            2020-04-27 2022-10-08  
#> # ℹ 34,455 more rows
#> # ℹ 9 more variables: date_finished <date>, item <chr>, response <int>,
#> #   sex <chr>, doe_catalan <dbl>, doe_spanish <dbl>, doe_others <dbl>,
#> #   edu_parent1 <chr>, edu_parent2 <chr>

The bvq_vocabulary() computes four measures of vocabulary size.

  • Total (total_*): total number of item the child was reported to know, summing both languages together.
  • L1 (l1_*): number of word the child was reported to know in their dominant language (e.g., Catalan words for a child whose language of most exposure is Catalan).
  • L2 (l2_*): number of word the child was reported to know in their non-dominant language (e.g., Spanish words for a child whose language of most exposure is Catalan)
  • Conceptual (concept_*): number of concepts the child know at least one word for, regardless of the language the word belongs to.
  • TE (te_*): number of translation equivalents the child knows, i.e., or how many concepts the child know one word in each language for.

Vocabulary sizes are, by default, computed in two different scales:

  • Proportion (*_prop): proportion of the items the child was reported to known, from the total of items that were included in the questionnaire, and caregivers answered to.
  • Counts (*_count): sum of the total number of items the child was reported to know.

The scale returned by bvq_vocabulary() can be modified with the .scale argument, which takes "prop" for proportions (default), and "count" for counts. Both can be computed using .scale = c("prop", "count"). For instance we can get vocabulary sizes as proportions running:

#> # A tibble: 76 × 9
#>    child_id response_id type     total_prop l1_prop l2_prop concept_prop te_prop
#>    <chr>    <chr>       <chr>         <dbl>   <dbl>   <dbl>        <dbl>   <dbl>
#>  1 58298    BL1879      underst…    0.518    0.924   0.141        0.881   0.114 
#>  2 58298    BL1879      produces    0.430    0.784   0.103        0.743   0.0838
#>  3 58361    BL1863      underst…    0.703    0.805   0.601        0.826   0.574 
#>  4 58361    BL1863      produces    0.476    0.573   0.379        0.617   0.331 
#>  5 58298    BL1848      underst…    0.370    0.702   0.0623       0.662   0.0486
#>  6 58298    BL1848      produces    0.00985  0.0205  0            0.0189  0     
#>  7 58068    BL1833      underst…    0.737    0.850   0.633        0.846   0.563 
#>  8 58068    BL1833      produces    0.328    0.528   0.146        0.526   0.102 
#>  9 57177    BL1748      underst…    0.837    0.814   0.857        0.887   0.714 
#> 10 57177    BL1748      produces    0.568    0.678   0.466        0.757   0.329 
#> # ℹ 66 more rows
#> # ℹ 1 more variable: contents <list>

To get vocabulary sizes as counts, we can run this instead:

#> # A tibble: 76 × 9
#>    child_id response_id type        total_count l1_count l2_count concept_count
#>    <chr>    <chr>       <chr>             <int>    <int>    <int>         <int>
#>  1 58298    BL1879      understands         368      316       52           326
#>  2 58298    BL1879      produces            306      268       38           275
#>  3 58361    BL1863      understands         490      281      209           289
#>  4 58361    BL1863      produces            332      200      132           216
#>  5 58298    BL1848      understands         263      240       23           245
#>  6 58298    BL1848      produces              7        7        0             7
#>  7 58068    BL1833      understands         523      288      235           314
#>  8 58068    BL1833      produces            233      179       54           195
#>  9 57177    BL1748      understands         594      276      318           329
#> 10 57177    BL1748      produces            403      230      173           281
#> # ℹ 66 more rows
#> # ℹ 2 more variables: te_count <int>, contents <list>

Finally, two types of vocabulary sizes are computed:

  • Comprehension (understands): number of items the child understands.
  • Production: (produces) number of items the child says.

These two measures are returned in the long format under the type column.

Vocabulary contents

In additional to the vocabulary size scores, bvq_vocabulary() also returns the column contents. This column is a list containing the items marked as acquired for comprehension or production. For instance:

Conditional vocabulary size: the ... extra arguments

We can also compute vocabulary sizes conditional to some variables at the item or participant level, such as semantic/functional (semantic_category) or language profile (lp), using the argument ... argument. Just take a look at the variables included in the data frame returned by bvq_logs() or in the pool dataset. For each participant, vocabulary sizes are computed for each level or combination of levels of the variables included in the columns included in .... You can use this argument to preserve participant-level information in the output data frame. For instance, we can keep information about the language profile (lp) of the participant:

bvq_vocabulary(participants, responses, lp)

We can also can also preserve information about the items, like the semantic/functional category of the words (semantic_category). In this case, the vocabulary sizes will be computed for each level of the semantic_category variable:

bvq_vocabulary(participants, responses, semantic_category)
#> # A tibble: 1,970 × 10
#>    child_id response_id type        semantic_category total_prop l1_prop l2_prop
#>    <chr>    <chr>       <chr>       <chr>                  <dbl>   <dbl>   <dbl>
#>  1 58298    BL1879      understands Adventures             0.75    1      0.5   
#>  2 58298    BL1879      produces    Adventures             0.7     0.9    0.5   
#>  3 58298    BL1879      understands Animals                0.516   0.968  0.0645
#>  4 58298    BL1879      produces    Animals                0.452   0.839  0.0645
#>  5 58298    BL1879      understands Parts of animals       0.409   0.818  0     
#>  6 58298    BL1879      produces    Parts of animals       0.136   0.273  0     
#>  7 58298    BL1879      understands Parts of things        0.375   0.75   0     
#>  8 58298    BL1879      produces    Parts of things        0.125   0.25   0     
#>  9 58298    BL1879      understands Question words         0.5     1      0     
#> 10 58298    BL1879      produces    Question words         0.5     1      0     
#> # ℹ 1,960 more rows
#> # ℹ 3 more variables: concept_prop <dbl>, te_prop <dbl>, contents <list>

Finally, we can preserve more than one variable, including combinations of participant-level and item-level variables, such as language profile (lp), age (age) and grammatical class (class):

bvq_vocabulary(participants, responses, age, lp, semantic_category)
#> # A tibble: 1,970 × 12
#>    child_id response_id type      age lp    semantic_category total_prop l1_prop
#>    <chr>    <chr>       <chr>   <dbl> <chr> <chr>                  <dbl>   <dbl>
#>  1 58298    BL1879      unders…  29.4 Mono… Adventures             0.75    1    
#>  2 58298    BL1879      produc…  29.4 Mono… Adventures             0.7     0.9  
#>  3 58298    BL1879      unders…  29.4 Mono… Animals                0.516   0.968
#>  4 58298    BL1879      produc…  29.4 Mono… Animals                0.452   0.839
#>  5 58298    BL1879      unders…  29.4 Mono… Parts of animals       0.409   0.818
#>  6 58298    BL1879      produc…  29.4 Mono… Parts of animals       0.136   0.273
#>  7 58298    BL1879      unders…  29.4 Mono… Parts of things        0.375   0.75 
#>  8 58298    BL1879      produc…  29.4 Mono… Parts of things        0.125   0.25 
#>  9 58298    BL1879      unders…  29.4 Mono… Question words         0.5     1    
#> 10 58298    BL1879      produc…  29.4 Mono… Question words         0.5     1    
#> # ℹ 1,960 more rows
#> # ℹ 4 more variables: l2_prop <dbl>, concept_prop <dbl>, te_prop <dbl>,
#> #   contents <list>