Skip to contents

Introduction

#remotes::install_github("https://github.com/sachseka/eatPrep")
library(eatPrep)
#> Loading required package: readxl

The goal of eatPrep, short for Educational Assesment Tools: Preparation, is to automate the preparation of item response test data; especially but not exclusively for typical use cases at the Institute for Educational Quality Improvement (IQB). Therefore it works together with IQB-specific applications such as the ItemDB or ZKDaemon.

The main purpose of eatPrep is to read, score, aggregate and merge data from a regular SPSS (.sav) data file (via the function readSpss()) and metadata from databases (readDaemonXlsx()) and Excel spreadsheets (readMerkmalXlsx()). Additional functions include checking the data (checkData()), meta data (checkInputList()), and the design (checkDesign()), recoding of (missing) values (recodeData(), mnrCoding(), aggregateData(), scoreData(),collapseMissings()), working with rater data (meanAgree(), meanKappa()), evaluating item categories (catPbc(), evalPbc()) and exporting prepared or raw item data sets to different formats (writeSpss(), prep2GADS()). Most of this can be done with a single function call to automateDataPreparation().

In a prototypical scenario, we are constructing a test at the IQB, e.g. for educational purposes. For our pilot study, we have created several overlapping booklets and a variety of items with different item formats (constructed response, multiple choice).
Thereby, the data consists of several layers:

booklets containing blocks
blocks containing units (= items)
units containing subunits/subitems
subunits having values and
values including missings and recode values

Now we would like to prepare the data collected in this way for scaling. In doing so, we would like to know if the booklets are speeded (number of mnr). Additionally, we intend to use different missing treatments for the actual scaling, so we preserve missing values during data preparation and recode them right before scaling the data. eatPrep can help us with all of that.

As a reminder, here is an overview for different value types the IQB uses:

missing types
Code Label Abbr Explanation
-98 mir missing invalid response (1) Item was edited, and (2a) empty answer or (2b) invalid (joke) answer.
-99 mbo missing by omission Item wasn’t edited but seen, or wasn’t seen, but there are seen or edited subsequent Items.
-96 mnr missing not reached (1) Item wasn’t seen, and (2) all subsequent Items weren’t seen, either.
-97 mci missing coding impossible (1) Item should/could have been edited, and (2) answer can’t be analysed due to technical problems.
-94 mbd missing by design no answer, because Item wasn’t shown to the testperson by design.

Input

Information/metadata about items is stored in many different ways, e.g., databases, Excel sheets, sometimes even in item names. eatPrep requires this metadata to have a specific form, namely a series of relational tables. For eatPrep it is most convenient to store these tables in a list. When downloading the eatPrep package, you can find files with some example tables in the data folder.

Sheets have specific names, also the columns in those sheets have specific names.

inputMinimal contains the bare minimum for all functions in eatPrep to work, inputList contains additional metadata about the items (such as information about the task’s content).

# Example: inputMinimal
inputMinimal <- list(units = inputList$units[ -nrow(inputList$units), c("unit", "unitAggregateRule")],
                     subunits = inputList$subunits[, c("unit", "subunit","subunitRecoded")],
                     values = inputList$values[ , c("subunit", "value", "valueRecode", "valueType")],
                     unitRecodings = inputList$unitRecodings[ , c("unit", "value", "valueRecode", "valueType")],
                     blocks = inputList$blocks,
                     booklets = inputList$booklets,
                     rotation = inputList$rotation)

Input Tables

Let’s look at the tables from bottom to top, starting with items and their values. In the values sheet, we define values and recode values for each subitem.

Here we see 3 items (called units) from inputList bzw. inputMinimal. One mc (item 1), one short response (item 5), one item with 3 subitems short response (item 12)

items <- c("I01", "I05", "I12")
subitems <- c("I01", "I05", "I12a", "I12b", "I12c")

kable(inputMinimal$units[which(inputMinimal$units$unit %in% items), ])
unit unitAggregateRule
1 I01
5 I05
12 I12 SUM

The values sheet contains all possible values for each subitem and scoring information for each value (e.g., whether it is true, false, or missing). There are several types of missing values.

kable(inputMinimal$values[which(inputMinimal$values$subunit %in% subitems), ])
subunit value valueRecode valueType
1 I01 1 0 vc
2 I01 2 0 vc
3 I01 3 1 vc
4 I01 6 mnr mnr
5 I01 7 mbd mbd
6 I01 8 mir mir
7 I01 9 mbi mbi
30 I05 0 0 vc
31 I05 1 1 vc
32 I05 6 mnr mnr
33 I05 7 mbd mbd
34 I05 8 mir mir
35 I05 9 mbi mbi
80 I12a 0 0 vc
81 I12a 1 1 vc
82 I12a 6 mnr mnr
83 I12a 7 mbd mbd
84 I12a 8 mir mir
85 I12a 9 mbi mbi
86 I12b 0 0 vc
87 I12b 1 1 vc
88 I12b 6 mnr mnr
89 I12b 7 mbd mbd
90 I12b 8 mir mir
91 I12b 9 mbi mbi
92 I12c 1 0 vc
93 I12c 2 0 vc
94 I12c 3 0 vc
95 I12c 4 1 vc
96 I12c 6 mnr mnr
97 I12c 7 mbd mbd
98 I12c 8 mir mir
99 I12c 9 mbi mbi

The subunits sheet contains information about the relationship between subitems and items. This information is used to aggregate subitems to items. If recoded subitems should be named differently than the unrecoded variables in the original data, this can be specified via subunitRecoded. The only type of aggregation currently supported by eatPrep is to count the number of correct subitems per item.

kable(inputMinimal$subunits[which(inputMinimal$subunits$subunit %in% subitems), ])
unit subunit subunitRecoded
1 I01 I01 I01R
5 I05 I05 I05R
12 I12 I12a I12aR
13 I12 I12b I12bR
14 I12 I12c I12cR

unitrecodings contains the same information as the values sheet, but for aggregated items. Here, the value on an item is the aggregated number of correct subitems and the recode value gives a threshold of how many subitems need to be solved to obtain a credit for the item. In the example, I12 has 3 subitems, and one receives a credit if all subitems are solved correctly.

kable(inputMinimal$unitRecodings)
unit value valueRecode valueType
I12 0 0 vc
I12 1 0 vc
I12 2 0 vc
I12 3 1 vc
I12 mbd mbd mbd
I12 mir mir mir
I12 mbi mbi mbi

The minimal use case is that each subitem corresponds to an item. In this case, the value sheet contains all information needed for recoding the items (apart from possibly renaming the recoded items).

Input Data

We have several overlapping booklets with several blocks in each booklet. Moreover, there is a unique identifier for each person and some additional information about each student like their gender or their socioeconomic status. There is one data set per booklet. In order to prepare the data, we need to construct one large data set.

In order to do that we first need to read the data into R, check the data for invalid or or incorrect codes and then merge the data into one data set. In the following the different functions to do that are described. inputDat gives us a first idea on how the data is supposed to look like.

# looking at the data
str(inputDat)
#> List of 3
#>  $ booklet1:'data.frame':    100 obs. of  25 variables:
#>   ..$ ID   : chr [1:100] "person100" "person101" "person102" "person103" ...
#>   ..$ hisei: num [1:100] 49 NA 57 32 59 56 55 47 NA 50 ...
#>   ..$ I01  : chr [1:100] "1" "9" "3" "2" ...
#>   ..$ I02  : chr [1:100] "4" "4" "4" "4" ...
#>   ..$ I03  : chr [1:100] "1" "2" "2" "3" ...
#>   ..$ I04  : chr [1:100] "2" "2" "3" "1" ...
#>   ..$ I05  : chr [1:100] "0" "0" "0" "1" ...
#>   ..$ I06  : chr [1:100] "0" "9" "0" "1" ...
#>   ..$ I07  : chr [1:100] "2" "2" "2" "3" ...
#>   ..$ I08  : chr [1:100] "0" "1" "0" "0" ...
#>   ..$ I09  : chr [1:100] "2" "2" "2" "3" ...
#>   ..$ I10  : chr [1:100] "2" "1" "1" "4" ...
#>   ..$ I11  : chr [1:100] "4" "1" "3" "2" ...
#>   ..$ I12a : chr [1:100] "1" "0" "1" "0" ...
#>   ..$ I12b : chr [1:100] "0" "9" "0" "1" ...
#>   ..$ I12c : chr [1:100] "4" "1" "2" "4" ...
#>   ..$ I13  : chr [1:100] "9" "4" "9" "4" ...
#>   ..$ I14  : chr [1:100] "9" "4" "9" "1" ...
#>   ..$ I15  : chr [1:100] "9" "3" "9" "1" ...
#>   ..$ I16  : chr [1:100] "9" "2" "9" "2" ...
#>   ..$ I17  : chr [1:100] "9" "3" "9" "4" ...
#>   ..$ I18  : chr [1:100] "9" "1" "9" "1" ...
#>   ..$ I19  : chr [1:100] "9" "4" "9" "2" ...
#>   ..$ I20  : chr [1:100] "9" "1" "9" "1" ...
#>   ..$ I21  : chr [1:100] "9" "2" "9" "3" ...
#>  $ booklet2:'data.frame':    100 obs. of  25 variables:
#>   ..$ ID   : chr [1:100] "person200" "person201" "person202" "person203" ...
#>   ..$ hisei: num [1:100] 69 76 47 58 62 78 70 26 57 70 ...
#>   ..$ I08  : chr [1:100] "0" "0" "1" "0" ...
#>   ..$ I09  : chr [1:100] "2" "4" "1" "2" ...
#>   ..$ I10  : chr [1:100] "3" "1" "2" "2" ...
#>   ..$ I11  : chr [1:100] "4" "2" "1" "4" ...
#>   ..$ I12a : chr [1:100] "1" "0" "1" "1" ...
#>   ..$ I12b : chr [1:100] "1" "0" "1" "1" ...
#>   ..$ I12c : chr [1:100] "4" "3" "2" "2" ...
#>   ..$ I13  : chr [1:100] "4" "2" "1" "2" ...
#>   ..$ I14  : chr [1:100] "3" "4" "2" "3" ...
#>   ..$ I15  : chr [1:100] "4" "1" "4" "4" ...
#>   ..$ I16  : chr [1:100] "4" "3" "3" "2" ...
#>   ..$ I17  : chr [1:100] "4" "1" "2" "3" ...
#>   ..$ I18  : chr [1:100] "1" "4" "3" "2" ...
#>   ..$ I19  : chr [1:100] "2" "1" "4" "3" ...
#>   ..$ I20  : chr [1:100] "4" "9" "2" "2" ...
#>   ..$ I21  : chr [1:100] "2" "9" "2" "1" ...
#>   ..$ I22  : chr [1:100] "1" "9" "2" "1" ...
#>   ..$ I23  : chr [1:100] "9" "9" "2" "3" ...
#>   ..$ I24  : chr [1:100] "1" "9" "3" "3" ...
#>   ..$ I25  : chr [1:100] "1" "9" "0" "0" ...
#>   ..$ I26  : chr [1:100] "9" "9" "1" "0" ...
#>   ..$ I27  : chr [1:100] "0" "9" "1" "0" ...
#>   ..$ I28  : chr [1:100] "1" "9" "1" "0" ...
#>  $ booklet3:'data.frame':    100 obs. of  23 variables:
#>   ..$ ID   : chr [1:100] "person300" "person301" "person302" "person303" ...
#>   ..$ hisei: num [1:100] 49 NA 57 32 59 56 55 47 NA 50 ...
#>   ..$ I01  : chr [1:100] "2" "3" "2" "1" ...
#>   ..$ I02  : chr [1:100] "4" "3" "3" "3" ...
#>   ..$ I03  : chr [1:100] "3" "1" "1" "1" ...
#>   ..$ I04  : chr [1:100] "1" "3" "1" "1" ...
#>   ..$ I05  : chr [1:100] "0" "1" "1" "0" ...
#>   ..$ I06  : chr [1:100] "9" "1" "0" "1" ...
#>   ..$ I07  : chr [1:100] "8" "2" "2" "1" ...
#>   ..$ I15  : chr [1:100] "3" "9" "9" "9" ...
#>   ..$ I16  : chr [1:100] "1" "3" "3" "3" ...
#>   ..$ I17  : chr [1:100] "4" "4" "4" "4" ...
#>   ..$ I18  : chr [1:100] "4" "4" "4" "4" ...
#>   ..$ I19  : chr [1:100] "2" "9" "9" "9" ...
#>   ..$ I20  : chr [1:100] "3" "4" "2" "4" ...
#>   ..$ I21  : chr [1:100] "3" "2" "2" "2" ...
#>   ..$ I22  : chr [1:100] "3" "4" "4" "9" ...
#>   ..$ I23  : chr [1:100] "3" "1" "1" "9" ...
#>   ..$ I24  : chr [1:100] "1" "1" "1" "9" ...
#>   ..$ I25  : chr [1:100] "0" "1" "9" "9" ...
#>   ..$ I26  : chr [1:100] "1" "0" "9" "9" ...
#>   ..$ I27  : chr [1:100] "0" "9" "9" "9" ...
#>   ..$ I28  : chr [1:100] "0" "9" "9" "9" ...

Getting the Data

First, we need to get the information from our IQB data bank via ZKDaemon, or alternatively via SPSS.

Items via ZKDaemon

ZKDaemon is a program used by IQB that can be found in the IQB internal folders (i:). After installing you can get the meta data and the items from your specific study using information stored in the IQB Databases (DB2/DB3/DB4). Alternatively you can get the meta data from an SPSS file via ZKDaemon. Within the program you can set missing types and import a test design. Then you can produce Excel files, which are expected to have the following sheets: “units”, “subunits”, “values”, “unitrecoding”, “sav-files”, “params”, “aggregate-missings”, “itemproperties”, “propertylabels”, “booklets”, and “blocks”.

The function readDaemonXlsx() reads the Excel file into R and will produce a warning if any sheets are missing. It needs one character string (filename) containing path, name and extension of the Excel file (.xlsx) produced by ZKDaemon.

The function returns a list of data frames containing information that is required by the data preparation functions. inputList shows an example of this list.

filename <- system.file("extdata", "inputList.xlsx", package = "eatPrep")
inpustList <- readDaemonXlsx(filename)
#> Reading sheet 'units'.
#> Reading sheet 'subunits'.
#> Reading sheet 'values'.
#> Reading sheet 'unitrecoding'.
#> Reading sheet 'sav-files'.
#> Reading sheet 'params'.
#> Reading sheet 'aggregate-missings'.
#> Reading sheet 'booklets'.
#> Reading sheet 'blocks'.
str(inpustList)
#> List of 9
#>  $ units        :'data.frame':   29 obs. of  6 variables:
#>   ..$ unit             : chr [1:29] "I01" "I02" "I03" "I04" ...
#>   ..$ unitLabel        : chr [1:29] "Animals: Weight of a duck" "Animals: Weight of a horse" "Animals: Weight of a mouse" "Animals: Weight of a cat" ...
#>   ..$ unitDescription  : chr [1:29] NA NA NA NA ...
#>   ..$ unitType         : chr [1:29] "TI" "TI" "TI" "TI" ...
#>   ..$ unitAggregateRule: chr [1:29] NA NA NA NA ...
#>   ..$ unitScoreRule    : chr [1:29] NA NA NA NA ...
#>  $ subunits     :'data.frame':   30 obs. of  9 variables:
#>   ..$ unit               : chr [1:30] "I01" "I02" "I03" "I04" ...
#>   ..$ subunit            : chr [1:30] "I01" "I02" "I03" "I04" ...
#>   ..$ subunitType        : chr [1:30] "1" "1" "1" "1" ...
#>   ..$ subunitLabel       : chr [1:30] "Animals: Weight of a duck" "Animals: Weight of a horse" "Animals: Weight of a mouse" "Animals: Weight of a cat" ...
#>   ..$ subunitDescription : chr [1:30] NA NA NA NA ...
#>   ..$ subunitPosition    : chr [1:30] "a)" "b)" "c)" "d)" ...
#>   ..$ subunitTransniveau : chr [1:30] NA NA NA NA ...
#>   ..$ subunitRecoded     : chr [1:30] "I01R" "I02R" "I03R" "I04R" ...
#>   ..$ subunitLabelRecoded: chr [1:30] "Recoded Animals: Weight of a duck" "Recoded Animals: Weight of a horse" "Recoded Animals: Weight of a mouse" "Recoded Animals: Weight of a cat" ...
#>  $ values       :'data.frame':   220 obs. of  8 variables:
#>   ..$ subunit                : chr [1:220] "I01" "I01" "I01" "I01" ...
#>   ..$ value                  : chr [1:220] "1" "2" "3" "6" ...
#>   ..$ valueRecode            : chr [1:220] "0" "0" "1" "mnr" ...
#>   ..$ valueType              : chr [1:220] "vc" "vc" "vc" "mnr" ...
#>   ..$ valueLabel             : chr [1:220] "Response option 1 marked" "Response option 2 marked" "Response option 3 marked" "missing not reached" ...
#>   ..$ valueDescription       : chr [1:220] "Response option 1 marked" "Response option 2 marked" "Response option 3 marked" "missing not reached" ...
#>   ..$ valueLabelRecoded      : chr [1:220] "0" "0" "1" "mnr" ...
#>   ..$ valueDescriptionRecoded: chr [1:220] NA NA NA NA ...
#>  $ unitRecodings:'data.frame':   7 obs. of  7 variables:
#>   ..$ unit             : chr [1:7] "I12" "I12" "I12" "I12" ...
#>   ..$ value            : chr [1:7] "0" "1" "2" "3" ...
#>   ..$ valueRecode      : chr [1:7] "0" "0" "0" "1" ...
#>   ..$ valueType        : chr [1:7] "vc" "vc" "vc" "vc" ...
#>   ..$ valueLabel       : chr [1:7] NA NA NA NA ...
#>   ..$ valueDescription : chr [1:7] NA NA NA NA ...
#>   ..$ valueLabelRecoded: chr [1:7] NA NA NA NA ...
#>  $ savFiles     :'data.frame':   3 obs. of  3 variables:
#>   ..$ filename: chr [1:3] "booklet1.sav" "booklet2.sav" "booklet3.sav"
#>   ..$ case.id : chr [1:3] "ID" "ID" "ID"
#>   ..$ fullname: chr [1:3] NA NA NA
#>  $ newID        :'data.frame':   1 obs. of  2 variables:
#>   ..$ key  : chr "master-id"
#>   ..$ value: chr "ID"
#>  $ aggrMiss     :'data.frame':   7 obs. of  8 variables:
#>   ..$ nam: chr [1:7] "vc" "mvi" "mnr" "mci" ...
#>   ..$ vc : chr [1:7] "vc" "mvi" "vc" "mci" ...
#>   ..$ mvi: chr [1:7] "mvi" "mvi" "err" "mci" ...
#>   ..$ mnr: chr [1:7] "vc" "err" "mnr" "mci" ...
#>   ..$ mci: chr [1:7] "mci" "mci" "mci" "mci" ...
#>   ..$ mbd: chr [1:7] "err" "err" "err" "err" ...
#>   ..$ mir: chr [1:7] "vc" "err" "mir" "mci" ...
#>   ..$ mbi: chr [1:7] "vc" "err" "mnr" "mci" ...
#>  $ booklets     :'data.frame':   3 obs. of  4 variables:
#>   ..$ booklet: chr [1:3] "booklet1" "booklet2" "booklet3"
#>   ..$ block1 : chr [1:3] "bl1" "bl4" "bl3"
#>   ..$ block2 : chr [1:3] "bl2" "bl3" "bl1"
#>   ..$ block3 : chr [1:3] "bl3" "bl2" "bl4"
#>  $ blocks       :'data.frame':   30 obs. of  3 variables:
#>   ..$ subunit             : chr [1:30] "I01" "I02" "I03" "I04" ...
#>   ..$ block               : chr [1:30] "bl1" "bl1" "bl1" "bl1" ...
#>   ..$ subunitBlockPosition: chr [1:30] "1" "2" "3" "4" ...

Merkmalsauszug

The software IQB Item-DB produces Excel files named “Merkmalsauszug” using information stored in the IQB Databases. The file is expected to have the sheets: “Itemmerkmale”, “Aufgabenmerkmale”. The order doesn’t matter. The file contains information about the attributes of the items and exercises.

The function readMerkmalXlsx() read the Excel file into R and will produce a warning if any sheets are missing or wrongly specified. It needs one character string (filename) containing path, name and extension of the Excel file (.xlsx) produced by IQB Item-DB. Per default the Item-ID is created without converting numbers to lowercase letters (tolcl) and a merged data frame contaning both “Itemmerkmale” and “Aufgabenmerkmale” will be created (alleM).

The function returns a list of data frames containing Itemmerkmale, Aufgabenmerkmale and AlleMerkmale (optional).

filename <- system.file("extdata", "itemmerkmale.xlsx", package = "eatPrep")
readMerkmalXlsx(filename, tolcl = FALSE, alleM = TRUE)
#> Reading sheet 'Aufgabenmerkmale'.
#> Reading sheet 'Itemmerkmale'.
#> Data frame 'AlleMerkmale' has been created.
#> $Aufgabenmerkmale
#>                      Aufgabe Zeit.A                     AufgID AufgTitel
#> 1  Animals: Weight of a duck   0:00  Animals: Weight of a duck        NA
#> 2 Animals: Weight of a horse   0:00 Animals: Weight of a horse        NA
#> 
#> $Itemmerkmale
#>                      Aufgabe Teilaufgabe Item Zeit.I Anforderungsbereich.MaP
#> 1  Animals: Weight of a duck           A   01   0:00                       I
#> 2 Animals: Weight of a horse           A   01   0:00                       I
#> 3 Animals: Weight of a horse           B   02   0:00                      II
#>   Bildungsstandards.MaP.allgemeine                Itemtyp.5er
#> 1                               1a GA - Geschlossen Ankreuzen
#> 2                             <NA> GA - Geschlossen Ankreuzen
#> 3                             <NA> GA - Geschlossen Ankreuzen
#>                       AufgID AufgTitel                       ItemID
#> 1  Animals: Weight of a duck        NA  Animals: Weight of a duck01
#> 2 Animals: Weight of a horse        NA Animals: Weight of a horse01
#> 3 Animals: Weight of a horse        NA Animals: Weight of a horse02
#> 
#> $AlleMerkmale
#>                       AufgID                    Aufgabe Teilaufgabe Item
#> 1  Animals: Weight of a duck  Animals: Weight of a duck           A   01
#> 2 Animals: Weight of a horse Animals: Weight of a horse           A   01
#> 3 Animals: Weight of a horse Animals: Weight of a horse           B   02
#>                         ItemID Zeit.I Anforderungsbereich.MaP
#> 1  Animals: Weight of a duck01   0:00                       I
#> 2 Animals: Weight of a horse01   0:00                       I
#> 3 Animals: Weight of a horse02   0:00                      II
#>   Bildungsstandards.MaP.allgemeine                Itemtyp.5er Zeit.A
#> 1                               1a GA - Geschlossen Ankreuzen   0:00
#> 2                             <NA> GA - Geschlossen Ankreuzen   0:00
#> 3                             <NA> GA - Geschlossen Ankreuzen   0:00

SPSS Data

To get the input from SPSS data files, readSpss() reads the data into R and converts all variables to the class character. The function needs the name fo the SPSS data file and has the option to specify some more attributes. For more information use the help function ?readSpss.

readSpss(file)

Checking Data

After reading in all the data, we need to check, whether it is in the right format and all missing codes were assigned correctly.

checkData()

checkData() checks data frames for missing or duplicated entries in the ID variable, persons and/or variables without valid codes, and generally invalid codes. The results of that check will be written in the console.

The function needs a data frame to be checked (dat) and its name (datnam), especially if there are multpile data frames (e.g. a list of data frames). Furthermore it needs data frames with code, subunit and unit information (values, subunits and units), and a string for an ID column name (ID). You can turn of printing information to the console with verbose.

checkData(dat, datnam, values, subunits, units, ID = NULL, verbose = TRUE)

Examples of data frames for values, subunits and units can be found by typing inputList in the console.

checkDesign()

checkDesign() checks whether a data frame corresponds to a particular rotated block design, i.e. whether all persons have valid codes on all items they were presented with and one consistent missing code for all items they were not presented with.

The function also needs a data frame to be checked (dat), as well as data frames containing information on the number and column names of blocks in each booklet (booklets), on the names of subunits and their order within each block (blocks) and about which participant worked on which booklet (rotation). sysMis specifies the missing code for items that were not administered to a participant and id indicates the name of the participant identifier variable in dat.

subunits is an optional argument to identify the names of recoded subunits. And you can turn off printing information with verbose.

checkDesign(dat, booklets, blocks, rotation, sysMis="NA", id="ID", subunits = NULL, verbose = TRUE)

inputDat and inputList are examples on how these data frames are supposed to look like. When you copy and paste the following code in your console, you can look at the data frames.

inputDat
inputList

Merging Data

Now that we checked that the data frames meet our requirements, we can merge a list of data frames into a single data frame by using mergeData(). For that we need a list of data frames, like inputDat which contains data of three booklets. The function returns a data frame containing unique cases and unique variables. All cases and all variables from the original data sets will be kept and matched.

mergeData() provides detailed diagnostics about value mismatches. If two identically named columns in two data sets do not have identical values, NAs are replaced by valid codes stemming from the other data set(s) and if two different valid values are found, the first value will be kept and the other dropped, and the user will be informed about the mismatch. Additionally, NA resulting from the merge (e.g., in repeated block designs) can be replaced with a customed character missing to facilitate future data preparation of the merged data set. See collapseMissings() for details on supported character missings for other functions in the eatPrep package.

When merging data frames with this function you need to specify at least two arguments: newID and datList. newID has to be a character vector length one indicating the name of the identifier variable (ID) in the merged data set and/or the name of the ID in every data frame in datList, if not specified differently in oldIDs. datList is the list of data frames to be merged, e.g. inputDat.

mergedDataset <- mergeData(newID = "ID", datList = inputDat)
#> Start merging of dataset 1.
#> Start merging of dataset 2.
#> Start merging of dataset 3.
str(mergedDataset)
#> 'data.frame':    300 obs. of  32 variables:
#>  $ ID   : chr  "person100" "person101" "person102" "person103" ...
#>  $ hisei: num  49 NA 57 32 59 56 55 47 NA 50 ...
#>  $ I01  : chr  "1" "9" "3" "2" ...
#>  $ I02  : chr  "4" "4" "4" "4" ...
#>  $ I03  : chr  "1" "2" "2" "3" ...
#>  $ I04  : chr  "2" "2" "3" "1" ...
#>  $ I05  : chr  "0" "0" "0" "1" ...
#>  $ I06  : chr  "0" "9" "0" "1" ...
#>  $ I07  : chr  "2" "2" "2" "3" ...
#>  $ I08  : chr  "0" "1" "0" "0" ...
#>  $ I09  : chr  "2" "2" "2" "3" ...
#>  $ I10  : chr  "2" "1" "1" "4" ...
#>  $ I11  : chr  "4" "1" "3" "2" ...
#>  $ I12a : chr  "1" "0" "1" "0" ...
#>  $ I12b : chr  "0" "9" "0" "1" ...
#>  $ I12c : chr  "4" "1" "2" "4" ...
#>  $ I13  : chr  "9" "4" "9" "4" ...
#>  $ I14  : chr  "9" "4" "9" "1" ...
#>  $ I15  : chr  "9" "3" "9" "1" ...
#>  $ I16  : chr  "9" "2" "9" "2" ...
#>  $ I17  : chr  "9" "3" "9" "4" ...
#>  $ I18  : chr  "9" "1" "9" "1" ...
#>  $ I19  : chr  "9" "4" "9" "2" ...
#>  $ I20  : chr  "9" "1" "9" "1" ...
#>  $ I21  : chr  "9" "2" "9" "3" ...
#>  $ I22  : chr  NA NA NA NA ...
#>  $ I23  : chr  NA NA NA NA ...
#>  $ I24  : chr  NA NA NA NA ...
#>  $ I25  : chr  NA NA NA NA ...
#>  $ I26  : chr  NA NA NA NA ...
#>  $ I27  : chr  NA NA NA NA ...
#>  $ I28  : chr  NA NA NA NA ...

Furthermore, you can specify some more arguments, but they have default options, so you don’t need to. Here is an example where the IDs are changed via oldIDs and where NAs are replaced by “mbd” (missing by design). For more information use the help function ?mergeData.

mergedDataset2 <- mergeData(newID = "idstud", datList = inputDat, oldIDs = c("ID", "ID", "ID"), addMbd = TRUE)
#> Start merging of dataset 1.
#> Start merging of dataset 2.
#> Start merging of dataset 3.
#> Start adding mbd according to data pattern.
str(mergedDataset2)
#> 'data.frame':    300 obs. of  32 variables:
#>  $ idstud: chr  "person100" "person101" "person102" "person103" ...
#>  $ hisei : num  49 NA 57 32 59 56 55 47 NA 50 ...
#>  $ I01   : chr  "1" "9" "3" "2" ...
#>  $ I02   : chr  "4" "4" "4" "4" ...
#>  $ I03   : chr  "1" "2" "2" "3" ...
#>  $ I04   : chr  "2" "2" "3" "1" ...
#>  $ I05   : chr  "0" "0" "0" "1" ...
#>  $ I06   : chr  "0" "9" "0" "1" ...
#>  $ I07   : chr  "2" "2" "2" "3" ...
#>  $ I08   : chr  "0" "1" "0" "0" ...
#>  $ I09   : chr  "2" "2" "2" "3" ...
#>  $ I10   : chr  "2" "1" "1" "4" ...
#>  $ I11   : chr  "4" "1" "3" "2" ...
#>  $ I12a  : chr  "1" "0" "1" "0" ...
#>  $ I12b  : chr  "0" "9" "0" "1" ...
#>  $ I12c  : chr  "4" "1" "2" "4" ...
#>  $ I13   : chr  "9" "4" "9" "4" ...
#>  $ I14   : chr  "9" "4" "9" "1" ...
#>  $ I15   : chr  "9" "3" "9" "1" ...
#>  $ I16   : chr  "9" "2" "9" "2" ...
#>  $ I17   : chr  "9" "3" "9" "4" ...
#>  $ I18   : chr  "9" "1" "9" "1" ...
#>  $ I19   : chr  "9" "4" "9" "2" ...
#>  $ I20   : chr  "9" "1" "9" "1" ...
#>  $ I21   : chr  "9" "2" "9" "3" ...
#>  $ I22   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I23   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I24   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I25   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I26   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I27   : chr  "mbd" "mbd" "mbd" "mbd" ...
#>  $ I28   : chr  "mbd" "mbd" "mbd" "mbd" ...

Recoding the Data

After importing the data and making sure it has the right format, the next step is to adjust the missing values.

recodeData()

First, you recode data sets with several kinds of missings values. For that, you need recode data sets with special consideration of missing values. See collapseMissings() for supported types of missing values. recodeData() recodes the specified data frames and will give warnings, if missing or incomplete recode information is found. Values without recode information will not be recoded.

Examples of data frames values and subunits can be found when copy-pasting the following code in you console:

inputList$values
inputList$subunits

recodeData() uses the recode information from those two data frames and recodes the variables on dat accordingly. The columns will be named according to the specifications in subunits$subunitRecoded, if subunits is not provided, item names will not be changed for recoded items.

datRec <- recodeData(dat = inputDat[[1]], values = inputList$values, subunits = inputList$subunits, verbose = TRUE)
#> 
#> Found no recode information for variable(s): 
#> ID, hisei. 
#> This/These variable(s) will not be recoded.
#> 
#> Variables... I01, I02, I03, I04, I05, I06, I07, I08, I09, I10, I11, I12a, I12b, I12c, I13, I14, I15, I16, I17, I18, I19, I20, I21
#> ...have been recoded.
str(datRec)
#> 'data.frame':    100 obs. of  25 variables:
#>  $ ID   : chr  "person100" "person101" "person102" "person103" ...
#>  $ hisei: chr  "49" NA "57" "32" ...
#>  $ I01R : chr  "0" "mbi" "1" "0" ...
#>  $ I02R : chr  "0" "0" "0" "0" ...
#>  $ I03R : chr  "0" "1" "1" "0" ...
#>  $ I04R : chr  "1" "1" "0" "0" ...
#>  $ I05R : chr  "0" "0" "0" "1" ...
#>  $ I06R : chr  "0" "mbi" "0" "1" ...
#>  $ I07R : chr  "0" "0" "0" "0" ...
#>  $ I08R : chr  "0" "1" "0" "0" ...
#>  $ I09R : chr  "0" "0" "0" "0" ...
#>  $ I10R : chr  "1" "0" "0" "0" ...
#>  $ I11R : chr  "0" "0" "1" "0" ...
#>  $ I12aR: chr  "1" "0" "1" "0" ...
#>  $ I12bR: chr  "0" "mbi" "0" "1" ...
#>  $ I12cR: chr  "1" "0" "0" "1" ...
#>  $ I13R : chr  "mbi" "0" "mbi" "0" ...
#>  $ I14R : chr  "mbi" "0" "mbi" "0" ...
#>  $ I15R : chr  "mbi" "0" "mbi" "1" ...
#>  $ I16R : chr  "mbi" "0" "mbi" "0" ...
#>  $ I17R : chr  "mbi" "0" "mbi" "0" ...
#>  $ I18R : chr  "mbi" "1" "mbi" "1" ...
#>  $ I19R : chr  "mbi" "1" "mbi" "0" ...
#>  $ I20R : chr  "mbi" "0" "mbi" "0" ...
#>  $ I21R : chr  "mbi" "0" "mbi" "1" ...

mnrCoding()

Then you can convert missing responses coded as “missing by intention” (mbi) at the end of a block of items to “missing not reached” (mnr) via mnrCoding(). The function returns a data frame with “missing not reached” coded as mnr. For each person with at least one mnr in the returned data set the names of recoded variables are given as an attribute to dat.

For that to work you need to specify the following arguments, as they don’t have any default settings:

Argument Explanation
dat A data set. Missing by intention needs to be coded mbi
pid Name or column number of the identifier (ID) variable in dat
rotation.id A character vector of length 1 indicating the column name of the test booklet identifier in dat
blocks A data frame containing the sequence of subunits in each block in long format. The column names need to be subunit, block, subunitBlockPosition
booklets A data frame containing the sequence of blocks in each booklet in wide format. The column names need to be booklet, block1, block2, block3, …
breaks Number of blocks after which mbi shall be recoded to mnr, e.g., c(1,2) to specify breaks after the first and second block

There are more arguments with default values which to you can specify, but don’t have to.

nMbi, for instance, specifies the number of subunits at the end of a block that need to be coded mbi. The default is 2, i.e. if the last and second to last subitem in a block are coded mbi, both subunits, as well as the preceding subunits coded mbi, will be recoded to mnr. If nMbi is larger than the number of subunits in a given block, no subitem in this block will be recoded. If all subunits in a block are coded mbi, none of them will be recoded to mnr. `nMbi needs to be > 0.

subunits has the default NULL, but when you specify a data frame, recodeMbiToMnr expects to find the recoded subunits in dat.

Examples for the data frames booklets, blocks, rotation and subunits can be found via inputList. Here is an example use case of mnrCoding(). The first two functions (automateDataPretatration() and mergeData()) create the data frame dat for mnrCoding().

prepDat <- automateDataPreparation(inputList = inputList, 
    datList = inputDat, readSpss = FALSE, checkData=FALSE, 
    mergeData = TRUE, recodeData=TRUE, aggregateData=FALSE, 
    scoreData= FALSE, writeSpss=FALSE, verbose = TRUE)
prepDat2 <- mergeData("ID", list(prepDat, inputList$rotation))
mnrDat <- mnrCoding(dat = prepDat2, pid = "ID", 
    booklets = inputList$booklets, blocks = inputList$blocks, 
    rotation.id = "booklet", breaks = c(1, 2), 
    subunits = inputList$subunits, nMbi = 2, mbiCode = "mbi", 
    mnrCode = "mnr", invalidCodes = c("mbd", "mir", "mci"), 
    verbose = TRUE)
#> ...identifying items in data (reference is blocks$subunit)
#> Variables in data not recognized as items:
#> ID, hisei, booklet
#> If some of these excluded variables should have been identified as items (and thus be used for mnr coding) check 'blocks', 'subunits', 'dat'.
#> ...identifying items with no mbi-codes ('mbi'):
#> I04R, I08R
#> If you expect mbi-codes on these variables check your data and option 'mbiCode'
#> mnr statistics:
#>      mnr cells: 553
#>      unique cases with at least one mnr code: 89
#>      unique items with at least one mnr code: 16
#> unique cases ('ID') per booklet and booklet section (0s omitted):
#>    booklet booklet.section N.ID
#> 1 booklet1               2   11
#> 2 booklet1               3   28
#> 3 booklet2               1   28
#> 4 booklet2               2   11
#> 5 booklet2               3    1
#> 6 booklet3               3   31
#> start recoding (item-wise)
#> done
#> elapsed time: 0.1 secs

Type ?mnrCoding into your console to learn more.

Aggregating the Data

After recoding missing values, the subunits are combined one by one via aggregateData().

This function needs three data frames: one containing the data to be aggregated (dat), one containing the subunit information (subunits), and one containing the unit information (units).

Optionally, you can specify how missing values should be aggregated (aggregatemissings) like in the example, or which column names the returned data frame should have (recodedData), for instance. Type ?aggregateData into your console to learn more.

am <- matrix(c(
  "vc" , "mvi", "vc" , "mci", "err", "vc" , "mbi", "err",
  "mvi", "mvi", "err", "mci", "err", "err", "err", "err",
  "vc" , "err", "mnr", "mci", "err", "mir", "mnr", "err",
  "mci", "mci", "mci", "mci", "err", "mci", "mci", "err",
  "err", "err", "err", "err", "mbd", "err", "err", "err",
  "vc" , "err", "mir", "mci", "err", "mir", "mir", "err",
  "mbi", "err", "mnr", "mci", "err", "mir", "mbi", "err",
  "err", "err", "err", "err", "err", "err", "err", "err" ),
  nrow = 8, ncol = 8, byrow = TRUE)

dimnames(am) <-
  list(c("vc" ,"mvi", "mnr", "mci",  "mbd", "mir", "mbi", "err"),
       c("vc" ,"mvi", "mnr", "mci",  "mbd", "mir", "mbi", "err"))
# using datRec from the chapter "recodeData()"
datAggr <- aggregateData(datRec, inputList$subunits, inputList$units,
    aggregatemissings = am, rename = TRUE, recodedData = TRUE,
    suppressErr = TRUE, recodeErr = "mci", verbose = TRUE)
#> All aggregation rules will be defaulted to 'SUM', because no other type is currently supported.
#> Found 20 unit(s) with only one subunit in 'dat'. This/these subunit(s) will not be aggregated and renamed to their respective unit name(s).
#> 1 units were aggregated: I12.

Scoring the Data

The next step is to score data set with special consideration of missing values. The function scoreData() is very similar to recodeData(), but with a few defaults that are better suited for scoring. scoreData() will give warnings when incomplete scoring information is found. Values without scoring information will not be scored.

Again, you need to specify three data frames. One data frame that you get after completing the steps you did so far or by using automateDataPrepatation (dat), one with information about the scoring of units (unitrecodings), and one with subunit information (subunits). Examples for the last two data frames can be found via inputList.

prepDat <- automateDataPreparation (inputList = inputList, datList = inputDat,
    readSpss = FALSE, checkData=FALSE, mergeData = TRUE, recodeData=TRUE,
    aggregateData=TRUE, scoreData= FALSE, writeSpss=FALSE, verbose = TRUE)
datSco <- scoreData(prepDat, inputList$unitRecodings, inputList$subunits,
    verbose = TRUE)
#>  1 unit was scored: `I12`.

Collapsing the Data

The last step is to recode character missings to numerical types like 0 or NA with collapseMissings(). All variables need to be converted to numeric, as well, so you can pass the data set to eatModel afterwards.

You need the data frame that is returned by recodeData, which we called datRec earlier (dat).

missing.rule is a list containing information which character missings should be converted to 0or NA. It has default settings, but you can adapt them if needed. Type ?collapseMissings into your console to learn more.

datColMis <- collapseMissings(datRec)

After doing that you should have a data frame that you can now use for further processing with the eatModel package.