microData

Author

Gutama Girja Urago

Published

May 14, 2024

What?

I am developing the microData package to search, browse, and extract metadata from microdata provided by the World Bank (WB), Food and Agriculture Organization (FAO), International Household Survey (IHSN), United Nations High Commissioner for Refugees (UNHCR), and International Labor Organization (ILO) via the NADA API. Any researcher who has used microdata from these organizations knows how difficult and time-consuming it is to understand and import these data and variables into R. If you are a user or plan to use micradata, then this is the life-saving R package for you.

Abstract

The purpose of microData is to simplify the process of extracting complex metadata from data provided by various organizations, thereby improving data preparation efficiency. At the moment, it supports five international organizations, namely the World Bank, FAO, UNHCR, IHSN, and ILO. It has the ability to search, filter, extract, and perform other tasks that you can do on the web, but it cannot download the data file itself. This is because, to my knowledge, there is currently no available documentation for use with the API. I think it is due to data license issue because there are few accessible datasets through the API. Furthermore, this package has the ability to assist in obtaining the names of variables from a specific survey, as well as their labels. It also allows you to select only variables that you are interested in and rename them, while assigning variable descriptions as label attributes. You can set custom names and labels for the dataset. Labels play a crucial role when exporting tables and graphs, as they save you from setting long names in manuscripts manually. Therefore, this package is available to alleviate all these difficulties.

Warning: Since this package is still under development, I don’t recommend you use it in reproducible code, as any changes can happen in the future.

Installation

You can install the development version of microData from GitHub with:

# install.packages("devtools")
devtools::install_github("GutUrago/microData")

Collection

All organizations supported by this package use the NADA API to publish micro-data, which makes use of similar terminologies. Collection simply means gathering multiple related studies or data sets. To see all available collections, you can use collections() function.

library(microData)

collections(org = "wb") |> 
  head() |> 
  kableExtra::kable()

id	repo_id	title
26	afrobarometer	Afrobarometer
2	datafirst	DataFirst , University of Cape Town, South Africa
22	dime	Development Impact Evaluation (DIME)
1	microdata_rg	Development Research Microdata
4	enterprise_surveys	Enterprise Surveys
30	fao	FAO - Food and Agriculture Microdata Catalog

Searching

This package gives all flexibility of searching on the web. For more see the documentation for search_catalog().

search_catalog(
  keyword = "food",
  org = "unhcr",
  from = 2015,
  to = 2024,
  country ="Ethiopia",
  sort_by = "year",
  sort_order = "desc", 
  results = 3) |> 
  kableExtra::kable()

idno	formid	title	country	authoring_entity	form_model	year_start	year_end	repositoryid	repo_title	created	changed	varcount	total_views	total_downloads	rank	type	id	var_found	url	iso3
UNHCR_ETH_2023_IKEA_v2.1	3	Baseline-Endline Panel Survey of Refugee Cooperative Members in UNHCR Sub-office Operational Area of Melkadida, Ethiopia, 2023	Ethiopia	UNHCR	licensed	2023	2023	EHA	East and Horn of Africa	2023-10-17T15:25:32+00:00	2024-05-30T11:36:57+00:00	287	2576	0	0.7357283	survey	1021	12	https://microdata.unhcr.org/index.php/catalog/1021	ETH
WBG_ETH_2020_HFPSR_v01_M	5	Monitoring COVID-19 Impact on Refugees in Ethiopia: High-Frequency Phone Survey of Refugees 2020	Ethiopia	World Bank-UNHCR Joint Data Center on Forced Displacement (JDC)	remote	2020	2020	EHA	East and Horn of Africa	2022-07-05T11:29:36+00:00	2022-07-05T11:29:53+00:00	392	3413	0	0.3826203	survey	704	19	https://microdata.unhcr.org/index.php/catalog/704	ETH
UNHCR_ETH_SENS_2018_v2.1	3	Standardized Expanded Nutrition Survey (SENS) in Melkadida Refugee Camps - 2018	Ethiopia	UNHCR	licensed	2018	2018	EHA	East and Horn of Africa	2019-07-15T13:21:23+00:00	2019-12-05T13:38:14+00:00	105	4480	212	0.6687105	survey	114	8	https://microdata.unhcr.org/index.php/catalog/114	ETH

There is also handy function to check latest publications of these datasets.

latest_entries(org = "wb", limit = 3) |> 
  kableExtra::kable()

id	idno	title	country	created	changed	url
6269	CIV_2021_PEJEDEC-AFL_v01_M	Youth Employment and Skills Development Project - Apprenticeship Firm Listing 2021	Côte d’Ivoire	Jul-24-2024	Jul-24-2024	https://microdata.worldbank.org/index.php/catalog/6269
6268	CIV_2014-2016_PEJEDEC-AFS_v01_M	Youth Employment and Skills Development Project - Apprenticeship Firms Surveys 2014-2016	Côte d’Ivoire	Jul-24-2024	Jul-24-2024	https://microdata.worldbank.org/index.php/catalog/6268
6267	CIV_2014-2018_PEJEDEC-AYS_v01_M	Youth Employment and Skills Development Project - Apprenticeship Youth Surveys 2014-2018	Côte d’Ivoire	Jul-24-2024	Jul-24-2024	https://microdata.worldbank.org/index.php/catalog/6267

You can use data_files to see the data files included in the study. Let’s see one of the popular survey on the WB. We can also use id number of the study, which is 3110 instead of the name (See next code chunk).

data_files(id = "IND_2015_DHS_v01_M_v02_A_IPUMS", org = "wb") |> 
  kableExtra::kable()

	id	sid	file_id	file_name	description	case_count	var_count
B	114450	3110	B	IND2015-B.dat	Birth records	1315617	NULL
C	114451	3110	C	IND2015-C.dat	Child records	259627	NULL
H	114453	3110	H	IND2015-H.dat	Household member records	2869043	NULL
M	114452	3110	M	IND2015-M.dat	Man records	112122	NULL
W	114449	3110	W	IND2015-W.dat	Woman records	699686	NULL

How about variables included in the data file? Of course you can check them as well.

variables(id = 3110, file_id = "W") |> 
  head() |> 
  kableExtra::kable()

uid	sid	fid	vid	name	labl
2609913	3110	W	W_SAMPLE	W_SAMPLE	IPUMS-DHS sample identifier
2609914	3110	W	W_SAMPLESTR	W_SAMPLESTR	IPUMS-DHS sample identifier (string)
2609915	3110	W	W_COUNTRY	W_COUNTRY	Country
2609916	3110	W	W_YEAR	W_YEAR	Year of sample
2609917	3110	W	W_IDHSPID	W_IDHSPID	Unique cross-sample respondent identifier
2609918	3110	W	W_IDHSHID	W_IDHSHID	Unique cross-sample household identifier

Setting Attributes

Variables in microdata are often named something that has nothing to do with the variable except question order like this.

v1	v2	v3
22	male	BE
23	female	DE
24	male	BE
25	female	DE

Then you can prepare another data that contains metadata like this. It will be explained in detail in vignettes later.

vars	name	labs
v1	age	Age
v2	gender	Gender
v3	country	Country

You can use set_attributes function to rename and set labels to these variables.

my_data <- set_attributes(
  mdt, 
  metadata,
  old_name = "vars",
  new_name = "name",
  label = "labs")
kableExtra::kable(my_data)

age	gender	country
22	male	BE
23	female	DE
24	male	BE
25	female	DE

labels are also assigned.

str(my_data)

'data.frame':   4 obs. of  3 variables:
 $ age    : int  22 23 24 25
  ..- attr(*, "label")= chr "Age"
 $ gender : chr  "male" "female" "male" "female"
  ..- attr(*, "label")= chr "Gender"
 $ country: chr  "BE" "DE" "BE" "DE"
  ..- attr(*, "label")= chr "Country"

More coming soon!