Analyzing data on CRAN packages

There’s a handy new function in R 3.4.0 for anyone interested in data about CRAN packages. It’s not documented, but it’s pretty simple:
returns a data frame with one row for every package on CRAN and 65 columns of data on those packages, as shown below.
> names(tools::CRAN_package_db())
[1] “Package” “Version” “Priority”
[4] “Depends” “Imports” “LinkingTo”
[7] “Suggests” “Enhances” “License”
[10] “License_is_FOSS” “License_restricts_use” “OS_type”
[13] “Archs” “MD5sum” “NeedsCompilation”
[16] “Additional_repositories” “Author” “Authors@R”
[19] “Biarch” “BugReports” “BuildKeepEmpty”
[22] “BuildManual” “BuildResaveData” “BuildVignettes”
[25] “Built” “ByteCompile” “Classification/ACM”
[28] “Classification/ACM-2012” “Classification/JEL” “Classification/MSC”
[31] “Classification/MSC-2010” “Collate” “Collate.unix”
[34] “” “Contact” “Copyright”
[37] “Date” “Description” “Encoding”
[40] “KeepSource” “Language” “LazyData”
[43] “LazyDataCompression” “LazyLoad” “MailingList”
[46] “Maintainer” “Note” “Packaged”
[49] “RdMacros” “SysDataCompression” “SystemRequirements”
[52] “Title” “Type” “URL”
[55] “VignetteBuilder” “ZipData” “Published”
[58] “Path” “X-CRAN-Comment” “Reverse depends”
[61] “Reverse imports” “Reverse linking to” “Reverse suggests”
[64] “Reverse enhances” “MD5sum”

In a recent blog post, Julia Silge analyzes this database to find out some interesting statistics on CRAN packages. For example, by mining the Description field of the packages, we can see which words (other than stopwords like “the” and “and”) are most commonly used. Unsurprisingly, most are associated with data and data analysis methods:

Julia also analysed which packages include tests (via the testthat or RUnit packages), or provide a link for providing bug reports, or provide vignettes (dynamically-created documentation). By this analysis, more than 60% of packages provide none of these:

This is a bit of an underestimate though, as many packages (especially older packages) do include a tests folder that doesn’t rely on those packages, but it’s not apparent how to identify those packages from the data. Tests can also be included in the help files for package functions. In particular, the CRAN maintainers will reject any package that includes no tests (except those that don’t require them, like data packages), so this estimate doesn’t look right to me. Nonetheless, the new CRAN_package_db function provides a useful data source for exploring the rich world of CRAN packages. You can see further examples at Julia’s blog post, linked below.
data science ish: Mining CRAN DESCRIPTION Files

from Revolutions


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s