Diabetes Data Analysis in R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set. Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics – an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set.

Data Overview

The data is from the UCI archive. It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables.

dd.list <- list(0)
for(i in 1:70) {
fileName <- sprintf("data-%02d", i)
dd <- read.csv(fileName,header=FALSE,sep='\t')
datetime.vec <- paste(dd$V1, dd$V2)
dd$datetime <- as.POSIXct(strptime(datetimeVec, "%m-%d-%Y %H:%M"))
colnames(dd) <- c('DATE', 'TIME', 'CODE', 'VALUE', 'DATETIME')
dd$CODE <- as.factor(dd$CODE)
dd.list[[i]] <- data.frame(ID=i, dd)
}
dd.df <- do.call("rbind", dd.list)
dd.df <- subset(dd.df, !is.na(dd.df$DATETIME))
dd.df$NO <- row.names(dd.df)
head(dd.df)
ID DATE TIME CODE VALUE DATETIME
1 04–21–1991 9:09 58 100 1991–04–21 09:09:00
1 04–21–1991 9:09 33 9 1991–04–21 09:09:00
1 04–21–1991 9:09 34 13 1991–04–21 09:09:00
1 04–21–1991 17:08 62 119 1991–04–21 17:08:00
1 04–21–1991 17:08 33 7 1991–04–21 17:08:00
1 04–21–1991 22:51 48 123 1991–04–21 22:51:00

We can store the data frame into Oracle Database using ORE create.

library(ORE)
ore.connect(…) # connect to Oracle Database
ore.drop(table="DD")
ore.create(dd.df, table="DD")

The column ID represents the patient ID and DATETIME is the timestamp when the event/measurement occurred. The field CODE stands for the particular type of measurement and the exact mapping can be found in the 'Data-Codes' file. Here is an example of some of the codes.

33 = Regular insulin dose
34 = NPH insulin dose
35 = UltraLente insulin dose
48 = Unspecified blood glucose measurement
58 = Pre-breakfast blood glucose measurement
62 = Pre-supper blood glucose measurement
65 = Hypoglycemic symptoms
66 = Typical meal ingestion
69 = Typical exercise activity

What can we do with this type of data? In the raw data file, data points are recorded in a 'transaction' style and the time interval is irregular. Also the data set does not have any clear label or indicator. This makes the data at hand difficult to work with. Therefore, we need to preprocess the data for machine learning tasks.
Next, we will show how to leverage the data to carry out analysis.

Clustering Analysis

Since the patients may have different levels of symptoms and also vary in treatment (such as insulin dose), we first conduct a clustering analysis to see if there are underlying groups.
For now, we ignore the timestamps and just do an aggregation on the patient level. We calculate the average value for each code and thus each average code value can be used as a feature. Note that for an event CODE, the VALUE is always zero, since it only indicates that an event happens at such time. In that case, we calculate the average number of occurrences over the number of days.
For each patient, we combine the information and form a feature vector. Here, we need to do a transpose of the data frame, which means that we want to convert CODE as separate columns. This can be done by using a 'pivot table' operation, which can be realized in R by calling the library reshape2.
We can use reshape2 through ORE embedded R execution. See the code below.

aggregate.code <- function(DD, full_code_list, event_list){
full_code_list <- c(33, 34, 35, 48, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
event_list = c(65, 66, 67, 68, 69,70, 71, 72)

code_filtered <- subset(DD, CODE %in% full_code_list)
code_agg <- aggregate(code_filtered$VALUE,
by=list(ID=code_filtered$ID, CODE=code_filtered$CODE),
FUN=mean,
na.rm=TRUE)
colnames(code_agg)[3] <- 'MEASURE'
activity_filtered <- subset(DD, CODE %in% event_list)
activity_count <- aggregate(activity_filtered$VALUE,
by=list(ID=activity_filtered$ID, CODE=activity_filtered$CODE),
FUN=length)
colnames(activity_count)[3] <- 'MEASURE'
total.date <- length(unique(DD$DATE))
activity_count$MEASURE <- activity_count$MEASURE*1.0/total.date
feature.agg <- rbind(code_agg, activity_count)
feature.df <- aggregate(feature.agg$MEASURE,
by=list(ID=feature.agg$ID,
CODE=feature.agg$CODE),
FUN=max)
library(reshape2)
cast.df <- dcast(feature.df, ID~CODE)
for (col in colnames(cast.df)){
cast.df[,col] <- ifelse(is.na(cast.df[,col]), 0, cast.df[,col])
}
ore.drop(table='PIVOTED')
ore.create(cast.df, table='PIVOTED')
TRUE
}
ore.tableApply(DD, # an ore.frame referencing the prepared database table
aggregate.code, # function defined above to execute on table
ore.connect = TRUE, # allows creating resulting table in function
full_code_list = full_code_list, # data for function args
event_list = event_list)
ore.sync(table = "PIVOTED") # make new table accessible in client session

Now, we obtained the ORE data frame (ore.frame) PIVOTED that contains the average of each CODE values for each patient ID. In fact, the entire operation is done at the database server and the result is stored as a DB table. We can have a look at the obtained data set shown below.

head(PIVOTED[,1:7])
ID 33 34 35 48 57 58
1 6.593750 16.892086 0.000000 150.1538 0.00 169.7185
2 10.060847 11.333333 0.000000 201.4022 0.00 207.8438
3 2.433333 8.000000 8.452055 0.0000 120.50 117.6750
4 2.304348 8.413793 8.444444 0.0000 142.75 141.5714
5 2.388889 8.500000 0.000000 0.0000 183.40 147.4324
6 6.084746 18.000000 0.000000 246.5556 0.00 213.5238

For ease of presentation, we omitted other CODE values. In total, there are 20 types of CODE. The next job is to cluster this data. Usually, clustering on data with high dimensionality is not ideal since the distance of each data point tends be large. It would be great if we can do a PCA analysis and cluster on principle components (PC), which can indicate directions of the features that have the highest variations.
In ORE, the following code carries out PCA analysis. Note that the function looks the same as open source R – it is overloaded in ORE.

dd.pca <- prcomp(PIVOTED[, –1], # use the ore.frame and exclude ID column
center = TRUE,
scale. = TRUE)

We can convert the original data frame to the space of principal components (PCs) using the code below.

dd.pca.ore <- ore.predict(dd.pca, PIVOTED[, –1])

In the principal component space, we use a k-means clustering method to do the clustering over the first two PCs.

model.km <- ore.odmKMeans(~., dd.pca.ore[, c("PC1","PC2")] , num.centers=5)
km.res <- predict(model.km, dd.pca.ore, type="class", supplemental.cols=c("PC1","PC2"))

To generate the plot, we use ggplot2, which can also be used through ORE embedded R execution. For simplicity here, however, we pull the data from the DB and call ggplot2.

km.res.df <- ore.pull(km.res)
ggplot(km.res.df, aes(PC1, PC2)) +
geom_point(aes(color=factor(CLUSTER_ID)))

The clusters are illustrated in the plot.
alt text

A natural question is what are the essential differences between different clusters? We plot several boxplots for glucose level and insulin dose to have an idea of the distribution in the patients' characteristics.

pivoted <- ore.pull(PIVOTED)
library(ggplot2)
colnames(pivoted) <- append('ID', paste("CODE", colnames(pivoted[,-1]), sep = '_'))
label.df <- cbind(pivoted, km.res.df$CLUSTER_ID)
colnames(label.df)[ncol(label.df)] <- 'CLUSTER_ID'
library(gridExtra)
p1 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_35)) + geom_boxplot() + labs(x = "CLUSTER", y = 'Reguar insulin dose') +
ggtitle("Boxplot of Selected Features across Patient Groups") +
theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_65)) + geom_boxplot()+ labs(x = "CLUSTER", y = 'Hypoglycemic')
p3 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_69)) +
geom_boxplot()+
labs(x = "CLUSTER", y = 'Typical Excercise')
p4 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_62)) +
geom_boxplot()+ labs(x = "CLUSTER", y = 'Pre-supper glucose') + grid.arrange(p1, p2, p3, p4, nrow=4)

alt text

From these boxplots, we can see that the difference in the patient clusters involves insulin dose, frequency of hypoglycemic symptoms, frequency of exercise and pre-supper glucose level. One interesting observation is that the patients with higher amount of exercise (cluster 8) have relatively lower glucose level than other groups with similar dose of insulin (cluster 5), but with higher hypoglycemic symptoms (occurs because of low glucose level).
Although this clustering analysis sheds some light on differences among patients, we lose a lot of information associated with the time stamped data. Next, we will see how we leverage the data without aggregation.

Regression Analysis

One important question for a diabetes patient is to estimate the glucose level in order to control the insulin dose. This topic has been widely studied, which is an important direction as part of system biology. Previous work has access to time stamped data with high frequency (15 min). But in this data set, each data point has a large time interval and that interval is not constant.
In this case, we can still do regression on the glucose level by considering all factors within a time window. For a glucose measurement at a particular time, we can focus on all the insulin doses and previous events 15 hours before that time. Then, a regression model can be built considering the insulin dose in all types and the previous events.
To do the regression, we need to create a data frame that contains the target glucose level and all features related to it. For that purpose, we go through each row of the original data frame, find all features within the time window and collect the features for that data point. This is done in the following function.

related.row <- function(row, dd.df){
range.df <- subset(dd.df,
(ID == row$ID & row$DATETIME – DATETIME < 14*3600 & row$DATETIME – DATETIME > 0))
range.df$TIME_DIFF = row$DATETIME – range.df$DATETIME
events.df <- subset(range.df, (CODE %in% c(66,67,68,69,70, 71)))
events.df <- events.df[which.max(events.df$DATETIME), ]
events.df$LAST_EVENT <- ifelse( events.df$CODE %in% c(67), 1, –1)
col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET')
glu.df <- subset(range.df, CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64))
last.glu <- glu.df[which.max(glu.df$DATETIME), ]
last.glu$TIME_DIFF <- row$DATETIME – last.glu$DATETIME
C33 <- subset(range.df, (CODE==33))
C33 <- C33[which.max(C33$DATETIME), c('VALUE', 'TIME_DIFF')]
C34 <- subset(range.df, CODE == 34)
C34 <- C34[which.max(C34$DATETIME), c('VALUE', 'TIME_DIFF')]
C35 <- subset(range.df, CODE == 35)
C35 <- C35[which.max(C35$DATETIME), c('VALUE', 'TIME_DIFF')]
row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF,
events.df$LAST_EVENT, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, row$VALUE)
row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)})
new.df <- data.frame(row.result)
colnames(new.df) <- col.names
new.df[1,]
}

combine.row <- function(DD.glu, dd.df, related.row) {
N.glu <- nrow(DD.glu)
row <- DD.glu[1,]
train.df <- related.row(row, dd.df)
for(i in 2:N.glu){
row <- DD.glu[i,]
new.row <- related.row(row,dd.df)
if(all(new.row[1,]==0) != TRUE)
train.df <- rbind(train.df, new.row)
}
train.df <- train.df[-1,]
ore.drop(table='TRAIN')
ore.create(train.df, table='TRAIN')
TRUE
}
DD.glu <- subset(DD, CODE %in% c(48, 57, 58, 60, 62))
row.names(DD.glu) <- DD.glu$NO
res <- ore.tableApply(DD.glu,
combine.row,
dd.df = DD, related.row = related.row,
ore.connect = TRUE)
ore.sync(table ='TRAIN')

We can have a look at the data frame obtained:

C33 T33 C34 T34 C35 T35 LAST_EVENT EVENT_TIME LAST_GLU LAST_TIME TARGET
1 12.483333 0 0.000000 8 12.48333 1 12.483333 220 12.516667 118
4 9.233333 8 9.233333 0 0.00000 1 9.233333 272 9.283333 213
0 0.000000 0 0.000000 0 0.00000 -1 0.850000 222 7.883333 71
0 0.000000 0 0.000000 0 0.00000 -1 6.100000 222 13.133333 193
4 1.883333 4 1.883333 0 0.00000 1 1.733333 70 5.383333 134
4 5.300000 4 5.300000 0 0.00000 1 5.150000 70 8.800000 281

On an i5 based laptop with 16G memory, this normally takes about 6 minutes. The data set itself has only 29244 rows, so if this performance in not acceptable to the user, we can further optimize the solution by including native SQL in our R function. For those data scientists comfortable with SQL, this is a significant advantage of ORE being able to leverage SQL in conjunctions with R and ore.frame objects.
Let us review this process. The entire operation actually need to join each row of glucose measurement to the original data set ('DD') and filter based on time stamps on a certain time window. The entire process is done in R using iteration, which is expensive.
One solution to this is to use an Oracle SQL query to do the joining and then do the subsequent operations in parallel using 'groupApply' in ORE embedded R execution. The relational database optimizes table join performance, so we can take advantage of that using the following SQL query.

ore.exec("CREATE TABLE DD_GLU_AGG AS
SELECT ID, GLU_NO, GLUCOSE, CODE, VALUE,
EXTRACT(HOUR FROM GLU_TIME – DATETIME) + EXTRACT(MINUTE FROM GLU_TIME – DATETIME)/60 AS TIME_DIFF
FROM (
SELECT ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME,
row_number() OVER ( PARTITION BY ID, GLU_NO, GLUCOSE, GLU_TIME, CODE ORDER BY DATETIME NULLS LAST) AS RANK
FROM (
SELECT a.ID, GLU_NO, GLUCOSE, GLU_TIME,
CASE WHEN b.CODE IN (48, 57, 58, 59, 60, 61, 62, 63, 64) THEN '0' ELSE
(CASE WHEN b.CODE IN (66, 68, 69, 70, 71) THEN '-1' ELSE
(CASE WHEN b.CODE = 67 THEN '1' ELSE b.CODE END) END) END AS CODE,
b.VALUE, b.DATETIME
FROM (
SELECT ID, NO AS GLU_NO, VALUE AS GLUCOSE, DATETIME AS GLU_TIME
FROM DD
WHERE CODE IN (48, 57, 58, 60, 62)
) a
JOIN DD b
ON (GLU_TIME – b.DATETIME) < INTERVAL '14' HOUR
AND (GLU_TIME – b.DATETIME) > INTERVAL '0' HOUR
AND a.ID = b.ID) c
ORDER BY ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME) d
WHERE RANK = 1")
ore.sync(table='DD_GLU_AGG')

This query generates a table that looks like

ID GLU_NO GLUCOSE CODE VALUE TIME_DIFF
1 105 282 0 183 13.13333
1 105 282 33 10 13.13333
1 105 282 34 14 13.13333
1 105 282 65 0 6.30000
1 107 91 0 282 9.40000
1 107 91 33 7 13.70000

The data aggregate all CODES, with the latest timestamp, related to each pair of IDs and GLU_NO. Next, we can use groupApply() to do the rest of the job in parallel.

form.row <- function(DD_GLU_AGG){
C33 <- subset(DD_GLU_AGG, CODE == 33)
C34 <- subset(DD_GLU_AGG, CODE == 34)
C35 <- subset(DD_GLU_AGG, CODE == 35)
events.df <- subset(DD_GLU_AGG, CODE %in% c(1,-1))
events.df$VALUE <- ifelse(events.df$CODE == –1, –1, 1)
last.glu <- subset(DD_GLU_AGG, CODE == 0)
row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF,
events.df$VALUE, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, DD_GLU_AGG$GLUCOSE[1])
row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)})

new.df <- data.frame(row.result)
# C33 stands for the insulin dose. T33 stands for the time between the injection and the glucose level
col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET')
colnames(new.df) <- col.names
new.df[1,]
}

Train <- ore.groupApply(DD_GLU_AGG, # ore.frame proxy for database table
DD_GLU_AGG[, c("ID", "GLU_NO")], # columns for partitioning data
form.row,
FUN.VALUE = data.frame(C33 = integer(0), # define resulting table sturucture
T33 = numeric(0),
C34 = integer(0),
T34 = numeric(0),
C35 = integer(0),
T35 = numeric(0),
LAST_EVENT = integer(0),
EVENT_TIME = numeric(0),
LAST_GLU = numeric(0),
LAST_TIME = numeric(0),
TARGET = numeric(0)),
parallel = 4)
# materialize the ORE frame.
train.df <- ore.pull(Train)

Using this approach, the entire process takes around 1 min, a 6x performance improvement. The ORE framework allows us to run the in-database queries and R analytics without moving the data off the database server.
After the data is prepared, we can go forward to build regression models. For simplicity, we treat all data points as homogeneous, which means we assume all patients have the same nature regarding their response to the insulin dose. A regression model is fit according to this data using the OREâ€™s parallel implementation of lm: ore.lm.

model.formula <- formula( TARGET ~ C33 + T33 + C34 + T34+ C35 + T35 + LAST_EVENT + EVENT_TIME + LAST_GLU + LAST_TIME)
model.lm <- ore.lm(model.formula, TRAIN)
summary(model.lm)
Call:
ore.lm(formula = model.formula, data = TRAIN)
Residuals:
Min 1Q Median 3Q Max
-488.69 -59.35 -7.54 46.96 370.25
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 132.316609 2.770239 47.764 < 2e-16 ***
C33 1.699390 0.143044 11.880 < 2e-16 ***
T33 2.339259 0.244046 9.585 < 2e-16 ***
C34 -0.384308 0.068708 -5.593 2.29e-08 ***
T34 -0.171930 0.102842 -1.672 0.09460 .
C35 0.278242 0.122206 2.277 0.02282 *
T35 -0.285301 0.114634 -2.489 0.01283 *
LAST_EVENT -0.150663 2.852225 -0.053 0.95787
EVENT_TIME -0.206862 0.302349 -0.684 0.49388
LAST_GLU 0.003885 0.009028 0.430 0.66696
LAST_TIME 0.580310 0.182167 3.186 0.00145 **
—

Residual standard error: 77.27 on 9952 degrees of freedom
Multiple R-squared: 0.04094, Adjusted R-squared: 0.03998
F-statistic: 42.48 on 10 and 9952 DF, p-value: < 2.2e-16

We can see that most of the features have significant impact on the response. But the R squared score is low. This is because we use all patients' data and the variance of glucose level given the same condition could be high.
Moreover, we can run a decision tree to see the effect of insulin dose and time of injection on the glucose level. For better visualization purpose, we use the conditional inference tree package {party}.

train.df <- ore.pull(TRAIN)
library(party)
model.ct <- ctree(model.formula,
data=train.df)
plot(model.ct, main="Conditional Inference Tree for GLU")

alt text

The plot provides boxplots of glucose level in different partitions. It is easy to understand that for larger doses of UltraLente insulin (C35), the related glucose level is lower, similar for NPH insulin (C34). However, the partition related to regular insulin (C33) indicates that the one with a higher dose of regular insulin tends to have a high level of glucose, which seems to be paradoxical. One explanation is that the patient who has severe symptoms tends to take more regular insulin. This reminds us that the data only provide evidence of correlation, not causality.

Statistical Test of Hypoglycemic Symptoms

One of the symptoms recorded in the data set is the hypoglycemic symptom. This symptom is supposed to occur when the patient has too low of a glucose level. The following plot illustrates the occurrence of this symptom. The blue dots are the glucose level of one patient and the red vertical line marks the occurrence of hypoglycemic symptoms. We can see that most of the symptoms are related to the low glucose level.
To verify this fact statistically, we can run a T-test to check if there is a significant difference in the glucose level.
Here is the code for this analysis. Basically, the code goes through each hypoglycemic event and finds the nearest glucose measurement. Then compares the group of glucose level associated with hypoglycemic event and the one not.

rm(list=ls())
library(ORE)
options(ore.warn.order=FALSE)
ore.connect(…)
ore.ls()
DD.hypo <- subset(DD, CODE==65)
rownames(DD.hypo) <- DD.hypo$NO
rownames(DD) <- DD$NO
row.ahead <- function(row, dd.df){
range.df <- subset(dd.df, ID == row$ID & (row$DATETIME – DATETIME < 0.25*3600) & (row$DATETIME – DATETIME > – 0.25*3600))
glu.df <- subset(range.df, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64)))
glu.df$TIME_DIFF = row$DATETIME -glu.df$DATETIME
events.df <- subset(range.df, CODE %in% c(66,67,68,69,70, 71))
events.df$TIME_DIFF = row$DATETIME -events.df$DATETIME
events.df <- events.df[which.max(events.df$DATETIME), ]
events.df$LAST_EVENT = ifelse( events.df$CODE %in% c(67), 1, -1)
col.names <- c('LAST_EVENT', 'EVENT_TIME', 'LAST_NO', 'LAST_GLU', 'MEAUSRE_TIME', 'HYPO_TIME')
if(nrow(glu.df) ==0){
empty.df <- data.frame(as.list(rep(0, length(col.names) )))
colnames(empty.df) <- col.names
return(empty.df)
}
last.glu <- glu.df[which.min(abs(glu.df$TIME_DIFF)), ]
row.list <- list(ifelse(nrow(events.df)==0, 0, events.df$LAST_EVENT),
ifelse(nrow(events.df)==0, 0, events.df$TIME_DIFF),
ifelse(nrow(last.glu) ==0, 0, last.glu$NO),
ifelse(nrow(last.glu) ==0, 0, last.glu$VALUE),
ifelse(nrow(last.glu) ==0, 0, last.glu$TIME_DIFF),
row$DATETIME
)
row.result <- lapply(row.list, function(x) { ifelse(is.na(x), 0, x)})
new.df <- data.frame(row.list)
colnames(new.df) <- col.names
new.df[1,]
}
combine.rows <- function(DD.hypo, dd.df, row.ahead){
N.glu <- nrow(DD.hypo)
row <- DD.hypo[1,]
hypo.df <- row.ahead(row, dd.df)
for(i in 2:N.glu){
row <- DD.hypo[i,]
new.row <- row.ahead(row,dd.df)
if(all(new.row[1,]==0) != TRUE)
hypo.df <- rbind(hypo.df, new.row)
}
hypo.df <- hypo.df[-1,]
ore.drop(table='HYPO')
ore.create(hypo.df, table='HYPO')
TRUE
}
res <- ore.tableApply(DD.hypo,
combine.rows,
dd.df = DD,
row.ahead = row.ahead,
ore.connect = TRUE)
ore.sync(table ='HYPO')
hypo.df <- ore.pull(HYPO)
DD.glu <- subset(DD, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64)))
dd.glu.df <- ore.pull(DD.glu)
boxplot(HYPO$LAST_GLU)
dd.glu.df$HYPO = ifelse(dd.glu.df$NO %in% hypo.df$LAST_NO, 1, 0 )

library(ggplot2)
dd.glu.df$HYPO <- as.factor(dd.glu.df$HYPO)
p <- ggplot(dd.glu.df, aes(HYPO, VALUE))
p + geom_boxplot()

alt text

t.test(dd.glu.df[dd.glu.df$HYPO==1,]$VALUE,dd.glu.df[dd.glu.df$HYPO==0,]$VALUE)
Welch Two Sample t-test
data: dd.glu.df[dd.glu.df$HYPO == 1, ]$VALUE and dd.glu.df[dd.glu.df$HYPO == 0, ]$VALUE
t = –35.265, df = 251.52, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
–106.35820 –95.10693
sample estimates:
mean of x mean of y
61.18304 161.91560

The p-value is lower than 0.05, so we can reject the null hypothesis that the two groups have the same glucose level. This provides statistical evidence of the relationship between glucose level and the hypoglycemic symptom.

Conclusion

In this blog, we demonstrated the data wrangling and analysis capability of R and ORE for the diabetes data set. A workable dataset was successfully created from the raw data. Based on the dataset, a clustering and decision tree based analysis and visualization provided important insights into the data, which can be useful for evaluation of the effect of the treatment for diabetes patients

from Oracle Blogs | Oracle R Technologies Blog https://blogs.oracle.com/r/diabetes-data-analysis-in-r-v3

Diabetes Data Analysis in R

Published by Cesar Prado

Leave a comment Cancel reply

Share this:

Published by Cesar Prado

Leave a comment Cancel reply