好吧,这让我非常困惑和担心——
作为例行程序的一部分,我一直将变量的个体观察分类为 TRUEFALSE基于它们的值是高于还是低于/等于中值。但是,我在 R 中得到了一个行为,这在执行这个简单的测试时在很大程度上是出乎意料的。

所以采取这组观察:

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166) 

为了对这些值进行分类,我做了:
data_med=median(data) 
quant_data=data 
quant_data[quant_data>data_med]="High" 
quant_data[quant_data<=data_med]="Low" 

我知道有 1 无数种方法可以更有效地做到这一点,但让我担心的是,由此产生的输出没有意义。由于没有 NaN s 在集合上并且测试是全包的( ><= ),我最终应该只得到一个列表 TRUE/ FALSE值,但我得到:
[1] "High"  "High"  "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "High"  "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "1e-04" 
[18] "Low"   "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "1e-04" "Low"   "High"  "Low"   "Low"   "High"  
[35] "High"  "Low"   "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "High"  "Low"   "Low"   "1e-04" "Low"   
[52] "1e-04" "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "Low"   "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"   
[69] "1e-04" "High"  "High"  "High"  "High"   

看到“1e-04”了吗?更奇怪的是,让我们选择值 69,这是返回奇数值之一:
data[69] 
>1e-04 

如果我单独测试这个值,我会得到我期望的结果:
data[69]<=data_med 
TRUE 

有人可以解释这种行为吗?只是看起来非常危险......

请您参考如下方法:

让我们来看看你在这里做了什么。

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166) 
 
 
 
data_med=median(data)  ## 0.5 
quant_data=data        ## irrelevant 
quant_data[quant_data>data_med]="High" 

但是通过这样做 您已将 quant_data 转换为字符向量 :
str(quant_data) 
##  chr [1:73] "High" "High" "High" "High" "High" "High" "High" ... 

现在比较字符值和 data_med value 几乎没有意义,因为 data_med也会被强制转换为字符值:
"High" < "0.5"  ## FALSE 
"1e-4" < "0.5"  ## FALSE -- this is your problem. 
quant_data[quant_data<=data_med]="Low" 

您大概打算做什么(以及分配 quant_data=data 的原因)是:
quant_data[data>data_med]="High" 
quant_data[data<=data_med]="Low" 
table(quant_data) 
## High  Low  
##   35   38  

正如@Arun 在上面的评论中指出的那样, quant_data <- ifelse(data>data_med,"High","Low")也会工作。正确使用 cut() 也是如此.


评论关闭
IT干货网

微信公众号号:IT虾米 (左侧二维码扫一扫)欢迎添加!