好吧,这让我非常困惑和担心——
作为例行程序的一部分,我一直将变量的个体观察分类为 TRUE
或 FALSE
基于它们的值是高于还是低于/等于中值。但是,我在 R 中得到了一个行为,这在执行这个简单的测试时在很大程度上是出乎意料的。
所以采取这组观察:
data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)
为了对这些值进行分类,我做了:
data_med=median(data)
quant_data=data
quant_data[quant_data>data_med]="High"
quant_data[quant_data<=data_med]="Low"
我知道有 1 无数种方法可以更有效地做到这一点,但让我担心的是,由此产生的输出没有意义。由于没有
NaN
s 在集合上并且测试是全包的(
>
或
<=
),我最终应该只得到一个列表
TRUE
/
FALSE
值,但我得到:
[1] "High" "High" "High" "High" "High" "High" "High" "High" "Low" "High" "Low" "High" "Low" "Low" "Low" "Low" "1e-04"
[18] "Low" "High" "High" "High" "Low" "Low" "Low" "High" "Low" "Low" "Low" "1e-04" "Low" "High" "Low" "Low" "High"
[35] "High" "Low" "High" "High" "High" "High" "High" "High" "Low" "Low" "Low" "High" "High" "Low" "Low" "1e-04" "Low"
[52] "1e-04" "Low" "Low" "High" "Low" "Low" "Low" "Low" "Low" "High" "High" "High" "High" "High" "Low" "Low" "Low"
[69] "1e-04" "High" "High" "High" "High"
看到“1e-04”了吗?更奇怪的是,让我们选择值 69,这是返回奇数值之一:
data[69]
>1e-04
如果我单独测试这个值,我会得到我期望的结果:
data[69]<=data_med
TRUE
有人可以解释这种行为吗?只是看起来非常危险......
请您参考如下方法:
让我们来看看你在这里做了什么。
data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)
data_med=median(data) ## 0.5
quant_data=data ## irrelevant
quant_data[quant_data>data_med]="High"
但是通过这样做 您已将 quant_data 转换为字符向量 :
str(quant_data)
## chr [1:73] "High" "High" "High" "High" "High" "High" "High" ...
现在比较字符值和
data_med
value 几乎没有意义,因为
data_med
也会被强制转换为字符值:
"High" < "0.5" ## FALSE
"1e-4" < "0.5" ## FALSE -- this is your problem.
quant_data[quant_data<=data_med]="Low"
您大概打算做什么(以及分配
quant_data=data
的原因)是:
quant_data[data>data_med]="High"
quant_data[data<=data_med]="Low"
table(quant_data)
## High Low
## 35 38
正如@Arun 在上面的评论中指出的那样,
quant_data <- ifelse(data>data_med,"High","Low")
也会工作。正确使用
cut()
也是如此.