Log in


_hellmaus_ in statisticians

Help needed!

As you can see, data points on this scatterplot are divided in two groups - one along horizontal line within Y values 4-8 and other along slanted line reaching Y=20.
 - Could you suggest any statistical criteria to confirm/reject that we really have two datasets here instead of one? 
 - Are there any methods of separating these points to two datasets better then "by eye"?
 - Are any of aforementioned methods implemented in StatSoft Statistica 8?

My best idea so far was to test the distribution of dependent variable (Y) for entire dataset and prove that it is not normal, then test two parts of dataset and prove that they are distributed normally. 


honestly? i *can't* see it in this pic. i'd suggest you go with different colors for the different data sets.

i have yet to come across a decent test for bimodality (i kind of made one up at one point, but it was a lot of math :) i suggest sorting your data into your circles and your squares, picking what you think is the minima between them, and using those two axes to do a simple chi-square.

good luck!
All these points are obtained in one experiment, so technically they are one dataset. However, bimodal distribution is obvious at least in right part of diagram. I want to test the significance of this bimodality

How to pick a minima between these groups better than "by eye"? What kind of sorting criteria should I use?
i repeat: i know of no good tests for actual bimodality.

in terms of picking the minimum, **look** at your graph, and then write that this was the local minimum. it's just like picking the mode.

the test i made up involved calculating all permutations (your actual data, your data with two points switched in sq vs circleness, 4 points switched, 6 poings switched, &c. until all switches have been made) and then seeing the probability of getting at least this many squares on one side of the minimum as opposed to the other. i do not recommend this. i think a chi-square is much more comprehensible and will give you a better sense of your data.
lyonesse has already given you a good answer, but to get some more I recommend asking this question at http://stats.stackexchange.com/.

I think, this is classical cluster analysis problem, isn't it?
And yes, I have a stat criteria to confirm/reject in R^N. It's from an article from one of the russian periodical stat journals, I'm not sure if it's any use to you if you don't know Russian. However I can explain the stat if you like.
I know Russian better than English ;)
О, отлично. Куда послать статью?
Should have it in your mail.
You might want to check out the literature on mixture models (https://secure.wikimedia.org/wikipedia/en/wiki/Mixture_model ) But, the real question is why do you care if there are 2 subpopulations within your dataset? I assume you have no other covariates that may interact with "Sampling" to produce different 75apicomplexa values?
Unfortunately, I have 80 other potential covariates...
80?!? That's one heck of an experimental design you've got there. If you have 80 covariates, I sure as heck wouldn't spend time trying to partition your 75apicomplexa data into 2 populations on the 75apicomplexa data alone. You still haven't told us the question you're actually trying to answer. You might want to check out classification and regression trees (CARTs). They're quite good at partitioning a single dependent variable based on (far too many) covariates.
Can't you do an analysis of covariance?

It looks like the 'top' set may have a different slope. create an factor variable for whatever you think it is that is causing the two sets.

Then fit this regression models using least squares:


(same intercept, different slopes)



(different intercepts, different slopes)

It's important to note though, that you've got to identify the factor you think made them different. You can't just say, 'these look different, so I'm going to give them a different factor). If you have no clue as to what made them different, you can definitely go kmeans clustering, or my personal favorite, fuzzy clustering which gives a 'possibility' of each datapoint being included in a group.
kmeans clustering. Fuzzy clustering. Thank you very much, I'll google it.
Identification of factors that made two groups different are planned, I think that multiple regression analysys should be enough.
They've been around for years... R has superb writeups about what they are, how to use them. It used to be the package was called 'fanny' I have no idea what it is now.
Can you recommend some nice reading on fuzzy clustering? I'm interested in modern algorithm of it and the stability/convergence depending on starting points etc.
I found this book superb:

It's a damn shame it's so expensive when I got it in '03, it wasn't nearly that much.

November 2011

Powered by LiveJournal.com