Normalising a Histogram

gerry87 · 13-05-2009 02:20PM #1

Hi, wondering if anybody can help shed some light on something.

I've got a dataset with 130,000+ observations of 5 minute stock returns. I've taken the log return of these - ln(xt/xt-1) and normalised it by calculating each observation as number of standard deviations from the mean. The data now ranges from roughly -21 to 25.

We've to make a histogram from this, and then fit a distribution function to the histogram. To make the histogram we took the max figure (25) minus the min figure (-21) and divided by the number of bins (1000). So say the bin width is around .046 (we were doing it in excel).

We then put a histogram together and now we're trying to normalise it so that the area under it is 1. The way we thought of doing this was to divide the number of observations in each bin by the total number of observations, so each bin is a % of the total and so sum to 100%.

However when we plotted this against the pdf function, as you can see in the picture, it doesn't seem to match well at all. The area under the blue line looks like it could be 1, but the green one doesn't.

So instead of dividing by the number of observations multiplied by the bin width and it came out much better (not perfect), but we can't understand if this makes sense to do or not.

Can anybody clear this up? Did we go wrong somewhere? Thanks.

LeixlipRed · 13-05-2009 04:46PM

It just looks like you're out by a factor of 100 somewhere. Are you perhaps confusing decimals and percentages at some stage. For example, picking a number out of thin air, mistakenly interpreting .0096 as .0096% or something?

gerry87 · 13-05-2009 07:11PM

LeixlipRed wrote: »

It just looks like you're out by a factor of 100 somewhere. Are you perhaps confusing decimals and percentages at some stage. For example, picking a number out of thin air, mistakenly interpreting .0096 as .0096% or something?

It does seem that, we have 5 datasets and they're all out by roughly the same magnitude. We haven't used percentages anywhere, all the data is correct up until we put the histogram together. It's likely we're making some small mistakes with the construction of the histogram like things falling into wrong bins or a couple being left out, but this wouldn't really be affecting the magnitude.

We're dividing the number in each bin by the number of observations, this should give us the probability, right? Which should sum to 1. Then if we multiply by the bin width, this should give us the area under of the histogram, so should this should equal to 1? Is that right? But the two of those can't really work together, unless we divide (or multiply) by an arbitrary normalising number to force the area equal to 1...

This is really confusing me!

LeixlipRed · 13-05-2009 07:30PM

The area under the histogram wouldn't necessarily be equal to 1.

gerry87 · 13-05-2009 07:46PM

LeixlipRed wrote: »

The area under the histogram wouldn't necessarily be equal to 1.

Ok, i see. And if we wanted to normalise it to make it equal to 1, so it's essentially an empirical probability density function, how would you do that? Is it possible?

Would we just pick a number to divide by that makes it equal to 1?

gerry87 · 13-05-2009 08:49PM

Ok, think i got my head around it. If we multiply the number in each bin by the width of the bin, and divide by the sum of all these numbers, it should be a pdf that sums up to 1... i think! Cheers

Edit: no... thats the same as the number in the bin divided by the number of observations... disregard...

MathsManiac · 13-05-2009 09:18PM

You said that dividing by the bin width makes it look right, but you couldn't see why. It is indeed to correct to do such a division. Here's why:

The y-value in the graph of a pdf does NOT represent the probability, but the probability density (as the name implies). You have used lots of bins, so your histogram is close to an idealised pdf. Each bin is behaving almost like an infinitessimal vertical slice of the graph. If the area of the slice is to represent the probability, then the appropriate height of the slice is found by dividing this area by the width of the slice.

Does that make sense to you?

If not, try drawing this: suppose I have an infinite population of numbers, uniformly distributed on the interval 0 to 10. Suppose I'm only interested in whether the number is bigger or smaller than 5. I draw a two-bin histogram. I scale the x-axis from 0 to 10. I draw two rectangles, each of width 5. Each must represent a probability of 0.5. However, if I draw them to a height of 0.5, each has area 2.5 instead of 0.5. To get them to have the right area, I have to divide the probability by the width of the rectangle. so I draw each to a height of 0.1. Then each has area 0.5, and the total area is 1. (So, in this uniform case, the probability density is 0.1 on the entire interval, no matter what kind of bins I had chosen.)

By the way, when you said you "normalised" by expressing each observation as a number of standard deviations above the mean, I would have said "standardised". Sometimes these are used interchangeably, but in many contexts, normalising a data-set (such as a set of test scores) may involve changing its shape as well as its scale. In particular, if the data were not normally distributed in the first place, "normalising" involves making the distribution normal as well as standardising it.

gerry87 · 13-05-2009 09:58PM

Ahh I see, I just couldn't get my head around it. Thank you!

Normalising a Histogram

Comments