The following is a post I originally wrote for Zillabyte, and was first published on their blog. The imghist tool I made to create these images is open source - free for anyone to use or modify.

Great data analysis is beautiful. Data is only useful once we understand it; it is critical to use tools and perspectives that fit the information you begin with and provide the information you want to extract.

Typical Color Histograms

Let's take a look at treating color as data. When working with a digital image, artists or photographers often view its color histogram to be aware of the image's overall brightness, saturation, and primary hues. A histogram is a graph showing the frequency of different items - higher points denote more frequent items. A typical (not beautiful) color histogram looks like this:

There are three overlapping graphs here, one each for red, green, and blue, and showing their combinations where they overlap (for example, red and green as light combine to give yellow). The color of every pixel in an image is described by three numbers for the proportions of light to emit for red, green, and blue, called channels. This graph is created by building one histogram for each color channel. The left end of the graph denotes low light values, corresponding to pixels with very little of that color, while the right side denotes high values.

In the example histogram, there's a blue spike near the left side. This means the image in question has a large number of pixels with low blue values.

New Perspective: Color Pies

A big trouble spot with typical color histograms is that they ignore correlations between the color channels. You get three separate buckets, one per color, without any idea if all the greens go with all the reds to make a yellow image, or if you have a half-green, half-red image.

Let's do something about that.

We'll use an analysis tool called k-means clustering to find a very small representative set of colors from any image, which I'll call a color pie. K-means clustering works by choosing a small set of sample colors, clustering the images' pixels around the nearest sample color, and then adjusting the sample color to be in the middle of its cluster. The cool thing about this algorithm is that it often converges quickly, meaning we end up with sample colors that are optimally representative of all the pixels. (And this technique applies to a lot more data than just pixels!)

Here are some example images along with their k-means color pies:

Hue Histograms

Color pies are great for visualizing the color theme of a complex image at a glance. However, they throw away a lot of information. Let's build a visualization that captures most of the critical color-frequency information at a glance.

Our graph will be organized so that each hue gets its own place along the horizontal axis. This is easier to understand with an example:

Every pixel in the image has a representative portion in the hue histogram, and vice versa. Every channel is taken into account, and the per-pixel channels are kept together in the histogram, addressing the weakness of a typical color histogram. This is achieved by drawing each vertical stripe in the hue according to its horizontal position (red at left, then greens, blues, etc until we get back to red on the right), and then averaging the saturation and lightness values of that vertical stripe to match the values of all the pixels in the image with that particular hue. The result is a histogram that very obviously matches the image.

Scroll back up to the typical color histogram to compare. Which would you rather work with? Great data analysis is our passion at Zillabyte - inventing these two new ways to visualize your image data is just one small step toward our vision. What data visualizations would you like to see reinvented?

(Photo Credits: 1st row: 1 2 3 ; 2nd row: 1 2 3 ; Beach image )

Tyler Neylon