Dr. Dobb's | Tag Clouds: Usability and Math

Three Functions

A tag cloud (Figure 1) is a collection of tags, which is presented in such a way that the visual emphasis of each tag corresponds to the relative importance within the collection. (Depending on the context, tag clouds can also be called "text clouds," "topic clouds," or "word clouds.")

Tag clouds offer three functions in a single visual construction:

Users can easily search tags because the texts are sorted alphabetically.
Alternatively (and more spontaneously), users can let their navigation depend on any one of several options.
But even without navigation, font sizes carry an informative function in regards to the relative importance of different tags and content of the web site.

[Click image to view at full size]

Figure 1: A tag cloud.

After examining various approaches, I concluded that tag clouds are generally implemented for specific datasets and layouts, and consider alternatives for other types of data and other UI designs. Clearly, what is needed is a more generic approach. While I don't present a complete implementation of a generic solution here, you can still use my examples as a starting point to build something much better than anything my Internet searches have produced to date.

The design of a tag cloud consists of several functional layers. Most implementations I examined treat data access, business logic, layout, and functionality in one piece of code—sometimes even in one class. My proposal treats the individual functional layers individually, one at a time. To me, this makes it easier to implement alternatives within the architectures. My examples are in Visual Basic. NET because its syntax is easy to grasp (especially by myself).

Source Data

As input for a tag cloud, you need a dataset consisting of at least three columns:

Text (to display).
Weight (to determine the font size).
Identifier (something to support navigation.)

The weight in the dataset often represents some frequency—the number of times a text is used as a search term, or the number of items sold of a product. However, the weight is not always an integer value. You can also consider, say, election results consisting of political parties and their percentages, earthquakes and their intensity, or movie stars and their IQ. In fact, there is technically little difference between tag clouds, histograms, line graphs, and pie charts. (I wouldn't be surprised to find the tag cloud as just another type of standard chart in Excel 2010.)

While constructing the source data for a tag cloud, you can impose restrictions on the raw data in the system in three ways:

Similar to other graphs, there is a restriction to the density of the information in tag clouds. According to Wikipedia, tag clouds generally contain between 30 and 150 tags. Usability clearly sets an upper limit to the number of tags. Moreover, the page layout can impose a restriction to the available space for the tag cloud. It is therefore necessary to take into account an imposed maximum length for the dataset.
Some texts may not be interesting to users and should be omitted from the tag cloud. This is the case for articles and other small words that are considered to be "noise" by search algorithms. If there are tags such as these in your data, you might want to filter the results.
Many tag clouds present information calculated over a period of time, such as the number of times that search terms have been used in the last 24 hours. Depending on the data, your function may contain extra parameters with which you restrict the aggregation of data to a (progressive) subset.

Eventually, you will create one or more functions that resemble Listing One. Your architecture for data access is hopefully more sophisticated than this simple example. But if you separate the construction of source data from the remaining functional layers of the tag cloud, then you already have a better design than the average tag cloud example found on the Internet.

Public Function GetWriters(ByVal maxCount As Integer, _
        ByVal ignoreNoise As Boolean, ByVal fromDate As DateTime, _
        ByVal toDate As DateTime) As DataTable
    Dim query As String = String.Format( _
        "SELECT * FROM (SELECT TOP {0} ID, Text, " & _
        "Count FROM Writers ORDER BY Count DESC) sub " & _
        "ORDER BY Text ASC", maxCount)
    'TODO: also filter on ignoreNoise, fromDate and toDate
    Dim adapter As New SqlDataAdapter(query, _ConnectionString)
    Dim table As New DataTable
    adapter.Fill(table)
    Return table
End Function

Listing One

Linearization

For the purposes of illustration, I created a dataset of well-known authors in our field, with the number of hits these names score in a Google search. When I use the raw data to create a tag cloud, I get the result in Figure 2(a). The tag cloud presents most of the names in approximately the same size. Only some names jump out, and some are nearly illegible. The reason is that the weights are not distributed evenly over the range of the source data. Most of the authors on my bookshelf have (roughly) the same number of Google hits. Only some authors have either very many or very few hits. It appears you can recognize a normal distribution (or Gaussian distribution) here, of which you can see examples in Figure 3. To get a more evenly distributed range of font sizes in the tag cloud, it is necessary to "linearize" the original values. You get a better result when you use a linearized representation, as in Figure 2(b). Technically, linearization means that the weights become less accurate. Bust because the tags have differing word lengths, there is already no such thing as an accurate reflection of the weights. Here, we are interested in usability, not accuracy.

[Click image to view at full size]

Figure 2: Linearization of source data.

[Click image to view at full size]

Figure 3: Normal distributions

The Pareto distribution, or "80-20 rule" (see Figure 4) is also frequently encountered. In this distribution, 80 percent of the weights are in the lowest 20 percent of the range, while the other 20 percent fill the remaining 80 percent of the range, or the other way around. Well-known examples of this distribution include wealth among people, popularity of websites, and the frequency of words from the English language. You need to select the right algorithm for linearization of your dataset. In Figure 2(c), my dataset (which contains a normal distribution) is linearized as if it contained a Pareto distribution. The result can be weird when you select the wrong distribution model. Strangely enough, I've noticed several authors doing exactly the opposite—they linearized datasets that contained Pareto distributions assuming (unknowingly, I suppose) that they were normal distributions. Evidently, statistical knowledge itself is not distributed evenly among software developers.

[Click image to view at full size]

Figure 4: Pareto distributions.

You will need several functions when linearizing multiple types of distributions. Each function only needs one collection of weights as input, and it returns a new (linearized) version of the collection. I suggest you work with generic interfaces for collections so that you can apply the same functions to different types of data sources. It is necessary to specify explicit upper and lower boundaries to the desired range of output values. It also seems proper to work with decimal or real numbers, not integers. Rounding the values to integers should be left to the UI code, in my opinion.

Listing Two is my attempt at linearizing a normal distribution, which is partly based on some examples on the Internet. The function calculates the standard deviation (sd) and makes the statistically correct assumption that nearly all numbers will be in the range -2 * sd to + 2 * sd. For each number, a new weight is calculated on a straight line through that range. Listing Three presents an algorithm that linearizes a Pareto distribution. This function calculates a new weight for each number using a logarithm, with e as the base number. (Diehards among us will not be satisfied with this and can determine from their own source data which base number would render the best approximation.) The remainder of the function in this case also plots the new values on a fictitious linear line between the minimum and maximum values.

Public Shared Function FromBellCurve( _
        ByVal weights As ICollection(Of Decimal), _
        ByVal minSize As Decimal, ByVal maxSize As Decimal) _
        As ICollection(Of Decimal)
    'First, calculate the mean weight.
    Dim meansum As Decimal = 0
    For Each w As Decimal In weights
        meansum += w
    Next
    Dim mean As Double = meansum / weights.Count
    'Second, calculate the standard deviation of the weights.
    Dim sdsum As Double = 0
    For Each w As Decimal In weights
        sdsum += (w - mean) ^ 2
    Next
    Dim sd As Double = ((1 / weights.Count) * sdsum) ^ 0.5
    'Now calculate the slope of a straight line from -2*sd to +2*sd.
    Dim slope As Double
    If sd > 0 Then
        slope = (maxSize - minSize) / (4 * sd)
    End If
    'Get the value in the middle between minSize and maxSize.
    Dim middle As Double = (minSize + maxSize) / 2
    'Calculate the result for the given deviation from mean.
    Dim output As New List(Of Decimal)

    For Each w As Decimal In weights
        If (sd = 0) Then
            'With sd=0 all tags have the same weight.
            output.Add(CDec(middle))
        Else
            'Calculate the distance from mean for this weight.
            Dim distance As Double = w - mean
            'Calculate the position on the slope for this distance.
            Dim result As Double = CDec(slope * distance + middle)
            'If the tag turned out too small, set minSize.
            If result < minSize Then result = minSize
            'If the tag turned out too big, set maxSize.
            If result > maxSize Then result = maxSize
            output.Add(CDec(result))
        End If
    Next
    Return output
End Function

Listing Two

Public Shared Function FromParetoCurve( _
        ByVal weights As ICollection(Of Decimal), _
        ByVal minSize As Decimal, ByVal maxSize As Decimal) _
        As ICollection(Of Decimal)
    'Convert each weight to its log value.
    Const BASE As Double = Math.E
    Dim logweights As New List(Of Decimal)
    For Each w As Decimal In weights
        logweights.Add(CDec(Math.Log(w, BASE)))
    Next
    'First, find the min and max weight.
    Dim min As Decimal = Decimal.MaxValue
    Dim max As Decimal = Decimal.MinValue
    For Each w As Decimal In logweights
        If w < min Then min = w
        If w > max Then max = w
    Next
    'Now calculate the slope of a straight line, from min to max.
    Dim slope As Double
    If max > min Then
        slope = (maxSize - minSize) / (max - min)
    End If
    'Get the value in the middle between minSize and maxSize.
    Dim middle As Double = (minSize + maxSize) / 2
    'Calculate the result for each of the weights.
    Dim output As New List(Of Decimal)
    For Each w As Decimal In logweights
        If (max <= min) Then
            'With max=min all tags have the same weight.
            output.Add(CDec(middle))
        Else
            'Calculate the distance from the minimum for this weight.
            Dim distance As Double = w - min

            'Calculate the position on the slope for this distance.
            Dim result As Double = CDec(slope * distance + minSize)
            'If the tag turned out too small, set minSize.
            If result < minSize Then result = minSize
            'If the tag turned out too big, set maxSize.
            If result > maxSize Then result = maxSize
            output.Add(CDec(result))
        End If
    Next
    Return output
End Function

Listing Three

Presentation

For determining font sizes, I encountered four techniques (Listing Four). You can see that both absolute (url1) and relative (url2) font sizes are used. You can also translate the weights to CSS class names (url3) or to multiple instances of sizing tags (url4). Undoubtedly, you will think of more alternatives that you would want to support with your code.

I don't address here how you create UI controls or bind HTML tags to data. I assume you can take the ideas from my examples and turn them into working code. Just remember that front-end developers are at least as stubborn as other software engineers. If you decide to create your own UI control, try to facilitate your front-end colleagues by making your HTML generator compatible with several of the techniques mentioned. This also applies to the context in which hyperlinks are placed. Some solutions use HTML for enumerating the text tags; others use spaces to separate them from each other. A UI designer might want other properties of the hyperlinks (such as the intensity of the color) to vary with the weight of the tags. A good UI control will not stand in the way of any creative ideas!

One last bit of advice: Consider that any spaces in tags (like in names of authors) can be replaced by ; while rendering the HTML. This saves front-end developers the effort of taking measures against undesirable wrapping of the tags.

Dim url1 As String = "<a href='{0}' style='font-size:{1}px;'>{2}</a>"
Dim url2 As String = "<a href='{0}' style='font-size:{1}em;'>{2}</a>"
Dim url3 As String = "<a href='{0}' class='class{1}'>{2}</a>"
Dim url4 As String = "<em>...<em><a href='{0}'>{2}</a></em>...</em>"

Listing Four

Conclusion

To sum up, there are different types of data that can serve as data sources for tag clouds and you can apply different algorithms to distribute font sizes more evenly among tag clouds. Using this information, you can build a generic solution in your own favorite environment that is applicable in many different situations. Hopefully, I will encounter some of your results the next time I am searching for good solutions.