Graphical representation in Data Analysis
Carrying out clinical-epidemiological studies finally implies issuing quantifiable results from said study or experiment. The clarity of said presentation is of vital importance for the understanding of the results and the interpretation of the same. When it comes to representing the results of statistical analysis in an adequate way, there are several publications that we can consult. Although it is recommended that the presentation of numerical data analysis is usually done by means of tables, sometimes a diagram or a graph can help us to represent our data more efficiently.
This article will address the graphic representation of the results of a study, confirming its usefulness in the process of statistical analysis and data presentation. The different types of graphs that we can use and their correspondence with the different stages of the analysis process will be described.
When data are available from a population, and before undertaking more complex statistical analyzes, a first step is to present that information in such a way that it can be viewed in a more systematic and summarized way. The data analysis that interest us depend, in each case, on the type of variables that we are handling.
For categorical variables, such as sex, TNM stage, profession, etc., we want to know the frequency and the percentage of total cases that “fall” in each category. A very simple way to represent these results graphically is by means of bar charts or pie charts. In pie charts, also known as “pie” diagrams, a circle is divided into as many portions as there are classes for the variable so that each class has an arc of a circle proportional to its absolute or relative frequency. An example is shown in Figure 1. As can be seen, the information that must be shown in each sector refers to the number of cases within each category and the percentage of the total that they represent. If the number of categories is excessively large, the image provided by the pie chart is not clear enough, and therefore the ideal situation is when there are around three categories. In this case, these subgroups can be clearly seen.
The bar charts are similar to pie charts. As many bars are represented as there are categories of the variable so that the height of each of them is proportional to the frequency or percentage of cases in each class ( Figure 2 ). These same graphs can also be used to describe discrete numerical variables that take on few values (number of children, number of relapses, etc.).
For continuous numerical variables, such as age, blood pressure, or body mass index, the most commonly used type of graph is the histogram. To construct a graph of this type, the range of values of the variable is divided into intervals of equal amplitude, representing a rectangle on each interval based on this segment. The criterion for calculating the height of each rectangle is to maintain the proportionality between the absolute (or relative) frequencies of the data in each interval and the area of the rectangles. As an example, Table I shows the frequency distribution of the age of 100 patients, between 18 and 42 years. If this range is divided into two-year intervals, the first tranche is between 18 and 19 years, among which is 4/100 = 4% of the total. Therefore, the first bar will have a height proportional to 4. Proceeding thus successively, the histogram shown in Figure 3 is constructed. By joining the midpoints of the upper end of the histogram bars, an image is obtained called a frequency polygon. This figure tries to show, in the simplest way, in which ranges most of the data is found. An example, using the data above, is presented in Figure 4.
Another common and very useful way to summarize a numeric type variable is by using the concept of percentiles, using box plots. Figure 5 shows a graph corresponding to the data boxes Table I. The central box indicates the range in which the central 50% of the data is concentrated. Their ends are, therefore, the 1st and 3 rd quartile of the distribution. The centerline on the box is the median. Thus, if the variable is symmetric, this line will be in the center of the box. The ends of the “whiskers” that come out of the box are the values that delimit the central 95% of the data, although they sometimes coincide with the extreme values of the distribution. Observations that fall outside this range (outliers or extreme values) are also usually represented. This is especially useful to check graphically for possible errors in our data. In general, box plots are more appropriate for representing variables that have a large deviation from the normal distribution. As will be seen later, they are also of great help when data are available on different groups of subjects.
Finally, and with regard to the description of the data, it is usually necessary, for subsequent analyzes, to check the normality of any of the numerical variables that are available. A box plot or a histogram are simple graphs that allow you to check, in a purely visual way, the symmetry and “pointing” of the distribution of a variable and, therefore, to assess its deviation from normality. There are other specific graphical methods for this purpose, such as PP or QQ graphs. In the former, the cumulative proportions of a variable are compared with those of a normal distribution. If the selected variable matches the test distribution, the points are concentrated around a straight line. The QQ graphs are obtained in an analogous way, this time representing the quantiles of the distribution of the variable with respect to the quantiles of the normal distribution. In Figure 6 the corresponding PP graphic data analysis shown in Table I suggests that, at the same as the corresponding histogram and box plot, the variable distribution is away from the normal.
Comparison of two or more groups
When you want to compare the observations taken in two or more groups of individuals once again, the statistical method to use, as well as the appropriate graphs to visualize this relationship, depend on the type of variables that we are handling.
When working with two qualitative variables we can continue to use bar or pie charts. We may want to determine, for example, if in a given sample, the frequency of subjects suffering from coronary disease is more frequent in those who have a family member with a heart history. From this sample, we can represent, as is done in Figure 7, two groups of bars: one for subjects with a family cardiac history and another for those without this type of history. In each group, two bars are drawn representing the percentage of patients who have or do not have coronary disease. It should not be forgotten that when the sizes of the two populations are different, it is convenient to use the relative frequencies since otherwise, the graph could be misleading.
On the other hand, the comparison of continuous variables in two or more groups is usually carried out in terms of their mean value, by means of the Student’s t-test, analysis of variance, or equivalent non-parametric methods, and this has to be reflected in the type of chart used. In this case, an error bar diagram is very useful, as in Figure 8. It compares the body mass index in a sample of men and women. For each group, its mean value is represented, along with its 95% confidence interval. It should be remembered that the fact that these intervals do not overlap does not necessarily imply that the difference between the two groups can be statistically significant, but it can help us to assess its magnitude. Likewise, to visualize this type of association, two box diagrams can be used, one for each group. These diagrams are especially useful here: they not only allow us to see whether or not there is a difference between the groups, but they also allow us to check the normality and variability of each of the distributions. normality and homoscedasticity are necessary conditions to apply some of the parametric analysis procedures.
Finally, it should be noted that also in this situation the well-known bar graphs can be used, representing here as the height of each bar the mean value of the variable of interest. The line graphs can also be of particular interest, especially when interested in studying trends over time ( Figure 9 ). They are nothing more than a series of points connected to each other by lines, where each point can represent different things depending on what interests us at each moment (the average value of a variable, percentage of cases in a category, the maximum value in each group, etc).
Relationship between two numerical variables.
When what is of interest is to study the relationship between two continuous variables, the appropriate analysis method is the study of correlation. The correlation coefficients (Pearson, Spearman, etc.) assess the extent to which the value of one of the variables increases or decreases when the value of the other increases. When all the data are available, a simple way to verify, graphically, if there is a high correlation, is through scatter diagrams, where the value of a variable is compared on the horizontal axis and the value of the other. A simple example of highly correlated variables is the relationship between the weight and height of a subject. Starting from an arbitrary sample, we can construct the scatter plot of Figure 10. In it, it can be clearly observed how there is a direct relationship between both variables, and assess to what extent this relationship can be modeled by the equation of a line. These types of graphs are therefore especially useful at the variable selection stage when fitting a linear regression model.
The types of graphs shown so far are the simplest that we can handle, but they offer great possibilities for data representation and can be used in multiple situations, even to represent the results obtained by more complicated analysis methods. We can use, for example, two overlapping line diagrams to visualize the results of a two-way analysis of variance ( Figure 11 ). A scatter plot is the appropriate method to assess the result of a logistic regression model ( Figure 12 ). There are even some concrete analyzes that are based entirely on graphical representation. In particular, the elaboration of ROC curves ( Figure 13) and the calculation of the area under the curve are the most appropriate method to assess the accuracy of a diagnostic test.
We have seen, therefore, how important and useful graphic representations can be in the data analysis process. Most statistical and epidemiological texts emphasize the different types of graphs that can be created, as an essential tool in the presentation of results and the statistical analysis process. However, it is difficult to say when it is more appropriate to use a graph than a table. Rather, we can consider them two different but complementary ways of viewing the same data. The increasing use of different computer programs makes obtaining them especially easy. Most statistical packages (SPSS, STATGRAPHICS, S-PLUS, EGRET, …) offer great possibilities in this regard. In addition to the graphics seen, It is possible to create other graphs, even three-dimensional.
You may also be interested in How to improve your website graphics in 10 steps