Exploratory Data Analysis and Visualizations (STAT GR5702)

PART 1

Find one graph that you like and one that you don’t online or in print and write one paragraph for each, describing what you think is effective or not effective. Your evaluation does not need to be tied to any theory or principles discussed in class so far. Be sure to include images and/or links to the graphs, with sources cited. The graphs may be static or dynamic. Some submissions will be discussed in class. [10 points]

Good Graph Source: Blog Hub Spot 1

[1] Source: https://blog.hubspot.com/marketing/data-visualization-choosing-chart#sm.0000iw0h7xqvpfkjtrb1yn3ppel7m

I like how bubble charts are always fun to play with and analyze. The above graph is effective in displaying distribution and relationship between multiple variables, such as the number of online hours with age, gender, and population size. The use of bubble opacity is also effective for understanding the overlaps between variables. We can see through the positive linear correlation that the hours spent online increases as kids grow with age. Although this dataset only focuses on children, it would be interesting to also see some relationship among adults and across today’s millennial generation.


Bad Graph

Source: Flowing Data 2 [2] Source: https://blog.hubspot.com/marketing/data-visualization-choosing-chart#sm.0000iw0h7xqvpfkjtrb1yn3ppel7m

The graph above is not very effective because there is a conflict between bar height and the measurements. The chart baseline is not very well defined, since the measurement of weight over time does not start from zero. This can mislead our perception of the weight differences over time. That is, what looks like a high increase/decrease in weight from day n to n+1 does not proportionally reflect the real fluctuation of the weight.


Fixing the Bad Graph

Source: Flowing Data 2 [2] Source: https://blog.hubspot.com/marketing/data-visualization-choosing-chart#sm.0000iw0h7xqvpfkjtrb1yn3ppel7m

In order to solve this distortion issue, we can focus on the weight change by using a line chart rather than bars. In this case, the graph does not need a zero baseline because bar length is out of the picture. Therefore, not only that the context of the graph would now align well with the visual encoding, but viewers would also be able to see more clearly how the weight changes over time.

PART 2

Chapter 3, pp. 50-51, from Graphical Data Analysis with R:
1) Galaxies The dataset galaxies in the package MASS contains the velocities of 82 planets. [5 points]
(a) Draw a histogram, a boxplot, and a density estimate of the data. What information can you get from each plot? Use base graphics.

if (!require('MASS')) 
{
  install.packages('MASS');
  library(MASS);
}
Loading required package: MASS
# Histogram:
gal <- galaxies/1000
hist(gal, col = "lightblue", border="black", main="Histogram for Galaxies", xlab = "Velocity of Galaxy (1000km/s)",las=1,breaks=10)
abline(v = mean(gal), col = "red")

The histogram lets us see the frequency distribution of Galaxies dataset containing the velocity of 82 planets. While the distribution seems to be unimodal (one-peak-distribution), it is slightly skewed right or positively skewed. The mean, marked by a red line, is slightly located on the right of the highest peak. The overall distribution is asymmetric.


# Boxplot:
boxplot(gal, main="Boxplot of Galaxies", xlab="Velocity of Galaxy (1000km/s)", col="orange", horizontal=TRUE, varwidth=TRUE, ylim=c(5,35))