Jan Aerts, KU Leuven, email@example.com
Student Team: YES
Did you use data from both mini-challenges? YES
R (ggplot2, dplyr, igraph, tidyr, RColorBrewer, lubridate, vegan)
Processing, to prototype
Approximately how many hours were spent working on this submission in total?
May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is complete? YES
MC1.1 – Characterize the attendance at DinoFun World on this weekend. Describe up to twelve different types of groups at the park on this weekend.
a. How big is this type of group?
b. Where does this type of group like to go in the park?
c. How common is this type of group?
d. What are your other observations about this type of group?
e. What can you infer about this type of group?
f. If you were to make one improvement to the park to better meet this group’s needs, what would it be?
Limit your response to no more than 12 images and 1000 words.
We first parsed the GPS data to infer trajectories of individual movements and stationary periods. We also used the provided map and the parsed data to infer the coordinates for each attraction in the park. (For more details on data transformation, please see our video). Using this trajectory data, we count the number of times each individual “checks in” or appear at attractions without the check-in system (such as Beer Gardens). If two individuals have the exact same pattern of attraction counts, we refer to them as a “group”. Using this logic, we aggregate the individual based on the attraction counts, and draw histograms based on the size of the groups (Figure 1). From this figure, we define 3 types of groups (small, medium, and large). For instance, large groups consist of 29 to 43 people who have the exact same attraction count patterns.
Figure 1. Histograms of group sizes per day. The blue text indicates the number of counts, where the bar is harder to read.
We subset the visitors in large groups from Friday, and visualize their behavior in the sequence view (Figure 2). In this view, each visitor is represented as a horizontal line and the attractions they participate are color-coded by types of attraction. For example, the “friday_5” group consists of 30 individuals, and this group is the only group who does not go to the show at 15:00, and relatively large gray gaps between attractions suggests that they spend more time walking between attractions. This group leaves the park around 18:30. Another insight from this representation is that we can find some variations within a large group. For instance, about a half of the “friday_1” group goes shopping between a beer garden and a kiddie ride around 11:00. Since none of the groups use the Information & Assistance, we hypothesize these groups may have a tour guide or someone very familiar with the park. By examining per attraction type, we can characterize a group better, for example, the “friday_3” group goes to Beer Garden 7 times throughout the day.
Figure 2. Sequence view of large groups on Friday.
Figure 3 shows the sequence view of large groups on Saturday. On Saturday, every large group goes to see the Grinosaurus Stage at 15:00. It also appears that groups who arrive early spend more time at the entrance. Perhaps, the park could try to minimize the waiting time to handle the arrival of large groups in the early morning. Another general trend is that people in large groups tend to shop at the end of the day.
Figure 3. Sequence view of large groups on Saturday.
Probably because of the partial closure on Sunday, the large groups on Sunday don’t have the distinctly common behavior among them (Figure 4).
Figure 4. Sequence view of large groups on Sunday.
We extend the analysis to medium-size groups as defined in the Figure 1. Figure 5 shows the medium groups on Friday and sorted by their arrival time. Although the movement pattern varies, many sets of groups arrive and leave at the same time. We hypothesize that these are a large group but moves around the park in a smaller groups of 6 to 11. By comparing the size of gray gaps between the attractions of large groups, the gray gaps in medium groups are much smaller, suggesting these groups are more efficient and spend their time more on attraction. Many groups also appear to spend a few hours for shopping at the end of the visit. Other notable groups are those who arrive around 9 and leave around 15:00, and those who arrive around 15:00 and leaves round 22:00.
Figure 5. Sequence view of medium groups on Friday.
The similar pattern of sets of groups arriving and leaving at the same time is observed on Saturday (Figure6). In contrast to large groups on Saturday (Figure 3), those medium groups who arrive later tend to spend more time at the entrance.
Figure 6. Sequence view of medium groups on Saturday.
The sequence view of medium groups on Sunday (Figure 7) shows a similar pattern of longer time spent at entrance if they arrive later. One anomaly was detected, where one medium size group appears to spend a very long time in the restroom. The group is indicated with a black triangle in Figure 7.
Figure 7. Sequence view of medium groups on Sunday.
Another analysis approach we took was to measure Morisita overlap index to compare overlaps between attractions. Using the derived table of attraction counts, we calculate Morisita overlap index with the “vegan” R package, and use the dissimilarity matrix as an input for hierarchical clustering with the complete linkage algorithm. Figure 8 shows the resulting hierarchy as a dendrogram for the data from Friday. The black triangle indicates the leaf node level where overlaps of attractions are observed. For instance, there are a group people (1) who go to Tyranosaurus Restroom, MaryAnning Beer Garden, and Alverez Beer garden. We can see Alverez Beer garden is in a different area of the park (Wet Land). The groups (2) and (3) are both overlaps involving rides from kiddie land. Another way to interpret this Morisita overlap index is by comparing the three entrances. The West, East, and North entrances are well separated in the hierarchy because they do not overlap, in other words, people come in and exit from the same entrance.
Figure 8. A dendrogram showing the result of hierarchical clustering of attractions.
We can use the insights from clustering to study specific groups. For example, if we subset those individuals who goes to two beer gardens and the Tyrannosaurus restroom on Friday, we find 76 individuals and the subset can be visualized in the sequence view (Figure 9).
Figure 9. Sequence view of the beer garden groups.
MC1.2 – Are there notable differences in the patterns of activity on in the park across the three days? Please describe the notable difference you see.
Limit your response to no more than 3 images and 300 words.
Some notable differences in the pattern of behavior in large groups are mentioned in MC1.1. Besides, we compare the activities at each attraction across the three days by generating small multiples of histogram to compare the distribution of attendance counts per attraction. We gained a few insights. First, the Craighton Pavilion and the Grinosaurus Stage close after 12:00 on Sunday. Second, there is a relatively high number of check-in at the Leggement Fix-Me-Up around 14:00 on Friday. Third, the park appears busier on Saturday and Sunday than Friday, and Sunday being the busiest.
Figure 10. Histograms of check-in counts per attraction, binned per hour.
In Figure 11, we aggregate the attraction counts by the area and the type of attraction. This plot allows to compare the distribution or the trends in the context of geographic location and the types of activity. For example, the North Entrance in the Entrance corridor is the most used entrance. Shopping attractions get busier after 18:00, while rides for everyone or thrill rides quiet down. The same anomaly due to the Leggement Fix-Me-Up, and the closure of the Craighton Pavilion and the Grinosaurus Stage can be observed.
Figure 11. Histograms of check-in counts, aggregated by areas and types of attraction, binned per hour.
Figure 12 compares the distributions of minutes spent at each attraction across the three days. The duration is estimated from the GPS record. The Wrightiraptor Mountain, TerrorSaur, Firefall and Flight of the Swingdon have longer waiting periods on Saturday and Sunday. The Auvilotops Express appears to have a longer waiting time only on Sunday.
Figure 12. Histograms of minutes spent at the attractions, binned per minute.
MC1.3 – What anomalies or unusual patterns do you see? Describe no more than 10 anomalies, and prioritize those unusual patterns that you think are most likely to be relevant to the crime.
Limit your response to no more than 10 images and 500 words.
Using the derived trajectory data and the sequence view, we visualize the subset of individuals who appear to be present in the Creighton Pavilion (GPS=32,33), and draw the inferred the time period they spend at this location. We color the line based on whether it was derived from check-in events or just from the movement records. By comparing plots from Friday (Figure 13), Saturday (Figure 14), and Sunday (Figure 15), we identify two suspicious groups on Sunday before the pavilion closed. The first group, called “group1”, consists of 37 individuals who appear to be at the pavilion during the hours it is usually closed, and they do not check-in. The “group 2” consists of 3 visitors who appears to stay in or near the pavilion also during the hours it is usually closed.
Figure 13. Sequence view at the Pavilion on Friday.
Figure 14. Sequence view at the Pavilion on Saturday.
Figure 15. Sequence view at the Pavilion on Sunday.
Using the GPS data and the derived trajectory data, we calculate the total count of “movement” GPS records and the distance traveled for each user. The result is shown in a scatter plot (Figure 16), and we identify outliers of 7 individuals. On the right of Figure 16 shows the movement pattern of individuals per hour, overlaid. The movement pattern is very synchronized and regular. They only check in at the East Entrance at 8:00 and 13:00, and move back and forth to the Grinosaurus Stage. Because of this regularity, we hypothesize these are securities working in the park.
Figure 16. Scatter plot of movement count and total distance traveled, and movement pattern of identified outliers.
If we subset individuals who check-in twice only at the East Entrance, we find one additional individual (1787551). Then, we compare the movement pattern across three days using the GPS and the trajectory data. The GPS trail or trajectory path is colored based on the time of the day. Figure 17 shows the GPS trails and Figure 18 shows the trajectory paths. The GPS trails show a very regular and synchronized patterns, while the trajectory paths show one anomaly (1080969) who appears to slow down or paused near the Pavilion in the morning on Sunday. We find this anomaly very suspicious and hypothesize that this finding is related to the crime.
Figure 17. GPS trails of securities.
Figure 18. Trajectory paths of securities.
Figure 19. Comparison of start and end position per visitor. Colored circles indicate those who come in and exit from the same entrance, and its size represents the frequency. The arrows show anomalies who don’t end up where they came from, and the number shows the visitor id. For example, 657863’s last GPS is near the Scholtz Express on Friday, but his/her GPS starts there and goes to the North Entrance on Saturday. This could be suspicious, but it could also be a case of lost mobile.