**Click here to get Transaction Data. Click here to get R Code.
Basket Transaction Dataset | Item Frequency Plot |
---|---|
(Click to see a larger view.) |
(Click to see a larger view.) |
From the item frequency plot, more frequent occurred items in the transactions can be observed. The most popular book formats are paperback and hardback, and the top book categories are Poetry-Drama and Teen-Young-Adult. In addition, the top book ratings are 4, 4.5 and 3.5, in that order.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
(From Wikipedia)
Top 15 Rules of Support | Top 15 Rules of Confidence | Top 15 Rules of Lift |
---|---|---|
(Click to see a larger view.) |
(Click to see a larger view.) |
(Click to see a larger view.) |
Support is the probability of two events occur together. From the top 15 rules by support, {4} => {Paperback} has the highest support of 0.4469, which indicates that 4 and Paperback occur in 44.69% of transactions at the same time, that is, 44.69% of books are in paperback format with rating of 4.
Confidence is the probability of two events occur together given the occurrence of one of the events. From the top 15 rules by confidence, {4, Crime-Thriller} => {Paperback} has the highest confidence of 0.9539, which indicates that Paperback occurs in 95.39% of the transactions that contain 4 and Crime-Thriller at the same time. Actually in the dataset, the most popular book format is paperback. Therefore, Paperback could occur in a great number of transactions. From the screenshot above, the RHS of all transactions here are Paperback.
Lift is a measure of dependent or correlated events. If lift equals to 1, two events are independent. From the top 15 rules by lift, {Food-Drink} => {Hardback} has the highest lift of 2.84, which indicates that Food-Drink and Hardback are strongly associated, that is, when books are in Food-Drink category, they are quite likely in hardback format. All lift here are greater than one, which means that each pair of items here are associated with each other.
From the scatter plot for 10 rules, most of the rules have high lift and low support, which indicates that the associations between items are strong, but the probabilities of both items occur together are low.
From the grouped matrix for 10 rules, the largest circle with the darkest color shows that {Food-Drink} => {Hardback} has the highest lift and the highest support. It indicates that Food-Drink and Hardback are strongly associated, which is the same as the result of top rules of lift. In addition, Food-Drink and Hardback have high probability to occur together.
The interactive graph shows the relationship between vertices. The items are mainly grouped into two parts, one about Paperback, and another about Hardback. In Paperback part, Graphic-Novels-Anime-Manga and 4.5 are strongly associated with Paperback. In Hardback part, Humour, Home-Garden and Food-Drink are strongly associated with Hardback, which indicates three main categories that books in hardback format of. Among them, Food-Drink has the strongest association with Hardback, and the same result has been gained from top rules of lift and grouped matrix.
The Network D3 shows that there are two groups of items which indicates that there are two sets of items that would occur together as a group frequently. Similarly, one about Paperback, one about Hardback. Items related to these two formats are book categories and ratings that occur together with each of them frequently. For example, Humour, Transport, Food-Drink, Home-Garden and 4.5 are frequently occur together with Hardback and Romance, Crime-Thriller and 4 are frequently occur together with Paperback.
Association rule mining is a great way to explore any relationship between book formats, book categories and ratings. After using apriori to get rules, three measures, support, confidence and lift are used to measure the rules. High support and confidence means strong association, and lift should be greater than 1 to be meaningful.
From the analysis above, books in Amazon are mainly grouped in two formats, paperback and hardback. The categories strongly associated to paperback are Crime-Thriller, Romance and Graphic-Novels-Anime-Manga, which indicates that if people choose books from Crime-Thriller, Romance or Graphic-Novels-Anime-Manga categories, they are more likely to be paperback. The categories strongly associated to hardback are Humour, Home-Garden and Food-Drink, which indicates that if people choose books from Humour, Home-Garden or Food-Drink categories, they are more likely to be hardback. As for ratings, 4 is strongly associated with paperback and 4.5 is strongly associated with hardback, which indicates that books in paperback format mainly have a rating of 4 and books in hardback format mainly have a rating of 4.5. The rating of hardback is higher than that of paperback.