The Apriori algorithm serves data mining operations by identifying important patterns within extensive databases. The algorithm proves its strength in identifying relationships among transactions in databases such as retail shopping cart records. Using the “frequent itemsets” concept, Apriori enables analysts to uncover essential understanding of consumers’ behaviour.
What is the Apriori Algorithm?
The Apriori algorithm represents a core data mining approach for association rule learning that discovers frequent itemsets while identifying relationships between items in big transactional databases.
Rakesh Agrawal and Ramakrishnan Srikant introduced this algorithm in 1994 by adopting the “Apriori property” to shrink the search area. It labelled frequent itemsets as those where all subset items must also be frequent.
Association rule mining heavily depends on Apriori because it helps uncover the “item A purchase leads to item B purchase” rules using support, confidence, and lift metrics.
The widespread application of this method appears in market basket analysis, together with recommendation systems and multiple domains that require pattern identification in item correlations.
Key Concepts and Terminology
1. Itemset:
A dataset contains one or multiple items as a single unit. An example of an e-commerce item set in this context would be a laptop and a wireless mouse.
2. Support:
The occurrence frequency of itemsets appears within the dataset. The support value equals 3% when the itemset containing a laptop and wireless mouse occurs in 30 out of 1,000 transactions.
3. Confidence:
Itemset Y’s existence among transactions that include itemset X signifies the confidence measurement. The rule “laptop ⇒ wireless mouse” achieves a confidence of 60% because 3% of transactions display both items, whereas 5% contain only a laptop.
4. Lift:
The lift measure determines the probability that item Y will be purchased along with item X above a random chance basis. The 1.5 lift value for the rule connecting laptops to wireless mouse indicates that shoppers become fifty percent more likely to purchase a wireless mouse after acquiring a laptop than under random circumstances.
5. Apriori Property:
According to the Apriori property, any subset of a frequent group of items must be present frequently in the database. The algorithm benefits from this property by eliminating those item sets containing infrequent subsets during the mining process.
How the Apriori Algorithm Works: Step-by-Step Breakdown
The Apriori algorithm follows a systematic approach to discover frequent itemsets and generate association rules from transactional data. Here’s an overview of how the algorithm works:
- Set Parameters: Minimum Support and Confidence
Before starting, the algorithm requires the user to define two key thresholds:
- Minimum Support: This is the frequency with which an itemset must appear in the dataset to be considered frequent.
- Minimum Confidence: This defines the likelihood that a rule will hold true, based on the frequent itemsets.
- Identify Frequent Itemsets (1-itemsets)
The algorithm first scans the entire dataset to find individual items (1-itemsets) that meet the minimum support threshold. These items are considered frequent.
Example:
In a retail store dataset, let’s say we have the following transactions:
- Transaction 1: {laptop, wireless mouse, keyboard}
- Transaction 2: {laptop, wireless mouse}
- Transaction 3: {keyboard, wireless mouse}
If the minimum support threshold is 50%, we check how often each item appears. In this case, {laptop}, {wireless mouse}, and {keyboard} are frequent 1-itemsets.
3. Generate Candidate Itemsets (k-itemsets)
The next step involves forming pairs of items (2-itemsets) by combining frequent 1-itemsets. This process is repeated to generate 3-itemsets, 4-itemsets, and so on, until no more frequent itemsets can be formed. The candidate itemsets are then validated against the support threshold.
Example:
- During the evaluation process for 2-itemsets, the algorithm analyses the combinations between laptop and wireless mouse, laptop and keyboard, and wireless mouse and keyboard.
- The minimum support threshold determines which pairs become frequent items (such as {laptop, wireless mouse} appearing in 2 out of 3 transactions).
4. Prune Non-Frequent Itemsets Using the Apriori Property
As per the Apriori property, if an itemset includes frequent components, then all of its subsets need to be frequent as well. During pruning, the algorithm discards candidate item sets that contain infrequent subsets.
Example:
If {laptop, keyboard} is a candidate itemset, but its subset {keyboard} is not frequent, then {laptop, keyboard} is also discarded, reducing unnecessary checks.
5. Repeat for Higher-Order Itemsets
The process is repeated for higher-order itemsets (3-itemsets, 4-itemsets, etc.). The algorithm continues generating combinations and checking their support until no further frequent itemsets can be found.
6. Generate Association Rules
Once all frequent itemsets have been found, the next step is to generate association rules. These rules take the form:
{Itemset A} ⇒ {Itemset B}
The algorithm calculates the confidence for each rule to determine how likely item B is to be purchased when item A is purchased.
Example:
- Rule: If a customer buys a laptop, they will also buy a wireless mouse.
- Confidence calculation: If 2 out of 3 transactions containing a laptop also contain a wireless mouse, the confidence is 66.7%.
7. Filter by Confidence Threshold
Finally, the algorithm filters out any rules that don’t meet the minimum confidence threshold.
Example:
Let’s say we have a retail store with the following transactions:
Transaction | Items |
1 | {laptop, wireless mouse, keyboard} |
2 | {laptop, wireless mouse} |
3 | {keyboard, wireless mouse} |
Step 1: Identify frequent 1-itemsets:
- {laptop}: Appears in 2 out of 3 transactions (support = 66.7%).
- {wireless mouse}: Appears in 3 out of 3 transactions (support = 100%).
- {keyboard}: Appears in 2 out of 3 transactions (support = 66.7%).
- Step 2: Generate candidate 2-itemsets:
- {laptop, wireless mouse}: Appears in 2 out of 3 transactions (support = 66.7%).
- {laptop, keyboard}: Appears in 1 out of 3 transactions (support = 33.3%)—discarded.
- {wireless mouse, keyboard}: Appears in 2 out of 3 transactions (support = 66.7%).
- Step 3: Generate candidate 3-itemsets:
- {laptop, wireless mouse, keyboard}: Appears in 1 out of 3 transactions (support = 33.3%)—discarded.
- Step 4: Generate association rules:
- {laptop} ⇒ {wireless mouse}: Confidence = 2/2 = 100%.
- {wireless mouse} ⇒ {laptop}: Confidence = 2/3 = 66.7%.
In this example, the frequent itemsets are {laptop}, {wireless mouse}, {keyboard}, {laptop, wireless mouse}, and {wireless mouse, keyboard}. The association rule {laptop} ⇒ {wireless mouse} has high confidence, indicating a strong relationship between the two items.
Want to dive deeper into how the Apriori algorithm uncovers valuable patterns in data? Explore this free Apriori Algorithm course by Great Learning and get hands-on experience with real-world applications!
Apriori Algorithm Example: Music Dataset
Sample Transaction Dataset:
Each transaction consists of a list of albums purchased by a customer.
Transaction ID | Items Purchased |
T1 | {Rock, Jazz} |
T2 | {Jazz, Pop, Classical} |
T3 | {Rock, Pop} |
T4 | {Jazz, Rock, Pop, Classical} |
T5 | {Pop, Classical} |
T6 | {Rock, Jazz, Classical} |
T7 | {Jazz, Pop, Classical} |
T8 | {Rock, Pop, Jazz} |
Step 1: Calculate Support for 1-itemsets
Support for an itemset is the fraction of transactions that contain the itemset.
Support Formula:
Let’s calculate the support for each individual album (1-itemset).
Item | Count | Support |
Rock | 5 | 5/8 = 0.625 |
Jazz | 6 | 6/8 = 0.75 |
Pop | 5 | 5/8 = 0.625 |
Classical | 4 | 4/8 = 0.5 |
Assume the minimum support threshold is 0.5. Based on this threshold, Classical will be filtered out, as its support is less than 0.5.
Frequent 1-itemsets:
- Rock, Jazz, Pop
Step 2: Generate 2-itemset Candidates
Now, we will form all possible pairs of frequent 1-itemsets and calculate their support. For example, the pairings are (Rock, Jazz), (Rock, Pop), and so on.
Support Calculation for 2-itemsets:
Itemset | Count | Support |
{Rock, Jazz} | 4 | 4/8 = 0.5 |
{Rock, Pop} | 4 | 4/8 = 0.5 |
{Jazz, Pop} | 4 | 4/8 = 0.5 |
Frequent 2-itemsets:
- {Rock, Jazz}, {Rock, Pop}, {Jazz, Pop}
Step 3: Prune and Repeat
Now, we generate 3-itemsets from the frequent 2-itemsets. We can only generate combinations from those 2-itemsets that have appeared frequently enough (those that pass the support threshold).
Let’s generate the 3-itemsets from the frequent 2-itemsets and calculate their support.
Itemset | Count | Support |
{Rock, Jazz, Pop} | 3 | 3/8 = 0.375 |
Since the support for {Rock, Jazz, Pop} is less than the minimum support of 0.5, we prune this itemset.
There are no more frequent 3-itemsets that meet the minimum support threshold.
Frequent Itemsets:
- {Rock}, {Jazz}, {Pop}
- {Rock, Jazz}, {Rock, Pop}, {Jazz, Pop}
Step 4: Generate Association Rules
Now we generate association rules from the frequent itemsets. For example, let’s create rules from the frequent 2-itemsets.
Rule Generation: We will focus on one rule: {Rock} → {Jazz}.
To calculate the confidence and lift of the rule:
- Confidence Formula:
- Lift Formula:
Let’s calculate the confidence and lift for the rule {Rock} → {Jazz}.
- Support({Rock, Jazz}) = 4/8 = 0.5
- Support({Rock}) = 5/8 = 0.625
- Support({Jazz}) = 6/8 = 0.75
Confidence({Rock} → {Jazz}):
Confidence=0.50.625=0.8\text{Confidence} = \frac{0.5}{0.625} = 0.8Confidence=0.6250.5=0.8
Lift({Rock} → {Jazz}):
Lift=0.80.75=1.07\text{Lift} = \frac{0.8}{0.75} = 1.07Lift=0.750.8=1.07
This rule suggests that when a customer purchases Rock, there is an 80% chance they will also purchase Jazz, and the lift of 1.07 indicates that Rock and Jazz are slightly more likely to be bought together than by chance.
Visualization of the Process
Step | Itemsets Generated | Pruning / Filtering Result |
Step 1 | {Rock}, {Jazz}, {Pop}, {Classical} | Prune {Classical} (Support < 0.5) |
Step 2 | {Rock, Jazz}, {Rock, Pop}, {Jazz, Pop} | No pruning needed |
Step 3 | {Rock, Jazz, Pop} | Prune {Rock, Jazz, Pop} (Support < 0.5) |
Step 4 (Rule) | {Rock} → {Jazz} | Confidence = 0.8, Lift = 1.07 |
This example highlights how to use the Apriori algorithm for association rule mining in a music dataset. Frequent itemsets are identified through support calculations, and rules are generated using confidence and lift to reveal associations between albums that are often bought together.
Apriori Algorithm: Advantages vs Disadvantages
Advantages | Disadvantages |
1. Easy to Understand and Implement: Follows a clear, step-by-step approach based on set theory. | 1. High Computational Cost: Needs multiple database scans, making it slow for large datasets. |
2. Intuitive Logic: Based on a natural concept of frequent itemsets and association rules. | 2. Generates Numerous Candidates: Can lead to an explosion of itemsets and memory usage. |
3. Interpretable Results: Produces human-readable rules that are easy to apply in decision-making. | 3. Requires Manual Thresholds: Setting appropriate support/confidence values can be tricky. |
4. Ideal for Market Basket Analysis: Widely used in retail, e-commerce, and recommendation systems. | 4. Poor Performance on Dense/High-Dimensional Data: Struggles when the data has many frequent patterns. |
5. Strong Foundation for Learning Concepts: Useful for teaching core data mining principles. | 5. Scalability Issues: Not suitable for real-time or massive-scale applications without optimisation. |
Tips for Efficient Use of the Apriori Algorithm
- Set Proper Levels For Support And Confidence Parameters
Select proper minimum support and confidence thresholds since they allow users to find significant patterns while managing processing requirements.
- Leverage the Apriori Property for Pruning
Early elimination of candidate itemsets becomes possible through the understanding that all frequent itemsets require all of their subsets to be frequent, which reduces unnecessary computational costs.
- Employ Hash-Based Techniques
Hash-based counting of item set occurrences enables efficient support in reducing the number of candidate item sets.
- Apply Transaction Reduction Strategies
Each pass should eliminate transactions that contain no frequent item sets in order to shrink the dataset size for the following rounds of processing.
- Consider Alternative Algorithms for Large Datasets
When working with complex or voluminous data, it is advisable to use FP-Growth to avoid generating candidates and achieve superior performance.
- Utilize Parallel Processing
Using parallel computing technologies will minimise execution time when you deploy the algorithm to process big datasets.
Conclusion
The Apriori algorithm is an essential data mining method that finds frequent item sets and associations in substantial datasets during market basket analysis.
The system identifies relationships between different items to help organisations develop optimal strategies while making better business decisions.
To deepen your understanding of such techniques and enhance your career in data science, consider exploring the PG Program in Data Science and Business Analytics at Great Learning.