BigBasket Products analysis

Loading database

library(tidyverse)

# loading database without index and desc column
df <- read.csv("BigBasket Products.csv") %>% select(-index,-description)

# factor distinct columns
df <- df %>% mutate(category=as_factor(category),
              sub_category=as_factor(sub_category),
              brand=as_factor(brand),
              type=as_factor(type))
# adding discount column
df <- df %>% mutate(discount=round((market_price-sale_price)/market_price, digits = 3))

Summary the data

summary(df)

##    product                              category   
##  Length:27555       Beauty & Hygiene        :7867  
##  Class :character   Gourmet & World Food    :4690  
##  Mode  :character   Kitchen, Garden & Pets  :3580  
##                     Snacks & Branded Foods  :2814  
##                     Foodgrains, Oil & Masala:2676  
##                     Cleaning & Household    :2675  
##                     (Other)                 :3253  
##                 sub_category                brand         sale_price      
##  Skin Care            : 2294   Fresho          :  638   Min.   :    2.45  
##  Health & Medicine    : 1133   bb Royal        :  539   1st Qu.:   95.00  
##  Hair Care            : 1028   BB Home         :  428   Median :  190.00  
##  Storage & Accessories: 1015   DP              :  250   Mean   :  322.51  
##  Fragrances & Deos    : 1000   Fresho Signature:  171   3rd Qu.:  359.00  
##  Bath & Hand Wash     :  996   bb Combo        :  168   Max.   :12500.00  
##  (Other)              :20089   (Other)         :25361                     
##   market_price                        type           rating     
##  Min.   :    3.0   Face Care            : 1508   Min.   :1.000  
##  1st Qu.:  100.0   Ayurveda             :  538   1st Qu.:3.700  
##  Median :  220.0   Men's Deodorants     :  500   Median :4.100  
##  Mean   :  382.1   Shampoo & Conditioner:  461   Mean   :3.943  
##  3rd Qu.:  425.0   Containers Sets      :  415   3rd Qu.:4.300  
##  Max.   :12500.0   Glassware            :  415   Max.   :5.000  
##                    (Other)              :23718   NA's   :8626   
##     discount     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0500  
##  Mean   :0.1182  
##  3rd Qu.:0.2000  
##  Max.   :0.8370  
##

% of NA’s in rating category

df %>% summarise(perc_of_na_rating = sum(is.na(rating))/n()*100)

##   perc_of_na_rating
## 1          31.30466

About 31% of rating data don’t have value.

Data structure

str(df)

## 'data.frame':    27555 obs. of  9 variables:
##  $ product     : chr  "Garlic Oil - Vegetarian Capsule 500 mg" "Water Bottle - Orange" "Brass Angle Deep - Plain, No.2" "Cereal Flip Lid Container/Storage Jar - Assorted Colour" ...
##  $ category    : Factor w/ 11 levels "Beauty & Hygiene",..: 1 2 3 3 1 3 1 1 1 3 ...
##  $ sub_category: Factor w/ 90 levels "Hair Care","Storage & Accessories",..: 1 2 3 4 5 6 7 5 1 8 ...
##  $ brand       : Factor w/ 2314 levels "Sri Sri Ayurveda ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ sale_price  : num  220 180 119 149 162 ...
##  $ market_price: num  220 180 250 176 162 ...
##  $ type        : Factor w/ 426 levels "Hair Oil & Serum",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ rating      : num  4.1 2.3 3.4 3.7 4.4 3.3 3.6 4 3.5 4.3 ...
##  $ discount    : num  0 0 0.524 0.153 0 0.151 0 0 0 0 ...

The most popular rating is about 4.

df %>% ggplot(aes(x=rating, fill=..x..))+
  geom_histogram(show.legend = F)+
  scale_fill_gradient(low="red", high="green")

What is the share of each category

df %>% 
  group_by(category) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count)) %>% 
  arrange(count)%>% 
  ggplot(aes(y=fct_reorder(category, count), x=perc*100, fill=category))+
  geom_col(show.legend = F)+
  labs(x="Percent", y="", title = "Percentage of a category")+
  theme_minimal()

About 30% items are from Beauty category.

Which brands have the best rating?

df2 <- df %>% group_by(brand) %>% 
  filter(!is.na(rating)) %>% 
  summarise(mean_rating = mean(rating), n_element = n()) %>% 
  arrange(desc(mean_rating))
# top rating brands which have more than 4 products
df2 %>% filter(n_element>4) %>% head(10) %>% select(-n_element)

## # A tibble: 10 x 2
##    brand                 mean_rating
##    <fct>                       <dbl>
##  1 Sangam Sweets                4.83
##  2 Kaadoo                       4.81
##  3 Bobs Red Mill                4.79
##  4 Nimyle                       4.78
##  5 Continent Spice Khadi        4.76
##  6 Pine Sol                     4.76
##  7 The Whole Truth              4.75
##  8 SOVI                         4.66
##  9 Pyrex                        4.63
## 10 Tabasco                      4.62

# top rating brands which have more than 9 products
df2 %>% filter(n_element>9) %>% head(10) %>% select(-n_element)

## # A tibble: 10 x 2
##    brand               mean_rating
##    <fct>                     <dbl>
##  1 "Hi-Luxe"                  4.57
##  2 "VAHDAM"                   4.49
##  3 "Morpheme Remedies"        4.46
##  4 "Papergrid"                4.43
##  5 "Whiskas"                  4.42
##  6 "Parachute "               4.41
##  7 "Tide"                     4.4 
##  8 "BB Home Herbal"           4.39
##  9 "Laopala Diva"             4.38
## 10 "Selpak"                   4.38

Discounts for each category?

df %>% ggplot(aes(y=category, x=discount, color=category)) +
  geom_boxplot(show.legend=F)

The vast majority of discounts are from 0 to 20%.

We can see that there are suspiciously many values around 0.2 in the Fruits & Vegetables category.

df %>% filter(category=="Fruits & Vegetables") %>% 
  summarise(discout_equal_0.2 =sum(discount==0.2)/n())

##   discout_equal_0.2
## 1         0.8743268

87% of products from the Fruits & Vegetables category have 20% discount.

Does discounting affect product rating?

df %>% na.omit(rating) %>% 
ggplot(aes(discount, rating))+
geom_point()+
geom_smooth()+
facet_wrap(vars(category))

BigBasket Products analysis

Patryk Tokarski

2022-07-05

Loading database

Summary the data

% of NA’s in rating category

About 31% of rating data don’t have value.

Data structure

The most popular rating is about 4.

Which brands have the best rating?

Discounts for each category?

The vast majority of discounts are from 0 to 20%.

We can see that there are suspiciously many values around 0.2 in the Fruits & Vegetables category.

87% of products from the Fruits & Vegetables category have 20% discount.

Does discounting affect product rating?

There is huge spread and it seems that discounting doesn’t affect rating. For each category rating is about 4.