I have been interested for awhile now in selling goods through FBA. The idea is simple: Buy cheap products from Alibaba, have them shipped to an Amazon warehouse, create a listing on Amazon for the product, and profit. There are many articles and blogs on the internet on how to do this successfully. There are also data mining software packages that allow you to see what is being sold on Amazon. I thought is would be interesting to see what has been sold in the electronics department over the last month ending on 7/16/2018. I gathered this data from Jungle Scout and to my knowledge is data has not been studied before. The objective is two fold:
1. Is it worth it to sell electronics on Amazon?
2. Can data science be used to determine what the best product is to be sold on Amazon?
One of the main goals of a data scientist is to look at a dataset and determine what value can be extracted from it. This is done though conducting exploratory data analysis and looking for correlations in the data.
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import numpy as np
import nltk
#from nltk.corpus import stopwords
%matplotlib inline
currdir = os.listdir()
path = 'Jungle Scout CSV Export - Mon Jul 16 2018 18_08_38 GMT-0500 (Central Daylight Time).csv'
df = pd.read_csv(path, skiprows=2, index_col=False)
df.head()
We have several csv files that need to be combined into one file so we will do a list comprehension to append the files into a single file:
df = []
for i in glob.glob("*.csv"):
data = pd.read_csv(i, skiprows=2, index_col=False)
df.append(data)
df = pd.concat(df)
df.describe()
The describe function looks at all continuous columns in the dataframe and gives statistics about the columns. The estimated monthly revenue can gives a positive outlook on the sales for the last month. The average sales were \$13,895. Not bad for a side gig. However, when looking at the 50th percentile we can see that the average sales are \$3,104 per month which indicates that the top sellers get the vast majority of the sales. The average product has 8 sellers competing to sell the product.
The names for the columns are a bit off so let's correct them:
df = df.rename(index=str, columns={" Product Name": "Product_Name", " Brand": "Brand", " Seller": "Seller", " Category": "Category", " Price": "Price", " Fees": "Fees", " Net": "Net", " Weight (lbs)": "Weight_lbs", " Product Tier": "Product_Tier", " Reviews": "Reviews", " Avg. Rating": "Avg_Rating", " Rank": "Rank", " Est. Monthly Sales": "Est_Monthly_Sales", " Est. Monthly Revenue": "Est_Monthly_Revenue", " LQS": "LQS", " Number Sellers": "Number_Sellers"})
df.info()
Looking at the info for the data frame we can see that there are several objects that should be numerical values. This indicates that there are possible NA values that are not represented as nan values that need to be cleaned and corrected. Price, Fees, Net, Weight, Reviews, and Rating are all objects that should be numerical values.
df.columns
We should drop category since the category is electronics.
df = df.drop(['Category'], axis = 1)
The dataframe currently represents nan's as N.A. We need to change that to nan's.
df['Price'] = pd.to_numeric(df.Price, errors='coerce')
df['Price'] = df['Price'].astype('float64', errors = 'raise')
df['Fees'] = pd.to_numeric(df.Fees, errors='coerce')
df['Fees'] = df['Fees'].astype('float64', errors = 'raise')
df['Net'] = pd.to_numeric(df.Net, errors='coerce')
df['Net'] = df['Net'].astype('float64', errors = 'raise')
df['Weight_lbs'] = pd.to_numeric(df.Weight_lbs, errors='coerce')
df['Weight_lbs'] = df['Weight_lbs'].astype('float64', errors = 'raise')
df['Reviews'] = pd.to_numeric(df.Reviews, errors='coerce')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Avg_Rating'] = pd.to_numeric(df.Avg_Rating, errors='coerce')
df['Avg_Rating'] = df['Avg_Rating'].astype('float64', errors = 'raise')
df.info()
df.describe()
df.head()
Let's now look at the number of N.A's in the object columns:
ASIN_nan = len(df[df.ASIN.str.contains("N.A.") == True])
Product_Name_nan = len(df[df.Product_Name.str.contains("N.A.") == True])
Seller_nan = len(df[df.Seller.str.contains("N.A.") == True])
Brand_nan = len(df[df.Brand.str.contains("N.A.") == True])
Product_Tier_nan = len(df[df.Product_Tier.str.contains("N.A.") == True])
print("Nans: ASIN:", ASIN_nan, "Product Name:", Product_Name_nan, "Brand:", Brand_nan, "Product Tier:", Product_Tier_nan, "Seller:", Seller_nan)
I do not believe that the ASIN is important because I believe it is an arbitrary number chosen by Amazon. The product name is important since it is the descriptor of what is being sold. The brand is less important and can stay as N.A. and the product tier is either standard or oversized and it is not a critical variable so it can be estimated for the N.A.'s.
I would like to look at a histogram of the distribution of the sales on Amazon:
bins = np.linspace(1000, df.Est_Monthly_Revenue.max(), 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()
The histogram above, which at first glance looks incorrect, shows that there are a few products that gross 8 million dollars in monthly revenue which make the histogram very unbalanced. Let's shrink the histogram axis so we can see the majority of sellers.
bins = np.linspace(1000, 50000, 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()
Even after cutting down the x axis we can still see that selling electronics on Amazon favors heavily the top 1%.
df.describe(percentiles=(np.linspace(1, 99, 25)/100))
Let's look at the top 10 selling products:
df.sort_values(by=['Est_Monthly_Revenue']).tail(10)
And the highest grossing product is Fortinet, a web app firewall. Next, is a smart TV and then a watch made by Garmin.