What is selling on Amazon?

I have been interested for awhile now in selling goods through FBA. The idea is simple: Buy cheap products from Alibaba, have them shipped to an Amazon warehouse, create a listing on Amazon for the product, and profit. There are many articles and blogs on the internet on how to do this successfully. There are also data mining software packages that allow you to see what is being sold on Amazon. I thought is would be interesting to see what has been sold in the electronics department over the last month ending on 7/16/2018. I gathered this data from Jungle Scout and to my knowledge is data has not been studied before. The objective is two fold:

1. Is it worth it to sell electronics on Amazon?
2. Can data science be used to determine what the best product is to be sold on Amazon?

One of the main goals of a data scientist is to look at a dataset and determine what value can be extracted from it. This is done though conducting exploratory data analysis and looking for correlations in the data.

In [1]:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import numpy as np
import nltk
#from nltk.corpus import stopwords
%matplotlib inline
In [2]:
currdir = os.listdir()
In [3]:
path = 'Jungle Scout CSV Export - Mon Jul 16 2018 18_08_38 GMT-0500 (Central Daylight Time).csv'
df = pd.read_csv(path, skiprows=2, index_col=False)
df.head()
Out[3]:
ASIN Product Name Brand Seller Category Price Fees Net Weight (lbs) Product Tier Reviews Avg. Rating Rank Est. Monthly Sales Est. Monthly Revenue LQS Number Sellers
0 B0723599RQ Motorola DOCSIS 3.1 Gig-speed Cable Modem Mod... Motorola MTRLC LLC Electronics 158.00 27.9 130.1 2.0503 Standard (Large) 573 4.1 495 2539 401162.00 54 22
1 B01MSTB5KW Motorola MG7540 16x4 Cable Modem plus AC1600 D... Motorola MTRLC LLC Electronics 129.99 23.93 106.06 2.65 Standard (Large) 774 4.2 170 3855 501111.45 52 9
2 B01LXRSS36 Motorola MG7550 16x4 Cable Modem plus AC1900 D... Motorola MTRLC LLC Electronics 169.99 29.97 140.02 2.75 Standard (Large) 1149 4.3 3187 612 104033.88 53 1
3 B01JGT2JI6 Motorola MG7550 16x4 Cable Modem plus AC1900 D... Motorola Etech Galaxy Electronics 163.92 29.06 134.85999999999999 2.75 Standard (Large) 1150 4.3 1535 1280 209817.60 63 44
4 B07BRZ2KW5 Motorola MG7700 24X8 Cable Modem plus AC1900 D... Motorola MTRLC LLC Electronics 184.99 32.22 152.77 2.7492 Standard (Large) 17 4.7 2836 481 88980.19 53 2

We have several csv files that need to be combined into one file so we will do a list comprehension to append the files into a single file:

In [4]:
df = []
for i in glob.glob("*.csv"):
    data = pd.read_csv(i, skiprows=2,  index_col=False)
    df.append(data)
df = pd.concat(df)
In [5]:
df.describe()
Out[5]:
Rank Est. Monthly Sales Est. Monthly Revenue LQS Number Sellers
count 3.912200e+04 39122.000000 3.912200e+04 39122.000000 39087.000000
mean 9.151861e+04 219.173420 1.389535e+04 48.737795 8.062860
std 1.636627e+05 923.034722 8.544046e+04 13.742925 15.792671
min 1.000000e+00 1.000000 1.000000e+01 0.000000 1.000000
25% 7.230750e+03 10.000000 1.652802e+03 38.000000 1.000000
50% 2.493200e+04 42.000000 3.104305e+03 49.000000 2.000000
75% 8.108150e+04 137.000000 7.996000e+03 60.000000 7.000000
max 1.828111e+06 90709.920000 8.288314e+06 87.000000 245.000000

The describe function looks at all continuous columns in the dataframe and gives statistics about the columns. The estimated monthly revenue can gives a positive outlook on the sales for the last month. The average sales were \$13,895. Not bad for a side gig. However, when looking at the 50th percentile we can see that the average sales are \$3,104 per month which indicates that the top sellers get the vast majority of the sales. The average product has 8 sellers competing to sell the product.

The names for the columns are a bit off so let's correct them:

In [6]:
df = df.rename(index=str, columns={" Product Name": "Product_Name", " Brand": "Brand", " Seller": "Seller", " Category": "Category", " Price": "Price", " Fees": "Fees", " Net": "Net", " Weight (lbs)": "Weight_lbs", " Product Tier": "Product_Tier", " Reviews": "Reviews", " Avg. Rating": "Avg_Rating", " Rank": "Rank", " Est. Monthly Sales": "Est_Monthly_Sales", " Est. Monthly Revenue": "Est_Monthly_Revenue", " LQS": "LQS", " Number Sellers": "Number_Sellers"})
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 39122 entries, 0 to 199
Data columns (total 17 columns):
ASIN                   39122 non-null object
Product_Name           39122 non-null object
Brand                  39122 non-null object
Seller                 39122 non-null object
Category               39122 non-null object
Price                  39122 non-null object
Fees                   39122 non-null object
Net                    39122 non-null object
Weight_lbs             39122 non-null object
Product_Tier           39122 non-null object
Reviews                39122 non-null object
Avg_Rating             39122 non-null object
Rank                   39122 non-null int64
Est_Monthly_Sales      39122 non-null float64
Est_Monthly_Revenue    39122 non-null float64
LQS                    39122 non-null int64
Number_Sellers         39087 non-null float64
dtypes: float64(3), int64(2), object(12)
memory usage: 5.4+ MB

Looking at the info for the data frame we can see that there are several objects that should be numerical values. This indicates that there are possible NA values that are not represented as nan values that need to be cleaned and corrected. Price, Fees, Net, Weight, Reviews, and Rating are all objects that should be numerical values.

In [8]:
df.columns
Out[8]:
Index(['ASIN', 'Product_Name', 'Brand', 'Seller', 'Category', 'Price', 'Fees',
       'Net', 'Weight_lbs', 'Product_Tier', 'Reviews', 'Avg_Rating', 'Rank',
       'Est_Monthly_Sales', 'Est_Monthly_Revenue', 'LQS', 'Number_Sellers'],
      dtype='object')

We should drop category since the category is electronics.

In [9]:
df = df.drop(['Category'], axis = 1)

The dataframe currently represents nan's as N.A. We need to change that to nan's.

In [10]:
df['Price'] = pd.to_numeric(df.Price, errors='coerce')
df['Price'] = df['Price'].astype('float64', errors = 'raise')
df['Fees'] = pd.to_numeric(df.Fees, errors='coerce')
df['Fees'] = df['Fees'].astype('float64', errors = 'raise')
df['Net'] = pd.to_numeric(df.Net, errors='coerce')
df['Net'] = df['Net'].astype('float64', errors = 'raise')
df['Weight_lbs'] = pd.to_numeric(df.Weight_lbs, errors='coerce')
df['Weight_lbs'] = df['Weight_lbs'].astype('float64', errors = 'raise')
df['Reviews'] = pd.to_numeric(df.Reviews, errors='coerce')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Avg_Rating'] = pd.to_numeric(df.Avg_Rating, errors='coerce')
df['Avg_Rating'] = df['Avg_Rating'].astype('float64', errors = 'raise')
In [78]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 39122 entries, 0 to 199
Data columns (total 16 columns):
ASIN                   39122 non-null object
Product_Name           39122 non-null object
Brand                  39122 non-null object
Seller                 39122 non-null object
Price                  39116 non-null float64
Fees                   28077 non-null float64
Net                    28080 non-null float64
Weight_lbs             31475 non-null float64
Product_Tier           39122 non-null object
Reviews                37215 non-null float64
Avg_Rating             36592 non-null float64
Rank                   39122 non-null int64
Est_Monthly_Sales      39122 non-null float64
Est_Monthly_Revenue    39122 non-null float64
LQS                    39122 non-null int64
Number_Sellers         39087 non-null float64
dtypes: float64(9), int64(2), object(5)
memory usage: 6.3+ MB
In [25]:
df.describe()
Out[25]:
Price Fees Net Weight_lbs Reviews Avg_Rating Rank Est_Monthly_Sales Est_Monthly_Revenue LQS Number_Sellers
count 39116.000000 28077.000000 28080.000000 31475.000000 37215.000000 36592.000000 3.912200e+04 39122.000000 3.912200e+04 39122.000000 39087.000000
mean 412.743272 72.277335 334.669700 11.680945 442.601545 95.656061 9.151861e+04 219.173420 1.389535e+04 48.737795 8.062860
std 2173.996327 336.442840 1859.984647 44.787448 3112.666196 6177.057637 1.636627e+05 923.034722 8.544046e+04 13.742925 15.792671
min 1.000000 3.360000 -4516.890000 0.000000 0.000000 0.000000 1.000000e+00 1.000000 1.000000e+01 0.000000 1.000000
25% 30.480000 8.090000 22.470000 0.500000 6.000000 3.600000 7.230750e+03 10.000000 1.652802e+03 38.000000 1.000000
50% 93.240000 20.140000 73.500000 1.950000 37.000000 4.100000 2.493200e+04 42.000000 3.104305e+03 49.000000 2.000000
75% 349.990000 66.590000 297.182500 10.250000 205.000000 4.500000 8.108150e+04 137.000000 7.996000e+03 60.000000 7.000000
max 246464.820000 36972.710000 209492.110000 4960.400000 129960.000000 805192.000000 1.828111e+06 90709.920000 8.288314e+06 87.000000 245.000000
In [26]:
df.head()
Out[26]:
ASIN Product_Name Brand Seller Price Fees Net Weight_lbs Product_Tier Reviews Avg_Rating Rank Est_Monthly_Sales Est_Monthly_Revenue LQS Number_Sellers
0 B0723599RQ Motorola DOCSIS 3.1 Gig-speed Cable Modem Mod... Motorola MTRLC LLC 158.00 27.90 130.10 2.0503 Standard (Large) 573.0 4.1 495 2539.0 401162.00 54 22.0
1 B01MSTB5KW Motorola MG7540 16x4 Cable Modem plus AC1600 D... Motorola MTRLC LLC 129.99 23.93 106.06 2.6500 Standard (Large) 774.0 4.2 170 3855.0 501111.45 52 9.0
2 B01LXRSS36 Motorola MG7550 16x4 Cable Modem plus AC1900 D... Motorola MTRLC LLC 169.99 29.97 140.02 2.7500 Standard (Large) 1149.0 4.3 3187 612.0 104033.88 53 1.0
3 B01JGT2JI6 Motorola MG7550 16x4 Cable Modem plus AC1900 D... Motorola Etech Galaxy 163.92 29.06 134.86 2.7500 Standard (Large) 1150.0 4.3 1535 1280.0 209817.60 63 44.0
4 B07BRZ2KW5 Motorola MG7700 24X8 Cable Modem plus AC1900 D... Motorola MTRLC LLC 184.99 32.22 152.77 2.7492 Standard (Large) 17.0 4.7 2836 481.0 88980.19 53 2.0

Let's now look at the number of N.A's in the object columns:

In [27]:
ASIN_nan = len(df[df.ASIN.str.contains("N.A.") == True])
Product_Name_nan = len(df[df.Product_Name.str.contains("N.A.") == True])
Seller_nan = len(df[df.Seller.str.contains("N.A.") == True])
Brand_nan = len(df[df.Brand.str.contains("N.A.") == True])
Product_Tier_nan = len(df[df.Product_Tier.str.contains("N.A.") == True])
print("Nans: ASIN:", ASIN_nan, "Product Name:", Product_Name_nan, "Brand:", Brand_nan, "Product Tier:", Product_Tier_nan, "Seller:", Seller_nan)
Nans: ASIN: 137 Product Name: 348 Brand: 123 Product Tier: 5198 Seller: 1958

I do not believe that the ASIN is important because I believe it is an arbitrary number chosen by Amazon. The product name is important since it is the descriptor of what is being sold. The brand is less important and can stay as N.A. and the product tier is either standard or oversized and it is not a critical variable so it can be estimated for the N.A.'s.

I would like to look at a histogram of the distribution of the sales on Amazon:

In [79]:
bins = np.linspace(1000, df.Est_Monthly_Revenue.max(), 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()

The histogram above, which at first glance looks incorrect, shows that there are a few products that gross 8 million dollars in monthly revenue which make the histogram very unbalanced. Let's shrink the histogram axis so we can see the majority of sellers.

In [80]:
bins = np.linspace(1000, 50000, 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()

Even after cutting down the x axis we can still see that selling electronics on Amazon favors heavily the top 1%.

In [100]:
df.describe(percentiles=(np.linspace(1, 99, 25)/100))
Out[100]:
Price Fees Net Weight_lbs Reviews Avg_Rating Rank Est_Monthly_Sales Est_Monthly_Revenue LQS Number_Sellers
count 39116.000000 28077.000000 28080.000000 31475.000000 37215.000000 36592.000000 3.912200e+04 39122.000000 3.912200e+04 39122.000000 39087.000000
mean 412.743272 72.277335 334.669700 11.680945 442.601545 95.656061 9.151861e+04 219.173420 1.389535e+04 48.737795 8.062860
std 2173.996327 336.442840 1859.984647 44.787448 3112.666196 6177.057637 1.636627e+05 923.034722 8.544046e+04 13.742925 15.792671
min 1.000000 3.360000 -4516.890000 0.000000 0.000000 0.000000 1.000000e+00 1.000000 1.000000e+01 0.000000 1.000000
1% 6.690000 3.707600 2.950000 0.000000 0.000000 1.000000 1.050000e+02 1.000000 1.020768e+03 17.000000 1.000000
5.1% 11.753458 4.520000 6.860000 0.020000 1.000000 2.800000 7.706508e+02 1.000000 1.100000e+03 27.000000 1.000000
9.2% 14.990000 5.240000 9.750000 0.100000 1.000000 3.100000 1.735000e+03 2.000000 1.198000e+03 30.000000 1.000000
13.2% 18.990000 5.840000 12.300000 0.200000 2.000000 3.300000 2.864000e+03 3.000000 1.295341e+03 33.000000 1.000000
17.3% 22.990000 6.590000 15.360000 0.280000 3.000000 3.400000 4.230000e+03 5.000000 1.402349e+03 35.000000 1.000000
21.4% 26.990000 7.340000 18.760000 0.400000 4.000000 3.500000 5.736414e+03 8.000000 1.529550e+03 36.000000 1.000000
25.5% 31.970000 8.230000 22.500000 0.500000 6.000000 3.600000 7.438000e+03 11.000000 1.679520e+03 38.000000 1.000000
29.6% 36.990000 9.270000 26.750000 0.630000 9.000000 3.700000 9.476296e+03 15.000000 1.838894e+03 40.000000 1.000000
33.7% 44.720000 10.490000 32.222633 0.780099 12.000000 3.800000 1.158474e+04 19.000000 2.017345e+03 42.000000 1.000000
37.8% 54.799500 12.286900 39.500000 0.990000 16.000000 3.900000 1.396918e+04 24.000000 2.238481e+03 44.000000 1.000000
41.8% 64.980000 14.520000 48.387633 1.150000 21.000000 3.900000 1.680300e+04 30.000000 2.479829e+03 45.000000 1.000000
45.9% 79.880000 17.005633 61.079408 1.460000 28.000000 4.000000 2.047024e+04 35.000000 2.759400e+03 47.000000 2.000000
50% 93.240000 20.140000 73.500000 1.950000 37.000000 4.100000 2.493200e+04 42.000000 3.104305e+03 49.000000 2.000000
54.1% 113.990000 24.208733 92.330592 2.550000 48.000000 4.100000 3.052688e+04 51.000000 3.509723e+03 50.000000 2.000000
58.2% 139.990000 29.230000 116.000000 3.400000 63.000000 4.200000 3.656876e+04 60.000000 3.999000e+03 52.000000 3.000000
62.2% 169.990000 34.579300 144.430000 4.600000 81.000000 4.300000 4.373382e+04 75.000000 4.598132e+03 54.000000 3.000000
66.3% 209.000000 42.187467 177.448400 6.000000 107.000000 4.300000 5.204563e+04 90.000000 5.399903e+03 56.000000 4.000000
70.4% 269.954792 51.721833 228.838875 7.750000 147.000000 4.400000 6.295982e+04 109.000000 6.418930e+03 58.000000 5.000000
74.5% 349.000000 64.772400 289.258550 10.000000 199.000000 4.500000 7.869480e+04 133.145000 7.787769e+03 60.000000 7.000000
78.6% 449.990000 82.441700 375.098283 13.359550 271.000000 4.600000 1.030483e+05 170.000000 9.667457e+03 62.000000 9.000000
82.7% 599.000000 105.664800 492.959467 17.750000 377.000000 4.700000 1.421441e+05 223.000000 1.272912e+04 64.000000 12.000000
86.8% 839.990000 141.187900 670.370000 24.000000 576.000000 4.900000 2.237988e+05 308.000000 1.690182e+04 66.000000 17.000000
90.8% 1200.000000 201.493667 986.624000 34.000000 872.000000 5.000000 3.136701e+05 470.000000 2.426794e+04 67.000000 24.000000
94.9% 1784.449750 286.928200 1392.942283 51.303583 1552.288333 5.000000 4.726150e+05 839.000000 4.435974e+04 70.000000 40.000000
99% 3797.990000 622.082000 2946.613500 123.000000 5229.000000 5.000000 7.679839e+05 3017.900000 1.802676e+05 74.000000 77.000000
max 246464.820000 36972.710000 209492.110000 4960.400000 129960.000000 805192.000000 1.828111e+06 90709.920000 8.288314e+06 87.000000 245.000000

Let's look at the top 10 selling products:

In [106]:
df.sort_values(by=['Est_Monthly_Revenue']).tail(10)
Out[106]:
ASIN Product_Name Brand Seller Price Fees Net Weight_lbs Product_Tier Reviews Avg_Rating Rank Est_Monthly_Sales Est_Monthly_Revenue LQS Number_Sellers
89 B01D1W119M Garmin Fenix 3 Sapphire Performance Bundle Garmin Amazon.com 517.06 NaN NaN NaN true 400.0 4.2 139 5190.0 2683541.40 48 6.0
13 B01MYGISTO TCL 49S405 49-Inch 4K Ultra HD Roku Smart LED ... TCL Amazon.com 319.99 67.50 252.49 34.45 Oversize (Small) 2900.0 4.3 93 9264.0 2964387.36 56 8.0
93 B0161RJ81C Garmin Fenix 3 Sapphire White Rose/Gold Garmin Amazon.com 599.99 92.99 507.00 0.75 Standard (Large) 400.0 4.2 122 5221.0 3132547.79 33 1.0
87 B01A5HILB2 Garmin Fenix 3 HR Gray Performer Bundle Garmin SmartTechs 414.89 65.22 349.67 0.93 Standard (Large) 1098.0 4.4 68 8010.0 3323268.90 65 5.0
88 B00S7LZWHU Garmin Fenix 3 Sapphire Performance Bundle Garmin Amazon.com 517.06 81.74 435.32 1.30 Standard (Large) 400.0 4.2 92 6720.0 3474643.20 38 6.0
84 B00RY1YWSO Garmin fenix 3 GPS Watch Gray Garmin Amazon.com 499.99 77.99 422.00 0.65 Standard (Large) 1098.0 4.4 85 7050.0 3524929.50 36 21.0
90 B014SS9Q90 Garmin Fenix 3 Sapphire Performance Bundle Garmin Amazon.com 577.23 90.76 486.47 1.30 Standard (Large) 400.0 4.2 92 6129.0 3537842.67 59 6.0
94 B01MXZPG0U Garmin Fenix 3 Sapphire White Rose/Gold Garmin Amazon.com 599.99 92.99 507.00 0.80 Standard (Large) 400.0 4.2 90 6810.0 4085931.90 33 1.0
18 B01MTGM5I9 TCL 55S405 55-Inch 4K Ultra HD Roku Smart LED ... TCL Amazon.com 379.99 90.47 289.52 38.01 Oversize (Large) 2900.0 4.3 57 13004.0 4941389.96 56 12.0
59 B06XNLC63G FORTINET | FC-10-03702-871-02-36 | FortiGate-3... Fortinet SerenIT 202154.00 NaN NaN NaN N.A. 5.0 4.1 29957 41.0 8288314.00 64 1.0

And the highest grossing product is Fortinet, a web app firewall. Next, is a smart TV and then a watch made by Garmin.