What is selling on Amazon?

I have been interested for awhile now in selling goods through FBA. The idea is simple: Buy cheap products from Alibaba, have them shipped to an Amazon warehouse, create a listing on Amazon for the product, and profit. There are many articles and blogs on the internet on how to do this successfully. There are also data mining software packages that allow you to see what is being sold on Amazon. I thought is would be interesting to see what has been sold in the electronics department over the last month ending on 7/16/2018. I gathered this data from Jungle Scout and to my knowledge is data has not been studied before. The objective is two fold:

1. Is it worth it to sell electronics on Amazon?
2. Can data science be used to determine what the best product is to be sold on Amazon?

One of the main goals of a data scientist is to look at a dataset and determine what value can be extracted from it. This is done though conducting exploratory data analysis and looking for correlations in the data.

In [1]:

import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import numpy as np
import nltk
#from nltk.corpus import stopwords
%matplotlib inline

In [2]:

currdir = os.listdir()

In [3]:

path = 'Jungle Scout CSV Export - Mon Jul 16 2018 18_08_38 GMT-0500 (Central Daylight Time).csv'
df = pd.read_csv(path, skiprows=2, index_col=False)
df.head()

Out[3]:

	ASIN	Product Name	Brand	Seller	Category	Price	Fees	Net	Weight (lbs)	Product Tier	Reviews	Avg. Rating	Rank	Est. Monthly Sales	Est. Monthly Revenue	LQS	Number Sellers
0	B0723599RQ	Motorola DOCSIS 3.1 Gig-speed Cable Modem Mod...	Motorola	MTRLC LLC	Electronics	158.00	27.9	130.1	2.0503	Standard (Large)	573	4.1	495	2539	401162.00	54	22
1	B01MSTB5KW	Motorola MG7540 16x4 Cable Modem plus AC1600 D...	Motorola	MTRLC LLC	Electronics	129.99	23.93	106.06	2.65	Standard (Large)	774	4.2	170	3855	501111.45	52	9
2	B01LXRSS36	Motorola MG7550 16x4 Cable Modem plus AC1900 D...	Motorola	MTRLC LLC	Electronics	169.99	29.97	140.02	2.75	Standard (Large)	1149	4.3	3187	612	104033.88	53	1
3	B01JGT2JI6	Motorola MG7550 16x4 Cable Modem plus AC1900 D...	Motorola	Etech Galaxy	Electronics	163.92	29.06	134.85999999999999	2.75	Standard (Large)	1150	4.3	1535	1280	209817.60	63	44
4	B07BRZ2KW5	Motorola MG7700 24X8 Cable Modem plus AC1900 D...	Motorola	MTRLC LLC	Electronics	184.99	32.22	152.77	2.7492	Standard (Large)	17	4.7	2836	481	88980.19	53	2

We have several csv files that need to be combined into one file so we will do a list comprehension to append the files into a single file:

In [4]:

df = []
for i in glob.glob("*.csv"):
    data = pd.read_csv(i, skiprows=2,  index_col=False)
    df.append(data)
df = pd.concat(df)

In [5]:

df.describe()

Out[5]:

	Rank	Est. Monthly Sales	Est. Monthly Revenue	LQS	Number Sellers
count	3.912200e+04	39122.000000	3.912200e+04	39122.000000	39087.000000
mean	9.151861e+04	219.173420	1.389535e+04	48.737795	8.062860
std	1.636627e+05	923.034722	8.544046e+04	13.742925	15.792671
min	1.000000e+00	1.000000	1.000000e+01	0.000000	1.000000
25%	7.230750e+03	10.000000	1.652802e+03	38.000000	1.000000
50%	2.493200e+04	42.000000	3.104305e+03	49.000000	2.000000
75%	8.108150e+04	137.000000	7.996000e+03	60.000000	7.000000
max	1.828111e+06	90709.920000	8.288314e+06	87.000000	245.000000

The describe function looks at all continuous columns in the dataframe and gives statistics about the columns. The estimated monthly revenue can gives a positive outlook on the sales for the last month. The average sales were \$13,895. Not bad for a side gig. However, when looking at the 50th percentile we can see that the average sales are \$3,104 per month which indicates that the top sellers get the vast majority of the sales. The average product has 8 sellers competing to sell the product.

The names for the columns are a bit off so let's correct them:

In [6]:

df = df.rename(index=str, columns={" Product Name": "Product_Name", " Brand": "Brand", " Seller": "Seller", " Category": "Category", " Price": "Price", " Fees": "Fees", " Net": "Net", " Weight (lbs)": "Weight_lbs", " Product Tier": "Product_Tier", " Reviews": "Reviews", " Avg. Rating": "Avg_Rating", " Rank": "Rank", " Est. Monthly Sales": "Est_Monthly_Sales", " Est. Monthly Revenue": "Est_Monthly_Revenue", " LQS": "LQS", " Number Sellers": "Number_Sellers"})

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39122 entries, 0 to 199
Data columns (total 17 columns):
ASIN                   39122 non-null object
Product_Name           39122 non-null object
Brand                  39122 non-null object
Seller                 39122 non-null object
Category               39122 non-null object
Price                  39122 non-null object
Fees                   39122 non-null object
Net                    39122 non-null object
Weight_lbs             39122 non-null object
Product_Tier           39122 non-null object
Reviews                39122 non-null object
Avg_Rating             39122 non-null object
Rank                   39122 non-null int64
Est_Monthly_Sales      39122 non-null float64
Est_Monthly_Revenue    39122 non-null float64
LQS                    39122 non-null int64
Number_Sellers         39087 non-null float64
dtypes: float64(3), int64(2), object(12)
memory usage: 5.4+ MB

Looking at the info for the data frame we can see that there are several objects that should be numerical values. This indicates that there are possible NA values that are not represented as nan values that need to be cleaned and corrected. Price, Fees, Net, Weight, Reviews, and Rating are all objects that should be numerical values.

In [8]:

df.columns

Out[8]:

Index(['ASIN', 'Product_Name', 'Brand', 'Seller', 'Category', 'Price', 'Fees',
       'Net', 'Weight_lbs', 'Product_Tier', 'Reviews', 'Avg_Rating', 'Rank',
       'Est_Monthly_Sales', 'Est_Monthly_Revenue', 'LQS', 'Number_Sellers'],
      dtype='object')

We should drop category since the category is electronics.

In [9]:

df = df.drop(['Category'], axis = 1)

The dataframe currently represents nan's as N.A. We need to change that to nan's.

In [10]:

df['Price'] = pd.to_numeric(df.Price, errors='coerce')
df['Price'] = df['Price'].astype('float64', errors = 'raise')
df['Fees'] = pd.to_numeric(df.Fees, errors='coerce')
df['Fees'] = df['Fees'].astype('float64', errors = 'raise')
df['Net'] = pd.to_numeric(df.Net, errors='coerce')
df['Net'] = df['Net'].astype('float64', errors = 'raise')
df['Weight_lbs'] = pd.to_numeric(df.Weight_lbs, errors='coerce')
df['Weight_lbs'] = df['Weight_lbs'].astype('float64', errors = 'raise')
df['Reviews'] = pd.to_numeric(df.Reviews, errors='coerce')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Reviews'] = df['Reviews'].astype('float64', errors = 'raise')
df['Avg_Rating'] = pd.to_numeric(df.Avg_Rating, errors='coerce')
df['Avg_Rating'] = df['Avg_Rating'].astype('float64', errors = 'raise')

In [78]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39122 entries, 0 to 199
Data columns (total 16 columns):
ASIN                   39122 non-null object
Product_Name           39122 non-null object
Brand                  39122 non-null object
Seller                 39122 non-null object
Price                  39116 non-null float64
Fees                   28077 non-null float64
Net                    28080 non-null float64
Weight_lbs             31475 non-null float64
Product_Tier           39122 non-null object
Reviews                37215 non-null float64
Avg_Rating             36592 non-null float64
Rank                   39122 non-null int64
Est_Monthly_Sales      39122 non-null float64
Est_Monthly_Revenue    39122 non-null float64
LQS                    39122 non-null int64
Number_Sellers         39087 non-null float64
dtypes: float64(9), int64(2), object(5)
memory usage: 6.3+ MB

In [25]:

df.describe()

Out[25]:

	Price	Fees	Net	Weight_lbs	Reviews	Avg_Rating	Rank	Est_Monthly_Sales	Est_Monthly_Revenue	LQS	Number_Sellers
count	39116.000000	28077.000000	28080.000000	31475.000000	37215.000000	36592.000000	3.912200e+04	39122.000000	3.912200e+04	39122.000000	39087.000000
mean	412.743272	72.277335	334.669700	11.680945	442.601545	95.656061	9.151861e+04	219.173420	1.389535e+04	48.737795	8.062860
std	2173.996327	336.442840	1859.984647	44.787448	3112.666196	6177.057637	1.636627e+05	923.034722	8.544046e+04	13.742925	15.792671
min	1.000000	3.360000	-4516.890000	0.000000	0.000000	0.000000	1.000000e+00	1.000000	1.000000e+01	0.000000	1.000000
25%	30.480000	8.090000	22.470000	0.500000	6.000000	3.600000	7.230750e+03	10.000000	1.652802e+03	38.000000	1.000000
50%	93.240000	20.140000	73.500000	1.950000	37.000000	4.100000	2.493200e+04	42.000000	3.104305e+03	49.000000	2.000000
75%	349.990000	66.590000	297.182500	10.250000	205.000000	4.500000	8.108150e+04	137.000000	7.996000e+03	60.000000	7.000000
max	246464.820000	36972.710000	209492.110000	4960.400000	129960.000000	805192.000000	1.828111e+06	90709.920000	8.288314e+06	87.000000	245.000000

In [26]:

df.head()

Out[26]:

	ASIN	Product_Name	Brand	Seller	Price	Fees	Net	Weight_lbs	Product_Tier	Reviews	Avg_Rating	Rank	Est_Monthly_Sales	Est_Monthly_Revenue	LQS	Number_Sellers
0	B0723599RQ	Motorola DOCSIS 3.1 Gig-speed Cable Modem Mod...	Motorola	MTRLC LLC	158.00	27.90	130.10	2.0503	Standard (Large)	573.0	4.1	495	2539.0	401162.00	54	22.0
1	B01MSTB5KW	Motorola MG7540 16x4 Cable Modem plus AC1600 D...	Motorola	MTRLC LLC	129.99	23.93	106.06	2.6500	Standard (Large)	774.0	4.2	170	3855.0	501111.45	52	9.0
2	B01LXRSS36	Motorola MG7550 16x4 Cable Modem plus AC1900 D...	Motorola	MTRLC LLC	169.99	29.97	140.02	2.7500	Standard (Large)	1149.0	4.3	3187	612.0	104033.88	53	1.0
3	B01JGT2JI6	Motorola MG7550 16x4 Cable Modem plus AC1900 D...	Motorola	Etech Galaxy	163.92	29.06	134.86	2.7500	Standard (Large)	1150.0	4.3	1535	1280.0	209817.60	63	44.0
4	B07BRZ2KW5	Motorola MG7700 24X8 Cable Modem plus AC1900 D...	Motorola	MTRLC LLC	184.99	32.22	152.77	2.7492	Standard (Large)	17.0	4.7	2836	481.0	88980.19	53	2.0

Let's now look at the number of N.A's in the object columns:

In [27]:

ASIN_nan = len(df[df.ASIN.str.contains("N.A.") == True])
Product_Name_nan = len(df[df.Product_Name.str.contains("N.A.") == True])
Seller_nan = len(df[df.Seller.str.contains("N.A.") == True])
Brand_nan = len(df[df.Brand.str.contains("N.A.") == True])
Product_Tier_nan = len(df[df.Product_Tier.str.contains("N.A.") == True])
print("Nans: ASIN:", ASIN_nan, "Product Name:", Product_Name_nan, "Brand:", Brand_nan, "Product Tier:", Product_Tier_nan, "Seller:", Seller_nan)

Nans: ASIN: 137 Product Name: 348 Brand: 123 Product Tier: 5198 Seller: 1958

I do not believe that the ASIN is important because I believe it is an arbitrary number chosen by Amazon. The product name is important since it is the descriptor of what is being sold. The brand is less important and can stay as N.A. and the product tier is either standard or oversized and it is not a critical variable so it can be estimated for the N.A.'s.

I would like to look at a histogram of the distribution of the sales on Amazon:

In [79]:

bins = np.linspace(1000, df.Est_Monthly_Revenue.max(), 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()

The histogram above, which at first glance looks incorrect, shows that there are a few products that gross 8 million dollars in monthly revenue which make the histogram very unbalanced. Let's shrink the histogram axis so we can see the majority of sellers.

In [80]:

bins = np.linspace(1000, 50000, 20)
plt.figure( figsize=(15, 6))
plt.hist(df['Est_Monthly_Revenue'], bins= bins, histtype='bar')
plt.xlabel('Revenue ($)')
plt.ylabel('Number of Sellers')
plt.title('Distribution of Monthly Revenue')
plt.show()

Even after cutting down the x axis we can still see that selling electronics on Amazon favors heavily the top 1%.

In [100]:

df.describe(percentiles=(np.linspace(1, 99, 25)/100))

Out[100]:

	Price	Fees	Net	Weight_lbs	Reviews	Avg_Rating	Rank	Est_Monthly_Sales	Est_Monthly_Revenue	LQS	Number_Sellers
count	39116.000000	28077.000000	28080.000000	31475.000000	37215.000000	36592.000000	3.912200e+04	39122.000000	3.912200e+04	39122.000000	39087.000000
mean	412.743272	72.277335	334.669700	11.680945	442.601545	95.656061	9.151861e+04	219.173420	1.389535e+04	48.737795	8.062860
std	2173.996327	336.442840	1859.984647	44.787448	3112.666196	6177.057637	1.636627e+05	923.034722	8.544046e+04	13.742925	15.792671
min	1.000000	3.360000	-4516.890000	0.000000	0.000000	0.000000	1.000000e+00	1.000000	1.000000e+01	0.000000	1.000000
1%	6.690000	3.707600	2.950000	0.000000	0.000000	1.000000	1.050000e+02	1.000000	1.020768e+03	17.000000	1.000000
5.1%	11.753458	4.520000	6.860000	0.020000	1.000000	2.800000	7.706508e+02	1.000000	1.100000e+03	27.000000	1.000000
9.2%	14.990000	5.240000	9.750000	0.100000	1.000000	3.100000	1.735000e+03	2.000000	1.198000e+03	30.000000	1.000000
13.2%	18.990000	5.840000	12.300000	0.200000	2.000000	3.300000	2.864000e+03	3.000000	1.295341e+03	33.000000	1.000000
17.3%	22.990000	6.590000	15.360000	0.280000	3.000000	3.400000	4.230000e+03	5.000000	1.402349e+03	35.000000	1.000000
21.4%	26.990000	7.340000	18.760000	0.400000	4.000000	3.500000	5.736414e+03	8.000000	1.529550e+03	36.000000	1.000000
25.5%	31.970000	8.230000	22.500000	0.500000	6.000000	3.600000	7.438000e+03	11.000000	1.679520e+03	38.000000	1.000000
29.6%	36.990000	9.270000	26.750000	0.630000	9.000000	3.700000	9.476296e+03	15.000000	1.838894e+03	40.000000	1.000000
33.7%	44.720000	10.490000	32.222633	0.780099	12.000000	3.800000	1.158474e+04	19.000000	2.017345e+03	42.000000	1.000000
37.8%	54.799500	12.286900	39.500000	0.990000	16.000000	3.900000	1.396918e+04	24.000000	2.238481e+03	44.000000	1.000000
41.8%	64.980000	14.520000	48.387633	1.150000	21.000000	3.900000	1.680300e+04	30.000000	2.479829e+03	45.000000	1.000000
45.9%	79.880000	17.005633	61.079408	1.460000	28.000000	4.000000	2.047024e+04	35.000000	2.759400e+03	47.000000	2.000000
50%	93.240000	20.140000	73.500000	1.950000	37.000000	4.100000	2.493200e+04	42.000000	3.104305e+03	49.000000	2.000000
54.1%	113.990000	24.208733	92.330592	2.550000	48.000000	4.100000	3.052688e+04	51.000000	3.509723e+03	50.000000	2.000000
58.2%	139.990000	29.230000	116.000000	3.400000	63.000000	4.200000	3.656876e+04	60.000000	3.999000e+03	52.000000	3.000000
62.2%	169.990000	34.579300	144.430000	4.600000	81.000000	4.300000	4.373382e+04	75.000000	4.598132e+03	54.000000	3.000000
66.3%	209.000000	42.187467	177.448400	6.000000	107.000000	4.300000	5.204563e+04	90.000000	5.399903e+03	56.000000	4.000000
70.4%	269.954792	51.721833	228.838875	7.750000	147.000000	4.400000	6.295982e+04	109.000000	6.418930e+03	58.000000	5.000000
74.5%	349.000000	64.772400	289.258550	10.000000	199.000000	4.500000	7.869480e+04	133.145000	7.787769e+03	60.000000	7.000000
78.6%	449.990000	82.441700	375.098283	13.359550	271.000000	4.600000	1.030483e+05	170.000000	9.667457e+03	62.000000	9.000000
82.7%	599.000000	105.664800	492.959467	17.750000	377.000000	4.700000	1.421441e+05	223.000000	1.272912e+04	64.000000	12.000000
86.8%	839.990000	141.187900	670.370000	24.000000	576.000000	4.900000	2.237988e+05	308.000000	1.690182e+04	66.000000	17.000000
90.8%	1200.000000	201.493667	986.624000	34.000000	872.000000	5.000000	3.136701e+05	470.000000	2.426794e+04	67.000000	24.000000
94.9%	1784.449750	286.928200	1392.942283	51.303583	1552.288333	5.000000	4.726150e+05	839.000000	4.435974e+04	70.000000	40.000000
99%	3797.990000	622.082000	2946.613500	123.000000	5229.000000	5.000000	7.679839e+05	3017.900000	1.802676e+05	74.000000	77.000000
max	246464.820000	36972.710000	209492.110000	4960.400000	129960.000000	805192.000000	1.828111e+06	90709.920000	8.288314e+06	87.000000	245.000000

Let's look at the top 10 selling products:

In [106]:

df.sort_values(by=['Est_Monthly_Revenue']).tail(10)

Out[106]:

	ASIN	Product_Name	Brand	Seller	Price	Fees	Net	Weight_lbs	Product_Tier	Reviews	Avg_Rating	Rank	Est_Monthly_Sales	Est_Monthly_Revenue	LQS	Number_Sellers
89	B01D1W119M	Garmin Fenix 3 Sapphire Performance Bundle	Garmin	Amazon.com	517.06	NaN	NaN	NaN	true	400.0	4.2	139	5190.0	2683541.40	48	6.0
13	B01MYGISTO	TCL 49S405 49-Inch 4K Ultra HD Roku Smart LED ...	TCL	Amazon.com	319.99	67.50	252.49	34.45	Oversize (Small)	2900.0	4.3	93	9264.0	2964387.36	56	8.0
93	B0161RJ81C	Garmin Fenix 3 Sapphire White Rose/Gold	Garmin	Amazon.com	599.99	92.99	507.00	0.75	Standard (Large)	400.0	4.2	122	5221.0	3132547.79	33	1.0
87	B01A5HILB2	Garmin Fenix 3 HR Gray Performer Bundle	Garmin	SmartTechs	414.89	65.22	349.67	0.93	Standard (Large)	1098.0	4.4	68	8010.0	3323268.90	65	5.0
88	B00S7LZWHU	Garmin Fenix 3 Sapphire Performance Bundle	Garmin	Amazon.com	517.06	81.74	435.32	1.30	Standard (Large)	400.0	4.2	92	6720.0	3474643.20	38	6.0
84	B00RY1YWSO	Garmin fenix 3 GPS Watch Gray	Garmin	Amazon.com	499.99	77.99	422.00	0.65	Standard (Large)	1098.0	4.4	85	7050.0	3524929.50	36	21.0
90	B014SS9Q90	Garmin Fenix 3 Sapphire Performance Bundle	Garmin	Amazon.com	577.23	90.76	486.47	1.30	Standard (Large)	400.0	4.2	92	6129.0	3537842.67	59	6.0
94	B01MXZPG0U	Garmin Fenix 3 Sapphire White Rose/Gold	Garmin	Amazon.com	599.99	92.99	507.00	0.80	Standard (Large)	400.0	4.2	90	6810.0	4085931.90	33	1.0
18	B01MTGM5I9	TCL 55S405 55-Inch 4K Ultra HD Roku Smart LED ...	TCL	Amazon.com	379.99	90.47	289.52	38.01	Oversize (Large)	2900.0	4.3	57	13004.0	4941389.96	56	12.0
59	B06XNLC63G	FORTINET \| FC-10-03702-871-02-36 \| FortiGate-3...	Fortinet	SerenIT	202154.00	NaN	NaN	NaN	N.A.	5.0	4.1	29957	41.0	8288314.00	64	1.0

And the highest grossing product is Fortinet, a web app firewall. Next, is a smart TV and then a watch made by Garmin.