Skip to content

Commit

Permalink
Update changes
Browse files Browse the repository at this point in the history
  • Loading branch information
rafaeldcsantos committed Jun 1, 2024
1 parent 573ac9c commit 78f3e5c
Show file tree
Hide file tree
Showing 18 changed files with 59 additions and 63 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
120 changes: 58 additions & 62 deletions docs/23-Projects-Supermarket.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@
"href": "23-Projects-Supermarket.html#are-there-patterns-on-the-amounts-paid-by-the-costumers",
"title": "Supermarket Basket Analysis",
"section": "Are there patterns on the amounts paid by the costumers?",
"text": "Are there patterns on the amounts paid by the costumers?\nHow many items our costumers buy in each transaction? Are there patterns or distributions on this amount? Let’s explore!\nFirst let’s see a histogram of the distribution of the number of items per transaction. In the way our data is organized, we need to infer this number since it is not explicit. To count the number of items per transaction we can group the dataframe by TransactionID and get the number of records in each group:\n\n# Group records by TransactionID, select TransactionID to extract a metric (size).\nitemsIT = supermDF.groupby('TransactionID')['TransactionID'].transform('size')\n# Add it as a new column.\nsupermDF['itemsInTransaction'] = itemsIT\n# We want only one record per transaction!\nonlyFirstItems = supermDF.loc[supermDF['itemOrder'] == 1].copy()\n\nThen we can do a histogram to show the distribution of items per transaction:\n\n# Plot the histogram of TotalDue with 50 bins.\nplt.hist(onlyFirstItems['itemsInTransaction'], bins=50);\n# Set some plot parameters.\nplt.xlabel('Number of Items')\nplt.ylabel('Frequency')\nplt.yscale('log') \nplt.title('Histogram of number of items in basket')\nplt.grid(True)\n# Correct the Y axis so values won't be displayed in scientific notation.\n# This solution was suggested by ChatGPT!\nax = plt.gca()\nax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '{:g}'.format(x)))\n# Display it.\nplt.show()\n\n\n\n\n\n\n\n\nAs expected, most of the transactions contain a few items.\n\n\n \n\nTo count the number of items per transaction we could also get the maximum value for each itemOrder with the dataframe grouped by TransactionID. Try it!\n\nWe expect a sort of correlation between the number of items in a transaction and its total cost (TotalPaid). Let’s do a quick XY (or scatter) plot to verify this:\n\n# Create the scatter plot with X=itemsInTransaction abnd Y=TotalPaid\nplt.scatter(onlyFirstItems['itemsInTransaction'],onlyFirstItems['TotalPaid'])\n# Set title and axes labels.\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nLet’s enhance the plot with a linear regression line and confidence interval. The seaborn (Waskom 2021) library has a simple method for that:\n\n# Create the X and Y series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\n# Plot a regression plot with transparency in the data points.\nsns.regplot(x=X,y=Y,scatter_kws={\"color\":\"skyblue\",\"alpha\":0.5},line_kws={\"color\":\"red\"})\n# Set title and axes labels.\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nThere are some outliers but the total paid for each transaction seems correlated with the number of items in each cart.\n\n\n \n\nThe regression line may be heavily influenced by the transactions with only a few items or with a low paymend due. Redo the plot but using only transactions with total above a cutoff, or with transactions with more than a specific number of items, or both.\n\nAre those transactions paid by cash? Can we see any pattern or outlier on the payment method? Seaborn can use a discrete column as a color to help discriminate the points, let’s color the plot points using the payment method:\n\n# Set the number of figure in inches.\nplt.figure(figsize=(12, 8)) \n# Create the X, Y and points' color series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\nPM = onlyFirstItems[\"PaymentMethod\"]\n# Plot a regression plot with transparency in the data points.\nsns.scatterplot(x=X,y=Y,hue=PM)\n# Set title and axes labels\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\n\n\n \n\nThere is a payment category without a title. See how many transactions are in this situation and propose a way to deal with this data.\n\nThat wasn’t very useful, since there are many options of payments and we would like to aggregate them. Let’s create a new column on our dataframe to represent a simplified payment method.\n\n# A simple function that simply the categories of payment method.\ndef simplifyPayment(paymethod):\n if paymethod.startswith(\"Dinheiro\"):\n return \"Dinheiro\"\n elif paymethod.startswith(\"Cartao\"):\n return \"Cartao\"\n elif paymethod.startswith(\"Cheque\"):\n return \"Cheque\"\n else:\n return \"Outros\"\n# Now we can add a new column based on the value of PaymentMethod.\nonlyFirstItems['SimplifiedPaymentMethod'] = \\\n onlyFirstItems['PaymentMethod'].apply(simplifyPayment) \n\nWith this new column we can plot a more informative scatter plot:\n\n# We need a palette - a mapping from names to colors.\ncolor_mapping = {\n 'Dinheiro': 'yellowgreen',\n 'Cartao' : 'slateblue',\n 'Cheque' : 'magenta',\n 'Outros' : 'lightgrey'\n}\n# Set the number of figure in inches.\nplt.figure(figsize=(12, 8)) \n# Create the X, Y and points' color series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\ncol = onlyFirstItems[\"SimplifiedPaymentMethod\"]\n# Plot a scatter plot with the data points and the new palette.\nsns.scatterplot(x='itemsInTransaction', y='TotalPaid', \\\n hue='SimplifiedPaymentMethod',\\\n s=40,data=onlyFirstItems,palette=color_mapping)\n# Set title and axes labels\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Change the title of the legend and the marker scale.\nax = plt.gca()\nax.legend(title='Payment Method',markerscale=2)\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nSome transactions’ values were about 500 reais and paid by cash – should we take a closer look?\n\n\n \n\nCreate plots to explore the relationship between itemsInTransaction and TotalPaid but using other fields to set the color – e.g. are the payment patterns the same for each of the three stores?\n\n\n\n \n\nAre the payment patterns the same for different hours of the day? For different days on the week?",
"text": "Are there patterns on the amounts paid by the costumers?\nHow many items our costumers buy in each transaction? Are there patterns or distributions on this amount? Let’s explore!\nFirst let’s see a histogram of the distribution of the number of items per transaction. In the way our data is organized, we need to infer this number since it is not explicit. To count the number of items per transaction we can group the dataframe by TransactionID and get the number of records in each group:\n\n# Group records by TransactionID, select TransactionID to extract a metric (size).\nitemsIT = supermDF.groupby('TransactionID')['TransactionID'].transform('size')\n# Add it as a new column.\nsupermDF['itemsInTransaction'] = itemsIT\n# We want only one record per transaction!\nonlyFirstItems = supermDF.loc[supermDF['itemOrder'] == 1].copy()\n\nThen we can do a histogram to show the distribution of items per transaction:\n\n# Plot the histogram of TotalDue with 50 bins.\nplt.hist(onlyFirstItems['itemsInTransaction'], bins=50);\n# Set some plot parameters.\nplt.xlabel('Number of Items')\nplt.ylabel('Frequency')\nplt.yscale('log') \nplt.title('Histogram of number of items in basket')\nplt.grid(True)\n# Correct the Y axis so values won't be displayed in scientific notation.\n# This solution was suggested by ChatGPT!\nax = plt.gca()\nax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '{:g}'.format(x)))\n# Display it.\nplt.show()\n\n\n\n\n\n\n\n\nAs expected, most of the transactions contain a few items.\n\n\n \n\nTo count the number of items per transaction we could also get the maximum value for each itemOrder with the dataframe grouped by TransactionID. Try it!\n\nWe expect a sort of correlation between the number of items in a transaction and its total cost (TotalPaid). Let’s do a quick XY (or scatter) plot to verify this:\n\n# Create the scatter plot with X=itemsInTransaction abnd Y=TotalPaid\nplt.scatter(onlyFirstItems['itemsInTransaction'],onlyFirstItems['TotalPaid'])\n# Set title and axes labels.\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nLet’s enhance the plot with a linear regression line and confidence interval. The seaborn (Waskom 2021) library has a simple method for that:\n\n# Create the X and Y series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\n# Plot a regression plot with transparency in the data points.\nsns.regplot(x=X,y=Y,scatter_kws={\"color\":\"skyblue\",\"alpha\":0.5},line_kws={\"color\":\"red\"})\n# Set title and axes labels.\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nThere are some outliers but the total paid for each transaction seems correlated with the number of items in each cart.\n\n\n \n\nThe regression line may be heavily influenced by the transactions with only a few items or with a low paymend due. Redo the plot but using only transactions with total above a cutoff, or with transactions with more than a specific number of items, or both.\n\nAre those transactions paid by cash? Can we see any pattern or outlier on the payment method? Seaborn can use a discrete column as a color to help discriminate the points, let’s color the plot points using the payment method:\n\n# Create the X, Y and points' color series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\nPM = onlyFirstItems[\"PaymentMethod\"]\n# Plot a regression plot with transparency in the data points.\nsns.scatterplot(x=X,y=Y,hue=PM)\n# Set title and axes labels\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\n\n\n \n\nThere is a payment category without a title. See how many transactions are in this situation and propose a way to deal with this data.\n\nThat wasn’t very useful, since there are many options of payments and we would like to aggregate them. Let’s create a new column on our dataframe to represent a simplified payment method.\n\n# A simple function that simply the categories of payment method.\ndef simplifyPayment(paymethod):\n if paymethod.startswith(\"Dinheiro\"):\n return \"Dinheiro\"\n elif paymethod.startswith(\"Cartao\"):\n return \"Cartao\"\n elif paymethod.startswith(\"Cheque\"):\n return \"Cheque\"\n else:\n return \"Outros\"\n# Now we can add a new column based on the value of PaymentMethod.\nonlyFirstItems['SimplifiedPaymentMethod'] = \\\n onlyFirstItems['PaymentMethod'].apply(simplifyPayment) \n\nWith this new column we can plot a more informative scatter plot:\n\n# We need a palette - a mapping from names to colors.\ncolor_mapping = {\n 'Dinheiro': 'yellowgreen',\n 'Cartao' : 'slateblue',\n 'Cheque' : 'magenta',\n 'Outros' : 'lightgrey'\n}\n# Create the X, Y and points' color series (to make next commands shorter).\nX = onlyFirstItems[\"itemsInTransaction\"]\nY = onlyFirstItems[\"TotalPaid\"]\ncol = onlyFirstItems[\"SimplifiedPaymentMethod\"]\n# Plot a scatter plot with the data points and the new palette.\nsns.scatterplot(x='itemsInTransaction', y='TotalPaid', \\\n hue='SimplifiedPaymentMethod',\\\n s=40,data=onlyFirstItems,palette=color_mapping)\n# Set title and axes labels\nplt.title('Scatter Plot of Number of Items in Cart versus Total Paid')\nplt.xlabel('Number of Items in Cart')\nplt.ylabel('Total Paid')\n# Change the title of the legend and the marker scale.\nax = plt.gca()\nax.legend(title='Payment Method',markerscale=2)\n# Show the plot.\nplt.show()\n\n\n\n\n\n\n\n\nSome transactions’ values were about 500 reais and paid by cash – should we take a closer look?\n\n\n \n\nCreate plots to explore the relationship between itemsInTransaction and TotalPaid but using other fields to set the color – e.g. are the payment patterns the same for each of the three stores?\n\n\n\n \n\nAre the payment patterns the same for different hours of the day? For different days on the week?",
"crumbs": [
"Projects",
"Supermarket Basket Analysis"
Expand Down

0 comments on commit 78f3e5c

Please sign in to comment.