Amazon evaluate knowledge

Julian McAuley, UCSD

New!: See our up to date (2018) model of the Amazon knowledge right here

See quite a lot of different datasets for recommender programs analysis on our lab’s dataset webpage

Description

This dataset incorporates product critiques and metadata from Amazon, together with 142.8 million critiques spanning Might 1996 – July 2014.

This dataset contains critiques (scores, textual content, helpfulness votes), product metadata (descriptions, class data, value, model, and picture options), and hyperlinks (additionally seen/additionally purchased graphs).

Recordsdata

“Small” subsets for experimentation

In the event you’re utilizing this knowledge for a category undertaking (or comparable) please think about using certainly one of these smaller datasets beneath earlier than requesting the bigger information. To acquire the bigger information you will have to contact me to acquire entry.

Okay-cores (i.e., dense subsets): These knowledge have been decreased to extract the k-core, such that every of the remaining customers and gadgets have ok critiques every.

Scores solely: These datasets embody no metadata or critiques, however solely (person,merchandise,ranking,timestamp) tuples. Thus they’re appropriate to be used with mymedialite (or comparable) packages.

Books 5-core (8,898,041 critiques) scores solely (22,507,155 scores)
Electronics 5-core (1,689,188 critiques) scores solely (7,824,482 scores)
Films and TV 5-core (1,697,533 critiques) scores solely (4,607,047 scores)
CDs and Vinyl 5-core (1,097,592 critiques) scores solely (3,749,004 scores)
Clothes, Sneakers and Jewellery 5-core (278,677 critiques) scores solely (5,748,920 scores)
Residence and Kitchen 5-core (551,682 critiques) scores solely (4,253,926 scores)
Kindle Retailer 5-core (982,619 critiques) scores solely (3,205,467 scores)
Sports activities and Outdoor 5-core (296,337 critiques) scores solely (3,268,695 scores)
Cell Telephones and Equipment 5-core (194,439 critiques) scores solely (3,447,249 scores)
Well being and Private Care 5-core (346,355 critiques) scores solely (2,982,326 scores)
Toys and Video games 5-core (167,597 critiques) scores solely (2,252,771 scores)
Video Video games 5-core (231,780 critiques) scores solely (1,324,753 scores)
Instruments and Residence Enchancment 5-core (134,476 critiques) scores solely (1,926,047 scores)
Magnificence 5-core (198,502 critiques) scores solely (2,023,070 scores)
Apps for Android 5-core (752,937 critiques) scores solely (2,638,172 scores)
Workplace Merchandise 5-core (53,258 critiques) scores solely (1,243,186 scores)
Pet Provides 5-core (157,836 critiques) scores solely (1,235,316 scores)
Automotive 5-core (20,473 critiques) scores solely (1,373,768 scores)
Grocery and Connoisseur Meals 5-core (151,254 critiques) scores solely (1,297,156 scores)
Patio, Garden and Backyard 5-core (13,272 critiques) scores solely (993,490 scores)
Child 5-core (160,792 critiques) scores solely (915,446 scores)
Digital Music 5-core (64,706 critiques) scores solely (836,006 scores)
Musical Devices 5-core (10,261 critiques) scores solely (500,176 scores)
Amazon Instantaneous Video 5-core (37,126 critiques) scores solely (583,933 scores)

Full evaluate knowledge

Please see the per-category information beneath, and solely obtain these (massive!) information if you really want them:

uncooked evaluate knowledge (20gb) – all 142.8 million critiques

The above file incorporates some duplicate critiques, primarily on account of near-identical merchandise whose critiques Amazon merges, e.g. VHS and DVD variations of the identical film. These duplicates have been eliminated within the information beneath:

person evaluate knowledge (18gb) – duplicate gadgets eliminated (83.68 million critiques), sorted by person

product evaluate knowledge (18gb) – duplicate gadgets eliminated, sorted by product

scores solely (3.2gb) – similar as above, in csv kind with out critiques or metadata

5-core (9.9gb) – subset of the information during which all customers and gadgets have at the least 5 critiques (41.13 million critiques)

Lastly, the next file removes duplicates extra aggressively, eradicating duplicates even when they’re written by completely different customers. This accounts for customers with a number of accounts or plagiarized critiques. Such duplicates account for lower than 1 % of critiques, although this dataset might be preferable for sentiment evaluation sort duties:

aggressively deduplicated knowledge (18gb) – no duplicates by any means (82.83 million critiques)

Format is one-review-per-line in (free) json. See examples beneath for additional assist studying the information.

Pattern evaluate:

{ “reviewerID”: “A2SUAM1J3GNN3B”, “asin”: “0000013714”, “reviewerName”: “J. McDonald”, “useful”: [2, 3], “reviewText”: “I purchased this for my husband who performs the piano. He’s having a beautiful time enjoying these previous hymns. The music is at instances onerous to learn as a result of we expect the e book was printed for singing from greater than enjoying from. Nice buy although!”, “total”: 5.0, “abstract”: “Heavenly Freeway Hymns”, “unixReviewTime”: 1252800000, “reviewTime”: “09 13, 2009” }

the place

  • reviewerID – ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin – ID of the product, e.g. 0000013714
  • reviewerName – identify of the reviewer
  • useful – helpfulness ranking of the evaluate, e.g. 2/3
  • reviewText – textual content of the evaluate
  • total – ranking of the product
  • abstract – abstract of the evaluate
  • unixReviewTime – time of the evaluate (unix time)
  • reviewTime – time of the evaluate (uncooked)

Metadata

Metadata contains descriptions, value, sales-rank, model data, and co-purchasing hyperlinks:

metadata (3.1gb) – metadata for 9.4 million merchandise

Pattern metadata:

{ “asin”: “0000031852”, “title”: “Women Ballet Tutu Zebra Sizzling Pink”, “value”: 3.17, “imUrl”: “http://ecx.images-amazon.com/photos/I/51fAmVkTbyL._SY300_.jpg”, “associated”: { “also_bought”: [“B00JHONN1S”, “B002BZX8Z6”, “B00D2K1M3O”, “0000031909”, “B00613WDTQ”, “B00D0WDS9A”, “B00D0GCI8S”, “0000031895”, “B003AVKOP2”, “B003AVEU6G”, “B003IEDM9Q”, “B002R0FA24”, “B00D23MC6W”, “B00D2K0PA0”, “B00538F5OK”, “B00CEV86I6”, “B002R0FABA”, “B00D10CLVW”, “B003AVNY6I”, “B002GZGI4E”, “B001T9NUFS”, “B002R0F7FE”, “B00E1YRI4C”, “B008UBQZKU”, “B00D103F8U”, “B007R2RM8W”], “also_viewed”: [“B002BZX8Z6”, “B00JHONN1S”, “B008F0SU0Y”, “B00D23MC6W”, “B00AFDOPDA”, “B00E1YRI4C”, “B002GZGI4E”, “B003AVKOP2”, “B00D9C1WBM”, “B00CEV8366”, “B00CEUX0D8”, “B0079ME3KU”, “B00CEUWY8K”, “B004FOEEHC”, “0000031895”, “B00BC4GY9Y”, “B003XRKA7A”, “B00K18LKX2”, “B00EM7KAG6”, “B00AMQ17JA”, “B00D9C32NI”, “B002C3Y6WG”, “B00JLL4L5Y”, “B003AVNY6I”, “B008UBQZKU”, “B00D0WDS9A”, “B00613WDTQ”, “B00538F5OK”, “B005C4Y4F6”, “B004LHZ1NY”, “B00CPHX76U”, “B00CEUWUZC”, “B00IJVASUE”, “B00GOR07RE”, “B00J2GTM0W”, “B00JHNSNSM”, “B003IEDM9Q”, “B00CYBU84G”, “B008VV8NSQ”, “B00CYBULSO”, “B00I2UHSZA”, “B005F50FXC”, “B007LCQI3S”, “B00DP68AVW”, “B009RXWNSI”, “B003AVEU6G”, “B00HSOJB9M”, “B00EHAGZNA”, “B0046W9T8C”, “B00E79VW6Q”, “B00D10CLVW”, “B00B0AVO54”, “B00E95LC8Q”, “B00GOR92SO”, “B007ZN5Y56”, “B00AL2569W”, “B00B608000”, “B008F0SMUC”, “B00BFXLZ8M”], “bought_together”: [“B002BZX8Z6”] }, “salesRank”: {“Toys & Video games”: 211836}, “model”: “Coxlures”, “classes”: [[“Sports & Outdoors”, “Other Sports”, “Dance”]] }

the place

  • asin – ID of the product, e.g. 0000031852
  • title – identify of the product
  • value – value in US {dollars} (at time of crawl)
  • imUrl – url of the product picture
  • associated – associated merchandise (additionally purchased, additionally seen, purchased collectively, purchase after viewing)
  • salesRank – gross sales rank data
  • model – model identify
  • classes – listing of classes the product belongs to

Visible Options

We extracted visible options from every product picture utilizing a deep CNN (see quotation beneath). Picture options are saved in a binary format, which consists of 10 characters (the product ID), adopted by 4096 floats (repeated for each product). See information beneath for additional assist studying the information.

visible options (141gb) – visible options for all merchandise

The pictures themselves may be extracted from the imUrl subject within the metadata information.

Per-category information

Beneath are information for particular person product classes, which have already had duplicate merchandise critiques eliminated.

Books critiques (22,507,155 critiques) metadata (2,370,585 merchandise) picture options
Electronics critiques (7,824,482 critiques) metadata (498,196 merchandise) picture options
Films and TV critiques (4,607,047 critiques) metadata (208,321 merchandise) picture options
CDs and Vinyl critiques (3,749,004 critiques) metadata (492,799 merchandise) picture options
Clothes, Sneakers and Jewellery critiques (5,748,920 critiques) metadata (1,503,384 merchandise) picture options
Residence and Kitchen critiques (4,253,926 critiques) metadata (436,988 merchandise) picture options
Kindle Retailer critiques (3,205,467 critiques) metadata (434,702 merchandise) picture options
Sports activities and Outdoor critiques (3,268,695 critiques) metadata (532,197 merchandise) picture options
Cell Telephones and Equipment critiques (3,447,249 critiques) metadata (346,793 merchandise) picture options
Well being and Private Care critiques (2,982,326 critiques) metadata (263,032 merchandise) picture options
Toys and Video games critiques (2,252,771 critiques) metadata (336,072 merchandise) picture options
Video Video games critiques (1,324,753 critiques) metadata (50,953 merchandise) picture options
Instruments and Residence Enchancment critiques (1,926,047 critiques) metadata (269,120 merchandise) picture options
Magnificence critiques (2,023,070 critiques) metadata (259,204 merchandise) picture options
Apps for Android critiques (2,638,173 critiques) metadata (61,551 merchandise) picture options
Workplace Merchandise critiques (1,243,186 critiques) metadata (134,838 merchandise) picture options
Pet Provides critiques (1,235,316 critiques) metadata (110,707 merchandise) picture options
Automotive critiques (1,373,768 critiques) metadata (331,090 merchandise) picture options
Grocery and Connoisseur Meals critiques (1,297,156 critiques) metadata (171,760 merchandise) picture options
Patio, Garden and Backyard critiques (993,490 critiques) metadata (109,094 merchandise) picture options
Child critiques (915,446 critiques) metadata (71,317 merchandise) picture options
Digital Music critiques (836,006 critiques) metadata (279,899 merchandise) picture options
Musical Devices critiques (500,176 critiques) metadata (84,901 merchandise) picture options
Amazon Instantaneous Video critiques (583,933 critiques) metadata (30,648 merchandise) picture options

Quotation

Please cite one or each of the next when you use the information in any method:

Ups and downs: Modeling the visible evolution of vogue traits with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf

Picture-based suggestions on kinds and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf

Code

Studying the information

Information may be handled as python dictionary objects. A easy script to learn any of the above the information is as follows:

def parse(path): g = gzip.open(path, ‘r’) for l in g: yield eval(l)

Convert to ‘strict’ json

The above knowledge may be learn with python ‘eval’, however will not be strict json. If you would like to make use of some language apart from python, you’ll be able to convert the information to strict json as follows:

import json import gzip def parse(path): g = gzip.open(path, ‘r’) for l in g: yield json.dumps(eval(l)) f = open(“output.strict”, ‘w’) for l in parse(“reviews_Video_Games.json.gz”): f.write(l + ‘n’)

Pandas knowledge body

This code reads the information right into a pandas knowledge body:

import pandas as pd import gzip def parse(path): g = gzip.open(path, ‘rb’) for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient=’index’) df = getDF(‘reviews_Video_Games.json.gz’)

Learn picture options

import array def readImageFeatures(path): f = open(path, ‘rb’) whereas True: asin = f.learn(10) if asin == ”: break a = array.array(‘f’) a.fromfile(f, 4096) yield asin, a.tolist()

Instance: compute common ranking

scores = [] for evaluate in parse(“reviews_Video_Games.json.gz”): scores.append(evaluate[‘overall’]) print sum(scores) / len(scores)

Instance: latent-factor mannequin in mymedialite

Predicts scores from a rating-only CSV file

./rating_prediction –recommender=BiasedMatrixFactorization –training-file=ratings_Video_Games.csv –test-ratio=0.1

Product Reviews