Introduction
In this set of exercises we will work with the Wine Reviews dataset.
Run the following cell to load your data and some utility functions (including code to check your answers).
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.indexing_selecting_and_assigning import *
print("Setup complete.")
Setup complete.
Look at an overview of your data by running the following line.
reviews.head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
Exercises
1.
Select the description
column from reviews
and assign the result to the variable desc
.
# Your code here
desc = reviews.description
# Check your answer
q1.check()
<IPython.core.display.Javascript object>
Correct
Follow-up question: what type of object is desc
? If you're not sure, you can check by calling Python's type
function: type(desc)
.
q1.hint()
q1.solution()
<IPython.core.display.Javascript object>
Hint: As an example, say we would like to select the column column
from a DataFrame table
. Then we have two options: we can call either table.column
or table["column"]
.
<IPython.core.display.Javascript object>
Solution:
desc = reviews.description
or
desc = reviews["description"]
desc
is a pandas Series
object, with an index matching the reviews
DataFrame. In general, when we select a single column from a DataFrame, we'll get a Series.
2.
Select the first value from the description column of reviews
, assigning it to variable first_description
.
first_description = reviews.description.iloc[0]
# Check your answer
q2.check()
first_description
<IPython.core.display.Javascript object>
Correct:
first_description = reviews.description.iloc[0]
Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result, such as reviews.description.loc[0]
, reviews.description[0]
, and more!
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive,
offering unripened apple, citrus and dried sage alongside brisk acidity."
q2.hint()
q2.solution()
<IPython.core.display.Javascript object>
Hint: To obtain a specific entry (corresponding to column column
and row i
) in a DataFrame table
, we can call table.column.iloc[i]
. Remember that Python indexing starts at 0!
<IPython.core.display.Javascript object>
Solution:
first_description = reviews.description.iloc[0]
Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result, such as reviews.description.loc[0]
, reviews.description[0]
, and more!
3.
Select the first row of data (the first record) from reviews
, assigning it to the variable first_row
.
first_row = reviews.iloc[0,:]
# Check your answer
q3.check()
first_row
<IPython.core.display.Javascript object>
Correct
country Italy
description Aromas include tropical fruit, broom, brimston...
...
variety White Blend
winery Nicosia
Name: 0, Length: 13, dtype: object
q3.hint()
q3.solution()
<IPython.core.display.Javascript object>
Hint: To obtain a specific row of a DataFrame, we can use the iloc
operator. For more information, see the section on Index-based selection in the reference component.
<IPython.core.display.Javascript object>
Solution:
first_row = reviews.iloc[0]
4.
Select the first 10 values from the description
column in reviews
, assigning the result to variable first_descriptions
.
Hint: format your output as a pandas Series.
first_descriptions = reviews.description.iloc[:10]
# Check your answer
q4.check()
first_descriptions
<IPython.core.display.Javascript object>
Correct:
first_descriptions = reviews.description.iloc[:10]
Note that many other options will return a valid result, such as desc.head(10)
and reviews.loc[:9, "description"]
.
0 Aromas include tropical fruit, broom, brimston...
1 This is ripe and fruity, a wine that is smooth...
...
8 Savory dried thyme notes accent sunnier flavor...
9 This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object
q4.hint()
q4.solution()
<IPython.core.display.Javascript object>
Hint: We can use either the loc
or iloc
operator to solve this problem. For more information, see the sections on Index-based selection and Label-based selection in the reference component.
<IPython.core.display.Javascript object>
Solution:
first_descriptions = reviews.description.iloc[:10]
Note that many other options will return a valid result, such as desc.head(10)
and reviews.loc[:9, "description"]
.
5.
Select the records with index labels 1
, 2
, 3
, 5
, and 8
, assigning the result to the variable sample_reviews
.
In other words, generate the following DataFrame:
rows = [1,2,3,5,8]
sample_reviews = reviews.loc[rows]
# Check your answer
q5.check()
sample_reviews
<IPython.core.display.Javascript object>
Correct
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
5 | Spain | Blackberry and raspberry aromas show a typical... | Ars In Vitro | 87 | 15.0 | Northern Spain | Navarra | NaN | Michael Schachner | @wineschach | Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... | Tempranillo-Merlot | Tandem |
8 | Germany | Savory dried thyme notes accent sunnier flavor... | Shine | 87 | 12.0 | Rheinhessen | NaN | NaN | Anna Lee C. Iijima | NaN | Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... | Gewürztraminer | Heinz Eifel |
q5.hint()
q5.solution()
<IPython.core.display.Javascript object>
Hint: Use either the loc
or iloc
operator to select rows of a DataFrame.
<IPython.core.display.Javascript object>
Solution:
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]
6.
Create a variable df
containing the country
, province
, region_1
, and region_2
columns of the records with the index labels 0
, 1
, 10
, and 100
. In other words, generate the following DataFrame:
rows = [0,1,10,100]
columns_index = ["country","province","region_1","region_2"]
df = reviews.loc[rows,columns_index]
# Check your answer
q6.check()
df
<IPython.core.display.Javascript object>
Correct
country | province | region_1 | region_2 | |
---|---|---|---|---|
0 | Italy | Sicily & Sardinia | Etna | NaN |
1 | Portugal | Douro | NaN | NaN |
10 | US | California | Napa Valley | Napa |
100 | US | New York | Finger Lakes | Finger Lakes |
q6.hint()
q6.solution()
<IPython.core.display.Javascript object>
Hint: Use the loc
operator. (Note that it is also possible to solve this problem using the iloc
operator, but this would require extra effort to convert each column name to a corresponding integer-valued index.)
<IPython.core.display.Javascript object>
Solution:
cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]
7.
Create a variable df
containing the country
and variety
columns of the first 100 records.
Hint: you may use loc
or iloc
. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the tutorial:
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded.loc
, meanwhile, indexes inclusively.
This is particularly confusing when the DataFrame index is a simple numerical list, e.g.
0,...,1000
. In this casedf.iloc[0:1000]
will return 1000 entries, whiledf.loc[0:1000]
return 1001 of them! To get 1000 elements usingloc
, you will need to go one lower and ask fordf.iloc[0:999]
.
df = reviews.loc[0:99,["country","variety"]]
# Check your answer
q7.check()
df
<IPython.core.display.Javascript object>
Correct:
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
or
cols_idx = [0, 11]
df = reviews.iloc[:100, cols_idx]
country | variety | |
---|---|---|
0 | Italy | White Blend |
1 | Portugal | Portuguese Red |
... | ... | ... |
98 | Italy | Sangiovese |
99 | US | Bordeaux-style Red Blend |
100 rows × 2 columns
q7.hint()
q7.solution()
<IPython.core.display.Javascript object>
Hint: It is most straightforward to solve this problem with the loc
operator. (However, if you decide to use iloc
, remember to first convert each column into a corresponding integer-valued index.)
<IPython.core.display.Javascript object>
Solution:
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
or
cols_idx = [0, 11]
df = reviews.iloc[:100, cols_idx]
8.
Create a DataFrame italian_wines
containing reviews of wines made in Italy
. Hint: reviews.country
equals what?
italian_wines = reviews[reviews.country == "Italy"]
# Check your answer
q8.check()
<IPython.core.display.Javascript object>
Correct
q8.hint()
q8.solution()
<IPython.core.display.Javascript object>
Hint: For more information, see the section on Conditional selection in the reference component.
<IPython.core.display.Javascript object>
Solution:
italian_wines = reviews[reviews.country == 'Italy']
9.
Create a DataFrame top_oceania_wines
containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.
top_oceania_wines = reviews.loc[reviews.country.isin(["Australia","New Zealand"])
& (reviews.points >= 95)]
# Check your answer
q9.check()
top_oceania_wines
<IPython.core.display.Javascript object>
Correct
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
345 | Australia | This wine contains some material over 100 year... | Rare | 100 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards |
346 | Australia | This deep brown wine smells like a damp, mossy... | Rare | 98 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscadelle... | Muscadelle | Chambers Rosewood Vineyards |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
122507 | New Zealand | This blend of Cabernet Sauvignon (62.5%), Merl... | SQM Gimblett Gravels Cabernets/Merlot | 95 | 79.0 | Hawke's Bay | NaN | NaN | Joe Czerwinski | @JoeCz | Squawking Magpie 2014 SQM Gimblett Gravels Cab... | Bordeaux-style Red Blend | Squawking Magpie |
122939 | Australia | Full-bodied and plush yet vibrant and imbued w... | The Factor | 98 | 125.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2013 The Factor Shiraz (Barossa Valley) | Shiraz | Torbreck |
49 rows × 13 columns
q9.hint()
q9.solution()
<IPython.core.display.Javascript object>
Hint: For more information, see the section on Conditional selection in the reference component.
<IPython.core.display.Javascript object>
Solution:
top_oceania_wines = reviews.loc[
(reviews.country.isin(['Australia', 'New Zealand']))
& (reviews.points >= 95)
]