Pandas Tutorial
Tutorial, Online, 2024
En este tutorial vamos trabajar con el paquete Pandas de Python. Pandas es ampliamente empleado en toda la temática de aprendizaje automático como por científicos de datos.
Pandas
Pandas documentación oficial, además podes acceder a tutoriales introductorios.
Estructuras básicas
- Dataframes
- Series
Datasets Online
Kaggle: Your Machine Learning and Data Science Community UCI Machine Learning Repository
Code Example
import pandas as pd
path = 'online_retail.csv'
retail_data = pd.read_csv(path)
#Excel
#data_excel = pd.read_excel(path)
#JSON
#data_json = pd.read_json(path)
#print(type(retail_data))
print(retail_data)
Salida al ejecutar el código anterior.
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
... ... ... ... ...
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3
InvoiceDate UnitPrice CustomerID Country
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
... ... ... ... ...
541904 2011-12-09 12:50:00 0.85 12680.0 France
541905 2011-12-09 12:50:00 2.10 12680.0 France
541906 2011-12-09 12:50:00 4.15 12680.0 France
541907 2011-12-09 12:50:00 4.15 12680.0 France
541908 2011-12-09 12:50:00 4.95 12680.0 France
[541909 rows x 8 columns]
Dataframes
Un dataframe es una estructura bidimensional similar a una tabla, donde puedes almacenar datos de diferentes tipos. Se pueden leer datos desde un archivo CSV, crear dataframes a partir de arrays, listas y diccionarios. También se puede descomponer dataframes en series.
Creando desde un array
import numpy as np
data = np.array([[1,2,3], [4,5,6], [7,8,9]])
dt_from_array = pd.DataFrame(data, columns=['A','B', 'c'])
print(dt_from_array)
A B c
0 1 2 3
1 4 5 6
2 7 8 9
Creando desde una lista
data = [[1,'Jhon', 22], [2,'Anna', 24]]
df_from_list = pd.DataFrame(data, columns = ['ID', 'Name', 'Age'])
print(df_from_list)
ID Name Age
0 1 Jhon 22
1 2 Anna 24
Creando desde una diccionario
data = [{'ID': 1,
'Name': 'Jhon',
'Age' : 22}]
dt_from_dict_list = pd.DataFrame(data)
print(dt_from_dict_list )
ID Name Age
0 1 Jhon 22
Creando usando Pandas Series
Una serie en Pandas es una estructura de datos unidimensional que puede contener datos de diferentes tipos (enteros, cadenas, fechas, etc.). Es similar a una columna en un DataFrame, y cada serie tiene un índice que permite acceder a sus elementos de manera eficiente. Las series son útiles para manipular y analizar datos, y se pueden crear a partir de listas, diccionarios o arrays de NumPy.
data = {'ID': pd.Series([1,2,3]),
'Name': pd.Series(['Jhon', 'Anna', 'Mike']),
'Age': pd.Series([22,24,21])}
df_from_series_dict = pd.DataFrame(data)
print(df_from_series_dict)
ID Name Age
0 1 Jhon 22
1 2 Anna 24
2 3 Mike 21
Ayuda por si empleamos Google Colab y Google Drive
from google.colab import drive
drive.mount('/content/drive')
file_path = "/content/drive/My Drive/online_retail.csv"
sales_data = pd.read_csv(file_path)
print(sales_data.head())
Estructuras de Datos y Funciones
Obteniendo el nombre de las columnas,
columns_names = retail_data.columns
print(columns_names)
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
dtype='object')
Obteniendo dimensión del data frame
num_rows, num_columns = retail_data.shape
print('Número de filas', num_rows)
print('Número de columnas', num_columns)
Número de filas 541909
Número de columnas 8
Obtener una columna específica
daily_sales = retail_data['Quantity']
print(daily_sales)
print(daily_sales[2])
0 6
1 6
2 8
3 6
4 6
..
541904 12
541905 6
541906 4
541907 4
541908 3
Name: Quantity, Length: 541909, dtype: int64
8
Información estadística del DataFrame
summary = retail_data.describe()
print(summary)
Quantity UnitPrice CustomerID
count 541909.000000 541909.000000 406829.000000
mean 9.552250 4.611114 15287.690570
std 218.081158 96.759853 1713.600303
min -80995.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13953.000000
50% 3.000000 2.080000 15152.000000
75% 10.000000 4.130000 16791.000000
max 80995.000000 38970.000000 18287.000000
Información estadística de la columana seleccionada
daily_sales = retail_data['Quantity']
mean_value = daily_sales.mean()
print('La media es', mean_value)
median_value = daily_sales.median()
print('La mediana es', mean_value)
sum_value = daily_sales.sum()
print('La suma total es', mean_value)
count_values = daily_sales.count()
print(count_values)
La media es 9.55224954743324
La mediana es 9.55224954743324
La suma total es 9.55224954743324
541909
Creando un Pandas Series
daily_sales = pd.Series([10, 20, None, 40, 50])
# Suma de Quantity
total_sum = daily_sales.sum()
print("Suma de Quantity:", total_sum)
# Conteo de Quantity
count_values = daily_sales.count()
print("Conteo de Quantity:", count_values)
Suma de Quantity: 120.0
Conteo de Quantity: 4
Uso de iloc y loc
Iloc
Extra la información del dataframe especificando el índice.
first_row = retail_data.iloc[0]
print(first_row)
InvoiceNo 536365
StockCode 85123A
Description WHITE HANGING HEART T-LIGHT HOLDER
Quantity 6
InvoiceDate 2010-12-01 08:26:00
UnitPrice 2.55
CustomerID 17850.0
Country United Kingdom
Name: 0, dtype: object
first_five_row = retail_data.iloc[6:8]
print(first_five_row)
InvoiceNo StockCode Description Quantity \
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6
7 536366 22633 HAND WARMER UNION JACK 6
InvoiceDate UnitPrice CustomerID Country
6 2010-12-01 08:26:00 4.25 17850.0 United Kingdom
7 2010-12-01 08:28:00 1.85 17850.0 United Kingdom
Podemos emplear todas las formas de slicing, como si fuera un array de Numpy.
subset = retail_data.iloc[:3, :2]
print(subset)
InvoiceNo StockCode
0 536365 85123A
1 536365 71053
2 536365 84406B
Loc
Accede a la información del dataframe pero especificando la etiqueta.
row_index_3 = retail_data.loc[3]
print(row_index_3)
InvoiceNo 536365
StockCode 84029G
Description KNITTED UNION FLAG HOT WATER BOTTLE
Quantity 6
InvoiceDate 2010-12-01 08:26:00
UnitPrice 3.39
CustomerID 17850.0
Country United Kingdom
Name: 3, dtype: object
row_index_0_to_4 = retail_data.loc[0:4]
print(row_index_0_to_4)
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
InvoiceDate UnitPrice CustomerID Country
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
quantity_column = retail_data.loc[:,'Quantity']
print(quantity_column)
0 6
1 6
2 8
3 6
4 6
..
541904 12
541905 6
541906 4
541907 4
541908 3
Name: Quantity, Length: 541909, dtype: int64
quantity_unitprices_column = retail_data.loc[:,['Quantity','UnitPrice']]
print(quantity_unitprices_column)
Quantity UnitPrice
0 6 2.55
1 6 3.39
2 8 2.75
3 6 3.39
4 6 3.39
... ... ...
541904 12 0.85
541905 6 2.10
541906 4 4.15
541907 4 4.15
541908 3 4.95
[541909 rows x 2 columns]
Manejo de Datos Faltantes
Documentación en Pandas para el manejo de datos faltantes
Buscamos datos faltantes
missing_data = retail_data.isna()
print(missing_data)
De esta forma es díficil visualizarlos, entonces hacemos
missing_data_count = retail_data.isna().sum()
print('Conteo de datos faltantes por columna:\n',missing_data_count )
Conteo de datos faltantes por columna:
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
dtype: int64
Podemos observar los datos que no tiene la información. Ahora podemos eliminar todas las filas que no tienen datos o llenar estos valores con algún tipo de operación.
Borrando las filas
no_missing_rows = retail_data.dropna()
print('Datos sin filas con valores faltantes:\n', no_missing_rows)
Datos sin filas con valores faltantes:
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
... ... ... ... ...
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3
InvoiceDate UnitPrice CustomerID Country
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
... ... ... ... ...
541904 2011-12-09 12:50:00 0.85 12680.0 France
541905 2011-12-09 12:50:00 2.10 12680.0 France
541906 2011-12-09 12:50:00 4.15 12680.0 France
541907 2011-12-09 12:50:00 4.15 12680.0 France
541908 2011-12-09 12:50:00 4.95 12680.0 France
[406829 rows x 8 columns]
Borrando las columnas
no_missing_columns = retail_data.dropna(axis=1)
print('Datos sin columnas con valores faltantes:\n', no_missing_columns)
DAtos sin columnas con valores faltantes:
InvoiceNo StockCode Quantity InvoiceDate UnitPrice \
0 536365 85123A 6 2010-12-01 08:26:00 2.55
1 536365 71053 6 2010-12-01 08:26:00 3.39
2 536365 84406B 8 2010-12-01 08:26:00 2.75
3 536365 84029G 6 2010-12-01 08:26:00 3.39
4 536365 84029E 6 2010-12-01 08:26:00 3.39
... ... ... ... ... ...
541904 581587 22613 12 2011-12-09 12:50:00 0.85
541905 581587 22899 6 2011-12-09 12:50:00 2.10
541906 581587 23254 4 2011-12-09 12:50:00 4.15
541907 581587 23255 4 2011-12-09 12:50:00 4.15
541908 581587 22138 3 2011-12-09 12:50:00 4.95
Country
0 United Kingdom
1 United Kingdom
2 United Kingdom
3 United Kingdom
4 United Kingdom
... ...
541904 France
541905 France
541906 France
541907 France
541908 France
[541909 rows x 6 columns]
Rellenando datos faltantes con ceros
retail_data_filled_zeros = retail_data.fillna(0)
retail_data_filled_zeros_count = retail_data_filled_zeros.isna().sum()
print(retail_data_filled_zeros_count)
InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64
Rellenando datos faltantes con operaciones matemáticas
mean_unit_price = retail_data['UnitPrice'].mean()
retail_data_filled_mean = retail_data['UnitPrice'].fillna(mean_unit_price)
print(retail_data_filled_mean)
0 2.55
1 3.39
2 2.75
3 3.39
4 3.39
...
541904 0.85
541905 2.10
541906 4.15
541907 4.15
541908 4.95
Name: UnitPrice, Length: 541909, dtype: float64
Creación y Manipulación de Columnas
Como usar apply en DataFrame y Series
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print(df.head())
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
InvoiceDate UnitPrice CustomerID Country TotalPrice
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
df['HighValue'] = df['TotalPrice'] > 16
print(df['HighValue'].head(10))
0 False
1 True
2 True
3 True
4 True
5 False
6 True
7 False
8 False
9 True
Name: HighValue, dtype: bool
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
8 TotalPrice 541909 non-null float64
9 HighValue 541909 non-null bool
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 37.7+ MB
None
Cambiando a tipo de dato Fecha
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null datetime64[ns]
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
8 TotalPrice 541909 non-null float64
9 HighValue 541909 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 37.7+ MB
None
Aplicando una función lambda
df['DiscountedPrice'] = df['UnitPrice'].apply(lambda x: x* 0.9)
print(df.head(3))
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
InvoiceDate UnitPrice CustomerID Country TotalPrice \
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
HighValue DiscountedPrice
0 False 2.295
1 True 3.051
2 True 2.475
Aplicando una función específica
def categorize_price(price):
if price > 50:
return 'High'
elif price < 20:
return 'Medium'
else:
return 'Low'
df['PriceCategory'] = df['UnitPrice'].apply(categorize_price)
print(df.head(10))
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6
7 536366 22633 HAND WARMER UNION JACK 6
8 536366 22632 HAND WARMER RED POLKA DOT 6
9 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32
InvoiceDate UnitPrice CustomerID Country TotalPrice \
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
5 2010-12-01 08:26:00 7.65 17850.0 United Kingdom 15.30
6 2010-12-01 08:26:00 4.25 17850.0 United Kingdom 25.50
7 2010-12-01 08:28:00 1.85 17850.0 United Kingdom 11.10
8 2010-12-01 08:28:00 1.85 17850.0 United Kingdom 11.10
9 2010-12-01 08:34:00 1.69 13047.0 United Kingdom 54.08
HighValue DiscountedPrice PriceCategory
0 False 2.295 Medium
1 True 3.051 Medium
2 True 2.475 Medium
3 True 3.051 Medium
4 True 3.051 Medium
5 False 6.885 Medium
6 True 3.825 Medium
7 False 1.665 Medium
8 False 1.665 Medium
9 True 1.521 Medium
Agrupaciones con groupby
country_count = df['Country'].value_counts()
print(country_count)
Country
United Kingdom 495478
Germany 9495
France 8557
EIRE 8196
Spain 2533
Netherlands 2371
Belgium 2069
Switzerland 2002
Portugal 1519
Australia 1259
Norway 1086
Italy 803
Channel Islands 758
Finland 695
Cyprus 622
Sweden 462
Unspecified 446
Austria 401
Denmark 389
Japan 358
Poland 341
Israel 297
USA 291
Hong Kong 288
Singapore 229
Iceland 182
Canada 151
Greece 146
Malta 127
United Arab Emirates 68
European Community 61
RSA 58
Lebanon 45
Lithuania 35
Brazil 32
Czech Republic 30
Bahrain 19
Saudi Arabia 10
Name: count, dtype: int64
Agrupando tomando en cuenta una columna
country_group = df.groupby('Country')['Quantity'].sum()
print(country_group)
Country
Australia 83653
Austria 4827
Bahrain 260
Belgium 23152
Brazil 356
Canada 2763
Channel Islands 9479
Cyprus 6317
Czech Republic 592
Denmark 8188
EIRE 142637
European Community 497
Finland 10666
France 110480
Germany 117448
Greece 1556
Hong Kong 4769
Iceland 2458
Israel 4353
Italy 7999
Japan 25218
Lebanon 386
Lithuania 652
Malta 944
Netherlands 200128
Norway 19247
Poland 3653
Portugal 16180
RSA 352
Saudi Arabia 75
Singapore 5234
Spain 26824
Sweden 35637
Switzerland 30325
USA 1034
United Arab Emirates 982
United Kingdom 4263829
Unspecified 3300
Name: Quantity, dtype: int64
Agrupando para obtener alguna estadística
country_stats = df.groupby('Country')['UnitPrice'].agg(['mean', 'sum'])
print(country_stats)
mean sum
Country
Australia 3.220612 4054.750
Austria 4.243192 1701.520
Bahrain 4.556316 86.570
Belgium 3.644335 7540.130
Brazil 4.456250 142.600
Canada 6.030331 910.580
Channel Islands 4.932124 3738.550
Cyprus 6.302363 3920.070
Czech Republic 2.938333 88.150
Denmark 3.256941 1266.950
EIRE 5.911077 48447.190
European Community 4.820492 294.050
Finland 5.448705 3786.850
France 5.028864 43031.990
Germany 3.966930 37666.000
Greece 4.885548 713.290
Hong Kong 42.505208 12241.500
Iceland 2.644011 481.210
Israel 3.633131 1079.040
Italy 4.831121 3879.390
Japan 2.276145 814.860
Lebanon 5.387556 242.440
Lithuania 2.841143 99.440
Malta 5.244173 666.010
Netherlands 2.738317 6492.550
Norway 6.012026 6529.060
Poland 4.170880 1422.270
Portugal 8.582976 13037.540
RSA 4.277586 248.100
Saudi Arabia 2.411000 24.110
Singapore 109.645808 25108.890
Spain 4.987544 12633.450
Sweden 3.910887 1806.830
Switzerland 3.403442 6813.690
USA 2.216426 644.980
United Arab Emirates 3.380735 229.890
United Kingdom 4.532422 2245715.474
Unspecified 2.699574 1204.010
Agrupaciones múltiples
country_stock_group = df.groupby(['Country', 'StockCode'])['Quantity'].sum()
print(country_stock_group)
Country StockCode
Australia 15036 600
15056BL 3
16161P 400
16169E 25
20665 6
...
Unspecified 85180A 2
85180B 1
85212 12
85213 12
85227 10
Name: Quantity, Length: 19839, dtype: int64
Agrupando y aplicando una función específica
def total_revenue(group):
return (group['Quantity'] * group['UnitPrice']).sum()
revenue_per_country = df.groupby('Country').apply(total_revenue)
print(revenue_per_country)
Country
Australia 137077.270
Austria 10154.320
Bahrain 548.400
Belgium 40910.960
Brazil 1143.600
Canada 3666.380
Channel Islands 20086.290
Cyprus 12946.290
Czech Republic 707.720
Denmark 18768.140
EIRE 263276.820
European Community 1291.750
Finland 22326.740
France 197403.900
Germany 221698.210
Greece 4710.520
Hong Kong 10117.040
Iceland 4310.000
Israel 7907.820
Italy 16890.510
Japan 35340.620
Lebanon 1693.880
Lithuania 1661.060
Malta 2505.470
Netherlands 284661.540
Norway 35163.460
Poland 7213.140
Portugal 29367.020
RSA 1002.310
Saudi Arabia 131.170
Singapore 9120.390
Spain 54774.580
Sweden 36595.910
Switzerland 56385.350
USA 1730.920
United Arab Emirates 1902.280
United Kingdom 8187806.364
Unspecified 4749.790
dtype: float64
Ejercicio
most_revenue = revenue_per_country.sort_values(ascending=False, inplace=False).head(3)
print(most_revenue)
Country
United Kingdom 8187806.364
Netherlands 284661.540
EIRE 263276.820
dtype: float64
poor_revenue = revenue_per_country.sort_values(ascending=True, inplace=False).head(3)
print(poor_revenue)
Country
Saudi Arabia 131.17
Bahrain 548.40
Czech Republic 707.72
dtype: float64
Filtrado de datos con condicionales
El filtrado de datos es una técnica utilizada en el análisis de datos para extraer subconjuntos específicos de información de un conjunto de datos más grande. Permite enfocarse en registros que cumplen ciertas condiciones, como analizar ventas de un país específico o transacciones en un período concreto. En Pandas, esto se logra utilizando condiciones lógicas sobre las columnas del DataFrame, permitiendo así una mejor comprensión y visualización de los datos relevantes para el análisis.
Filtrando ventas por columna
#Filtrar ventas en el Reino Unido - UK
uk_sales = sales_data[sales_data['Country'] == 'United Kingdom']
print(uk_sales)
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
... ... ... ... ...
541889 581585 22466 FAIRY TALE COTTAGE NIGHT LIGHT 12
541890 581586 22061 LARGE CAKE STAND HANGING STRAWBERY 8
541891 581586 23275 SET OF 3 HANGING OWLS OLLIE BEAK 24
541892 581586 21217 RED RETROSPOT ROUND CAKE TINS 24
541893 581586 20685 DOORMAT RED RETROSPOT 10
InvoiceDate UnitPrice CustomerID Country TotalPrice
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
... ... ... ... ... ...
541889 2011-12-09 12:31:00 1.95 15804.0 United Kingdom 23.40
541890 2011-12-09 12:49:00 2.95 13113.0 United Kingdom 23.60
541891 2011-12-09 12:49:00 1.25 13113.0 United Kingdom 30.00
541892 2011-12-09 12:49:00 8.95 13113.0 United Kingdom 214.80
541893 2011-12-09 12:49:00 7.08 13113.0 United Kingdom 70.80
[361878 rows x 9 columns]
Filtrando respecto a un valor
high_quantity_sales = sales_data[sales_data['Quantity'] > 40]
print(high_quantity_sales)
InvoiceNo StockCode Description Quantity \
31 536370 10002 INFLATABLE POLITICAL GLOBE 48
46 536371 22086 PAPER CHAIN KIT 50'S CHRISTMAS 80
82 536376 22114 HOT WATER BOTTLE TEA AND SYMPATHY 48
83 536376 21733 RED HANGING HEART T-LIGHT HOLDER 64
96 536378 21212 PACK OF 72 RETROSPOT CAKE CASES 120
... ... ... ... ...
541746 581571 23167 SMALL CERAMIC TOP STORAGE JAR 96
541747 581571 21314 SMALL GLASS HEART TRINKET POT 48
541751 581572 23328 SET 6 SCHOOL MILK BOTTLES IN CRATE 48
541867 581584 20832 RED FLOCK LOVE HEART PHOTO FRAME 72
541868 581584 85038 6 CHOCOLATE LOVE HEART T-LIGHTS 48
InvoiceDate UnitPrice CustomerID Country TotalPrice
31 2010-12-01 08:45:00 0.85 12583.0 France 40.80
46 2010-12-01 09:00:00 2.55 13748.0 United Kingdom 204.00
82 2010-12-01 09:32:00 3.45 15291.0 United Kingdom 165.60
83 2010-12-01 09:32:00 2.55 15291.0 United Kingdom 163.20
96 2010-12-01 09:37:00 0.42 14688.0 United Kingdom 50.40
... ... ... ... ... ...
541746 2011-12-09 12:00:00 0.69 15311.0 United Kingdom 66.24
541747 2011-12-09 12:00:00 1.85 15311.0 United Kingdom 88.80
541751 2011-12-09 12:08:00 3.39 16705.0 United Kingdom 162.72
541867 2011-12-09 12:25:00 0.72 13777.0 United Kingdom 51.84
541868 2011-12-09 12:25:00 1.85 13777.0 United Kingdom 88.80
[18591 rows x 9 columns]
Varios filtros
uk_high_quantity_sales = sales_data[(sales_data['Country'] == 'United Kingdom') &
(sales_data['Quantity'] > 300)]
print(uk_high_quantity_sales)
InvoiceNo StockCode Description Quantity \
181 536387 22466 FAIRY TALE COTTAGE NIGHTLIGHT 432
182 536387 21731 RED TOADSTOOL LED NIGHT LIGHT 432
730 536437 17021 NAMASTE SWAGAT INCENSE 600
870 536477 21137 BLACK RECORD COVER FRAME 480
2364 536584 84029E RED WOOLLY HOTTIE WHITE HEART. 384
... ... ... ... ...
540060 581457 22189 CREAM HEART CARD HOLDER 324
540061 581457 23543 WALL ART KEEP CALM 698
540070 581458 22197 POPCORN HOLDER 1500
540071 581459 22197 POPCORN HOLDER 1200
540421 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995
InvoiceDate UnitPrice CustomerID Country TotalPrice
181 2010-12-01 09:58:00 1.45 16029.0 United Kingdom 626.40
182 2010-12-01 09:58:00 1.25 16029.0 United Kingdom 540.00
730 2010-12-01 12:12:00 0.24 13694.0 United Kingdom 144.00
870 2010-12-01 12:27:00 3.39 16210.0 United Kingdom 1627.20
2364 2010-12-01 16:22:00 2.95 13777.0 United Kingdom 1132.80
... ... ... ... ... ...
540060 2011-12-08 18:43:00 3.39 18102.0 United Kingdom 1098.36
540061 2011-12-08 18:43:00 4.15 18102.0 United Kingdom 2896.70
540070 2011-12-08 18:45:00 0.72 17949.0 United Kingdom 1080.00
540071 2011-12-08 18:46:00 0.72 17949.0 United Kingdom 864.00
540421 2011-12-09 09:15:00 2.08 16446.0 United Kingdom 168469.60
[727 rows x 9 columns]
Filtrado de Fecha
sales_2011 = sales_data[sales_data['InvoiceDate'].dt.year == 2011]
print(sales_2011)
InvoiceNo StockCode Description Quantity \
42481 539993 22386 JUMBO BAG PINK POLKADOT 10
42482 539993 21499 BLUE POLKADOT WRAP 25
42483 539993 21498 RED RETROSPOT WRAP 25
42484 539993 22379 RECYCLING BAG RETROSPOT 5
42485 539993 20718 RED RETROSPOT SHOPPER BAG 10
... ... ... ... ...
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3
InvoiceDate UnitPrice CustomerID Country TotalPrice
42481 2011-01-04 10:00:00 1.95 13313.0 United Kingdom 19.50
42482 2011-01-04 10:00:00 0.42 13313.0 United Kingdom 10.50
42483 2011-01-04 10:00:00 0.42 13313.0 United Kingdom 10.50
42484 2011-01-04 10:00:00 2.10 13313.0 United Kingdom 10.50
42485 2011-01-04 10:00:00 1.25 13313.0 United Kingdom 12.50
... ... ... ... ... ...
541904 2011-12-09 12:50:00 0.85 12680.0 France 10.20
541905 2011-12-09 12:50:00 2.10 12680.0 France 12.60
541906 2011-12-09 12:50:00 4.15 12680.0 France 16.60
541907 2011-12-09 12:50:00 4.15 12680.0 France 16.60
541908 2011-12-09 12:50:00 4.95 12680.0 France 14.85
[379979 rows x 9 columns]
sales_dec_2010 = sales_data[(sales_data['InvoiceDate'].dt.year == 2010) &
(sales_data['InvoiceDate'].dt.month == 12)]
print(sales_dec_2010)
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
... ... ... ... ...
42448 539988 84380 SET OF 3 BUTTERFLY COOKIE CUTTERS 1
42449 539988 84849D HOT BATHS SOAP HOLDER 1
42450 539988 84849B FAIRY SOAP SOAP HOLDER 1
42451 539988 22854 CREAM SWEETHEART EGG HOLDER 2
42452 539988 47559B TEA TIME OVEN GLOVE 2
InvoiceDate UnitPrice CustomerID Country TotalPrice
0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
... ... ... ... ... ...
42448 2010-12-23 16:06:00 1.25 18116.0 United Kingdom 1.25
42449 2010-12-23 16:06:00 1.69 18116.0 United Kingdom 1.69
42450 2010-12-23 16:06:00 1.69 18116.0 United Kingdom 1.69
42451 2010-12-23 16:06:00 4.95 18116.0 United Kingdom 9.90
42452 2010-12-23 16:06:00 1.25 18116.0 United Kingdom 2.50
[26850 rows x 9 columns]
Restructuración de datos: Pivot y Reshape
La reestructuración de datos usando Pivot Tables en Pandas. Se pueden crear tablas dinámicas para resumir y reorganizar datos crudos en un DataFrame, permitiendo realizar cálculos estadísticos como sumas y promedios. Se pueden modificar índices usando columnas y cómo apilar y desapilar datos para una mejor visualización. El uso de estas herramientas permite descubrir patrones y obtener insights valiosos en el análisis de datos.
pivot_table = pd.pivot_table(sales_data, values = 'Quantity', index = 'Country',
columns='StockCode', aggfunc='sum')
print(pivot_table)
StockCode 10002 10080 10120 10123C 10124A 10124G 10125 \
Country
Australia NaN NaN NaN NaN NaN NaN NaN
Austria NaN NaN NaN NaN NaN NaN NaN
Bahrain NaN NaN NaN NaN NaN NaN NaN
Belgium NaN NaN NaN NaN NaN NaN NaN
Brazil NaN NaN NaN NaN NaN NaN NaN
Canada NaN NaN NaN NaN NaN NaN NaN
Channel Islands NaN NaN NaN NaN NaN NaN NaN
Cyprus NaN NaN NaN NaN NaN NaN NaN
Czech Republic NaN NaN NaN NaN NaN NaN NaN
Denmark NaN NaN NaN NaN NaN NaN NaN
EIRE 12.0 NaN NaN NaN NaN NaN NaN
European Community NaN NaN NaN NaN NaN NaN NaN
Finland NaN NaN NaN NaN NaN NaN NaN
France 372.0 NaN 10.0 NaN NaN NaN 470.0
Germany 1.0 NaN NaN NaN NaN NaN 120.0
Greece NaN NaN NaN NaN NaN NaN NaN
Iceland NaN NaN NaN NaN NaN NaN NaN
Israel NaN NaN NaN NaN NaN NaN NaN
Italy NaN NaN NaN NaN NaN NaN NaN
Japan 1.0 NaN NaN NaN NaN NaN NaN
Lebanon NaN NaN NaN NaN NaN NaN NaN
Lithuania NaN NaN NaN NaN NaN NaN NaN
Malta NaN NaN NaN NaN NaN NaN NaN
Netherlands NaN NaN NaN NaN NaN NaN NaN
Norway NaN NaN NaN NaN NaN NaN NaN
Poland NaN NaN NaN NaN NaN NaN NaN
Portugal NaN NaN NaN NaN NaN NaN NaN
RSA NaN NaN NaN NaN NaN NaN NaN
Saudi Arabia NaN NaN NaN NaN NaN NaN NaN
Singapore NaN NaN NaN NaN NaN NaN NaN
Spain 24.0 NaN NaN NaN NaN NaN NaN
Sweden NaN NaN NaN NaN NaN NaN NaN
Switzerland 12.0 NaN NaN NaN NaN NaN 20.0
USA NaN NaN NaN NaN NaN NaN NaN
United Arab Emirates NaN NaN NaN NaN NaN NaN NaN
United Kingdom 401.0 291.0 183.0 5.0 16.0 17.0 616.0
Unspecified NaN NaN NaN NaN NaN NaN NaN
StockCode 10133 10135 11001 ... 90214Y 90214Z \
Country ...
Australia NaN NaN NaN ... NaN NaN
Austria NaN NaN NaN ... NaN NaN
Bahrain NaN NaN NaN ... NaN NaN
Belgium NaN NaN NaN ... NaN NaN
Brazil NaN NaN NaN ... NaN NaN
Canada 40.0 20.0 NaN ... NaN NaN
Channel Islands NaN NaN NaN ... NaN NaN
Cyprus NaN NaN NaN ... NaN NaN
Czech Republic NaN NaN NaN ... NaN NaN
Denmark NaN NaN NaN ... NaN NaN
EIRE 40.0 60.0 48.0 ... NaN NaN
European Community NaN NaN NaN ... NaN NaN
Finland NaN NaN NaN ... NaN NaN
France NaN 10.0 16.0 ... NaN NaN
Germany NaN 200.0 32.0 ... NaN NaN
Greece NaN NaN NaN ... NaN NaN
Iceland NaN NaN NaN ... NaN NaN
Israel 20.0 NaN NaN ... NaN NaN
Italy NaN NaN NaN ... NaN NaN
Japan NaN NaN NaN ... NaN NaN
Lebanon NaN NaN NaN ... NaN NaN
Lithuania NaN NaN NaN ... NaN NaN
Malta NaN NaN NaN ... NaN NaN
Netherlands NaN NaN NaN ... NaN NaN
Norway NaN 10.0 16.0 ... NaN NaN
Poland NaN NaN NaN ... NaN NaN
Portugal 20.0 NaN 16.0 ... NaN NaN
RSA NaN NaN NaN ... NaN NaN
Saudi Arabia NaN NaN NaN ... NaN NaN
Singapore NaN NaN NaN ... NaN NaN
Spain NaN NaN 16.0 ... NaN NaN
Sweden NaN NaN NaN ... NaN NaN
Switzerland 10.0 NaN NaN ... NaN NaN
USA NaN NaN NaN ... NaN NaN
United Arab Emirates NaN NaN NaN ... NaN NaN
United Kingdom 2244.0 1637.0 923.0 ... 61.0 12.0
Unspecified NaN NaN NaN ... NaN NaN
StockCode BANK CHARGES C2 CRUK D DOT M PADS \
Country
Australia NaN NaN NaN NaN NaN NaN NaN
Austria NaN NaN NaN NaN NaN NaN NaN
Bahrain NaN NaN NaN NaN NaN NaN NaN
Belgium NaN NaN NaN NaN NaN NaN NaN
Brazil NaN NaN NaN NaN NaN NaN NaN
Canada NaN NaN NaN NaN NaN NaN NaN
Channel Islands NaN 3.0 NaN NaN NaN 0.0 NaN
Cyprus NaN NaN NaN NaN NaN 0.0 NaN
Czech Republic NaN NaN NaN NaN NaN NaN NaN
Denmark NaN NaN NaN NaN NaN NaN NaN
EIRE NaN 97.0 NaN -1.0 NaN -1.0 NaN
European Community NaN NaN NaN NaN NaN NaN NaN
Finland NaN NaN NaN NaN NaN 2.0 NaN
France NaN 3.0 NaN NaN NaN -16.0 NaN
Germany NaN NaN NaN NaN NaN 2.0 NaN
Greece NaN NaN NaN NaN NaN NaN NaN
Iceland NaN NaN NaN NaN NaN NaN NaN
Israel NaN NaN NaN NaN NaN NaN NaN
Italy NaN NaN NaN -1.0 NaN -1.0 NaN
Japan NaN NaN NaN NaN NaN -3.0 NaN
Lebanon NaN NaN NaN NaN NaN NaN NaN
Lithuania NaN NaN NaN NaN NaN NaN NaN
Malta NaN NaN NaN NaN NaN NaN NaN
Netherlands NaN NaN NaN -1.0 NaN -480.0 NaN
Norway NaN NaN NaN NaN NaN 0.0 NaN
Poland NaN NaN NaN NaN NaN NaN NaN
Portugal NaN NaN NaN NaN NaN 1.0 NaN
RSA NaN NaN NaN NaN NaN 1.0 NaN
Saudi Arabia NaN NaN NaN NaN NaN NaN NaN
Singapore NaN NaN NaN NaN NaN 0.0 NaN
Spain NaN NaN NaN NaN NaN -3.0 NaN
Sweden NaN NaN NaN NaN NaN 1.0 NaN
Switzerland NaN NaN NaN NaN NaN NaN NaN
USA NaN NaN NaN NaN NaN NaN NaN
United Arab Emirates NaN NaN NaN NaN NaN NaN NaN
United Kingdom 12.0 30.0 -16.0 -1191.0 16.0 3681.0 4.0
Unspecified NaN NaN NaN NaN NaN NaN NaN
StockCode POST
Country
Australia 0.0
Austria 37.0
Bahrain NaN
Belgium 272.0
Brazil NaN
Canada 1.0
Channel Islands NaN
Cyprus 1.0
Czech Republic 0.0
Denmark 41.0
EIRE NaN
European Community 9.0
Finland 89.0
France 803.0
Germany 1104.0
Greece 6.0
Iceland NaN
Israel NaN
Italy 44.0
Japan NaN
Lebanon NaN
Lithuania NaN
Malta 10.0
Netherlands 99.0
Norway 57.0
Poland 9.0
Portugal 97.0
RSA NaN
Saudi Arabia NaN
Singapore NaN
Spain 209.0
Sweden 37.0
Switzerland 97.0
USA NaN
United Arab Emirates 1.0
United Kingdom -21.0
Unspecified NaN
[37 rows x 3684 columns]
Añadiendo una nueva columna
pivot_table = pd.pivot_table(sales_data, values = 'Quantity', index = 'Country',
columns=['StockCode', 'UnitPrice'], aggfunc='sum')
print(pivot_table)
Reshape
df = pd.DataFrame({
'A': ['foo', 'bar', 'baz'],
'B': [1, 2, 3],
'C': [4, 5, 6]
})
print(df)
A B C
0 foo 1 4
1 bar 2 5
2 baz 3 6
df_stacked = df.stack()
print(df_stacked)
0 A foo
B 1
C 4
1 A bar
B 2
C 5
2 A baz
B 3
C 6
dtype: object
df_unstacked = df_stacked.unstack()
print(df_unstacked)
Fusión de DataFrames
Merge, join, concatenate and compare
df1 = pd.DataFrame({
'key': ['A', 'B', 'C'],
'value1': [1, 2, 3]
})
df2 = pd.DataFrame({
'key': ['B', 'C', 'D'],
'value2': [4, 5, 6]
})
Merge basado en las llaves
Solo se agregan aquellas que tengan igual llave
inner_merged = pd.merge(df1, df2, on='key', how='inner')
print(inner_merged)
key value1 value2
0 B 2 4
1 C 3 5
Ahora vamos a hacer un merge completo
outer_merged = pd.merge(df1, df2, on='key', how='outer')
print(outer_merged)
key value1 value2
0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0
Merge a izquierda y derecha
El join izquierdo incluye todas las filas del DataFrame de la izquierda, y solo las filas coincidentes del DataFrame de la derecha. Si no hay coincidencia, se llenan con NaN.
El join derecho incluye todas las filas del DataFrame de la derecha y solamente las filas coincidentes del DataFrame de la izquierda. Al igual que en el join izquierdo, las filas sin coincidencia se llenan con NaN.
Ambos métodos permiten combinar información de diferentes fuentes, pero cada uno tiene un enfoque distinto en cuanto a qué filas se mantienen.
left_merged = pd.merge(df1, df2, on='key', how='left')
print(left_merged)
key value1 value2
0 A 1 NaN
1 B 2 4.0
2 C 3 5.0
right_merged = pd.merge(df1, df2, on='key', how='right')
print(right_merged)
key value1 value2
0 B 2.0 4
1 C 3.0 5
2 D NaN 6
Concatenando
df3 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df4 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
Verticalmente
vertical_concat = pd.concat([df3, df4], axis=0)
print(vertical_concat)
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
Horizontal
horizontal_concat = pd.concat([df3, df4], axis=1)
print(horizontal_concat)
A B A B
0 A0 B0 A3 B3
1 A1 B1 A4 B4
2 A2 B2 A5 B5
Manejo de Series Temporales
El formateo de fechas es una parte esencial de la visualización de datos, especialmente cuando se trabaja con series temporales.
Matplotlib, una de las bibliotecas más populares para la creación de gráficos en Python, ofrece herramientas flexibles para personalizar cómo se muestran las fechas en los gráficos.
Utilizando DateFormatter
y los códigos de formato de strftime
, puedes adaptar la presentación de las fechas para que se ajusten a las necesidades específicas de tu análisis o presentación.
A continuación, exploramos cómo se utiliza el formato ‘%b %Y’ y otras opciones comunes para formatear fechas.
[Series-strftime]
¿Qué hace ‘%b %Y’?
%b
: Representa el mes abreviado en texto (por ejemplo, ‘Jan’, ‘Feb’, ‘Mar’, etc.). Este formato es útil cuando se quiere mostrar el mes de manera compacta.%Y
: Representa el año en cuatro dígitos (por ejemplo, ‘2023’, ‘2024’). Esto proporciona una indicación clara y completa del año. Cuando se utiliza ‘%b %Y’ en el DateFormatter, cada etiqueta del eje x se formatea como Mes Año. Por ejemplo, una fecha que corresponde a enero de 2023 se mostrará como ‘Jan 2023’.
Otras Formas de Dar Formato a Fechas con %
:
Matplotlib, a través de DateFormatter
, permite una amplia personalización de cómo se muestran las fechas y horas utilizando códigos de formato de fecha y hora basados en el estándar de strftime
. Aquí algunos ejemplos comunes:
%d
: Día del mes como número decimal (por ejemplo, ‘01’, ‘15’).%B
: Nombre completo del mes (por ejemplo, ‘January’, ‘February’).%m
: Mes como número decimal con ceros a la izquierda (por ejemplo, ‘01’ para enero, ‘12’ para diciembre).%y
: Año como número de dos dígitos (por ejemplo, ‘23’ para 2023).%H
: Hora en formato de 24 horas (por ejemplo, ‘14’ para las 2 PM).%I
: Hora en formato de 12 horas (por ejemplo, ‘02’ para las 2 PM).%p
: AM o PM.%M
: Minuto con ceros a la izquierda (por ejemplo, ‘05’).%S
: Segundo con ceros a la izquierda (por ejemplo, ‘09’).%A
: Nombre completo del día de la semana (por ejemplo, ‘Monday’, ‘Tuesday’).%a
: Nombre abreviado del día de la semana (por ejemplo, ‘Mon’, ‘Tue’).
Al comprender y utilizar estos formatos, puedes mejorar la claridad y la estética de tus visualizaciones, asegurando que la información temporal se comunique de manera efectiva.
Creando Serie Temporal
data = {
'Timestamp': ['2024-07-22 10:00:00',
'2024-07-22 11:00:00',
'2024-07-22 12:00:00'],
'Value': [100, 150, 200]
}
df = pd.DataFrame(data)
print(df)
Timestamp Value
0 2024-07-22 10:00:00 100
1 2024-07-22 11:00:00 150
2 2024-07-22 12:00:00 200
Cargando
# Leer el dataset de ventas desde un archivo CSV
file_path = "online_retail.csv"
sales_data = pd.read_csv(file_path)
# Convertir la columna 'InvoiceDate' a tipo datetime
sales_data['InvoiceDate'] = pd.to_datetime(sales_data['InvoiceDate'])
# Eliminar filas con valores faltantes en las columnas críticas
sales_data.dropna(subset=['CustomerID', 'InvoiceDate'], inplace=True)
# Crear una nueva columna 'TotalPrice'
sales_data['TotalPrice'] = sales_data['Quantity'] * sales_data['UnitPrice']
print(sales_data.head())
Convirtiendo a datetime
#Convertir la columna a tipo datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
#Establecer Timestamp como índice
df.set_index('Timestamp', inplace= True)
print(df)
Value
Timestamp
2024-07-22 10:00:00 100
2024-07-22 11:00:00 150
2024-07-22 12:00:00 200
Estableciendo a datetime como index
sales_data['InvoiceDate'] = pd.to_datetime(sales_data['InvoiceDate'])
#Establecer InvoiceDate como índice
sales_data.set_index('InvoiceDate', inplace=True)
print(sales_data.head())
InvoiceNo StockCode Description \
InvoiceDate
2010-12-01 08:26:00 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER
2010-12-01 08:26:00 536365 71053 WHITE METAL LANTERN
2010-12-01 08:26:00 536365 84406B CREAM CUPID HEARTS COAT HANGER
2010-12-01 08:26:00 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE
2010-12-01 08:26:00 536365 84029E RED WOOLLY HOTTIE WHITE HEART.
Quantity UnitPrice CustomerID Country \
InvoiceDate
2010-12-01 08:26:00 6 2.55 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
2010-12-01 08:26:00 8 2.75 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
TotalPrice
InvoiceDate
2010-12-01 08:26:00 15.30
2010-12-01 08:26:00 20.34
2010-12-01 08:26:00 22.00
2010-12-01 08:26:00 20.34
2010-12-01 08:26:00 20.34
Crear columnas con mes, día y año
sales_data['Year'] = sales_data.index.year
sales_data['Month'] = sales_data.index.month
sales_data['Day'] = sales_data.index.day
print(sales_data)
InvoiceNo StockCode Description \
InvoiceDate
2010-12-01 08:26:00 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER
2010-12-01 08:26:00 536365 71053 WHITE METAL LANTERN
2010-12-01 08:26:00 536365 84406B CREAM CUPID HEARTS COAT HANGER
2010-12-01 08:26:00 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE
2010-12-01 08:26:00 536365 84029E RED WOOLLY HOTTIE WHITE HEART.
... ... ... ...
2011-12-09 12:50:00 581587 22613 PACK OF 20 SPACEBOY NAPKINS
2011-12-09 12:50:00 581587 22899 CHILDREN'S APRON DOLLY GIRL
2011-12-09 12:50:00 581587 23254 CHILDRENS CUTLERY DOLLY GIRL
2011-12-09 12:50:00 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE
2011-12-09 12:50:00 581587 22138 BAKING SET 9 PIECE RETROSPOT
Quantity UnitPrice CustomerID Country \
InvoiceDate
2010-12-01 08:26:00 6 2.55 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
2010-12-01 08:26:00 8 2.75 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
2010-12-01 08:26:00 6 3.39 17850.0 United Kingdom
... ... ... ... ...
2011-12-09 12:50:00 12 0.85 12680.0 France
2011-12-09 12:50:00 6 2.10 12680.0 France
2011-12-09 12:50:00 4 4.15 12680.0 France
2011-12-09 12:50:00 4 4.15 12680.0 France
2011-12-09 12:50:00 3 4.95 12680.0 France
TotalPrice Year Month Day
InvoiceDate
2010-12-01 08:26:00 15.30 2010 12 1
2010-12-01 08:26:00 20.34 2010 12 1
2010-12-01 08:26:00 22.00 2010 12 1
2010-12-01 08:26:00 20.34 2010 12 1
2010-12-01 08:26:00 20.34 2010 12 1
... ... ... ... ...
2011-12-09 12:50:00 10.20 2011 12 9
2011-12-09 12:50:00 12.60 2011 12 9
2011-12-09 12:50:00 16.60 2011 12 9
2011-12-09 12:50:00 16.60 2011 12 9
2011-12-09 12:50:00 14.85 2011 12 9
[406829 rows x 11 columns]
Aplicando resample
monthly_sales = sales_data['Quantity'].resample('ME').sum()
print(monthly_sales)
InvoiceDate
2010-12-31 296362
2011-01-31 269379
2011-02-28 262833
2011-03-31 344012
2011-04-30 278585
2011-05-31 367852
2011-06-30 356922
2011-07-31 363418
2011-08-31 386612
2011-09-30 537496
2011-10-31 569666
2011-11-30 669915
2011-12-31 203836
Freq: ME, Name: Quantity, dtype: int64
Obtienendo información luego de resample
max_sales_month = monthly_sales.idxmax()
max_sales_value = monthly_sales.max()
print(f"El mes con mayores ventas es {max_sales_month.strftime('%B %Y')} con un total de {max_sales_value} unidades vendidas.")
El mes con mayores ventas es November 2011 con un total de 669915 unidades vendidas.