Pandas Tutorial

Tutorial, Online, 2024

En este tutorial vamos trabajar con el paquete Pandas de Python. Pandas es ampliamente empleado en toda la temática de aprendizaje automático como por científicos de datos.

Pandas

Pandas documentación oficial, además podes acceder a tutoriales introductorios.

Estructuras básicas

  • Dataframes
  • Series

Datasets Online

Kaggle: Your Machine Learning and Data Science Community UCI Machine Learning Repository

Code Example

import pandas as pd

path = 'online_retail.csv'
retail_data = pd.read_csv(path)

#Excel
#data_excel = pd.read_excel(path)

#JSON
#data_json = pd.read_json(path)

#print(type(retail_data))
print(retail_data)

Salida al ejecutar el código anterior.

       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

                InvoiceDate  UnitPrice  CustomerID         Country  
0       2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2       2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
...                     ...        ...         ...             ...  
541904  2011-12-09 12:50:00       0.85     12680.0          France  
541905  2011-12-09 12:50:00       2.10     12680.0          France  
541906  2011-12-09 12:50:00       4.15     12680.0          France  
541907  2011-12-09 12:50:00       4.15     12680.0          France  
541908  2011-12-09 12:50:00       4.95     12680.0          France  

[541909 rows x 8 columns]

Dataframes

Un dataframe es una estructura bidimensional similar a una tabla, donde puedes almacenar datos de diferentes tipos. Se pueden leer datos desde un archivo CSV, crear dataframes a partir de arrays, listas y diccionarios. También se puede descomponer dataframes en series.

Pandas DataFrame

Creando desde un array

import numpy as np

data = np.array([[1,2,3], [4,5,6], [7,8,9]])

dt_from_array = pd.DataFrame(data, columns=['A','B', 'c'])
print(dt_from_array)
   A  B  c
0  1  2  3
1  4  5  6
2  7  8  9

Creando desde una lista

data = [[1,'Jhon', 22], [2,'Anna', 24]]

df_from_list = pd.DataFrame(data, columns = ['ID', 'Name', 'Age'])
print(df_from_list)
   ID  Name  Age
0   1  Jhon   22
1   2  Anna   24

Creando desde una diccionario

data = [{'ID': 1,
         'Name': 'Jhon',
         'Age' : 22}]
dt_from_dict_list = pd.DataFrame(data)
print(dt_from_dict_list )
   ID  Name  Age
0   1  Jhon   22

Creando usando Pandas Series

Una serie en Pandas es una estructura de datos unidimensional que puede contener datos de diferentes tipos (enteros, cadenas, fechas, etc.). Es similar a una columna en un DataFrame, y cada serie tiene un índice que permite acceder a sus elementos de manera eficiente. Las series son útiles para manipular y analizar datos, y se pueden crear a partir de listas, diccionarios o arrays de NumPy.

data = {'ID': pd.Series([1,2,3]),
        'Name': pd.Series(['Jhon', 'Anna', 'Mike']),
        'Age': pd.Series([22,24,21])}

df_from_series_dict = pd.DataFrame(data)
print(df_from_series_dict)
   ID  Name  Age
0   1  Jhon   22
1   2  Anna   24
2   3  Mike   21

Ayuda por si empleamos Google Colab y Google Drive

from google.colab import drive

drive.mount('/content/drive')
file_path = "/content/drive/My Drive/online_retail.csv"

sales_data = pd.read_csv(file_path)
print(sales_data.head())

Estructuras de Datos y Funciones

06_calculate_statistics

Obteniendo el nombre de las columnas,

columns_names = retail_data.columns
print(columns_names)
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

Obteniendo dimensión del data frame

num_rows, num_columns = retail_data.shape
print('Número de filas', num_rows)
print('Número de columnas', num_columns)
Número de filas 541909
Número de columnas 8

Obtener una columna específica

daily_sales = retail_data['Quantity']
print(daily_sales)
print(daily_sales[2])
0          6
1          6
2          8
3          6
4          6
          ..
541904    12
541905     6
541906     4
541907     4
541908     3
Name: Quantity, Length: 541909, dtype: int64
8

Información estadística del DataFrame

summary = retail_data.describe()
print(summary)
            Quantity      UnitPrice     CustomerID
count  541909.000000  541909.000000  406829.000000
mean        9.552250       4.611114   15287.690570
std       218.081158      96.759853    1713.600303
min    -80995.000000  -11062.060000   12346.000000
25%         1.000000       1.250000   13953.000000
50%         3.000000       2.080000   15152.000000
75%        10.000000       4.130000   16791.000000
max     80995.000000   38970.000000   18287.000000

Información estadística de la columana seleccionada

daily_sales = retail_data['Quantity']

mean_value = daily_sales.mean()
print('La media es', mean_value)

median_value = daily_sales.median()
print('La mediana es', mean_value)

sum_value = daily_sales.sum()
print('La suma total es', mean_value)

count_values = daily_sales.count()
print(count_values)
La media es 9.55224954743324
La mediana es 9.55224954743324
La suma total es 9.55224954743324
541909

Creando un Pandas Series

daily_sales = pd.Series([10, 20, None, 40, 50])

# Suma de Quantity
total_sum = daily_sales.sum()
print("Suma de Quantity:", total_sum)

# Conteo de Quantity
count_values = daily_sales.count()
print("Conteo de Quantity:", count_values)
Suma de Quantity: 120.0
Conteo de Quantity: 4

Uso de iloc y loc

Iloc

Extra la información del dataframe especificando el índice.

first_row = retail_data.iloc[0]
print(first_row)
InvoiceNo                                  536365
StockCode                                  85123A
Description    WHITE HANGING HEART T-LIGHT HOLDER
Quantity                                        6
InvoiceDate                   2010-12-01 08:26:00
UnitPrice                                    2.55
CustomerID                                17850.0
Country                            United Kingdom
Name: 0, dtype: object
first_five_row = retail_data.iloc[6:8]
print(first_five_row)
  InvoiceNo StockCode                        Description  Quantity  \
6    536365     21730  GLASS STAR FROSTED T-LIGHT HOLDER         6   
7    536366     22633             HAND WARMER UNION JACK         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
6  2010-12-01 08:26:00       4.25     17850.0  United Kingdom  
7  2010-12-01 08:28:00       1.85     17850.0  United Kingdom  

Podemos emplear todas las formas de slicing, como si fuera un array de Numpy.

subset = retail_data.iloc[:3, :2]
print(subset)
  InvoiceNo StockCode
0    536365    85123A
1    536365     71053
2    536365    84406B

Loc

Accede a la información del dataframe pero especificando la etiqueta.

row_index_3 = retail_data.loc[3]
print(row_index_3)
InvoiceNo                                   536365
StockCode                                   84029G
Description    KNITTED UNION FLAG HOT WATER BOTTLE
Quantity                                         6
InvoiceDate                    2010-12-01 08:26:00
UnitPrice                                     3.39
CustomerID                                 17850.0
Country                             United Kingdom
Name: 3, dtype: object
row_index_0_to_4 = retail_data.loc[0:4]
print(row_index_0_to_4)
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
quantity_column = retail_data.loc[:,'Quantity']
print(quantity_column)
0          6
1          6
2          8
3          6
4          6
          ..
541904    12
541905     6
541906     4
541907     4
541908     3
Name: Quantity, Length: 541909, dtype: int64
quantity_unitprices_column = retail_data.loc[:,['Quantity','UnitPrice']]
print(quantity_unitprices_column)
        Quantity  UnitPrice
0              6       2.55
1              6       3.39
2              8       2.75
3              6       3.39
4              6       3.39
...          ...        ...
541904        12       0.85
541905         6       2.10
541906         4       4.15
541907         4       4.15
541908         3       4.95

[541909 rows x 2 columns]

Manejo de Datos Faltantes

Documentación en Pandas para el manejo de datos faltantes

Buscamos datos faltantes

missing_data = retail_data.isna()
print(missing_data)

De esta forma es díficil visualizarlos, entonces hacemos

missing_data_count = retail_data.isna().sum()
print('Conteo de datos faltantes por columna:\n',missing_data_count )
Conteo de datos faltantes por columna:
 InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Podemos observar los datos que no tiene la información. Ahora podemos eliminar todas las filas que no tienen datos o llenar estos valores con algún tipo de operación.

Borrando las filas

no_missing_rows = retail_data.dropna()
print('Datos sin filas con valores faltantes:\n', no_missing_rows)
Datos sin filas con valores faltantes:
        InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

                InvoiceDate  UnitPrice  CustomerID         Country  
0       2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2       2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4       2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
...                     ...        ...         ...             ...  
541904  2011-12-09 12:50:00       0.85     12680.0          France  
541905  2011-12-09 12:50:00       2.10     12680.0          France  
541906  2011-12-09 12:50:00       4.15     12680.0          France  
541907  2011-12-09 12:50:00       4.15     12680.0          France  
541908  2011-12-09 12:50:00       4.95     12680.0          France  

[406829 rows x 8 columns]

Borrando las columnas

no_missing_columns = retail_data.dropna(axis=1)
print('Datos sin columnas con valores faltantes:\n', no_missing_columns)
DAtos sin columnas con valores faltantes:
        InvoiceNo StockCode  Quantity          InvoiceDate  UnitPrice  \
0         536365    85123A         6  2010-12-01 08:26:00       2.55   
1         536365     71053         6  2010-12-01 08:26:00       3.39   
2         536365    84406B         8  2010-12-01 08:26:00       2.75   
3         536365    84029G         6  2010-12-01 08:26:00       3.39   
4         536365    84029E         6  2010-12-01 08:26:00       3.39   
...          ...       ...       ...                  ...        ...   
541904    581587     22613        12  2011-12-09 12:50:00       0.85   
541905    581587     22899         6  2011-12-09 12:50:00       2.10   
541906    581587     23254         4  2011-12-09 12:50:00       4.15   
541907    581587     23255         4  2011-12-09 12:50:00       4.15   
541908    581587     22138         3  2011-12-09 12:50:00       4.95   

               Country  
0       United Kingdom  
1       United Kingdom  
2       United Kingdom  
3       United Kingdom  
4       United Kingdom  
...                ...  
541904          France  
541905          France  
541906          France  
541907          France  
541908          France  

[541909 rows x 6 columns]

Rellenando datos faltantes con ceros

retail_data_filled_zeros = retail_data.fillna(0)
retail_data_filled_zeros_count = retail_data_filled_zeros.isna().sum()
print(retail_data_filled_zeros_count)
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

Rellenando datos faltantes con operaciones matemáticas

mean_unit_price = retail_data['UnitPrice'].mean()
retail_data_filled_mean = retail_data['UnitPrice'].fillna(mean_unit_price)
print(retail_data_filled_mean)
0         2.55
1         3.39
2         2.75
3         3.39
4         3.39
          ... 
541904    0.85
541905    2.10
541906    4.15
541907    4.15
541908    4.95
Name: UnitPrice, Length: 541909, dtype: float64

Creación y Manipulación de Columnas

Como usar apply en DataFrame y Series

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print(df.head())
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
0  2010-12-01 08:26:00       2.55     17850.0  United Kingdom       15.30  
1  2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
2  2010-12-01 08:26:00       2.75     17850.0  United Kingdom       22.00  
3  2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
4  2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
df['HighValue'] = df['TotalPrice'] > 16
print(df['HighValue'].head(10))
0    False
1     True
2     True
3     True
4     True
5    False
6     True
7    False
8    False
9     True
Name: HighValue, dtype: bool
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
 8   TotalPrice   541909 non-null  float64
 9   HighValue    541909 non-null  bool   
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 37.7+ MB
None

Cambiando a tipo de dato Fecha

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
 8   TotalPrice   541909 non-null  float64       
 9   HighValue    541909 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 37.7+ MB
None

Aplicando una función lambda

df['DiscountedPrice'] = df['UnitPrice'].apply(lambda x: x* 0.9)
print(df.head(3))
  InvoiceNo StockCode                         Description  Quantity  \
0    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                 WHITE METAL LANTERN         6   
2    536365    84406B      CREAM CUPID HEARTS COAT HANGER         8   

          InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  \
0 2010-12-01 08:26:00       2.55     17850.0  United Kingdom       15.30   
1 2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34   
2 2010-12-01 08:26:00       2.75     17850.0  United Kingdom       22.00   

   HighValue  DiscountedPrice  
0      False            2.295  
1       True            3.051  
2       True            2.475  

Aplicando una función específica

def categorize_price(price):
  if price > 50:
    return 'High'
  elif price < 20:
    return 'Medium'
  else:
    return 'Low'

df['PriceCategory'] = df['UnitPrice'].apply(categorize_price)
print(df.head(10))
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
5    536365     22752         SET 7 BABUSHKA NESTING BOXES         2   
6    536365     21730    GLASS STAR FROSTED T-LIGHT HOLDER         6   
7    536366     22633               HAND WARMER UNION JACK         6   
8    536366     22632            HAND WARMER RED POLKA DOT         6   
9    536367     84879        ASSORTED COLOUR BIRD ORNAMENT        32   

          InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  \
0 2010-12-01 08:26:00       2.55     17850.0  United Kingdom       15.30   
1 2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34   
2 2010-12-01 08:26:00       2.75     17850.0  United Kingdom       22.00   
3 2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34   
4 2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34   
5 2010-12-01 08:26:00       7.65     17850.0  United Kingdom       15.30   
6 2010-12-01 08:26:00       4.25     17850.0  United Kingdom       25.50   
7 2010-12-01 08:28:00       1.85     17850.0  United Kingdom       11.10   
8 2010-12-01 08:28:00       1.85     17850.0  United Kingdom       11.10   
9 2010-12-01 08:34:00       1.69     13047.0  United Kingdom       54.08   

   HighValue  DiscountedPrice PriceCategory  
0      False            2.295        Medium  
1       True            3.051        Medium  
2       True            2.475        Medium  
3       True            3.051        Medium  
4       True            3.051        Medium  
5      False            6.885        Medium  
6       True            3.825        Medium  
7      False            1.665        Medium  
8      False            1.665        Medium  
9       True            1.521        Medium  

Agrupaciones con groupby

Info groupby

country_count = df['Country'].value_counts()
print(country_count)
Country
United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon                     45
Lithuania                   35
Brazil                      32
Czech Republic              30
Bahrain                     19
Saudi Arabia                10
Name: count, dtype: int64

Agrupando tomando en cuenta una columna

country_group = df.groupby('Country')['Quantity'].sum()
print(country_group)
Country
Australia                 83653
Austria                    4827
Bahrain                     260
Belgium                   23152
Brazil                      356
Canada                     2763
Channel Islands            9479
Cyprus                     6317
Czech Republic              592
Denmark                    8188
EIRE                     142637
European Community          497
Finland                   10666
France                   110480
Germany                  117448
Greece                     1556
Hong Kong                  4769
Iceland                    2458
Israel                     4353
Italy                      7999
Japan                     25218
Lebanon                     386
Lithuania                   652
Malta                       944
Netherlands              200128
Norway                    19247
Poland                     3653
Portugal                  16180
RSA                         352
Saudi Arabia                 75
Singapore                  5234
Spain                     26824
Sweden                    35637
Switzerland               30325
USA                        1034
United Arab Emirates        982
United Kingdom          4263829
Unspecified                3300
Name: Quantity, dtype: int64

Agrupando para obtener alguna estadística

country_stats = df.groupby('Country')['UnitPrice'].agg(['mean', 'sum'])
print(country_stats)
                            mean          sum
Country                                      
Australia               3.220612     4054.750
Austria                 4.243192     1701.520
Bahrain                 4.556316       86.570
Belgium                 3.644335     7540.130
Brazil                  4.456250      142.600
Canada                  6.030331      910.580
Channel Islands         4.932124     3738.550
Cyprus                  6.302363     3920.070
Czech Republic          2.938333       88.150
Denmark                 3.256941     1266.950
EIRE                    5.911077    48447.190
European Community      4.820492      294.050
Finland                 5.448705     3786.850
France                  5.028864    43031.990
Germany                 3.966930    37666.000
Greece                  4.885548      713.290
Hong Kong              42.505208    12241.500
Iceland                 2.644011      481.210
Israel                  3.633131     1079.040
Italy                   4.831121     3879.390
Japan                   2.276145      814.860
Lebanon                 5.387556      242.440
Lithuania               2.841143       99.440
Malta                   5.244173      666.010
Netherlands             2.738317     6492.550
Norway                  6.012026     6529.060
Poland                  4.170880     1422.270
Portugal                8.582976    13037.540
RSA                     4.277586      248.100
Saudi Arabia            2.411000       24.110
Singapore             109.645808    25108.890
Spain                   4.987544    12633.450
Sweden                  3.910887     1806.830
Switzerland             3.403442     6813.690
USA                     2.216426      644.980
United Arab Emirates    3.380735      229.890
United Kingdom          4.532422  2245715.474
Unspecified             2.699574     1204.010

Agrupaciones múltiples

country_stock_group = df.groupby(['Country', 'StockCode'])['Quantity'].sum()
print(country_stock_group)
Country      StockCode
Australia    15036        600
             15056BL        3
             16161P       400
             16169E        25
             20665          6
                         ... 
Unspecified  85180A         2
             85180B         1
             85212         12
             85213         12
             85227         10
Name: Quantity, Length: 19839, dtype: int64

Agrupando y aplicando una función específica

def total_revenue(group):
  return (group['Quantity'] * group['UnitPrice']).sum()

revenue_per_country = df.groupby('Country').apply(total_revenue)
print(revenue_per_country)
Country
Australia                137077.270
Austria                   10154.320
Bahrain                     548.400
Belgium                   40910.960
Brazil                     1143.600
Canada                     3666.380
Channel Islands           20086.290
Cyprus                    12946.290
Czech Republic              707.720
Denmark                   18768.140
EIRE                     263276.820
European Community         1291.750
Finland                   22326.740
France                   197403.900
Germany                  221698.210
Greece                     4710.520
Hong Kong                 10117.040
Iceland                    4310.000
Israel                     7907.820
Italy                     16890.510
Japan                     35340.620
Lebanon                    1693.880
Lithuania                  1661.060
Malta                      2505.470
Netherlands              284661.540
Norway                    35163.460
Poland                     7213.140
Portugal                  29367.020
RSA                        1002.310
Saudi Arabia                131.170
Singapore                  9120.390
Spain                     54774.580
Sweden                    36595.910
Switzerland               56385.350
USA                        1730.920
United Arab Emirates       1902.280
United Kingdom          8187806.364
Unspecified                4749.790
dtype: float64

Ejercicio

most_revenue = revenue_per_country.sort_values(ascending=False, inplace=False).head(3)
print(most_revenue)
Country
United Kingdom    8187806.364
Netherlands        284661.540
EIRE               263276.820
dtype: float64
poor_revenue = revenue_per_country.sort_values(ascending=True, inplace=False).head(3)
print(poor_revenue)
Country
Saudi Arabia      131.17
Bahrain           548.40
Czech Republic    707.72
dtype: float64

Filtrado de datos con condicionales

El filtrado de datos es una técnica utilizada en el análisis de datos para extraer subconjuntos específicos de información de un conjunto de datos más grande. Permite enfocarse en registros que cumplen ciertas condiciones, como analizar ventas de un país específico o transacciones en un período concreto. En Pandas, esto se logra utilizando condiciones lógicas sobre las columnas del DataFrame, permitiendo así una mejor comprensión y visualización de los datos relevantes para el análisis.

Filtrando ventas por columna

#Filtrar ventas en el Reino Unido - UK
uk_sales = sales_data[sales_data['Country'] == 'United Kingdom']
print(uk_sales)
       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541889    581585     22466       FAIRY TALE COTTAGE NIGHT LIGHT        12   
541890    581586     22061  LARGE CAKE STAND  HANGING STRAWBERY         8   
541891    581586     23275     SET OF 3 HANGING OWLS OLLIE BEAK        24   
541892    581586     21217        RED RETROSPOT ROUND CAKE TINS        24   
541893    581586     20685                DOORMAT RED RETROSPOT        10   

               InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
0      2010-12-01 08:26:00       2.55     17850.0  United Kingdom       15.30  
1      2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
2      2010-12-01 08:26:00       2.75     17850.0  United Kingdom       22.00  
3      2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
4      2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
...                    ...        ...         ...             ...         ...  
541889 2011-12-09 12:31:00       1.95     15804.0  United Kingdom       23.40  
541890 2011-12-09 12:49:00       2.95     13113.0  United Kingdom       23.60  
541891 2011-12-09 12:49:00       1.25     13113.0  United Kingdom       30.00  
541892 2011-12-09 12:49:00       8.95     13113.0  United Kingdom      214.80  
541893 2011-12-09 12:49:00       7.08     13113.0  United Kingdom       70.80  

[361878 rows x 9 columns]

Filtrando respecto a un valor

high_quantity_sales = sales_data[sales_data['Quantity'] > 40]
print(high_quantity_sales)
       InvoiceNo StockCode                         Description  Quantity  \
31        536370     10002         INFLATABLE POLITICAL GLOBE         48   
46        536371     22086     PAPER CHAIN KIT 50'S CHRISTMAS         80   
82        536376     22114   HOT WATER BOTTLE TEA AND SYMPATHY        48   
83        536376     21733    RED HANGING HEART T-LIGHT HOLDER        64   
96        536378     21212     PACK OF 72 RETROSPOT CAKE CASES       120   
...          ...       ...                                 ...       ...   
541746    581571     23167      SMALL CERAMIC TOP STORAGE JAR         96   
541747    581571     21314       SMALL GLASS HEART TRINKET POT        48   
541751    581572     23328  SET 6 SCHOOL MILK BOTTLES IN CRATE        48   
541867    581584     20832    RED FLOCK LOVE HEART PHOTO FRAME        72   
541868    581584     85038     6 CHOCOLATE LOVE HEART T-LIGHTS        48   

               InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
31     2010-12-01 08:45:00       0.85     12583.0          France       40.80  
46     2010-12-01 09:00:00       2.55     13748.0  United Kingdom      204.00  
82     2010-12-01 09:32:00       3.45     15291.0  United Kingdom      165.60  
83     2010-12-01 09:32:00       2.55     15291.0  United Kingdom      163.20  
96     2010-12-01 09:37:00       0.42     14688.0  United Kingdom       50.40  
...                    ...        ...         ...             ...         ...  
541746 2011-12-09 12:00:00       0.69     15311.0  United Kingdom       66.24  
541747 2011-12-09 12:00:00       1.85     15311.0  United Kingdom       88.80  
541751 2011-12-09 12:08:00       3.39     16705.0  United Kingdom      162.72  
541867 2011-12-09 12:25:00       0.72     13777.0  United Kingdom       51.84  
541868 2011-12-09 12:25:00       1.85     13777.0  United Kingdom       88.80  

[18591 rows x 9 columns]

Varios filtros

uk_high_quantity_sales = sales_data[(sales_data['Country'] == 'United Kingdom') &
                                    (sales_data['Quantity'] > 300)]
print(uk_high_quantity_sales)
       InvoiceNo StockCode                     Description  Quantity  \
181       536387     22466   FAIRY TALE COTTAGE NIGHTLIGHT       432   
182       536387     21731   RED TOADSTOOL LED NIGHT LIGHT       432   
730       536437     17021          NAMASTE SWAGAT INCENSE       600   
870       536477     21137        BLACK RECORD COVER FRAME       480   
2364      536584    84029E  RED WOOLLY HOTTIE WHITE HEART.       384   
...          ...       ...                             ...       ...   
540060    581457     22189         CREAM HEART CARD HOLDER       324   
540061    581457     23543             WALL ART KEEP CALM        698   
540070    581458     22197                  POPCORN HOLDER      1500   
540071    581459     22197                  POPCORN HOLDER      1200   
540421    581483     23843     PAPER CRAFT , LITTLE BIRDIE     80995   

               InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
181    2010-12-01 09:58:00       1.45     16029.0  United Kingdom      626.40  
182    2010-12-01 09:58:00       1.25     16029.0  United Kingdom      540.00  
730    2010-12-01 12:12:00       0.24     13694.0  United Kingdom      144.00  
870    2010-12-01 12:27:00       3.39     16210.0  United Kingdom     1627.20  
2364   2010-12-01 16:22:00       2.95     13777.0  United Kingdom     1132.80  
...                    ...        ...         ...             ...         ...  
540060 2011-12-08 18:43:00       3.39     18102.0  United Kingdom     1098.36  
540061 2011-12-08 18:43:00       4.15     18102.0  United Kingdom     2896.70  
540070 2011-12-08 18:45:00       0.72     17949.0  United Kingdom     1080.00  
540071 2011-12-08 18:46:00       0.72     17949.0  United Kingdom      864.00  
540421 2011-12-09 09:15:00       2.08     16446.0  United Kingdom   168469.60  

[727 rows x 9 columns]

Filtrado de Fecha

sales_2011 = sales_data[sales_data['InvoiceDate'].dt.year == 2011]
print(sales_2011)
       InvoiceNo StockCode                      Description  Quantity  \
42481     539993     22386          JUMBO BAG PINK POLKADOT        10   
42482     539993     21499               BLUE POLKADOT WRAP        25   
42483     539993     21498              RED RETROSPOT WRAP         25   
42484     539993     22379         RECYCLING BAG RETROSPOT          5   
42485     539993     20718        RED RETROSPOT SHOPPER BAG        10   
...          ...       ...                              ...       ...   
541904    581587     22613      PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899     CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254    CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255  CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138    BAKING SET 9 PIECE RETROSPOT          3   

               InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
42481  2011-01-04 10:00:00       1.95     13313.0  United Kingdom       19.50  
42482  2011-01-04 10:00:00       0.42     13313.0  United Kingdom       10.50  
42483  2011-01-04 10:00:00       0.42     13313.0  United Kingdom       10.50  
42484  2011-01-04 10:00:00       2.10     13313.0  United Kingdom       10.50  
42485  2011-01-04 10:00:00       1.25     13313.0  United Kingdom       12.50  
...                    ...        ...         ...             ...         ...  
541904 2011-12-09 12:50:00       0.85     12680.0          France       10.20  
541905 2011-12-09 12:50:00       2.10     12680.0          France       12.60  
541906 2011-12-09 12:50:00       4.15     12680.0          France       16.60  
541907 2011-12-09 12:50:00       4.15     12680.0          France       16.60  
541908 2011-12-09 12:50:00       4.95     12680.0          France       14.85  

[379979 rows x 9 columns]
sales_dec_2010 = sales_data[(sales_data['InvoiceDate'].dt.year == 2010) &
                            (sales_data['InvoiceDate'].dt.month == 12)]
print(sales_dec_2010)
      InvoiceNo StockCode                          Description  Quantity  \
0        536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1        536365     71053                  WHITE METAL LANTERN         6   
2        536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3        536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4        536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...         ...       ...                                  ...       ...   
42448    539988     84380    SET OF 3 BUTTERFLY COOKIE CUTTERS         1   
42449    539988    84849D                HOT BATHS SOAP HOLDER         1   
42450    539988    84849B               FAIRY SOAP SOAP HOLDER         1   
42451    539988     22854          CREAM SWEETHEART EGG HOLDER         2   
42452    539988    47559B                  TEA TIME OVEN GLOVE         2   

              InvoiceDate  UnitPrice  CustomerID         Country  TotalPrice  
0     2010-12-01 08:26:00       2.55     17850.0  United Kingdom       15.30  
1     2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
2     2010-12-01 08:26:00       2.75     17850.0  United Kingdom       22.00  
3     2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
4     2010-12-01 08:26:00       3.39     17850.0  United Kingdom       20.34  
...                   ...        ...         ...             ...         ...  
42448 2010-12-23 16:06:00       1.25     18116.0  United Kingdom        1.25  
42449 2010-12-23 16:06:00       1.69     18116.0  United Kingdom        1.69  
42450 2010-12-23 16:06:00       1.69     18116.0  United Kingdom        1.69  
42451 2010-12-23 16:06:00       4.95     18116.0  United Kingdom        9.90  
42452 2010-12-23 16:06:00       1.25     18116.0  United Kingdom        2.50  

[26850 rows x 9 columns]

Restructuración de datos: Pivot y Reshape

Reshaping and pivot tables

La reestructuración de datos usando Pivot Tables en Pandas. Se pueden crear tablas dinámicas para resumir y reorganizar datos crudos en un DataFrame, permitiendo realizar cálculos estadísticos como sumas y promedios. Se pueden modificar índices usando columnas y cómo apilar y desapilar datos para una mejor visualización. El uso de estas herramientas permite descubrir patrones y obtener insights valiosos en el análisis de datos.

pivot_table = pd.pivot_table(sales_data, values = 'Quantity', index = 'Country',
                             columns='StockCode', aggfunc='sum')
print(pivot_table)
StockCode             10002  10080  10120  10123C  10124A  10124G  10125  \
Country                                                                    
Australia               NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Austria                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Bahrain                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Belgium                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Brazil                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Canada                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Channel Islands         NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Cyprus                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Czech Republic          NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Denmark                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
EIRE                   12.0    NaN    NaN     NaN     NaN     NaN    NaN   
European Community      NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Finland                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
France                372.0    NaN   10.0     NaN     NaN     NaN  470.0   
Germany                 1.0    NaN    NaN     NaN     NaN     NaN  120.0   
Greece                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Iceland                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Israel                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Italy                   NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Japan                   1.0    NaN    NaN     NaN     NaN     NaN    NaN   
Lebanon                 NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Lithuania               NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Malta                   NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Netherlands             NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Norway                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Poland                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Portugal                NaN    NaN    NaN     NaN     NaN     NaN    NaN   
RSA                     NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Saudi Arabia            NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Singapore               NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Spain                  24.0    NaN    NaN     NaN     NaN     NaN    NaN   
Sweden                  NaN    NaN    NaN     NaN     NaN     NaN    NaN   
Switzerland            12.0    NaN    NaN     NaN     NaN     NaN   20.0   
USA                     NaN    NaN    NaN     NaN     NaN     NaN    NaN   
United Arab Emirates    NaN    NaN    NaN     NaN     NaN     NaN    NaN   
United Kingdom        401.0  291.0  183.0     5.0    16.0    17.0  616.0   
Unspecified             NaN    NaN    NaN     NaN     NaN     NaN    NaN   

StockCode              10133   10135  11001  ...  90214Y  90214Z  \
Country                                      ...                   
Australia                NaN     NaN    NaN  ...     NaN     NaN   
Austria                  NaN     NaN    NaN  ...     NaN     NaN   
Bahrain                  NaN     NaN    NaN  ...     NaN     NaN   
Belgium                  NaN     NaN    NaN  ...     NaN     NaN   
Brazil                   NaN     NaN    NaN  ...     NaN     NaN   
Canada                  40.0    20.0    NaN  ...     NaN     NaN   
Channel Islands          NaN     NaN    NaN  ...     NaN     NaN   
Cyprus                   NaN     NaN    NaN  ...     NaN     NaN   
Czech Republic           NaN     NaN    NaN  ...     NaN     NaN   
Denmark                  NaN     NaN    NaN  ...     NaN     NaN   
EIRE                    40.0    60.0   48.0  ...     NaN     NaN   
European Community       NaN     NaN    NaN  ...     NaN     NaN   
Finland                  NaN     NaN    NaN  ...     NaN     NaN   
France                   NaN    10.0   16.0  ...     NaN     NaN   
Germany                  NaN   200.0   32.0  ...     NaN     NaN   
Greece                   NaN     NaN    NaN  ...     NaN     NaN   
Iceland                  NaN     NaN    NaN  ...     NaN     NaN   
Israel                  20.0     NaN    NaN  ...     NaN     NaN   
Italy                    NaN     NaN    NaN  ...     NaN     NaN   
Japan                    NaN     NaN    NaN  ...     NaN     NaN   
Lebanon                  NaN     NaN    NaN  ...     NaN     NaN   
Lithuania                NaN     NaN    NaN  ...     NaN     NaN   
Malta                    NaN     NaN    NaN  ...     NaN     NaN   
Netherlands              NaN     NaN    NaN  ...     NaN     NaN   
Norway                   NaN    10.0   16.0  ...     NaN     NaN   
Poland                   NaN     NaN    NaN  ...     NaN     NaN   
Portugal                20.0     NaN   16.0  ...     NaN     NaN   
RSA                      NaN     NaN    NaN  ...     NaN     NaN   
Saudi Arabia             NaN     NaN    NaN  ...     NaN     NaN   
Singapore                NaN     NaN    NaN  ...     NaN     NaN   
Spain                    NaN     NaN   16.0  ...     NaN     NaN   
Sweden                   NaN     NaN    NaN  ...     NaN     NaN   
Switzerland             10.0     NaN    NaN  ...     NaN     NaN   
USA                      NaN     NaN    NaN  ...     NaN     NaN   
United Arab Emirates     NaN     NaN    NaN  ...     NaN     NaN   
United Kingdom        2244.0  1637.0  923.0  ...    61.0    12.0   
Unspecified              NaN     NaN    NaN  ...     NaN     NaN   

StockCode             BANK CHARGES    C2  CRUK       D   DOT       M  PADS  \
Country                                                                      
Australia                      NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Austria                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Bahrain                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Belgium                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Brazil                         NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Canada                         NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Channel Islands                NaN   3.0   NaN     NaN   NaN     0.0   NaN   
Cyprus                         NaN   NaN   NaN     NaN   NaN     0.0   NaN   
Czech Republic                 NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Denmark                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
EIRE                           NaN  97.0   NaN    -1.0   NaN    -1.0   NaN   
European Community             NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Finland                        NaN   NaN   NaN     NaN   NaN     2.0   NaN   
France                         NaN   3.0   NaN     NaN   NaN   -16.0   NaN   
Germany                        NaN   NaN   NaN     NaN   NaN     2.0   NaN   
Greece                         NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Iceland                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Israel                         NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Italy                          NaN   NaN   NaN    -1.0   NaN    -1.0   NaN   
Japan                          NaN   NaN   NaN     NaN   NaN    -3.0   NaN   
Lebanon                        NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Lithuania                      NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Malta                          NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Netherlands                    NaN   NaN   NaN    -1.0   NaN  -480.0   NaN   
Norway                         NaN   NaN   NaN     NaN   NaN     0.0   NaN   
Poland                         NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Portugal                       NaN   NaN   NaN     NaN   NaN     1.0   NaN   
RSA                            NaN   NaN   NaN     NaN   NaN     1.0   NaN   
Saudi Arabia                   NaN   NaN   NaN     NaN   NaN     NaN   NaN   
Singapore                      NaN   NaN   NaN     NaN   NaN     0.0   NaN   
Spain                          NaN   NaN   NaN     NaN   NaN    -3.0   NaN   
Sweden                         NaN   NaN   NaN     NaN   NaN     1.0   NaN   
Switzerland                    NaN   NaN   NaN     NaN   NaN     NaN   NaN   
USA                            NaN   NaN   NaN     NaN   NaN     NaN   NaN   
United Arab Emirates           NaN   NaN   NaN     NaN   NaN     NaN   NaN   
United Kingdom                12.0  30.0 -16.0 -1191.0  16.0  3681.0   4.0   
Unspecified                    NaN   NaN   NaN     NaN   NaN     NaN   NaN   

StockCode               POST  
Country                       
Australia                0.0  
Austria                 37.0  
Bahrain                  NaN  
Belgium                272.0  
Brazil                   NaN  
Canada                   1.0  
Channel Islands          NaN  
Cyprus                   1.0  
Czech Republic           0.0  
Denmark                 41.0  
EIRE                     NaN  
European Community       9.0  
Finland                 89.0  
France                 803.0  
Germany               1104.0  
Greece                   6.0  
Iceland                  NaN  
Israel                   NaN  
Italy                   44.0  
Japan                    NaN  
Lebanon                  NaN  
Lithuania                NaN  
Malta                   10.0  
Netherlands             99.0  
Norway                  57.0  
Poland                   9.0  
Portugal                97.0  
RSA                      NaN  
Saudi Arabia             NaN  
Singapore                NaN  
Spain                  209.0  
Sweden                  37.0  
Switzerland             97.0  
USA                      NaN  
United Arab Emirates     1.0  
United Kingdom         -21.0  
Unspecified              NaN  

[37 rows x 3684 columns]

Añadiendo una nueva columna

pivot_table = pd.pivot_table(sales_data, values = 'Quantity', index = 'Country',
                             columns=['StockCode', 'UnitPrice'], aggfunc='sum')
print(pivot_table)

Reshape

df = pd.DataFrame({
    'A': ['foo', 'bar', 'baz'],
    'B': [1, 2, 3],
    'C': [4, 5, 6]
})
print(df)
     A  B  C
0  foo  1  4
1  bar  2  5
2  baz  3  6
df_stacked = df.stack()
print(df_stacked)
0  A    foo
   B      1
   C      4
1  A    bar
   B      2
   C      5
2  A    baz
   B      3
   C      6
dtype: object
df_unstacked = df_stacked.unstack()
print(df_unstacked)

Fusión de DataFrames

Merge, join, concatenate and compare

df1 = pd.DataFrame({
   'key': ['A', 'B', 'C'],
   'value1': [1, 2, 3]
})

df2 = pd.DataFrame({
   'key': ['B', 'C', 'D'],
   'value2': [4, 5, 6]
})

Merge basado en las llaves

Solo se agregan aquellas que tengan igual llave

inner_merged = pd.merge(df1, df2, on='key', how='inner')
print(inner_merged)
  key  value1  value2
0   B       2       4
1   C       3       5

Ahora vamos a hacer un merge completo

outer_merged = pd.merge(df1, df2, on='key', how='outer')
print(outer_merged)
  key  value1  value2
0   A     1.0     NaN
1   B     2.0     4.0
2   C     3.0     5.0
3   D     NaN     6.0

Merge a izquierda y derecha

El join izquierdo incluye todas las filas del DataFrame de la izquierda, y solo las filas coincidentes del DataFrame de la derecha. Si no hay coincidencia, se llenan con NaN.

El join derecho incluye todas las filas del DataFrame de la derecha y solamente las filas coincidentes del DataFrame de la izquierda. Al igual que en el join izquierdo, las filas sin coincidencia se llenan con NaN.

Ambos métodos permiten combinar información de diferentes fuentes, pero cada uno tiene un enfoque distinto en cuanto a qué filas se mantienen.

left_merged = pd.merge(df1, df2, on='key', how='left')
print(left_merged)
  key  value1  value2
0   A       1     NaN
1   B       2     4.0
2   C       3     5.0
right_merged = pd.merge(df1, df2, on='key', how='right')
print(right_merged)
  key  value1  value2
0   B     2.0       4
1   C     3.0       5
2   D     NaN       6

Concatenando

df3 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

df4 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
})

Verticalmente

vertical_concat = pd.concat([df3, df4], axis=0)
print(vertical_concat)
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5

Horizontal

horizontal_concat = pd.concat([df3, df4], axis=1)
print(horizontal_concat)
    A   B   A   B
0  A0  B0  A3  B3
1  A1  B1  A4  B4
2  A2  B2  A5  B5

Manejo de Series Temporales

El formateo de fechas es una parte esencial de la visualización de datos, especialmente cuando se trabaja con series temporales.

Matplotlib, una de las bibliotecas más populares para la creación de gráficos en Python, ofrece herramientas flexibles para personalizar cómo se muestran las fechas en los gráficos.

Utilizando DateFormatter y los códigos de formato de strftime, puedes adaptar la presentación de las fechas para que se ajusten a las necesidades específicas de tu análisis o presentación.

A continuación, exploramos cómo se utiliza el formato ‘%b %Y’ y otras opciones comunes para formatear fechas.

[Series-strftime]

¿Qué hace ‘%b %Y’?

  • %b: Representa el mes abreviado en texto (por ejemplo, ‘Jan’, ‘Feb’, ‘Mar’, etc.). Este formato es útil cuando se quiere mostrar el mes de manera compacta.
  • %Y: Representa el año en cuatro dígitos (por ejemplo, ‘2023’, ‘2024’). Esto proporciona una indicación clara y completa del año. Cuando se utiliza ‘%b %Y’ en el DateFormatter, cada etiqueta del eje x se formatea como Mes Año. Por ejemplo, una fecha que corresponde a enero de 2023 se mostrará como ‘Jan 2023’.

Otras Formas de Dar Formato a Fechas con %:

Matplotlib, a través de DateFormatter, permite una amplia personalización de cómo se muestran las fechas y horas utilizando códigos de formato de fecha y hora basados en el estándar de strftime. Aquí algunos ejemplos comunes:

  • %d: Día del mes como número decimal (por ejemplo, ‘01’, ‘15’).
  • %B: Nombre completo del mes (por ejemplo, ‘January’, ‘February’).
  • %m: Mes como número decimal con ceros a la izquierda (por ejemplo, ‘01’ para enero, ‘12’ para diciembre).
  • %y: Año como número de dos dígitos (por ejemplo, ‘23’ para 2023).
  • %H: Hora en formato de 24 horas (por ejemplo, ‘14’ para las 2 PM).
  • %I: Hora en formato de 12 horas (por ejemplo, ‘02’ para las 2 PM).
  • %p: AM o PM.
  • %M: Minuto con ceros a la izquierda (por ejemplo, ‘05’).
  • %S: Segundo con ceros a la izquierda (por ejemplo, ‘09’).
  • %A: Nombre completo del día de la semana (por ejemplo, ‘Monday’, ‘Tuesday’).
  • %a: Nombre abreviado del día de la semana (por ejemplo, ‘Mon’, ‘Tue’).

Al comprender y utilizar estos formatos, puedes mejorar la claridad y la estética de tus visualizaciones, asegurando que la información temporal se comunique de manera efectiva.

Creando Serie Temporal

data = {
    'Timestamp': ['2024-07-22 10:00:00', 
    '2024-07-22 11:00:00', 
    '2024-07-22 12:00:00'],
    'Value': [100, 150, 200]
}
df = pd.DataFrame(data)

print(df)
             Timestamp  Value
0  2024-07-22 10:00:00    100
1  2024-07-22 11:00:00    150
2  2024-07-22 12:00:00    200

Cargando

# Leer el dataset de ventas desde un archivo CSV
file_path = "online_retail.csv"
sales_data = pd.read_csv(file_path)

# Convertir la columna 'InvoiceDate' a tipo datetime
sales_data['InvoiceDate'] = pd.to_datetime(sales_data['InvoiceDate'])

# Eliminar filas con valores faltantes en las columnas críticas
sales_data.dropna(subset=['CustomerID', 'InvoiceDate'], inplace=True)

# Crear una nueva columna 'TotalPrice'
sales_data['TotalPrice'] = sales_data['Quantity'] * sales_data['UnitPrice']

print(sales_data.head())

Convirtiendo a datetime

#Convertir la columna a tipo datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

#Establecer Timestamp como índice
df.set_index('Timestamp', inplace= True)
print(df)
                     Value
Timestamp                 
2024-07-22 10:00:00    100
2024-07-22 11:00:00    150
2024-07-22 12:00:00    200

Estableciendo a datetime como index

sales_data['InvoiceDate'] = pd.to_datetime(sales_data['InvoiceDate'])

#Establecer InvoiceDate como índice
sales_data.set_index('InvoiceDate', inplace=True)
print(sales_data.head())
                    InvoiceNo StockCode                          Description  \
InvoiceDate                                                                    
2010-12-01 08:26:00    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER   
2010-12-01 08:26:00    536365     71053                  WHITE METAL LANTERN   
2010-12-01 08:26:00    536365    84406B       CREAM CUPID HEARTS COAT HANGER   
2010-12-01 08:26:00    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE   
2010-12-01 08:26:00    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.   

                     Quantity  UnitPrice  CustomerID         Country  \
InvoiceDate                                                            
2010-12-01 08:26:00         6       2.55     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   
2010-12-01 08:26:00         8       2.75     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   

                     TotalPrice  
InvoiceDate                      
2010-12-01 08:26:00       15.30  
2010-12-01 08:26:00       20.34  
2010-12-01 08:26:00       22.00  
2010-12-01 08:26:00       20.34  
2010-12-01 08:26:00       20.34  

Crear columnas con mes, día y año

sales_data['Year'] = sales_data.index.year
sales_data['Month'] = sales_data.index.month
sales_data['Day'] = sales_data.index.day

print(sales_data)
                    InvoiceNo StockCode                          Description  \
InvoiceDate                                                                    
2010-12-01 08:26:00    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER   
2010-12-01 08:26:00    536365     71053                  WHITE METAL LANTERN   
2010-12-01 08:26:00    536365    84406B       CREAM CUPID HEARTS COAT HANGER   
2010-12-01 08:26:00    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE   
2010-12-01 08:26:00    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.   
...                       ...       ...                                  ...   
2011-12-09 12:50:00    581587     22613          PACK OF 20 SPACEBOY NAPKINS   
2011-12-09 12:50:00    581587     22899         CHILDREN'S APRON DOLLY GIRL    
2011-12-09 12:50:00    581587     23254        CHILDRENS CUTLERY DOLLY GIRL    
2011-12-09 12:50:00    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE   
2011-12-09 12:50:00    581587     22138        BAKING SET 9 PIECE RETROSPOT    

                     Quantity  UnitPrice  CustomerID         Country  \
InvoiceDate                                                            
2010-12-01 08:26:00         6       2.55     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   
2010-12-01 08:26:00         8       2.75     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   
2010-12-01 08:26:00         6       3.39     17850.0  United Kingdom   
...                       ...        ...         ...             ...   
2011-12-09 12:50:00        12       0.85     12680.0          France   
2011-12-09 12:50:00         6       2.10     12680.0          France   
2011-12-09 12:50:00         4       4.15     12680.0          France   
2011-12-09 12:50:00         4       4.15     12680.0          France   
2011-12-09 12:50:00         3       4.95     12680.0          France   

                     TotalPrice  Year  Month  Day  
InvoiceDate                                        
2010-12-01 08:26:00       15.30  2010     12    1  
2010-12-01 08:26:00       20.34  2010     12    1  
2010-12-01 08:26:00       22.00  2010     12    1  
2010-12-01 08:26:00       20.34  2010     12    1  
2010-12-01 08:26:00       20.34  2010     12    1  
...                         ...   ...    ...  ...  
2011-12-09 12:50:00       10.20  2011     12    9  
2011-12-09 12:50:00       12.60  2011     12    9  
2011-12-09 12:50:00       16.60  2011     12    9  
2011-12-09 12:50:00       16.60  2011     12    9  
2011-12-09 12:50:00       14.85  2011     12    9  

[406829 rows x 11 columns]

Aplicando resample

monthly_sales = sales_data['Quantity'].resample('ME').sum()
print(monthly_sales)
InvoiceDate
2010-12-31    296362
2011-01-31    269379
2011-02-28    262833
2011-03-31    344012
2011-04-30    278585
2011-05-31    367852
2011-06-30    356922
2011-07-31    363418
2011-08-31    386612
2011-09-30    537496
2011-10-31    569666
2011-11-30    669915
2011-12-31    203836
Freq: ME, Name: Quantity, dtype: int64

Obtienendo información luego de resample

max_sales_month = monthly_sales.idxmax()
max_sales_value = monthly_sales.max()

print(f"El mes con mayores ventas es {max_sales_month.strftime('%B %Y')} con un total de {max_sales_value} unidades vendidas.")
El mes con mayores ventas es November 2011 con un total de 669915 unidades vendidas.