2021-12-09

Tech Programing

程式人小天地

Splitting dataframe into multiple dataframes

2 min read


  • The method in the OP works, but isn’t efficient. It may have seemed to run forever, because the dataset was long.
  • Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
    • .groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
  • The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
  • The original question wanted a list of DataFrames, which can be done with a list-comprehension
    • df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns  # for test dataset

# load data for example
df = sns.load_dataset('planets')

# display(df.head())
            method  number  orbital_period   mass  distance  year
0  Radial Velocity       1         269.300   7.10     77.40  2006
1  Radial Velocity       1         874.774   2.21     56.95  2008
2  Radial Velocity       1         763.000   2.60     19.84  2011
3  Radial Velocity       1         326.030  19.40    110.62  2007
4  Radial Velocity       1         516.220  10.50    119.47  2009


# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}

print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])

# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}

print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
  • df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
  • There are only 2 in this group
         method  number  orbital_period  mass  distance  year
113  Astrometry       1          246.36   NaN     20.77  2013
537  Astrometry       1         1016.00   NaN     14.98  2010
  • df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
                       method  number  orbital_period  mass  distance  year
32  Eclipse Timing Variations       1         10220.0  6.05       NaN  2009
37  Eclipse Timing Variations       2          5767.0   NaN    130.72  2008
38  Eclipse Timing Variations       2          3321.0   NaN    130.72  2008
  • df_dict['df3].head(3) or df_dict['Imaging'].head(3)
     method  number  orbital_period  mass  distance  year
29  Imaging       1             NaN   NaN     45.52  2005
30  Imaging       1             NaN   NaN    165.00  2007
31  Imaging       1             NaN   NaN    140.00  2004
  • For more information about the seaborn datasets

Alternatively

  • This is a manual method to create separate DataFrames using pandas: Boolean Indexing
  • This is similar to the accepted answer, but .loc is not required.
  • This is an acceptable method for creating a couple extra DataFrames.
  • The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']



Source link

資料來源:Stackoverflow

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *