{"cells":[{"cell_type":"markdown","metadata":{},"source":["
"]},{"cell_type":"markdown","metadata":{"id":"7Y1x-BtnOUan"},"source":["
"]},{"cell_type":"markdown","metadata":{},"source":["***"]},{"cell_type":"markdown","metadata":{"id":"FIWWH9MWOtfS"},"source":["## Combining data"]},{"cell_type":"markdown","metadata":{"id":"lNbjF_YVOsXg"},"source":["This exercise will walk you through how to combine dataframes. Often combining dataframes will let you add aditional information to each row, or add additional rows.\n","\n","Outline:\n","* Look at how to add more rows of data using **concat**\n","* Look at how to add more columns of data using **merge**\n"]},{"cell_type":"markdown","metadata":{"id":"-QsmnI5fPERa"},"source":["### Concatinating\n","\n"," Concatinating let's us combine two dataframes, that share similar columns, vertically. You can think of this as stacking one dataframe on top of another."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"icRcRdcyPujb"},"outputs":[],"source":["#import pandas \n","import pandas as pd\n","import numpy as np"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"4pYVhe_aOoYj"},"outputs":[],"source":["#create a simple dataframe\n","df_top = pd.DataFrame({'score':[12,22,13,24,15], 'class':['1','1',np.NaN,'2','2']})\n","\n","#take a look\n","df_top"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"kUkYHUI1QAvK"},"outputs":[],"source":["#create a simple dataframe\n","df_bottom = pd.DataFrame({'score':[16,17,18,19,20], 'class':['3','3','4','4','4']})\n","\n","#take a look\n","df_bottom"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"eB9X-jWXQDLC"},"outputs":[],"source":["#contaconate the dataframes together. Note: the square brackets are there to pass the dataframes to the concat method as a list of dataframes.\n","pd.concat([df_top,df_bottom])"]},{"cell_type":"markdown","metadata":{"id":"G0GMaQHvQUhI"},"source":["But what happends when there are extra or missing columns?"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mz6XC71HQYy-"},"outputs":[],"source":["#create a simple dataframe without the B column and a new column\n","df_messy = pd.DataFrame({'score':[11,21],'school':['1','2']})\n","\n","#take a look\n","df_messy"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"a_Gk0_8FQwNM"},"outputs":[],"source":["pd.concat([df_top,df_bottom,df_messy])"]},{"cell_type":"markdown","metadata":{"id":"Z3ICrDbkQ3y8"},"source":["We can see that when we contatinate the dataframes, pandas adds in missing values where there are no values."]},{"cell_type":"markdown","metadata":{"id":"OrampdmfMhdO"},"source":["
"]},{"cell_type":"markdown","metadata":{"id":"mifVGN7qoM0V"},"source":[" \n","Try to create two dataframes with the same column names, and then use concat to combine them."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"s_k8CDKUoU0C"},"outputs":[],"source":["#create first\n","df1 = pd.DataFrame({???})\n","\n","#create second\n","df2 = ?\n","\n","#combine the dataframes\n","pd.concat(?)"]},{"cell_type":"markdown","metadata":{"id":"WuB4r27hQXKo"},"source":["### Joining\n","\n","\n","Joining lets us combine two dataframes that share similar values horizontally. You can think of this as sliding two dataframes next to each other.\n","\n","When joining you often want to increase the amount of information in each row (i.e., adding more columns).\n","\n","Let's try a **left join**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"246yiB7ZQkww"},"outputs":[],"source":["#let's use as the dataframe of interest\n","df_top"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"6FQq72YSRfVH"},"outputs":[],"source":["#create some extra information\n","df_right = pd.DataFrame({'class':['1','2','3'],\n"," 'teacher':['Robby','Sarah','Alex']})\n","\n","#take a look\n","df_right"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"SxaOWjpIRZPm"},"outputs":[],"source":["#Let's add some extra information to each row\n","pd.merge(left=df_top,right=df_right,on='class', how=\"left\")\n"]},{"cell_type":"markdown","metadata":{"id":"P1mEBuH2U1yd"},"source":["Notice how the merge took the two teacher names and associated them with the right class. Here we end up with many rows with Robby and Sarah. This is an example of a many-to-one join. "]},{"cell_type":"markdown","metadata":{"id":"9J0zY0nWQ3Y-"},"source":["Right join"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"GU8FFU4aQ5qu"},"outputs":[],"source":["pd.merge(left=df_top,right=df_right,on='class', how=\"right\")"]},{"cell_type":"markdown","metadata":{"id":"cDuV7CsNQ6Fw"},"source":["Inner join"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"lJZMaDPSQ7wT"},"outputs":[],"source":["pd.merge(left=df_top,right=df_right,on='class', how=\"inner\")"]},{"cell_type":"markdown","metadata":{"id":"V_nAncFGQ7_f"},"source":["outer join"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"SLCrvSxPQ9Lz"},"outputs":[],"source":["pd.merge(left=df_top,right=df_right,on='class', how=\"outer\")"]},{"cell_type":"markdown","metadata":{"id":"w9DUyN9EQ9pT"},"source":["Generally when joining two dataframes the left join is the most common. In this case you have a main dataframe that you'd like to add information to."]},{"cell_type":"markdown","metadata":{"id":"DCtk3eheRQCj"},"source":["### Join airport data"]},{"cell_type":"markdown","metadata":{"id":"iBWcqtl7RSGn"},"source":["Let's load in the flight data again and add airport size information using joins!"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"rlxj0R70RaLH"},"outputs":[],"source":["#load flight data from google drive\n","df_flights = pd.read_csv(\"/content/drive/MyDrive/Colab Notebooks/DataScience/dataDSI/nyc_flight_data.csv\")\n","\n","#take a look\n","df_flights"]},{"cell_type":"markdown","metadata":{"id":"Ub1yQgpOiEk5"},"source":["Download the data with the destination [information](https://drive.google.com/file/d/1-ACJcTJkGlHG_lNqeNIjACh6Dt7NHxBZ/view?usp=sharing)."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"34_vv2LeZFG_"},"outputs":[],"source":["#place the file in your google drive folder, and load it into colab using pd.read_cvs()\n","df_dest = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataScience/dataDSI/destination_size.csv')\n","\n","#take a look\n","df_dest\n"]},{"cell_type":"markdown","metadata":{"id":"spO8A8QNpp-Y"},"source":["
"]},{"cell_type":"markdown","metadata":{"id":"0AwRFGZGZjd1"},"source":[" \n","Can you merge the two dataframes (df_flights and df_dest) so that you'll have all the destination airport sizes?\n","> This might be useful to compare arrival delays of individual flights to the size of the airport."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"4dGKh3HCZWVA"},"outputs":[],"source":["#join the two dataframes so that you'll have \n","pd.merge(?)"]},{"cell_type":"markdown","metadata":{"id":"8X0bL0Gpp-Ic"},"source":[" What kind of join did you use?"]},{"cell_type":"markdown","metadata":{"id":"lknGnsz9i0Ra"},"source":["By merging the two dataframes we now have the amount of delay along with the size of the destination airport. We could then look at the relationship between the two using a figure or a model."]},{"cell_type":"markdown","metadata":{"id":"6V4xVe65TyDm"},"source":["### Further reading\n","\n","https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html\n","> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroData5_CombineData.ipynb) version.\n"]}],"metadata":{"colab":{"authorship_tag":"ABX9TyPLByv+w27yWOzUHHrZ5hpt","collapsed_sections":[],"mount_file_id":"18tSIUjqJ9FUAgU_oCxF8fYs3e7gFQDLb","name":"IntroData5_CombineData.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":2}