Pandas in python

Panda is a part of python

Pandas is a popular open-source library in Python for data manipulation and analysis. It provides easy-to-use data structures and functions to work with structured data, making it a fundamental tool for data scientists, analysts, and developers dealing with tabular or labeled data. Here are some key features and components of Pandas:

DataFrame: The core data structure in Pandas is the DataFrame, which is a two-dimensional, labeled table with columns of potentially different data types. It is similar to a spreadsheet or SQL table. DataFrames allow you to store and manipulate data in a tabular form, making it easy to perform operations on rows and columns.A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used data structure in Pandas and is akin to a table in a database or an Excel spreadsheet.

Series: A Series is a one-dimensional array-like object in Pandas.A one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, etc.). It can be thought of as a column in a table.It is essentially a single column from a DataFrame. Series objects have both data and index labels, allowing for easy alignment of data and efficient access.

Data Import and Export: Pandas supports reading and writing data from/to various file formats, including CSV, Excel, SQL databases, JSON, and more. It can also scrape data from websites and work with data from web APIs.

Data Cleaning and Transformation: Pandas provides powerful functions for data cleaning, such as handling missing values (NaN or None), data type conversion, and removing duplicates. You can also reshape and pivot data using methods like groupby, pivot, melt, and stack/unstack.

Data Indexing and Selection: Pandas allows you to select, filter, and slice data in various ways, including label-based indexing, integer-based indexing, boolean indexing, and using conditions.

Aggregation and Statistical Analysis: You can perform aggregation operations like mean, sum, count, and more using Pandas. It also provides a wide range of statistical functions for descriptive and inferential statistics.Pandas allows you to group data based on certain criteria and perform aggregate functions like sum, mean, count, etc., on the grouped data.Pandas provides functions to calculate common statistical measures such as mean, median, variance, standard deviation, etc.

Time Series Data: Pandas has excellent support for time series data. It includes date and time handling, resampling, and rolling window operations for time-based data analysis.Pandas has robust support for working with time series data, including date ranges, frequency conversions, and resampling.

Merge and Join: Pandas can combine datasets using SQL-like operations, such as merging (joining) data based on common columns or indices. This is especially useful for combining data from multiple sources.You can combine data from multiple DataFrames using various join operations (inner, outer, left, right).

Visualization: While Pandas itself doesn't provide visualization capabilities, it integrates seamlessly with data visualization libraries like Matplotlib and Seaborn, allowing you to create various plots and charts from your data.Pandas integrates well with Plotly, allowing for quick and easy plotting directly from DataFrames and Series.

Customization and Extensibility: You can customize and extend Pandas functionality by creating your own functions, aggregators, and custom data structures.

Here's a simple example of how to use Pandas to work with data in a DataFrame:

Python

Copy code

import pandas as pd

# Create a DataFrame from a dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35]}

df = pd.DataFrame(data)

# Select and filter data

filtered_df = df[df['Age'] > 30]

# Calculate statistics

mean_age = df['Age'].mean()

# Display the results

print(df)

print(filtered_df)

print("Mean Age:", mean_age)

Pandas simplifies data manipulation tasks and provides an efficient and flexible way to work with structured data in Python. It is an essential tool in the data analysis and data science toolbox, and it greatly facilitates tasks such as data cleaning, exploration, and preparation for further analysis or modeling.

Python Code

from pandas import Series, DataFrame

import pandas as pd

we will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.** A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

obj = Series([4, 7, -5, 3])

obj

  4
  7
 -5
  3
dtype: int64

obj.values

array([ 4,  7, -5,  3])

obj.index

RangeIndex(start=0, stop=4, step=1)

obj2=Series([4,7,-5,3],index=['d','b',

'a','c'])

obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

obj2.values

array([ 4,  7, -5,  3])

obj2[obj2>0]

d    4
b    7
c    3
dtype: int64

obj2*3

d    12
b    21
a   -15
c     9
dtype: int64

import numpy as np

np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

'b' in obj2

True

'e' in obj2

False

we can create a Series from it by passing the dict:

sdata = {'Ohio': 35000, 'Texas': 71000,

 'Oregon': 16000, 'Utah': 5000}

pdata={'Rice': 1500,'Weat':1800,' Suger'

:4000}

obj3=Series(sdata)

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

states = ['California', 'Ohio', 'Oregon',

 'Texas']

obj4=Series(sdata,index=states)

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

I will use the terms “missing” or “NA” to refer to missing data. The isnull and not null functions in pandas should be used to detect missing data:

pd.isnull(obj4) 

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4) 

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

obj3+obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

obj4.name = 'population'
obj4.index.name = 'state'
A Series’s index can be altered in place by assignment:

obj.index = ['Bob', 'Steve', 'Jeff',

 'Ryan']

obj.index

Index(['Bob', 'Steve', 'Jeff', 'Ryan'], 

dtype='object')

obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index). There are numerous ways to construct a data frame, though one of the most common is from a dict of equal-length lists or NumPy arrays

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],

 'year': [2000, 2001, 2002, 2001, 2002],

 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

data

{'pop': [1.5, 1.7, 3.6, 2.4, 2.9],
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002]}

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

frame=DataFrame(data)

 DataFrame(data, columns=['year',
 'state', 'pop'])
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9

DataFrame(data,columns=['year','pop',

state'])

	year	pop	state
0	2000	1.5	Ohio
1	2001	1.7	Ohio
2	2002	3.6	Ohio
3	2001	2.4	Nevada
4	2002	2.9	Nevada

frame2 = DataFrame(data, columns=['year',
 'state', 'pop', 'debt'],
     index=['one', 'two',
 'three', 'four', 'five'])

frame2

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

frame2.columns

Index(['year', 'state', 'pop', 'debt'],

 dtype='object')

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

frame2['debt'] = 16.5

 frame2

yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5
val = Series([-1.2, -1.5, -1.7],
 index=['two', 'four', 'five'])
val
two    -1.2
four   -1.5
five   -1.7
dtype: float64
frame2['debt'] = val
Another common form of data is a nested dict of dicts format: If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner keys as the row indices:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop
{'Nevada': {2001: 2.4, 2002: 2.9},

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5

 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = DataFrame(pop)
frame3

	Nevada	Ohio
2001	2.4	1.7
2002	2.9	3.6
2000	NaN	1.5

 frame3.T

	2001	2002	2000
Nevada	2.4	2.9	NaN
Ohio	1.7	3.6	1.5

 DataFrame(pop, index=[2001, 2002, 2003])

	Nevada	Ohio
2001	2.4	1.7
2002	2.9	3.6
2003	NaN	NaN

pdata = {'Ohio': frame3['Ohio'][:-1],

 'Nevada': frame3['Nevada'][:2]}

pdata

{'Nevada': 2001    2.4
 2002    2.9
 Name: Nevada, dtype: float64, 'Ohio': 2001    1.7
 2002    3.6
 Name: Ohio, dtype: float64}

DataFrame(pdata)

OhioNevada
20011.72.4
20023.62.9
Reindexing A critical method on pandas objects is reindex, which means creating a new object with the data conformed to a new index

	Ohio	Nevada
2001	1.7	2.4
2002	3.6	2.9

obj = Series([4.5, 7.2, -5.3, 3.6],
 index=['d', 'b', 'a', 'c'])

obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Data sources and pandas methods

Data sources for a data science project can be divided into the following categories:

Databases: Most CRM, ERP, and other enterprise archive tools on the website. Depending on the volume, speed, and variability, a traditional or NoSQL database. To connect with many popular details, we need JDBC / ODBC drivers from Python. Fortunately, there are drivers that are available on all popular databases. Data processing is such a website including making a connection with Python to these sources, asking questions about them via Python, and then tricking it down using pandas. We will look at an example of how to do this later in this chapter.

Web Services: Many business application tools, especially Software such as Service tools (SaaS), make their data accessible through the App Programming Interfaces (APIs) instead of a website. This reduces the cost of permanent website hosting infrastructure. Instead, data is generated is available as a service, if required. An API call can be made in Python, which returns data packets in formats such as JSON or XML. And data it is processed and used using pandas for continuous use.

Data files: Most data prototyping data science models come as data files. One example of data stored as a portable file is data from IoT sensors in most cases, data from these sensors is stored in a flat file, a .txt file, or .csv file. Another source of data file is a sample of existing data extracted from the website and stored in such files. Excessive data extraction machine science and learning algorithms are also stored in such files, as CSV, Excel, and .txt files. Another example is that weighted matrices are trained in the neural network model for deep learning and can be saved as an HDF file.

Web and document scratches: Two other sources of data tables and text are available on web pages. This data is collected on these pages using Python packages like BeautifulSoup and Scrapy and is included in the data file or database to be used continuously. Tables and data available in another non-data format file, such as PDF or Documents, are also a major source of data. This, then is released using Python packages such as Tesseract and Tabula-py.

For more panda's code

Python

Pandas in python

Panda is a part of python

Data sources and pandas methods

For more panda's code

Posted by Manibhushan

Post a Comment

0 Comments

More Posts

About Me

Featured Post

API Testing Toolkit

Total Pageviews

Search This Blog

Author Details

Recent Posts

More Info.

Report Abuse

Footer Menu Widget

Pandas in python

Panda is a part of python

Data sources and pandas methods

For more panda's code

Posted by Manibhushan

You may like these posts

Post a Comment

0 Comments

Social Plugin

More Posts

About Me

Featured Post

API Testing Toolkit

Total Pageviews

Search This Blog

Author Details

Recent Posts

More Info.

Report Abuse

Footer Menu Widget