大数据参考资料Numpy,Pandas,Matplotlid

大数据参考资料Numpy,Pandas,Matplotlid

Richer Chen
2023-05-22 / 0 评论 / 101 阅读 / 正在检测是否收录...
温馨提示:
本文最后更新于2023年06月05日,已超过616天没有更新,若内容或图片失效,请留言反馈。
本文共 1821 个字数,平均阅读时长 ≈ 5分钟

numpy

导入numpy库并简写为 np (★☆☆)

(提示: import … as …)

import numpy as np

创建一个长度为10的空向量 (★☆☆)

(提示: np.zeros)

Z = np.zeros(10)
print(Z)
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

创建一个长度为10并且除了第五个值为1的空向量 (★☆☆)

(提示: array[4])

Z = np.zeros(10)
Z[4] = 1
print(Z)
    [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]

创建一个值域范围从10到49的向量(★☆☆)

(提示: np.arange)

Z = np.arange(10,50)
print(Z)
    [10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
     34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]

反转一个向量(第一个元素变为最后一个) (★☆☆)

(提示: array[::-1])

Z = np.arange(50)
Z = Z[::-1]
print(Z)
    [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
     25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
      1  0]

创建一个 3x3 并且值从0到8的矩阵(★☆☆)

(提示: reshape)

Z = np.arange(9).reshape(3,3)
print(Z)
    [[0 1 2]
     [3 4 5]
     [6 7 8]]

找到数组[1,2,0,0,4,0]中非0元素的位置索引 (★☆☆)

(提示: np.where)

nz = np.array([1,2,0,0,4,0])
nz = np.where(nz!=0)
print(nz)
    (array([0, 1, 4]),)

创建一个 3x3x3的随机数组 (★☆☆)

(提示: np.random.random)

Z = np.random.random((3,3,3))
print(Z)
    [[[0.49540183 0.03833072 0.17015454]
      [0.53560863 0.00536714 0.76869732]
      [0.57771647 0.00343808 0.07679618]]
    
     [[0.51326329 0.34007645 0.31003736]
      [0.05885512 0.61487165 0.86874288]
      [0.37408803 0.24506961 0.50094522]]
    
     [[0.56903475 0.12505482 0.5400201 ]
      [0.46160486 0.00820837 0.56462576]
      [0.10545321 0.17982915 0.89136815]]]

创建一个 10x10 的随机数组并找到它的最大值和最小值 (★☆☆)

(提示: min, max)

Z = np.random.random((10,10))
Zmin, Zmax = Z.min(), Z.max()
print(Zmin, Zmax)
   0.007735040835088136 0.9787372284134425

创建一个长度为30的随机向量并找到它的平均值 (★☆☆)

(提示: mean)

Z = np.random.random(30)
m = Z.mean()
print(m)
    0.4802736181542377

创建一个二维数组,其中边界值为1,其余值为0 (★☆☆)

(提示: array[1:-1, 1:-1])

Z = np.ones((10,10))
Z[1:-1,1:-1] = 0
print(Z)
    [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

以下表达式运行的结果分别是什么? (★☆☆)

(提示: NaN = not a number)

0 * np.nan
np.nan == np.nan
np.nan - np.nan
0.3 == 3 * 0.1
print(0 * np.nan)
    nan
print(np.nan == np.nan)
    False
print(np.nan - np.nan)
    nan
print(0.3 == 3 * 0.1)
    False

创建一个 8x8 的矩阵,并且设置成棋盘样式 (★☆☆)

(提示: array[::2])

Z = np.zeros((8,8),dtype=int)
Z[1::2,::2] = 1
Z[::2,1::2] = 1
print(Z)
    [[0 1 0 1 0 1 0 1]
     [1 0 1 0 1 0 1 0]
     [0 1 0 1 0 1 0 1]
     [1 0 1 0 1 0 1 0]
     [0 1 0 1 0 1 0 1]
     [1 0 1 0 1 0 1 0]
     [0 1 0 1 0 1 0 1]
     [1 0 1 0 1 0 1 0]]

对一个5x5的随机矩阵做归一化(★☆☆)

(提示: min-max和z-score)

Z = np.random.random((5,5))
Zmax, Zmin = Z.max(), Z.min()
Z = (Z - Zmin)/(Zmax - Zmin)
print(Z)
    [[0.28005346 0.76908806 0.94344695 0.64199284 0.12711646]
     [0.51537725 0.12372809 0.45086104 0.29787342 0.84346246]
     [0.71279596 0.60373318 0.04030923 0.         0.80699155]
     [1.         0.67174818 0.12411185 0.34138983 0.40129933]
     [0.57061635 0.40793513 0.1658807  0.62630389 0.62997557]]
Z = np.random.random((5,5))
Zmean, Zstd = Z.mean(), Z.std()
Z = (Z - Zmean)/Zstd
print(Z)
    [[-0.3881769   0.43273592 -1.1007492   0.11627415  1.20322183]
     [-1.88538572  0.3273551   1.05532266  0.78136778  0.14407397]
     [ 0.85746185 -1.55078272 -0.50247391  1.02981358 -0.99543472]
     [-0.42427793  0.02952572  0.41339576 -0.11907466  0.28742804]
     [ 1.1604349  -1.77222348  1.31790618 -1.68842404  1.27068583]]

给定一个一维数组,对其在3到8之间的所有元素取反 (★☆☆)

(提示: >, <=)

Z = np.arange(11)
Z[(3 < Z) & (Z <= 8)] = Z[(3 < Z) & (Z <= 8)] * (-1)
print(Z)
    [ 0  1  2  3 -4 -5 -6 -7 -8  9 10]

下面脚本运行后的结果是什么? (★☆☆)

(提示: np.sum)

print(sum(range(5),-1))
from numpy import *
print(sum(range(5),-1))
print(sum(range(5),-1))
    9
from numpy import *
print(sum(range(5),-1))
    10

考虑一个整数向量Z,下列表达合法的是哪个? (★☆☆)

Z**Z
2 << Z >> 2
Z <- Z
Z/1/1
Z<Z>Z
Z = np.arange(5)
Z ** Z  # legal
    array([  1,   1,   4,  27, 256])
Z = np.arange(5)
2 << Z >> 2  # legal
    array([0, 1, 2, 4, 8])
Z = np.arange(5)
Z <- Z   # legal
    array([False, False, False, False, False])
Z = np.arange(5)
Z/1/1   # legal
    array([0., 1., 2., 3., 4.])
Z = np.arange(5)
Z<Z>Z    # false
    ---------------------------------------------------------------------------
    
    ValueError                                Traceback (most recent call last)
    
    /var/folders/47/5yry4px511s8gxnw_31hzwlw0000gn/T/ipykernel_31933/1838066248.py in <module>
          1 Z = np.arange(5)
    ----> 2 Z<Z>Z    # false


    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

下列表达式的结果分别是什么?(★☆☆)

np.array(0) / np.array(0)
np.array(0) // np.array(0)
np.array([np.nan]).astype(int)
print(np.array(0) / np.array(0))
    nan
    /var/folders/47/5yry4px511s8gxnw_31hzwlw0000gn/T/ipykernel_31933/4120864939.py:1: RuntimeWarning: invalid value encountered in true_divide
      print(np.array(0) / np.array(0))
print(np.array(0) // np.array(0))
   0
    /var/folders/47/5yry4px511s8gxnw_31hzwlw0000gn/T/ipykernel_31933/4108562882.py:1: RuntimeWarning: divide by zero encountered in floor_divide
      print(np.array(0) // np.array(0))
print(np.array([np.nan]).astype(int))
    [-9223372036854775808]

用三种不同的方法去提取一个随机数组的整数部分(★★☆)

(提示: %, np.floor, np.ceil)

Z = np.random.uniform(0,10,10)
print (Z - Z%1)
    [5. 3. 8. 7. 3. 8. 7. 3. 3. 5.]
print (np.floor(Z))
    [5. 3. 8. 7. 3. 8. 7. 3. 3. 5.]
print (np.ceil(Z)-1)
    [5. 3. 8. 7. 3. 8. 7. 3. 3. 5.]

创建一个5x5的矩阵,其中每行的数值范围从0到4 (★★☆)

(提示: np.arange)

Z = np.zeros((5,5))
Z += np.arange(5)
print (Z)
    [[0. 1. 2. 3. 4.]
     [0. 1. 2. 3. 4.]
     [0. 1. 2. 3. 4.]
     [0. 1. 2. 3. 4.]
     [0. 1. 2. 3. 4.]]

创建一个长度为10的随机向量,并将其排序 (★★☆)

(提示: sort)

Z = np.random.random(10)
Z.sort()
print (Z)
    [0.05948668 0.06772389 0.18053073 0.3690779  0.43207858 0.59212272
     0.61474614 0.64012558 0.67395373 0.72028118]

创建一个长度为10的向量,并将向量中最大值替换为1 (★★☆)

(提示: argmax)

Z = np.random.random(10)
Z[Z.argmax()] = 1
print (Z)
    [0.2322195  0.72417001 0.54942971 0.83360414 1.         0.05253964
     0.91218709 0.64805915 0.73789832 0.09189523]

减去一个矩阵中的每一行的平均值 (★★☆)

(提示: mean(axis=)

X = np.random.rand(5, 10)
Y = X - X.mean(axis=1).reshape(5, 1)
print (Y)
    [[ 0.02477607 -0.44978471  0.35835226 -0.11303505  0.41311033 -0.04494334
       0.37069344 -0.33173567 -0.2418161   0.01438276]
     [-0.21299384  0.32298826  0.00319469 -0.24739807 -0.36345109 -0.28780506
       0.37067288 -0.08722747  0.57116507 -0.06914538]
     [ 0.32097414 -0.13742142  0.00140244 -0.45716547  0.47243945 -0.19557543
       0.29747343 -0.37215733 -0.02343571  0.0934659 ]
     [ 0.05670052  0.22122356  0.02275285 -0.11814782 -0.12763654 -0.22702062
      -0.2752823   0.21867024 -0.20739751  0.43613762]
     [ 0.50697231  0.19782196  0.01629516 -0.24005828 -0.27824507 -0.08727165
      -0.26113724 -0.29081586  0.44296733 -0.00652866]]

如何通过第n列对一个数组进行排序? (★★☆)

(提示: argsort)

Z = np.random.randint(0,10,(3,3))
print (Z)
print (Z[Z[:,1].argsort()])
    [[4 3 3]
     [1 0 1]
     [7 8 4]]
    [[1 0 1]
     [4 3 3]
     [7 8 4]]

考虑一个向量[1,2,3,4,5],如何建立一个新的向量,在这个新向量中每个值之间有3个连续的零?(★★★)

(提示: array[::4])

Z = np.array([1,2,3,4,5])
nz = 3
Z0 = np.zeros(len(Z) + (len(Z)-1)*(nz))
Z0[::nz+1] = Z
print (Z0)
    [1. 0. 0. 0. 2. 0. 0. 0. 3. 0. 0. 0. 4. 0. 0. 0. 5.]

如何对一个数组中任意两行做交换? (★★★)

(提示: array[[]] = array[[]])

A = np.arange(25).reshape(5,5)
A[[0,1]] = A[[1,0]]
print (A)
    [[ 5  6  7  8  9]
     [ 0  1  2  3  4]
     [10 11 12 13 14]
     [15 16 17 18 19]
     [20 21 22 23 24]]

如何通过滑动窗口计算一个数组的平均数? (★★★)

(提示: np.cumsum)

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n
Z = np.arange(20)
print(moving_average(Z, n=3))
    [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18.]

如何找到一个数组的第n个最大值? (★★★)

(提示: np.argsort)

Z = np.arange(10000)
n = 5
print (Z[np.argsort(Z)[-n:]])
    [9995 9996 9997 9998 9999]

考虑一个大向量Z, 用两种不同的方法计算它的立方(★★★)

(提示: np.power, \*)

x = np.random.rand()
np.power(x,3)
    0.6061196882414749
# 方法2
x*x*x
    0.6061196882414749

附1.考虑两个形状分别为(8,3)(2,2)的数组AB. 如何在数组A中找到满足包含B中元素的行?(不考虑B中每行元素顺序)?
附2.随意生成一个(4,4)的二维数组,找出最大的三个元素及其坐标,将结果以如下形式输出:{(行索引,列索引):元素值,...}
附3.随机生成两个二维数组,请问两个数组之间的相同元素有哪些,如果有相同元素,输出出来,若没有,输出“无相同元素”。
附4.现有一条线穿过P0(x1,y1)P1(x2,y2)两个点,请计算点p(x3,y3)到这条线的距离?

Pandas

将一个列表转换成Pandas的数据框

import pandas as pd
my_list=[('join',25,'male'),('lisa',30,'female'),('david','18','male')]
df=pd.DataFrame(my_list,columns=['Name','age','gender'])
print(df)
        Name age  gender
    0   join  25    male
    1   lisa  30  female
    2  david  18    male

从一个CSV文件中读取数据到一个Pandas数据框

df=pd.read_csv('文件路径')
print(df)

查看一个Pandas数据框的行数和列数

import pandas as pd
df=pd.DataFrame({'A':[1,2,3],"B":[4,5,6],"C":[7,8,9]})
print(df.shape)
    (3, 3)

查看一个Pandas数据框的列名

import pandas as pd
data={"name":['alex','box','chery'],'age':[18,20,12]}
df=pd.DataFrame(data)
print(df.columns)
    Index(['name', 'age'], dtype='object')

查看一个Pandas数据框的索引

import pandas as pd
data={"name":['alex','box','chery'],'age':[18,20,12]}
df=pd.DataFrame(data)
print(df.index)
    RangeIndex(start=0, stop=3, step=1)

从CSV文件中读取数据并读取前面部分数据

import pandas as pd
df=pd.read_csv("文件路径")
df.head(3)

查看一个Pandas数据框的数据类型

import pandas as pd
data={"name":['alex','bob','chery'],'age':[10,12,13]}
df=pd.DataFrame(data)
print(df.dtypes)
    name    object
    age      int64
    dtype: object

查看一个Pandas数据框的数据摘要统计信息

import pandas as pd
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[2.1,4.2,6.3,8.4,10.5],'C':['a','b','a','b','a']})
df.describe()
A B
count 5.000000 5.000000
mean 3.000000 6.300000
std 1.581139 3.320392
min 1.000000 2.100000
25% 2.000000 4.200000
50% 3.000000 6.300000
75% 4.000000 8.400000
max 5.000000 10.500000

如何选择一个Pandas数据框的行?

import pandas as pd
df=pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                 'Age': [25, 30, 35],
                 'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
first_row=df.loc[0]
first_row
    Name       Alice
    Age           25
    City    New York
    Name: 0, dtype: object
fist_two=df.loc[[0,2],:]
fist_two
Name Age City
0 Alice 25 New York
2 Charlie 35 London
sub=df.loc[[0,2],['Name','Age']]
sub
Name Age
0 Alice 25
2 Charlie 35

如何选择一个Pandas数据框的列?

import pandas as pd
df=pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
df['Name']
    0      Alice
    1        Bob
    2    Charlie
    Name: Name, dtype: object
df[['Name','Age']]
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
df.iloc[:,0]
    0      Alice
    1        Bob
    2    Charlie
    Name: Name, dtype: object
df.iloc[:,1:3]
Age City
0 25 New York
1 30 Paris
2 35 London

如何选择一个Pandas数据框的行和列?

import pandas as pd

# 创建一个数据框
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
sub=df.loc[[0,2],['Name','Age']]
sub
Name Age
0 Alice 25
2 Charlie 35
sub1=df.iloc[[0,2],[0,1]]
sub1
Name Age
0 Alice 25
2 Charlie 35

如何筛选一个Pandas数据框的行?

import pandas as pd

# 创建一个数据框
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
bool_index=df["Age"]>25
bool_index
    0    False
    1     True
    2     True
    Name: Age, dtype: bool
filt=df[bool_index]
print(filt)
          Name  Age    City
    1      Bob   30   Paris
    2  Charlie   35  London

如何筛选一个Pandas数据框的行和列?

import pandas as pd

# 创建一个数据框
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
# 选择年龄大于 25 岁的行以及 'Name' 和 'Age' 两列
sub=df.loc[df['Age']>25,['Name','Age']]
print(sub)
          Name  Age
    1      Bob   30
    2  Charlie   35
sub=df.loc[df['Name']=='Bob',['Age','City']]
sub
Age City
1 30 Paris
sub1=df.iloc[[0,2],[0,1]]
sub1
Name Age
0 Alice 25
2 Charlie 35
sub1=df.iloc[1,1:]
sub1
    Age        30
    City    Paris
    Name: 1, dtype: object

如何根据某一列的值对一个Pandas数据框进行排序?

import pandas as pd

# 创建一个数据框
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 10, 35],
                   'City': ['New York', 'Paris', 'London']})
df
Name Age City
0 Alice 25 New York
1 Bob 10 Paris
2 Charlie 35 London
df_sort=df.sort_values('Age')
df_sort
Name Age City
1 Bob 10 Paris
0 Alice 25 New York
2 Charlie 35 London
df_sorted=df.sort_values('Name',ascending='False')
df_sorted
Name Age City
0 Alice 25 New York
1 Bob 10 Paris
2 Charlie 35 London
df_sorted=df.sort_values(['Age','Name'])
print(df_sorted)
          Name  Age      City
    1      Bob   10     Paris
    0    Alice   25  New York
    2  Charlie   35    London

如何对一个Pandas数据框进行聚合操作?

import pandas as pd

# 创建一个包含销售数据的数据框
df = pd.DataFrame({'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
                   'SalesDate': ['2022-01-01', '2022-01-01', '2022-01-01',
                                 '2022-01-02', '2022-01-02', '2022-01-02',
                                 '2022-01-03', '2022-01-03', '2022-01-03'],
                   'SalesAmount': [100, 200, 150, 50, 75, 125, 300, 250, 200]})
df
Product SalesDate SalesAmount
0 A 2022-01-01 100
1 B 2022-01-01 200
2 C 2022-01-01 150
3 A 2022-01-02 50
4 B 2022-01-02 75
5 C 2022-01-02 125
6 A 2022-01-03 300
7 B 2022-01-03 250
8 C 2022-01-03 200
jhe_df=df.groupby('Product')['SalesAmount'].agg(['sum','mean','max'])
jhe_df
sum mean max
Product
A 450 150.000000 300
B 525 175.000000 250
C 475 158.333333 200

如何对一个Pandas数据框进行合并操作

import pandas as pd
#解决数据输出时列名不对齐的问题
pd.set_option('display.unicode.east_asian_width', True)
df1 = pd.DataFrame({'编号':['mr001','mr002','mr003'],
                    '语文':[110,105,109],
                    '数学':[105,88,120],
                    '英语':[99,115,130]})
print(df1)
        编号  语文  数学  英语
    0  mr001   110   105    99
    1  mr002   105    88   115
    2  mr003   109   120   130
df2 = pd.DataFrame({'编号':['mr002','mr001','mr003','mr004'],
                    '体育':[34.5,39.7,38,45]})
print(df2)
        编号  体育
    0  mr002  34.5
    1  mr001  39.7
    2  mr003  38.0
    3  mr004  45.0
cont_df=pd.concat([df1,df2],axis=0)
cont_df
编号 语文 数学 英语 体育
0 mr001 110.0 105.0 99.0 NaN
1 mr002 105.0 88.0 115.0 NaN
2 mr003 109.0 120.0 130.0 NaN
0 mr002 NaN NaN NaN 34.5
1 mr001 NaN NaN NaN 39.7
2 mr003 NaN NaN NaN 38.0
3 mr004 NaN NaN NaN 45.0

如何在 Pandas 数据框中删除一列数据?

import pandas as pd
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df=pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
df.drop('height',axis=1,inplace=True)
print(df)
        name  age
    0   Jack   24
    1  Sarah   30
    2   Mike   21
    3  David   29

如何在 Pandas 数据框中添加一行数据?

import pandas as pd
data= {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df=pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
new_row={'name':'jeames','age':28,'height':181}
df.loc[len(df)]=new_row
print(df)
         name  age  height
    0    Jack   24     175
    1   Sarah   30     165
    2    Mike   21     180
    3   David   29     170
    4  jeames   28     181

如何在 Pandas 数据框中删除一行数据?

data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
df.drop(1,inplace=True)
print(df)
        name  age  height
    0   Jack   24     175
    2   Mike   21     180
    3  David   29     170

如何在 Pandas 数据框中选择某个范围内的行?

import pandas as pd

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
new_df=df[1:4]
new_df
name age height
1 Sarah 30 165
2 Mike 21 180
3 David 29 170

如何在 Pandas 数据框中选择某个范围内的行?

data={
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
data
    {'name': ['Jack', 'Sarah', 'Mike', 'David'],
     'age': [24, 30, 21, 29],
     'height': [175, 165, 180, 170]}
df=pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
df.set_index('name',inplace=True)
new_df=df.loc['Sarah':'David']
print(new_df)
           age  height
    name              
    Sarah   30     165
    Mike    21     180
    David   29     170

如何在 Pandas 数据框中按特定条件选择行?

data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
new_df=df[df['height']>170]
print(new_df)
       name  age  height
    0  Jack   24     175
    2  Mike   21     180

如何在 Pandas 数据框中对某一列进行排序?

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
new_df=df.sort_values(by='age',ascending=True)
print(new_df)
        name  age  height
    2   Mike   21     180
    0   Jack   24     175
    3  David   29     170
    1  Sarah   30     165

如何在 Pandas 数据框中计算某一列的总和?

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
total_age=df['age'].sum()
total_age
    104

如何在 Pandas 数据框中计算某一列的平均值

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David'],
    'age': [24, 30, 21, 29],
    'height': [175, 165, 180, 170]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
avg_age=df["age"].mean()
avg_age
    26.0

如何在 Pandas 数据框中计算某一列的中位数?

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David', 'Zoe'],
    'age': [24, 30, 21, 29, 28],
    'height': [175, 165, 180, 170, 172]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
4 Zoe 28 172
median_age=df["height"].median()
median_age
    172.0

如何在 Pandas 数据框中计算某一列的标准差?

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David', 'Zoe'],
    'age': [24, 30, 21, 29, 28],
    'height': [175, 165, 180, 170, 172]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
4 Zoe 28 172
std_age=df["age"].std()
std_age
    3.7815340802378072

如何在 Pandas 数据框中计算某一列的方差?

import pandas as pd

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David', 'Zoe'],
    'age': [24, 30, 21, 29, 28],
    'height': [175, 165, 180, 170, 172]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
4 Zoe 28 172
var_age=df['age'].var()
print(var_age)
    14.299999999999999

如何在 Pandas 数据框中查找最大值和最小值?

import pandas as pd

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David', 'Zoe'],
    'age': [24, 30, 21, 29, 28],
    'height': [175, 165, 180, 170, 172]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
4 Zoe 28 172
max_age=df['age'].max()
max_age
    30
min_age=df['age'].min()
min_age
    21

如何在 Pandas 数据框中查找特定行的最大值和最小值?

# 创建示例数据
data = {
    'name': ['Jack', 'Sarah', 'Mike', 'David', 'Zoe'],
    'age': [24, 30, 21, 29, 28],
    'height': [175, 165, 180, 170, 172]
}
df = pd.DataFrame(data)
df
name age height
0 Jack 24 175
1 Sarah 30 165
2 Mike 21 180
3 David 29 170
4 Zoe 28 172
max_height=df.loc[2,'height'].max()
min_height=df.loc[2,'height'].min()
max_height
    180
min_height
    180

如何在 Pandas 数据框中替换特定值?

data = {'name': ['Jack', 'Sarah', 'Mike', 'David']}
df = pd.DataFrame(data)
df
name
0 Jack
1 Sarah
2 Mike
3 David
df["name"]=df["name"].replace(to_replace=r"ck",value="bb",regex=True)
print(df)
        name
    0   Jabb
    1  Sarah
    2   Mike
    3  David

如何在 Pandas 数据框中将特定值替换为缺失值?

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e'], 'C': [0, 1, 2, 3, 4]})
df
A B C
0 1 a 0
1 2 b 1
2 3 c 2
3 4 d 3
4 5 e 4
import numpy as np
df=df.replace(3,np.NaN)
df
A B C
0 1.0 a 0.0
1 2.0 b 1.0
2 NaN c 2.0
3 4.0 d NaN
4 5.0 e 4.0

如何在 Pandas 数据框中填充缺失值?

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})
df
A B
0 1.0 5.0
1 2.0 NaN
2 NaN 7.0
3 4.0 8.0
df.fillna(value=0,inplace=True)
print(df)
         A    B
    0  1.0  5.0
    1  2.0  0.0
    2  0.0  7.0
    3  4.0  8.0
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})
df
A B
0 1.0 5.0
1 2.0 NaN
2 NaN 7.0
3 4.0 8.0
df.fillna(method='ffill',inplace=True)
print(df)
         A    B
    0  1.0  5.0
    1  2.0  5.0
    2  2.0  7.0
    3  4.0  8.0
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})
df
A B
0 1.0 5.0
1 2.0 NaN
2 NaN 7.0
3 4.0 8.0
df.fillna(method='bfill',inplace=True)
print(df)
         A    B
    0  1.0  5.0
    1  2.0  7.0
    2  4.0  7.0
    3  4.0  8.0
# 创建数据框
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})
df
A B
0 1.0 5.0
1 2.0 NaN
2 NaN 7.0
3 4.0 8.0
df.fillna(value={'A':-1,'B':-2},inplace=True)
df
A B
0 1.0 5.0
1 2.0 -2.0
2 -1.0 7.0
3 4.0 8.0

如何在 Pandas 数据框中删除缺失值?

import pandas as pd
df=pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8],
                   'C': [9, 10, 11, None]})
df
A B C
0 1.0 5.0 9.0
1 2.0 NaN 10.0
2 NaN 7.0 11.0
3 4.0 8.0 NaN
df_dropna=df.dropna()
print(df_dropna)
         A    B    C
    0  1.0  5.0  9.0

如何在 Pandas 中使用聚合函数?

import pandas as pd
data = {'Name':['Tom', 'Tom', 'Mary', 'Mary', 'Jack', 'Jack'],
        'Subject':['Math', 'English', 'Math', 'English', 'Math', 'English'],
        'Score':[80, 70, 85, 75, 90, 95]}
data
    {'Name': ['Tom', 'Tom', 'Mary', 'Mary', 'Jack', 'Jack'],
     'Subject': ['Math', 'English', 'Math', 'English', 'Math', 'English'],
     'Score': [80, 70, 85, 75, 90, 95]}
df=pd.DataFrame(data)
df
Name Subject Score
0 Tom Math 80
1 Tom English 70
2 Mary Math 85
3 Mary English 75
4 Jack Math 90
5 Jack English 95
gruped=df.groupby(['Name','Subject']).mean()
gruped
Score
Name Subject
Jack English 95.0
Math 90.0
Mary English 75.0
Math 85.0
Tom English 70.0
Math 80.0

如何在 Pandas 中进行分组和聚合?

data = {'Name':['Tom', 'Tom', 'Mary', 'Mary', 'Jack', 'Jack'],
        'Subject':['Math', 'English', 'Math', 'English', 'Math', 'English'],
        'Score':[80, 70, 85, 75, 90, 95]}
df = pd.DataFrame(data)
df
Name Subject Score
0 Tom Math 80
1 Tom English 70
2 Mary Math 85
3 Mary English 75
4 Jack Math 90
5 Jack English 95
groupted=df.groupby(['Name'])['Score'].agg(['mean','max','min','count'])
print(groupted)
          mean  max  min  count
    Name                       
    Jack  92.5   95   90      2
    Mary  80.0   85   75      2
    Tom   75.0   80   70      2

如何在 Pandas 中进行数据类型转换?

data = {'Name':['Tom', 'Tom', 'Mary', 'Mary', 'Jack', 'Jack'],
        'Subject':['Math', 'English', 'Math', 'English', 'Math', 'English'],
        'Score':['80', '70', '85', '75', '90', '95']}
df = pd.DataFrame(data)
df
Name Subject Score
0 Tom Math 80
1 Tom English 70
2 Mary Math 85
3 Mary English 75
4 Jack Math 90
5 Jack English 95
print(df.dtypes)
    Name       object
    Subject    object
    Score      object
    dtype: object
df['Score']=df['Score'].astype(int)
print(df.dtypes)
    Name       object
    Subject    object
    Score       int64
    dtype: object

如何在 Pandas 中使用一位有效编码(One-Hot Encoding)?

import pandas as pd

# 创建一个数据帧
data = {'Name':['Tom', 'Mary', 'Jack', 'Tom', 'Mary'],
        'Gender':['M', 'F', 'M', 'M', 'F']}
df = pd.DataFrame(data)

# 对 Gender 列进行一位有效编码
gender_encoding = pd.get_dummies(df['Gender'], prefix='Gender')

# 将编码结果添加到原始数据帧中
df = pd.concat([df, gender_encoding], axis=1)

# 输出编码结果
print(df)
       Name Gender  Gender_F  Gender_M
    0   Tom      M         0         1
    1  Mary      F         1         0
    2  Jack      M         0         1
    3   Tom      M         0         1
    4  Mary      F         1         0

如何在 Pandas 中使用 groupby 函数进行数据汇总?

import pandas as pd

# 创建数据帧
data = {'Year': [2018, 2018, 2019, 2019, 2020, 2020],
        'Month': [1, 2, 1, 2, 1, 2],
        'Sales': [100, 200, 300, 400, 500, 600]}
df = pd.DataFrame(data)

# 使用 groupby 函数创建分组对象
grouped = df.groupby(['Year', 'Month'])

# 对分组对象进行聚合操作
result = grouped.sum()

print(result)
                Sales
    Year Month       
    2018 1        100
         2        200
    2019 1        300
         2        400
    2020 1        500
         2        600

如何在 Pandas 中使用 set_index 函数进行数据索引操作

import pandas as pd
df = pd.DataFrame(
    {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
     'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
     'C': [1, 2, 3, 4, 5, 6, 7, 8],
     'D': [8, 7, 6, 5, 4, 3, 2, 1]}
)
df
A B C D
0 foo one 1 8
1 bar one 2 7
2 foo two 3 6
3 bar three 4 5
4 foo two 5 4
5 bar two 6 3
6 foo one 7 2
7 foo three 8 1
indexed=df.set_index(['A','B'])
print(indexed)
               C  D
    A   B          
    foo one    1  8
    bar one    2  7
    foo two    3  6
    bar three  4  5
    foo two    5  4
    bar two    6  3
    foo one    7  2
        three  8  1

如何在 Pandas 中使用 reset_index 函数进行索引重置操作?

import pandas as pd
df = pd.DataFrame(
    {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
     'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
     'C': [1, 2, 3, 4, 5, 6, 7, 8],
     'D': [8, 7, 6, 5, 4, 3, 2, 1]}
)
df
A B C D
0 foo one 1 8
1 bar one 2 7
2 foo two 3 6
3 bar three 4 5
4 foo two 5 4
5 bar two 6 3
6 foo one 7 2
7 foo three 8 1
indexed=df.set_index(['A','B'])
reseted=indexed.reset_index()
indexed
C D
A B
foo one 1 8
bar one 2 7
foo two 3 6
bar three 4 5
foo two 5 4
bar two 6 3
foo one 7 2
three 8 1
reseted
A B C D
0 foo one 1 8
1 bar one 2 7
2 foo two 3 6
3 bar three 4 5
4 foo two 5 4
5 bar two 6 3
6 foo one 7 2
7 foo three 8 1

如何在 Pandas 中使用 agg 函数进行分组聚合操作?

import pandas as pd
# 创建示例数据
df = pd.DataFrame({
    'column': ['A', 'A', 'B', 'B'],
    'other_column': [1, 2, 3, 4]
})
df
column other_column
0 A 1
1 A 2
2 B 3
3 B 4
gruped=df.groupby('column')
result = gruped.agg({'other_column': 'sum'})
result
other_column
column
A 3
B 7

如何在 Pandas 中使用 dropna 函数进行数据清理操作?

import pandas as pd
import pandas as pd
data = {'Name': ['Alice', np.nan, 'Charlie', 'Diana', 'Emily'],
        'Age': [25, 30, 35, 40, 45],
        'Email': ['alice@gmail.com', np.nan, 'charlie@hotmail.com', 'diana@gmail.com', 'emily@hotmail.com']}

df = pd.DataFrame(data)
df
Name Age Email
0 Alice 25 alice@gmail.com
1 NaN 30 NaN
2 Charlie 35 charlie@hotmail.com
3 Diana 40 diana@gmail.com
4 Emily 45 emily@hotmail.com
df.dropna(axis='index',how='any',inplace=False)
Name Age Email
0 Alice 25 alice@gmail.com
2 Charlie 35 charlie@hotmail.com
3 Diana 40 diana@gmail.com
4 Emily 45 emily@hotmail.com

如何在 Pandas 中使用 pd.to_excel 函数进行 Excel 数据写入操作?

import pandas as pd
data = {'Name': ['Tom', 'Jerry', 'Mickey', 'Donald'],
        'Age': [28, 23, 31, 25],
        'Gender': ['M', 'M', 'M', 'M']}
df=pd.DataFrame(data)
df
Name Age Gender
0 Tom 28 M
1 Jerry 23 M
2 Mickey 31 M
3 Donald 25 M
df.to_excel("文件路径",index=False)

如何在pandas中分别对一列数据的正数和负数进行分组聚合操作?¶

import pandas as pd
df = pd.DataFrame({'values': [1, -2, 3, -4, 5]})
df['positive']=df['values']>0
df['negative']=df['values']<0
result=df.groupby(['positive','negative']).agg({'values':'sum'})
print(result)
                       values
    positive negative        
    False    True          -6
    True     False          9

如何在pandas中实现字符串的叠加操作?

import pandas as pd
df=pd.DataFrame({'A': ['hello', 'world'], 'B': ['pandas', 'numpy']})
df
A B
0 hello pandas
1 world numpy
df['C']=df['A']+df['B']
print(df)
           A       B            C
    0  hello  pandas  hellopandas
    1  world   numpy   worldnumpy

如何对数据框中的字符串进行模糊匹配?

import pandas as pd
data = pd.DataFrame({'name': ['Alice', 'Bob', 'Cathy', 'Daniel', 'Emily'], 'score': [85, 73, 90, 82, 79]})
data
name score
0 Alice 85
1 Bob 73
2 Cathy 90
3 Daniel 82
4 Emily 79
matched_rows=data['name'].str.contains('a')
result=data[matched_rows]
result
name score
2 Cathy 90
3 Daniel 82
matched_rows = data['name'].str.contains('a|e')
result = data[matched_rows]
result
name score
0 Alice 85
2 Cathy 90
3 Daniel 82

如何在数据框中找到重复的行?

import pandas as pd
data=pd.read_csv('文件路径')
data
data.duplicated()

如何查找数据框中的缺失值?

import pandas as pd
data=pd.read_csv('文件路径',encoding='gbk')
data

data.isnull().sum()

如何在pandas中使用groupby和agg函数进行分组聚合操作?

import pandas as pd
df=pd.DataFrame({
    'Name': ['Tom', 'Jerry', 'Tom', 'Jerry', 'Tom', 'Jerry'],
    'Gender': ['M', 'M', 'M', 'M', 'M', 'M'],
    'Score': [80, 90, 75, 85, 70, 95]
})
df
Name Gender Score
0 Tom M 80
1 Jerry M 90
2 Tom M 75
3 Jerry M 85
4 Tom M 70
5 Jerry M 95
result=df.groupby('Name').agg({'Score':['mean','max']})
result
Score
mean max
Name
Jerry 90.0 95
Tom 75.0 80

如何在DataFrame中使用cumsum函数进行累加计算

import pandas as pd
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50],
        'C': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
4 5 50 500
cumulative_sum=df.cumsum(axis=0)
cumulative_sum
A B C
0 1 10 100
1 3 30 300
2 6 60 600
3 10 100 1000
4 15 150 1500

如何将DataFrame根据某列的值进行过滤?

import pandas as pd
data = {"name": ["Alice", "Bob", "Charlie", "David", "Emily"],
        "score": [80, 90, 85, 95, 92],
        "gender": ["F", "M", "M", "M", "F"]}
df = pd.DataFrame(data)
df
name score gender
0 Alice 80 F
1 Bob 90 M
2 Charlie 85 M
3 David 95 M
4 Emily 92 F
df_filted=df.loc[df['gender']=='M']
print(df_filted)
          name  score gender
    1      Bob     90      M
    2  Charlie     85      M
    3    David     95      M

如何使用Pandas DataFrame中的值计算新列

import pandas as pd
data = {"name": ["Alice", "Bob", "Charlie", "David", "Emily"],
        "score": [80, 90, 85, 95, 92]}
df = pd.DataFrame(data)
df
name score
0 Alice 80
1 Bob 90
2 Charlie 85
3 David 95
4 Emily 92
df['weighted_score']=0.4*df['score']+0.6*100
print(df)
          name  score  weighted_score
    0    Alice     80            92.0
    1      Bob     90            96.0
    2  Charlie     85            94.0
    3    David     95            98.0
    4    Emily     92            96.8

matplotlib

导入matplotlib库简写为plt

import matplotlib.pyplot as plt

绘制一个柱状图

x = [1,2,3,4,5,6,7,8]
y = [3,1,4,5,8,9,7,2]
label=['A','B','C','D','E','F','G','H']

plt.bar(x,y,tick_label = label)


lhykg5lt.png

绘制一个水平方向柱状图

x = [1,2,3,4,5,6,7,8]
y = [3,1,4,5,8,9,7,2]
label=['A','B','C','D','E','F','G','H']

plt.barh(x,y,tick_label = label)


lhykgbbf.png

绘制x=(0,10)间sin的图像

import numpy as np
x = np.arange(0,10,0.1)
y = np.sin(x)
plt.plot(x, y, label='sin(x)')
plt.ylim(-1.5,1.5)
plt.xlabel('variable x')
plt.ylabel('value y')
plt.title('三角函数')
plt.grid()
plt.axhline(y=0.8,c='r')
plt.axvspan(xmin=4, xmax=6, facecolor='r', alpha=0.3) # 垂直x轴
plt.axhspan(ymin=-0.2, ymax=0.2, facecolor='y', alpha=0.3)  # 垂直y轴
plt.text(3.2, 0, 'sin(x)', color='r')
plt.annotate('maximum',xy=(np.pi/2, 1),xytext=(np.pi/2+1, 1),
             color='r',
             arrowprops=dict(arrowstyle='->',  color='r'))
plt.legend()
    <matplotlib.legend.Legend at 0x7f8228a9e2e0>


lhykggy4.png

绘制的图像$sin(x),sin(x+\pi /2),sin(x+\pi)$的图像,并只显示前2者的图例

y1 = np.sin(x)
y2 = np.sin(x + np.pi * 0.5)
y3 = np.sin(x + np.pi)
plt.plot(x, y1, label = 'first')
plt.plot(x, y2, label = 'second')
plt.plot(x, y3)
plt.legend()
    <matplotlib.legend.Legend at 0x7f81f8173970>


lhykgqf0.png

0

打赏

海报

正在生成.....

评论 (0)

语录
取消