来个python高手帮帮弟弟

带带东百狗 发表于 2024-11-5 21:12:41

自学了一段时间python，感觉卵用没有。然后向chatgpt描述了我的request，但是给我的script并不能帮我整理excel里的数据，如下图所示。我的测序数据分4个samples，相应的sample对应相应的reads，我需要把4个sample相同的sequence align到同一行，以及相对应的reads

下面是gpt给我的code，我用了之后，发现reads和sequences并不匹配。

来个高手帮帮弟弟

import pandas as pd

# Load the Excel file
file_path = 'a.xlsx'# Replace with your actual file path
df = pd.read_excel(file_path)

# Create a new DataFrame that aligns unique sequences and their corresponding reads
aligned_data = pd.DataFrame({
'Sequence': df['D2RPL23_EBOX_SEQUENCES'],# Using sequences from column B as the identifier
'D2RPL23_EBOX_read': df['D2RPL23_EBOX_READS'],# Reads from column A
'D2RPL23_ELF5_read': df['D2RPL23_ELF5_READS'],# Reads from column C
'D7RPL23_EBOX_read': df['D7RPL23_EBOX_READS'],# Reads from column E
'D7RPL23_ELF5_read': df['D7RPL23_ELF5_READS'] # Reads from column G
}).dropna(subset=['Sequence'])# Ensure rows are aligned by dropping rows without a sequence

# Remove duplicates by sequence to ensure each sequence appears only once
aligned_data = aligned_data.drop_duplicates(subset='Sequence')

# Save the aligned data to a new Excel file
aligned_data.to_excel("Aligned_Sequences_with_Reads.xlsx", index=False)

print("Aligned data has been saved to 'Aligned_Sequences_with_Reads.xlsx'.")

别克君悦 发表于 2024-11-5 21:21:04

当你的数据在4个样本中存在重复的序列时，可能出现了“无法对齐”或“匹配错误”的问题。我们可以尝试通过更精确的代码来解决你的需求，将相同的序列对齐到同一行，并把对应的 reads 值一并合并。以下是一个改进版本的代码：示例代码import pandas as pd# 加载Excel文件 file_path = 'a.xlsx'# 确保路径正确 df = pd.read_excel(file_path)# 检查需要的列是否存在 required_columns = [ 'D2RPL23_EBOX_SEQUENCES', 'D2RPL23_EBOX_READS', 'D2RPL23_ELF5_READS', 'D7RPL23_EBOX_READS', 'D7RPL23_ELF5_READS' ] missing_cols = if missing_cols: raise ValueError(f"Excel文件缺少以下列: {', '.join(missing_cols)}")# 根据序列列对数据进行分组，然后将每个样本的reads填入对应的行中 aligned_data = df.groupby('D2RPL23_EBOX_SEQUENCES', as_index=False).agg({ 'D2RPL23_EBOX_READS': 'first', # 对于每个唯一序列，获取第一个值 'D2RPL23_ELF5_READS': 'first', 'D7RPL23_EBOX_READS': 'first', 'D7RPL23_ELF5_READS': 'first' })# 移除没有序列的行，并去重 aligned_data = aligned_data.dropna(subset=['D2RPL23_EBOX_SEQUENCES']).drop_duplicates(subset='D2RPL23_EBOX_SEQUENCES')# 将结果保存为新的Excel文件 aligned_data.to_excel("Aligned_Sequences_with_Reads.xlsx", index=False)print("已将对齐数据保存至 'Aligned_Sequences_with_Reads.xlsx'.")说明1. 该代码根据序列（D2RPL23_EBOX_SEQUENCES）进行分组，并将各样本中的 reads 值按序列进行对齐。 2. 我们通过 .agg() 来选择每个序列的 reads 值，以确保在对齐时不会出现交错或错位。 3. 最终会保存对齐好的数据至一个新的Excel文件 Aligned_Sequences_with_Reads.xlsx。注意事项1. 请确保 D2RPL23_EBOX_SEQUENCES 列中确实存在用于对齐的唯一序列。 2. 如果在分组后仍有不匹配的 reads，你可能需要检查数据源格式是否一致。

刘培茄中校 发表于 2024-11-5 22:01:44

看不懂，头疼，希望我也会

今天NGA被封了吗 发表于 2024-11-6 00:13:31

把4个group encode一个新的categorical col (比如 A B C D), 从现在8 col * n rows变成 3 col * 4n rows, 然后groupby sequence就可以了

粉色狂战斧 发表于 2024-11-6 08:07:08

我理解你这seq等于是个index对吧，你直接把表拆成四个，然后重命名一下seq作为index，再merge一下应该就行了

1747032911 发表于 2024-11-6 08:13:53

编程随想论坛

180 发表于 2024-11-6 08:26:04

编程随想论坛

fullout2020 发表于 2024-11-6 09:14:39

不懂不懂

请叫我豆奶酱 发表于 2024-11-6 20:32:50

不懂不懂

towenyu 发表于 2024-11-7 00:49:29

希望我也会

小李飞刀 发表于 2024-11-7 08:00:10

lecter 发表于 2024-11-7 09:08:52

希望我也会

刁迈乎 发表于 2024-11-7 09:54:02

编程随想论坛

lmljk 发表于 2024-11-7 11:26:02

外宾别卷计算机了捣鼓你那诺贝尔吧

乡下南瓜0 发表于 2024-11-7 12:27:43

chatgpt问不出来吗

qwertttass 发表于 2024-11-8 08:16:53

或者直接用二维字典遍历一遍就行了，这个计算量估计很小，谭友说的pandas 估计也行

Rogerzzx 发表于 2024-11-8 09:49:00

编程随想论坛

页: [1]

哥谭's Archiver

来个python高手帮帮弟弟