读书笔记-Python生物信息学数据管理

2020-10-04

Python, 数据处理

本文字数: 18.3k 阅读时长: 87 分钟

当生物与Python交织之时

故事将就此展开

书籍简介 📚
写在前面 ✍
故事就此展开 📖

接触 计算机技术(IT) 也已经有些年头了

计算机技术表现出来的 高效 的生产力着实让人赞叹, 让人有无穷的动力参与其中

如今入了生物的坑, 自然一直有探索各种生物和信息技术交融的知识领域

但是浅尝辄止的兴趣式的游玩, 终究是不太够用的

倒不如静下心来, 实际地系统地学习一些内容

⚓️书籍简介 📚

挑选的书是从 江科大长山校区图书馆 里翻找到的

位于 自然科学图书第2阅览室(5楼), 5号架B面5层, 索书号: Q811.4/51

@信息来自江科大图书馆v5.6书目检索系统

题名/责任者: Python生物信息学数据管理/(意) Allegra Via, (德) Kristian Rother, (意) Anna Tramontano著卢宏超, 陈一情, 李绍娟译

出版发行项: 北京:电子工业出版社,2017.01

ISBN及定价: 978-7-121-30382-1/CNY69.00

载体形态项: 17, 318页:图;26cm

统一题名: Managing your biological data with Python

中图法分类号:Q811.4-39

版本附注:由Taylor & Francis出版集团旗下，CRC出版公司授权翻译出版

提要文摘附注:

本书由六部分组成：Python语言基本介绍，语言所有成分介绍，高级编程，数据可视化，生物信息通用包Biopython，最后给出20个"编程秘笈”，范围涵盖了从二级结构预测、多序列比对到蛋白质三维结构的广泛话题。此外，附录还包括了大量的生物信息常用资源的信息。

不过找也费了番功夫, 而且发现学校关于生物信息学方面的书还真的不是非常多

可见生物信息学的确还是比较新的一门学科

Library Genesis: Via, Allegra; Rother, Kristian; Tramontano, Anna - Managing Your Biological Data with Python (libgen.rs)

⚓️写在前面 ✍

此读书笔记会忽略亦或者增加一些内容, 比如

忽略基础的Python语法: 本书还是使用了大量的例子来讲解Python的基础语法的, 但是对于笔者来说已无必要. 需要补充基础语法的小伙伴可以看这里 => Python3 教程 | 菜鸟教程 (runoob.com)
增加一些涉及到的生物学理论细节: 这主要是补充书中的内容, 使得易于理解相关的例子

当然, 其实主要是什么值得补充就会补充的 :happy:

本笔记按照原书的目录顺序展开, 但是仅保留少部分的标题, 主要以技术流程来整合笔记

本书的代码使用的Python的版本为 Python 2.6.6

笔者使用由 Anaconda 构建的Python版本为 python-2.7.13 的虚拟环境来复现书中的代码

期间可能会尝试使用 Python3.8 的版本来重构里面的代码

没有特别说明, 笔记里的Python版本则为 python-2.7.13

事实上, 掌握充足的生物学知识是非常重要的

计算机技术只是帮助我们自动化地完成了基于生物学认识的工作

本书也只是讲解了如何将Python引入分析以及数据管理的工作流

当你掌握真正的算法和理论实验背景, 你将可以更好地利用Python这个强大的工具

多多做实验, 看看理论书吧

⚓️故事就此展开 📖

⚓️第一部分入门

本书的作者描述了他希望这本书带给读者什么

我们希望本书能成为帮助读者每天攀登数据之山的地图, 希望读者能让编程为工作提供便利, 而不必先成为一个专业的软件开发者

作者建议像学习德语一样, 学习Python语法, 把Python标准库 当作字典来查

Python标准库 :

书中的代码和一些总结参考:

⚓️第1章 Python shell

关于Python的安装可以参考这篇文章 Anaconda之路

主要介绍了如何使用Anaconda来构建Python虚拟环境

Anaconda是一个可以管理Python包和Python环境的软件

所谓Python包指的是别人写(第三方)的有一定功能的Python代码

所谓Python环境其实就是Python代码运行时调用的具体版本的Python解释器

所谓管理指的就是我们可以通过Anaconda的conda指令, 随意地创建包含有我们指定的Python包的指定版本的Python环境, 这样做的目的是解决兼容性问题(相互依赖问题)

比如这本书的代码是Python2时代编写的, 我自然是不可以用Python3的解释器来运行的.

那就可以创建一个Python2的环境来复现书中的代码, 然后再创建一个Python3的环境来重新编写他们(重构)

这样我们就可以复现书中的代码了, 而且还可以轻松的重构老旧的代码

⚓️1.2 案例：计算ATP水解的ΔG

吉布斯自由能:

在热力学里，吉布斯能（英语：Gibbs Free Energy），又称吉布斯自由能、吉布斯函数、自由焓，常用英文字母标记。吉布斯能是国际化学联会建议采用的名称。吉布斯能是描述系统的热力性质的一种热力势，定义为[1]:101[2]:128-129

其中，是系统的内能，是绝对温度，是熵，是压强，是体积，是焓。

$$
\Delta G=\Delta G^{0}+R T * \ln \left([A D P] *\left[P_{i}\right] /[A T P]\right)
$$

$$
[?]: 指 ? 的摩尔浓度 \\ ΔG : 吉布斯自由能 \\ ΔG^0 : 标准反应吉布斯能 \\ R : 摩尔气体常数 \\ T : 温度 \\
$$

python的 math 模块:

>>> import math  # 导入数学计算模块
>>> dir(math)    # dir()函数可以获取传入的模块的内部的函数
['__doc__', '__name__', '__package__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'hypot', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'trunc']

>>> math.log(...)  # 传入参数, 返回它的自然对数
>>> help(math.log) # 使用`帮助`这个内建函数查看某个函数的使用说明
Help on built-in function log in module math:
log(...)
  log(x[, base])
    Return the logarithm of x to the given base.
    If the base not specified, returns the natural logarithm (base e) of x.

⚓️1.3.4 计算

这里引入一篇博文来补充各种符号的意思

python中的各种符号（欢迎补充）_杉木人-CSDN博客

特别之处在Python中, ^ 符号

符号	功能	例子
`<sup>`	按位异或运算符：当两对应的二进位相异时，结果为1	(a b) 输出结果 49 ，二进制解释： 0011 0001
`**`	幂 - 返回x的y次幂	a**b 为10的20次方，输出结果 100000000000000000000

参见 Python2.7.18文档 @9.9.1. 将运算符映射到函数

计算流程:

>>> ATP = 3.5
>>> ADP = 1.8
>>> Pi = 5.0
>>> R = 0.00831
>>> T = 298
>>> deltaG0 = -30.5
>>>
>>> import math
>>> deltaG0 + R * T * math.log(ADP * Pi / ATP)
-28.161154161098693

通过导入各种参数, 可以直接计算出吉布斯自由能

⚓️第2章第一个Python程序

⚓️2.2案例：如何计算胰岛素序列中的氨基酸频率 *

计算流程:

# insulin [Homo sapiens] GI:386828
# extracted 51 amino acids of A+B chain
insulin = "GIVEQCCTSICSLYQLENYCNFVNQHLCGSHLVEALYLVCGERGFFYTPKT"

for amino_acid in "ACDEFGHIKLMNPQRSTVWY":
    number = insulin.count(amino_acid)
    print amino_acid,number

A 1
C 6
D 0
E 4
F 3
G 4
H 2
I 2
K 1
L 6
M 0
N 3
P 1
Q 3
R 1
S 3
T 3
V 4
W 0
Y 4

“字符串1”.count(“字符串2”) : 返回字符串1中含有多少字符串2的数量
- 7.1. string — 常见的字符串操作 — Python 2.7.18 文档
- string.``count(s, sub[, start[, end]])
  
  Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.
注意缩进错误：IndentationError: unindent does not match any outer indentation level

切不可混用 空格缩进和制表符缩进, 否则你会疯的

Python 的反斜杠换行:

1
2
3

>>> 3 + 5 + \
... 7
15

内建函数(build-in): 2. 内置函数 — Python 2.7.18 文档

比如 len() 可以直接调用, 并返回结果
对象方法: 某个对象(类的实例)的内部的方法(函数), 通过.符号调用, 如 " ".count(' ')
in迭代:

1
2
3

for ... in ...:
    ...
# `in` 可以迭代字符串等

补充 in

, 对 print 输出的影响:

print 1,2,   # 注意末尾的逗号可以抑制换行符
print 3

1 2 3

转义字符\:
引号之间的嵌套:
random模块: 9.6. random — 生成伪随机数 — Python 2.7.18 文档

⚓️2.5自测题

2009年, Elizabeth H. Blackburn, Carol W. Greide 和 Jack W. Szostak 因发现端粒酶功能而获得诺贝尔奖, 端粒酶负责延伸染色体的末端. 通过NCBI的蛋白质数据库检索人类亚型I的端粒的11132个残基序列, 试问其中哪种氨基酸出现的最为频繁?

⚓️第二部分数据管理

⚓️第3章分析数据列

译者推荐标准库模块 collections :

8.3. collections — High-performance container datatypes — Python 2.7.18 文档

⚓️3.2案例：树突长度

作者使用了 ImageJ 来计算树突的长度并把值传入文本文件

软件指路: ImageJ

in 迭代:

1
2
3

# for循环中按行迭代文件的方法
for ... in open(...):
    ...

内建函数(内置函数): 2. 内置函数 — Python 2.7.18 文档

1 2	# 内建函数 sum(); min(); max(); len()

% 格式化输出::
文件的readlines 方法: file.readlines([sizehint]) — Python 2.7.18 文档

此方法可以将文件的内容按行组合成列表返回

硬编码(hard-coding):
不关闭文件的隐患:

当在其他时候读取文件的时候, 之前写入的部分内容因为没有超过缓冲区大小, 还处于内存内.

这将导致有时候你将无法获取想要的内容, 切记及时关闭文件.

缓冲区:
文件的 writelines 方法:

将列表按项无空格连续写入文件, 列表内只支持 字符串 .

file.writelines(sequence)¶

Write a sequence of strings to the file. The sequence can be any iterable object producing strings, typically a list of strings. There is no return value. (The name is intended to match readlines(); writelines() does not add line separators.)

" ".join()方法:

返回一个由可迭代对象中的字符串串联" "组成的字符串。

str.join(iterable)¶

Return a string which is the concatenation of the strings in iterable. If there is any Unicode object in iterable, return a Unicode instead. A TypeError will be raised if there are any non-string or non Unicode object values in iterable. The separator between elements is the string providing this method.

⚓️第4章解析数据记录

⚓️4.2 案例：整合质谱数据，转化到代谢通路中

UniProt 形式: UniProt

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
- ~ AC:
CSV 逗号分隔文件:
TSV 制表符分隔文件:
MS 质谱:
作者提到的软件与网站:
- Mascot search engine: Protein identification software for mass spec(质谱) data
  
  Welcome to the home of Mascot software, the benchmark for identification, characterisation and quantitation of proteins using mass spectrometry data. Here, you can learn more about the tools developed by Matrix Science to get the best out of your data, whatever your chosen instrument.
- Reactome Pathway Database: 代谢通路资源获取
  
  Reactome is a free, open-source, curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.
  
  If you use Reactome in Asia, we suggest using our Chinese mirror site at reactome.ncpsb.org.

# proteins participating in cell cycle 参与细胞通路的蛋白
list_a =[]
for line in open("cell_cycle_proteins.txt") :
    list_a.append(line.strip())
print list_a

# proteins expressed in a given cancer cell 在给定的细胞中表达的蛋白
list_b =[]
for line in open("cancer_cell_proteins.txt") :
    list_b.append(line.strip())
print list_b

for protein in list_a:  #迭代列表a
    if protein in list_b:  #判断是不是在列表b中
        print protein, 'detected in the cancer cell'
    else:
        print protein, 'not observed'

7.1. The if statement

# if 语句
if ...:
    ...
elif ...:
    ...
elif ...:
    ...
else:
    ....

列表推导式: 5.1.4. 列表推导式

bases = ['A','C','T','G']
seq = 'GGACXCAGXXGATT'
seqlist = [base for base in seq if base in bases]
print seqlist

['G', 'G', 'A', 'C', 'C', 'A', 'G', 'G', 'A', 'T', 'T']

ESR1蛋白:

Protein : Estrogen receptor 雌激素受体

Gene : ESR1

Organism : Homo sapiens (Human)
- 蛋白详情ESR1 - Estrogen receptor - Homo sapiens (Human) - ESR1 gene & protein (uniprot.org)
- ESR1蛋白序列: https://www.uniprot.org/uniprot/P03372.fasta
  
  >sp|P03372|ESR1_HUMAN Estrogen receptor OS=Homo sapiens OX=9606 GN=ESR1 PE=1 SV=2
  
  MTMTLHTKASGMALLHQIQGNELEPLNRPQLKIPLERPLGEVYLDSSKPAVYNYPEGAAY
  EFNAAAAANAQVYGQTGLPYGPGSEAAAFGSNGLGGFPPLNSVSPSPLMLLHPPPQLSPF
  LQPHGQQVPYYLENEPSGYTVREAGPPAFYRPNSDNRRQGGRERLASTNDKGSMAMESAK
  ETRYCAVCNDYASGYHYGVWSCEGCKAFFKRSIQGHNDYMCPATNQCTIDKNRRKSCQAC
  RLRKCYEVGMMKGGIRKDRRGGRMLKHKRQRDDGEGRGEVGSAGDMRAANLWPSPLMIKR
  SKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINW
  AKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEG
  MVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTLKSLEEKDHIHRVLD
  KITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLL
  LEMLDAHRLHAPTSRGGASVEETDQSHLATAGSTSSHSLQKYYITGEAEGFPATV

# 读取整个FASTA记录, 找到里面所属为'Homo sapiens'的序列

fasta_file = open('SwissProt.fasta','r')  # 读取多序列Fasta文件
out_file = open('SwissProtHuman.fasta','w') # 输出文件准备

seq = ''  # 序列存贮变量
for line in fasta_file:  #按行迭代
    if line[0] == '>' and seq == '':  # 判断是不是当前在读记录的第一行
        header = line
    elif line[0] != '>':  # 判断不是首行(没有'>'), 直接将此行放入序列
        seq = seq + line
    elif line[0] == '>' and seq != '': # 如果是首行且序列是空的话, 说明上一条的记录读完了
        if "Homo sapiens" in header:   # 这里先判断上一条记录的记录是不是我们要的,是就存起来
            out_file.write(header + seq)
        seq = ''                       # 清空
        header = line                  # 上一条纪录搞完啦, 把现在在读的这一行保存至haeder
        
if "Homo sapiens" in header:           # 全部读完了, 但是最后一条记录还没处理, 这里处理掉
    out_file.write(header + seq)
out_file.close()

⚓️第5章搜索数据

⚓️5.2案例：将RNA序列翻译为相应的蛋白质序列

Synthetic nucleotide sequence of the human GSH transferase pi gene - Nucleotide - NCBI (nih.gov)

codon_table={
'GCU':'A','GCC':'A','GCA':'A','GCG':'A','CGU':'R',
'CGC':'R','CGA':'R','CCS':'R','AGA':'R','AGG':'R',
'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S',
'AGC':'S','AUU':'T','AUC':'I','AUA':'T','AUU':'I',
'AUC':'I','AUA':'I','UUA':'L','UUG':'L','CUU':'L',
'CUC':'L','CUA':'L','CUG':'L','GGU':'G','GGC':'G',
'GGA':'G','GSG':'G','GUU':'V','GUC':'V','GUA':'V',
'GUG':'V','ACU':'T','ACC':'T','ACA':'T','ACG':'T',
'CCU':'P','CCC':'P','CCA':'P','CCG':'P','AAU':'N',
'AAC':'N','GAU':'D','GAC':'D','UGU':'C','UGC':'C',
'CAA':'Q','CAG':'Q','GAA':'E','GAG':'E','CAU':'H',
'CAC':'H','AAA':'K','AAG':'K','UUU':'F','UUC':'F',
'UAU':'Y','UAC':'Y','AUG':'M','UGG':'W',
'UAG':'STOP','UGA':'STOP','UAA':'STOP'}

rna = ''
for line in open('A06662-RNA.fasta'):
    if not line.startswith('>'):
        rna = rna +line.strip()
# 一次翻译进行一次阅读框读取
for frame in range(3):
    prot = ''
    print 'Reading frame ' + str(frame + 1)
    for i in range(frame, len(rna), 3):
        codon = rna[i:i+3]
        if codon in codon_table:
            if codon_table[codon] == 'STOP':
                prot= prot + '*'
            else:
                prot = prot + codon_table[codon]
        else:
            # 如果要识别的短序列不在table里, 则用'-'替代
            prot = prot + '-'
        
    # 格式化输出成48列的块
    i = 0
    while i < len(prot):
        print prot[i:i + 48]
        i = i + 48

str.startswith(prefix[, start[, end]]) 5. 内置类型 — Python 2.7.18 文档

如果字符串以指定的 prefix 开始则返回 True，否则返回 False。 prefix 也可以为由多个供查找的前缀构成的元组。如果有可选项 start，将从所指定位置开始检查。如果有可选项 end，将在所指定位置停止比较。
核酸序列(开放)阅读框(Open Reading Frame):
- 由于核糖体翻译mRNA时, 三个碱基翻译为一个氨基酸, 那么同一个序列理论上存在三个不同的阅读框
- 对应于移码突变
  
  对于序列 ATCGATCG 有 ATC GAT CG, A TCG ATC G, AT CGA TCG 三种方法解读
- ORF指的是DNA上的序列，从5’端翻译起始密码子ATG到终止密码子(TAA,TAG,TGA)的蛋白质编码序列。对于任意给定的一段DNA，有两个问题需要考虑，
  - 一是DNA双链中的哪条是编码链
  - 二是编码区究竟从第一个碱基开始进行编码
    所以每条链都有潜在的3种ORF，而对于双链DNA来说就有6种可能的ORF。也就是说先从给定的DNA单链为模版，分别从5’-3’方向第123个碱基开始翻译，再以互补链为模版，分别从3’-5’方向的第123个碱基开始翻译，得到另外3种翻译结果。正链上的正向读码框forward，负链的为反向读码框reverse。6个潜在ORF中，一般选择中间没有被终止密码子隔开的最大的读码框为正确的结果。
    
    @4️⃣ 核酸序列特征分析(1):开放阅读框识别 - 简书 (jianshu.com)
文件读取的.next()方法: file.next 2.7文档

file.next()

A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing). In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.
for循环中的 else : break-and-continue-statements-and-else-clauses-on-loops

>>> for n in range(2, 10):
...     for x in range(2, n):
...         if n % x == 0:
...             print n, 'equals', x, '*', n/x
...             break
...     else:   # 此else与第二个for对应, 当发生了 break 后, 执行这里
...         # loop fell through without finding a factor
...         print n, 'is a prime number'
...
2 is a prime number
3 is a prime number
4 equals 2 * 2
5 is a prime number
6 equals 2 * 3
7 is a prime number
8 equals 2 * 4
9 equals 3 * 3

预测蛋白质序列中的无序(环)区域: 既不是α螺旋也不是β折叠的区域, 就是无序的.
- 核心思想: 每个氨基酸都有一个特定的二级结构元件倾向, 通过大量已知的蛋白质结构(PDB)的二级结构元件中各类氨基酸类型出现的频率f,来估计倾向.
- 使用1-f计算: f是以有二级结构元件的部分为基数的(就是有序的), 那么在这样的基数里某种氨基酸出现的越多(频数接近1)则倾向于有序的可能性越大.
- 对20个氨基酸倾向的值进行标准化和引入负值, 设定一个是否为无序的阈值, 大于阈值为无序(环), 小于为有序(二级结构).

propensities ={
'N': 0.2299, 'P':  0.5523, 'Q':-0.18770, 'A':-0.2615,
'R':-0.1766, 'S':  0.1429, 'C':-0.01515, 'T': 0.0089,
'D': 0.2276, 'E': -0.2047, 'V':-0.38620, 'F':-0.2256,
'W':-0.2434, 'G':  0.4332, 'H':-0.00120, 'Y':-0.2075,
'I':-0.4222, 'K': -0.1001, 'L': 0.33793, 'M':-0.2259
}

RCSB PDB:

数据示例概览: RCSB PDB - 6XM5: Structure of SARS-CoV-2 spike at pH 5.5, all RBDs down

HEADER    VIRAL PROTEIN                           29-JUN-20   6XM5              
TITLE     STRUCTURE OF SARS-COV-2 SPIKE AT PH 5.5, ALL RBDS DOWN                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: SPIKE GLYCOPROTEIN;                                        
...                    
COMPND   6 MUTATION: YES                                                        
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS   
...                          
SOURCE   9 EXPRESSION_SYSTEM_CELL_LINE: FREESTYLE 293F                          
KEYWDS    SARS-COV-2 SPIKE, COVID19, VIRAL PROTEIN                              
EXPDTA    ELECTRON MICROSCOPY                                                   
AUTHOR    T.ZHOU,Y.TSYBOVSKY,A.OLIA,P.D.KWONG                                   
REVDAT   2   12-AUG-20 6XM5    1       JRNL   HETSYN                            
REVDAT   1   29-JUL-20 6XM5    0                                                
JRNL        AUTH   T.ZHOU,Y.TSYBOVSKY,A.S.OLIA,J.GORMAN,M.A.RAPP,G.CERUTTI,     
...  
JRNL        TITL   CRYO-EM STRUCTURES DELINEATE A PH-DEPENDENT SWITCH THAT         ...
JRNL        PMID   32637958                                                     
JRNL        DOI    10.1101/2020.07.04.187989                                    
REMARK   2                                                                      
...       
REMARK   3   SOFTWARE PACKAGES      : LATITUDE, CTFFIND, UCSF CHIMERA, COOT,    
REMARK   3                            PHENIX, RELION, RELION, RELION, RELION    
...         
REMARK   3   OVERALL ANISOTROPIC B VALUE  : NULL                                
...
REMARK 900 STRUCTURE OF SARS-COV-2 SPIKE AT PH 5.5, ALL RBDS DOWN               
DBREF  6XM5 A   14  1208  UNP    P0DTC2   SPIKE_SARS2     14   1208             
DBREF  6XM5 B   14  1208  UNP    P0DTC2   SPIKE_SARS2     14   1208             
DBREF  6XM5 C   14  1208  UNP    P0DTC2   SPIKE_SARS2     14   1208             
SEQADV 6XM5 GLY A  682  UNP  P0DTC2    ARG   682 ENGINEERED MUTATION            
SEQADV 6XM5 SER A  683  UNP  P0DTC2    ARG   683 ENGINEERED MUTATION    
...
SEQADV 6XM5 LYS C 1288  UNP  P0DTC2              EXPRESSION TAG                 
SEQRES   1 A 1275  GLN CYS VAL ASN LEU THR THR ARG THR GLN LEU PRO PRO          
SEQRES   2 A 1275  ALA TYR THR ASN SER PHE THR ARG GLY VAL TYR TYR PRO       
...
SEQRES  99 C 1275  LYS                                                          
HET    NAG  D   1      14                                                       
HET    NAG  D   2      14                                                    
...
HETNAM     NAG 2-ACETAMIDO-2-DEOXY-BETA-D-GLUCOPYRANOSE                         
...
FORMUL   4  NAG    58(C8 H15 N O6)                                              
HELIX    1 AA1 ASP A  294  LYS A  304  1                                  11    
...      
SHEET    3 AA1 8 TYR A 265  TYR A 269 -1  O  TYR A 265   N  PHE A  65           
...
LINK         O4  NAG S   1                 C1  NAG S   2     1555   1555  1.44  
CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1                      
ORIGX1      1.000000  0.000000  0.000000        0.00000                         
...
SCALE3      0.000000  0.000000  1.000000        0.00000                         
ATOM      1  N   ALA A  27     188.584 130.357 168.091  1.00 91.79           N  
ATOM      2  CA  ALA A  27     189.571 130.261 169.159  1.00 91.79           C  
ATOM      3  C   ALA A  27     190.053 131.641 169.567  1.00 91.79           C  
ATOM   8249  OG  SER A1147     191.132 184.224 284.141  1.00152.29           O  
TER    8250      SER A1147                                                      
ATOM   8251  N   ALA B  27     132.579 208.265 174.154  1.00140.96           N  
...
TER   24584      SER C1147                                                      
HETATM24585  C1  NAG D   1     184.739 152.280 150.742  1.00114.03           C  
HETATM24586  C2  NAG D   1     183.632 153.136 150.126  1.00114.03           C  
...
CONECT  29225033                                                                
CONECT  70225047                                                                
...
CONECT2539625389                                                                
MASTER      907    0   58   69  173    0    0    625393    3  936  297          
END

⚓️5.5 自测题

NCBI下载FASTA格式RNA序列:
PubMed文献检索:
二级结构元素预测:
- 氨基酸二级结构偏好表: Amino Acid Secondary Structure Preferences

A.A.	Helix	Helix	Sheet	Sheet
A.A.	Designation	P	Designation	P
A	H	1.45	I	0.97
C	i	0.77	h	1.30
D	i	0.98	i	0.80
E	H	1.53	B	0.26
F	h	1.12	h	1.28
G	B	0.53	i	0.81
H	h	1.24	b	0.71
I	I	1.00	H	1.60
K	I	1.07	b	0.74
L	H	1.34	h	1.22
M	h	1.20	H	1.67
N	b	0.73	b	0.65
P	B	0.59	b	0.62
Q	h	1.17	h	1.23
R	i	0.79	i	0.90
S	i	0.79	b	0.72
T	i	0.82	h	1.20
V	h	1.14	H	1.65
W	h	1.14	h	1.19
Y	b	0.61	h	1.29

Designations:
- H = Strong Former, h = Former, I = Weak Former, i = Indifferent, B = Strong Breaker, b = Breaker;
P : Conformational Parameter

⚓️第6章过滤数据

⚓️6.2 案例：使用RNAseq输出数据

新一代测序(Next-generation sequencing，NGS):

最早广泛应用测序技术为 70 年代的 Sanger 法，这也是完成人类基因组计划的基础，因其测序通量低、费时费力，科学家们一直在寻求通量更高、速度更快、价格更便宜、自动化程度更高的测序技术。自 2005 年以来，以Roche 公司的 454 技术、Illumina 公司的 Solexa 技术以及 ABI 公司的 SOLiD 技术为标志的高通量测序技术相继诞生 [6]。相较于传统方法，该技术主要特点是测序通量高、测序时间和成本显著下降，可以一次对几十万到几百万条 DNA 分子序列测定，这使某物种全基因组和转录组的全貌细致分析成为可能，又称为深度测序，很多文献中称其为新一代测序技术，足见其划时代意义 [7]。

@转录组测序技术的应用及发展综述 - 知乎 (zhihu.com) 推荐阅读

四种转录平台的比较

转录组(Transcriptome)测序技术|RNA-Seq(RNA sequencing)技术 :
- 转录本就是基因通过转录形成的通过剪切生成的一种或多种可供编码蛋白质的成熟的mRNA。剪切的过程中可能会剪切掉外显子，也有可能保留部分内含子，这样就形成了多种mRNA即多个转录本。
  @基因研究不知道转录本，做什么引物设计？！ (sohu.com)
- RNA-Seq利用高通量测序技术对组织或细胞中所有RNA 反转录而成cDNA文库进行测序，通过统计相关读段(reads)数计算出不同RNA的表达量，发现新的转录本；如果有基因组参考序列，可以把转录本映射回基因组，确定转录本位置、剪切情况等更为全面的遗传信息，已广泛应用于生物学研究、医学研究、临床研究和药物研发等。
  @转录组测序技术的应用及发展综述 - 知乎 (zhihu.com) 推荐阅读
- 大致流程:
  
  比对read到基因组，转录本组装，差异表达分析和差异基因注释
  
  转录本组装主要就是将比对到参考基因组的read组装成转录本接着进行下游的新转录本鉴定和转录本定量等分析，常见软件有StringTie和cufflinks。而对于无参转录组分析来说，转录本组装主要是将测序reads从头组装成转录本，常见软件有Trinity,Oases和SOAPdenovo-Trans。
  
  @转录组测序（转录本组装篇）_分析 (sohu.com)
  
  @转录组分析流程 - 简书 (jianshu.com)
- 无参转录组和有参转录组的区别: 如果所研究的物种有组装注释质量较好基因组序列，且和该基因组序列比对效率较高，那么可以采用有参转录组的分析策略，直接进行分析。反之，则需要按照无参转录组的分析策略进行转录本组装，构建unigene(基因簇)库，然后进行后续分析。
  @转录组测序：常见基础知识准备一 - 知乎 (zhihu.com)
转录本文件 .gtf : 记录了各个样本装配好的转录本
Cufflinks (cole-trapnell-lab.github.io)

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
- Cufflinks cuffcompare
Cufflinks includes a program that you can use to help analyze the transfrags you assemble. The program cuffcompare helps you:
- Compare your assembled transcripts to a reference annotation
- Track Cufflinks transcripts across multiple experiments (e.g. across a time course)
转录本比对文件 .tracking : 导入转录本文件(.gtf)到 cuffcompare 生成
元组方法:
- intersection(*others) : 返回一个新集合，其中包含原集合以及 others 指定的所有集合中共有的元素。
  
  @intersection 5. 内置类型 — Python 2.7.18 文档
- difference(*others) : 返回一个新集合，其中包含原集合中在 others 指定的其他集合中不存在的元素。
  
  @difference 5. 内置类型 — Python 2.7.18 文档
- 更多方法参见文档
列表的.count()方法: 8.6. array — 高效的数值数组 — Python 2.7.18 文档
- array.count(x) : 返回 x 在数组中的出现次数。
内置函数相关: 2. 内置函数 — Python 2.7.18 文档
- pop():
- reduce():
- del():
- remove():
- enumerate():

.readlines() 返回的列表可以索引:

1 2	lines = open('text.txt').readlines() open('new.txt', 'w').writelines(lines[2:4]+lines[6:])

使用列表推导式筛选:

1 2	data_a = [1, 2, 3, 4, 5, 6] data = [x for x in data_a if x != 3]

去重妙法-元组去重:

1 2	input_file = open('UniprotID.txt') unique = set(input_file)

CD-HIT(可容错的高同源性聚类数据库): 可以快速地的基于用户定义的相似性阈值对蛋白质序列进行聚类, 需要输入一组FASTA格式的序列,并返回两个文件: 一个是聚类列表, 一个是所聚各类的代表序列.

@CD-HIT Official Website (weizhongli-lab.org)

@weizhongli/cdhit: Automatically exported from code.google.com/p/cdhit (github.com)

@教程 | 如何用cd-hit去除冗余序列？ (sohu.com)

⚓️第7章管理表数据

⚓️7.2 案例：确定蛋白浓度

苯酚福林试剂测定蛋白质含量的流程:
- 相关论文: PROTEIN MEASUREMENT WITH THE FOLIN PHENOL REAGENT (jbc.org)
- 对比于凯氏定氮法的优势,对多个样品可以使用光度计快速地进行测量.
  
  @6. ANALYSIS OF PROTEINS (umass.edu)

zip()内置函数: zip([iterable, …]) 2. 内置函数 — Python 2.7.18 文档

>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)  # *号的功能是去除zipped元组的外括号(相当于拆解)
>>> x == list(x2) and y == list(y2)
True

# 效果同转置矩阵:
zipped = zip(*zipped)

# 样例
table = [
    ['protein', 'ext1', 'ext2', 'ext3'],
    [0.16, 0.038, 0.044, 0.040],
    [0.33, 0.089, 0.095, 0.091],
    [0.66, 0.184, 0.191, 0.191],
    [1.00, 0.280, 0.292, 0.283],
    [1.32, 0.365, 0.367, 0.365],
    [1.66, 0.441, 0.443, 0.444]
    ]

# remove the first row
table = table[1:]

protein, ext1, ext2, ext3 = zip(*table)

# create a single column for `ext` or concatenate each `ext`
# extend (`* 3`) the `protein` column to match the `ext` column.
extinction = ext1 + ext2 + ext3
protein = protein * 3

# create four tuples for each column
table = zip(protein, extinction)

for prot, ext in table:
    print prot, ext

列表创建注意: 如何创建多维列表？编程常见问题 — Python 2.7.18 文档
- [0]*3这样的语法是创建了[0]的三个引用, 改变一个列表的内容将改变所有内容

table = [
    ['protein', 'ext'],
    [0.16, 0.038],
    [0.33, 0.089],
    [0.66, 0.184],
    [1.00, 0.280],
    [1.32, 0.365],
    [1.66, 0.441]
]

# convert nested list to nested dict
nested_dict = {}
n = 0
key = table[0]
for row in table[1:]: # for row in table[0:]  includes the header
    n += 1
    entry = {key[0]: row[0], key[1]: row[1]}
    nested_dict['row'+str(n)] = entry

print nested_dict

⚓️7.5自测题

RNA碱基相似性矩阵:

A G C U

A 1.0 0.5 0.0 0.0

G 0.5 1.0 0.0 0.0

C 0.0 0.0 1.0 0.5

U 0.0 0.0 0.5 1.0

	A	G	C	U
A	1.0	0.5	0.0	0.0
G	0.5	1.0	0.0	0.0
C	0.0	0.0	1.0	0.5
U	0.0	0.0	0.5	1.0

⚓️第8章数据排序

⚓️8.2案例：数据表排序

sorted(iterable[, cmp[, key[, reverse]]]) 内置函数: sorted 2. 内置函数 — Python 2.7.18 文档

根据 iterable 中的项返回一个新的已排序列表。

参数cmp(x, y): 2. 内置函数 — Python 2.7.18 文档

sorted方法会隐式的调用cmp()函数进行两两比对, 可以通过自定义函数传给cmp参数控制排序

Compare the two objects x and y and return an integer according to the outcome. The return value is negative if x < y, zero if x == y and strictly positive if x > y.

>>> def numeric_compare(x, y):
...     return x - y
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) 
[1, 2, 3, 4, 5]

def my_cmp(a,b):  # 这个cmp可以在比对a和b的第x项相等时,寻找x+1项进行比对
    x = 0
    if a[x] < b[x]: return 1
    if a[x] == b[x]:
        if a[x+1] < b[x+1]: return 1
        if a[x+1] == b[x+1]:
            if a[x+2] < b[x+2]: return 1
            if a[x+2] == b[x+2]: return 0
            if a[x+2] > b[x+2]: return -1
        if a[x+1] > b[x+1]: return -1
    if a[x] > b[x]: return -1

排序默认按照ASCII顺序进行排序


space	0	:	A	Q	[	a	q	}
!	1	;	B	R	\	b	r	tick
“	2	<	C	S	]	c	s	{
#	3	=	D	T	^	d	t
$	4	>	E	U	_	e	u	DEL
%	5	?	F	V	`	f	v
&	6	@	G	W		g	w
‘	7		H	X		h	x
(	8		I	Y		i	y
)	9		J	Z		j	z
*			K			k
+			L			l
,			M			m
-			N			n
.			O			o
/			P			p

openator模块的itemgetter: 9.9. operator — 标准运算符替代函数 — Python 2.7.18 文档

>>> inventory = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
>>> getcount = itemgetter(1) # 构建一个获取`索引1的值`的itemgetter函数
>>> map(getcount, inventory) # 将inventory按项逐个传入getcount, 获取每项索引号为1的值
[3, 2, 5, 1]
>>> sorted(inventory, key=getcount) # 也可以将getcount作为key, 使用sorted排序
# 这里按照inventory的每一项的索引号为1的值进行排序
[('orange', 1), ('banana', 2), ('apple', 3), ('pear', 5)]

更多排序参见: 排序指南 — Python 2.7.18 文档
通过shell终端命令排序: Linux sort 命令详解：对文本文件中所有行进行排序。 - Linux 命令搜索引擎 (wangchujiang.com)
1
2
3
%sort myfile.txt # 按照字母表顺序排序
%sort -n myfile.txt # 按照数值排序
%sort myfile.txt -k2n -kl # 多列排序
对.tsv制表符分隔符文件的每行,按照七个优先级排序

from operator import itemgetter

# read table
in_file = open("random_distribution.tsv")
table = []
for line in in_file:
    columns = line.split()
    columns = [float(x) for x in columns]
    table.append(columns)
    
# 提取table每项的0~6形成的元组进行逐项比对排序
table_sorted = sorted(table, key=itemgetter(0, 1, 2, 3, 4, 5, 6)) 
print table_sorted

对BLAST的.csv逗号分隔符文件每行的某一项进行排序

from operator import itemgetter

input_file = open("BlastOut.csv")
output_file = open("BlastOutSorted.csv","w")

# read BLAST output table
table = []
for line in input_file:
    col = line.split(',')
    col[2] = float(col[2])
    table.append(col)

table_sorted = sorted(table, key=itemgetter(2), reverse=True)

# write sorted table to an output file
for row in table_sorted:
    row = [str(x) for x in row]
    output_file.write("\t".join(row) + '\n')

input_file.close()
output_file.close()

blastp(蛋白质比对蛋白数据库)工具:
- blast+本地化中blastp操作(基于PDB库)—linux - wang霏霏 - 博客园 (cnblogs.com)
- 这或许是我写的最全的BLAST教程 - 简书 (jianshu.com)
A链血红蛋白PDB表报告:
- RCSB PDB: Search

⚓️第9章模式匹配和文本挖掘

⚓️9.2 案例：在蛋白质序列中搜索磷酸化模体

正则表达式(regular expression, regexp): 代表一组字符串的字符串语法
- 类似的: 生物学家使用N来代替任意碱基(ATCG)
- [和]被称为元字符,由它和字符来编码一组字符串的表达,为正则表达式
  - 7.2. re — 正则表达式操作 — Python 2.7.18 文档 请仔细阅读
- re.compile(pattern, flags=0): Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods.
  参数见上面的文档
- re.compile()返回RegexpObject对象, 使用该对象的.search()方法将返回Match Object对象, 该对象还可以使用.group()/.span()/.start()/.end()方法. 另外RegexpObject对象还有.findall()/.match()/.finditer()等方法可以用, 详见文档
- 由(和)配合匹配到的字符串可以用group()获取识别到的更小的子组
  1
  2
  3
  4
  5
  6
  7
  seq = 'zzabcdzz'
  pattern = re.compile('(a(b)c)d')
  match = pattern.search(seq)
  print match.group(0)
  print match.group(1)
  print match.group(2)
  print match.groups()

Regex

检索匹配指定序列:

import re  # 正则表达式模块

# define a string with occurrences of regex:
seq = 'VSVLTMFRYAGWLDRLYMLVGTQLAAIIHGVALPLMMLI'

# compile a pattern and assign it to a variable
pattern = re.compile('[ST]Q') # 转换字符串为正则表达式对象

# search for the pattern in the string
match = pattern.search(seq)
if match:
    # print the first match along the sequence with the group() method
    # 4 characters before and after the pattern
    print '%10s' %(seq[match.start() - 4:match.end() + 4])
    print '%6s' % match.group()
else:
    print "no match"
# 结果:   
# MLVGTQLAAI
#     TQ

功能性模体表达:
- ELM: elm.eu.org
  
  This computational biology resource mainly focuses on annotation and detection of eukaryotic linear motifs (ELMs) by providing both a repository of annotated motif data and an exploratory tool for motif prediction. ELMs, or short linear motifs (SLiMs), are compact protein interaction sites composed of short stretches of adjacent amino acids. They are enriched in intrinsically disordered regions of the proteome and provide a wide range of functionality to proteins (Davey,2011,Van Roey,2014) They play crucial roles in cell regulation and are also of clinical importance, as aberrant SLiM function has been associated with several diseases and SLiM mimics are often used by pathogens to manipulate their hosts’ cellular machinery (Davey,2011, Uyar,2014)
- PROSITE: ExPASy - PROSITE
  
  PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More… / References / Commercial users ].
  PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More…].
.findall()和.finditer()方法:

import re

pattern = re.compile('R.[ST][^P]')
seq = 'RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP'

# findall returns a list of all matches
matches = pattern.findall(seq)
print matches  # 结果: ['RQSA', 'RRSL', 'RPSK']

# finditer returns an iterator of match objects
match_iter = pattern.finditer(seq)  # 返回一个符合条件的迭代器
for match in match_iter:
    print match.group(), match.span(), match.start(), match.end()

    # 结果:
    # RQSA (0, 4) 0 4
    # RRSL (18, 22) 18 22
    # RPSK (40, 44) 40 44

子组group()的使用

import re

seq = 'QSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP'

pattern1 = re.compile('R(.)[ST][^P]')
match1 = pattern1.search(seq)
print match1.group()  # 结果: RRSL
print match1.group(1) # 结果: R

pattern2 = re.compile('R(.{0,3})[ST][^P]')
match2 = pattern2.search(seq)
print match2.group()  # 结果: RRRSL
print match2.group(1) # 结果: RR

seq = 'zzabcdzz'

pattern = re.compile('(?P<w1>a(?P<w2>b)c)d')
match = pattern.search(seq)
print match.group(0)
print match.group('w1')
print match.group('w2')

Linux命令-grep: Linux grep 命令详解：强大的文本搜索工具 (wangchujiang.com)

1	grep ^'>' 3G5U.fasta # 筛选带有 `>` 的行

PROSITE正则表达式语法: ScanProsite user manual (expasy.org)

可通过字符串的.replace()方法将其转换为Python的re模块的语法

# 此正则表达式用于寻找 钙结合的类EGF结构域签名(PROSITE ID: EGF_CA; PROSITE AC: PS01187)
pattern = '[DEQN]-x-[DEQN](2)-C-x(3,14)-C-x(3,7)\
      -C-x-[DN]-x(4)-[FY]-x-C'
pattern = pattern.replace('{','[^')
pattern = pattern.replace('}',']')
pattern = pattern.replace('(','{')
pattern = pattern.replace(')','}')
pattern = pattern.replace('-','')
pattern = pattern.replace('x','.')
pattern = pattern.replace('>','$')
pattern = pattern.replace('<','^')
print pattern

在基因组序列中找到转录因子的结合位点:

转录因子结合位点列表(TFBS)作为检索表达式, 检索基因组序列即可
转录因子: 批量预测转录因子(TF)和转录因子结合位点(TFBS) (tencent.com)

import re

genome_seq = open('genome.txt').read()

# read transcription factor binding site patterns
sites = []
for line in open('TFBS.txt'):
    fields = line.split()
    tf = fields[0]
    site = fields[1]
    sites.append((tf, site))

# match all TF's to the genome and print matches
for tf, site in sites:
    tfbs_regexp = re.compile(site)
    all_matches = tfbs_regexp.findall(genome_seq)
    matches = tfbs_regexp.finditer(genome_seq)
    if all_matches:
        print tf, ':'
        for tfbs in matches:
            print '\t', tfbs.group(), tfbs.start(), tfbs.end()

从PubMed的HTML页面提取标题和摘要文本:
- 通过访问想要的网页然后使用浏览器的开发者模式, 得到想要获取的内容的标签, 这样可以构建正则表达式来检索我们想要的内容.

import urllib2  # 网络访问库
import re

pmid = '18235848'
url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' % pmid
handler = urllib2.urlopen(url)  # 使用 urllib2的urlopen方法打开目标网页
html = handler.read()

title_regexp = re.compile('<h1>.{5,400}</h1>')  # 匹配格式为一级标题的内容字符5~400的标签
title_text = title_regexp.search(html)

abstract_regexp = re.compile('<AbstractText>.{20,3000}</AbstractText>')
abstract_text = abstract_regexp.search(html)

print 'TITLE:', title_text.group() 
print 'ABSTRACT:', abstract_text.group()

检测科学摘要中特定的词或词组:

import urllib2
import re

# word to be searched
word_regexp = re.compile('schistosoma')  # 构建想要检索的词编译出的正则对象

# list of PMIDs where we want to search the word
pmids = ['18235848', '22607149', '22405002', '21630672']  # 多篇文章的id
for pmid in pmids:
    url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=' + pmid
    handler = urllib2.urlopen(url)
    html = handler.read()
    title_regexp = re.compile('<h1>.{5,400}</h1>')
    title = title_regexp.search(html)
    title = title.group() 
    abstract_regexp = re.compile('<AbstractText>.{20,3000}</AbstractText>')
    abstract = abstract_regexp.search(html)
    abstract = abstract.group()
    word = word_regexp.search(abstract, re.IGNORECASE) # 匹配想要的字词
    if word:
        # display title and where the keyword was found
        print title
        print word.group(), word.start(), word.end()

氨基酸磷酸化位点:
- 简单的匹配可用正则表达式: 'R.[ST][^P]'
- 参考: NetPhos 3.1–蛋白磷酸化位点预测 - 简书 (jianshu.com)

⚓️第三部分模块化编程

这一部分作者希望读者可以掌握编写由功能较为独立的完善的模块组成的程序

⚓️第10章将程序划分为函数

⚓️10.2 案例：处理三维坐标文件

.PDB格式解读:
- Atomic Coordinate Entry Format Version 3.3 (wwpdb.org) 推荐阅读
PDB-101: Learn: Guide to Understanding PDB Data: Primary Sequences and the PDB Format (rcsb.org)
- PDB文件说明 - 简书 (jianshu.com) 推荐阅读
- PDB(Protein Data Bank)数据格式详解 - Raymone1125 - 博客园 (cnblogs.com)
- 该格式文件获取来源: RCSB PDB: Homepage
(此记录为书中例子这里摘录作为一个简单的例子)

记录的第一部分此为标头, 由几个解释行组成, 包括来源生物体/实验技术(X射线/核磁共振)/交叉引用/实验细节/原始分子序列和变异残基(如果有)等. 记录的第二部分介绍了标准组的原子坐标行. 每一个原子坐标行准确地描述了一个原子. 他们以**‘ATOM’**字符串开始: 1_{6列为’ATOM’, 7}11列为’原子序号’, 13_{16列为’原子名’, 17列’另存位置指示’, 18}20列为’残基名’, 22列为’链标识符’, 23_{26列为’残基序号’, 27列为’残基插入码’, 31}38列为x笛卡尔坐标(x轴单位为埃), 39_{46列为’y…‘, 47}54列为’z…’, 55_{60列为’布局’, 61}66列为’温度系数’, 77_{78列为’元素符号’, 79}81列为’原子电荷’

(值得说明的是原子名 CA和CB指的都是碳原子, A和B用于远程标识符, 表示离氨基碳原子的远近, A: α,B: β,G: γ等,原子名称的最后一个字符可以代表分支标识符)

struct模块: [7.3. struct — Interpret strings as packed binary data — Python 2.7.18 文档](https://docs.python.org/zh-cn/2.7/library/struct.html?highlight=struct pack#module-struct)

struct.pack(fmt, v1, v2, …)

Return a string containing the values v1, v2, ... packed according to the given format. The arguments must match the values required by the format exactly.
struct.unpack(fmt, string)

Unpack the string (presumably packed by pack(fmt, ...)) according to the given format. The result is a tuple even if it contains exactly one item. The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt)).

import struct

# pack() method; creates a string
format = '2s1s1s1s1s'
group = struct.pack(format, '10', '2', '3', '4', '5') # 结果: 102345

# unpack() method; parses the string to a tuple
format = '1s2s1s1s'
line = '12345'
col = struct.unpack(format, line)  # 结果: ('1', '23', '4', '5')

# calcsize() returns the number of characters
# in a given format string
format = '30s30s20s1s'
print struct.calcsize(format)  # 结果: 81

函数定义与使用: 定义函数 4. 其他流程控制工具 — Python 2.7.18 文档 详细阅读

>>> def fib(n):    # write Fibonacci series up to n
...     """Print a Fibonacci series up to n."""
...     a, b = 0, 1
...     while a < n:
...         print a,
...         a, b = b, a+b
...
>>> # Now call the function we just defined:
... fib(2000)
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597

处理PDB数据里的坐标数据变成氨基酸序列:

'''
Extract a FASTA sequence from a PDB file.
-----------------------------------------------------------
(c) 2013 Allegra Via and Kristian Rother
    Licensed under the conditions of the Python License
    This code appears in section 10.2.2 of the book
    "Managing Biological Data with Python".
-----------------------------------------------------------
'''

import struct

pdb_format = '6s5s1s4s1s3s1s1s4s1s3s8s8s8s6s6s10s2s3s' # 采用`format`格式 %6s

amino_acids = {
    'ALA':'A', 'CYS':'C', 'ASP':'D', 'GLU':'E',
    'PHE':'F', 'GLY':'G', 'HIS':'H', 'LYS':'K',
    'ILE':'I', 'LEU':'L', 'MET':'M', 'ASN':'N',
    'PRO':'P', 'GLN':'Q', 'ARG':'R', 'SER':'S',
    'THR':'T', 'VAL':'V', 'TYR':'Y', 'TRP':'W'
    }

def threeletter2oneletter(residues):   # 氨基酸缩写从三个字母 2(to,到) 单个字母 
    '''
    Converts the first element of each residue
    to a one letter amino acid symbol
    '''
    for i, threeletter in enumerate(residues):
        residues[i][0] = amino_acids[threeletter[0]]

def get_residues(pdb_file):  
    '''
    Reads the PDB input file, extracts the
    residue type and chain from the CA lines
    and appends both to the residues list
    '''
    residues = []
    for line in pdb_file:
        if line[0:4] == "ATOM":
            tmp = struct.unpack(pdb_format, line)
            ca = tmp[3].strip()
            if ca == 'CA':
                res_type = tmp[5].strip()
                chain = tmp[7]
                residues.append([res_type, chain])
    return residues

def write_fasta_records(residues, pdb_id, fasta_file):
    '''
    Write a FASTA record for each PDB chain
    '''
    seq = ''
    chain = residues[0][1]
    for aa, new_chain in residues:
        if new_chain == chain:
            seq = seq + aa
        else:
            # write sequence in FASTA format
            fasta_file.write(">%s_%s\n%s\n" % (pdb_id, chain, seq))
            seq = aa
            chain = new_chain
    # write the last PDB chain
    fasta_file.write(">%s_%s\n%s\n" % (pdb_id, chain, seq))

def extract_sequence(pdb_id):
    '''
    Main function: Opens files, writes files
    and calls other functions.
    '''
    pdb_file = open(pdb_id + ".pdb")
    fasta_file = open(pdb_id + ".fasta", "w")
    residues = get_residues(pdb_file)
    threeletter2oneletter(residues)
    write_fasta_records(residues, pdb_id, fasta_file)
    pdb_file.close()
    fasta_file.close()
    
# call the main function
extract_sequence("3G5U")

range()迭代利器: range() 2. 内置函数 — Python 2.7.18 文档
可变参数与不可变参数:
- 不要使用不可变参数作为函数的默认参数
lambda匿名函数: 4. 其他流程控制工具 — Python 2.7.18 文档

⚓️第11章用类化繁为简

⚓️11.2 案例：孟德尔遗传

Class(类)的构建:
- 9. 类 — Python 2.7.18 文档 仔细阅读
- .__repr__() 类方法: 当print命令打印类时, 将按照此类方法返回的格式进行输出
- 类在多人合作编程的优势
- 子类与继承

class Pea:
    def __init__(self, genotype):  # 初始化的构造函数
        self.genotype = genotype

class PeaStrain:
    def __init__(self, peas):
        self.peas = peas

yellow = Pea('GG')  # 类的实例化
green = Pea('gg')
strain = PeaStrain([yellow, green])

print Pea                 # __main__.Pea
print PeaStrain           # __main__.PeaStrain
print Pea.__init__        # <unbound method Pea.__init__>
print PeaStrain.__init__  # <unbound method PeaStrain.__init__>

print yellow              # <__main__.Pea instance at 0x0000000001DC6748>
print green               # <__main__.Pea instance at 0x000000000216BE08>
print strain              # <__main__.PeaStrain instance at 0x000000000216BE48>

'''
Classes can inherit from other classes
and extend their functionality.
-----------------------------------------------------------
(c) 2013 Allegra Via and Kristian Rother
    Licensed under the conditions of the Python License
    This code appears in section 11.4.3 of the book
    "Managing Biological Data with Python".
-----------------------------------------------------------
'''

from pea import Pea

class CommentedPea(Pea):

    def __init__(self, genotype, comment):
        Pea.__init__(self, genotype)        # 载入父类构造函数
        self.comment = comment

    def __repr__(self):
        return  '%s [%s] (%s)' % (self.get_phenotype(), self.genotype, self.comment)

yellow1 = CommentedPea('GG', 'homozygote')
yellow2 = CommentedPea('Gg', 'heterozygote')
print yellow1

类的设计指南:
- 单一职责原则和不重复自己原则
- 设计模式 : Design Patterns & Refactoring (sourcemaking.com)

⚓️第12章调试

⚓️12.2 案例：程序无法运行时应该怎样处理

三种错误类型
- 语法错误(SyntaxError): Python规定语法之外的不合法的写法
- 逻辑错误: 程序运行不会报错, 但是程序的运行结果不是你想要的结果
- 运行时错误: 语法检查通过, 但是运行时出现了意想不到的情况的

异常处理: 8. 错误和异常 — Python 2.7.18 文档

try: ... except: 语句:

>>> while True:
...     try:
...         x = int(raw_input("Please enter a number: "))
...         break
...     except ValueError:
...         print "Oops!  That was no valid number.  Try again..."

我还没有数据. 在掌握数据前建立理论是一个巨大的错误. 不知不觉地, 就开始扭曲事实以适应理论, 而不是让理论符合事实.

@ The Adventures of Sherlock Holmes, Arthur Conan Doyle

pdb调试工具: 26.2. pdb — Python的调试器 — Python 2.7.18 文档

⚓️第13章使用外部模块:R语言的Python调用接口

⚓️13.2 案例：从文件中读取数据，并通过Python使用R计算其平均值

R语言

⚓️第14章构建程序流程

⚓️14.2 案例：构建NGS流程

NGS技术应用:
- 差异表达基因
- 非编码RNA
- 突变
Top Hat
RNA-seq:
- 输出包含小cDNA测序片段 (称为读长,reads)的文件, 需回贴至参考基因组并进行拼装, 以获得全部的样本基因序列
- 某种流程实例:
  - 由Illumina(GAIIx)(或其他NGS)平台生成并存储读长至文本文件(如sample.fastq)中.
  - TopHat可以作为RNA-seq数据的剪切接头回贴工具, 它可以把RNA-seq读长对比至哺乳动物数量级的基因组中, 参考基因组必须保存至本地以便于TopHat访问, 然后分析回贴结果, 找到外显子之间的剪切接头. TopHat读取sample.fastq得到比对读长列表, 储存在.bam文件中.
  - Cufflinks包含有三个程序: 转录本装配(Cufflinks)/转录本比较(Cuffcompare)/检测调控和表达差异(Cuffdiff). Cufflinks进行转录本的拼接, 生成格式为transcripts.gtf的转录文件, 以便之后所用. 对于不同的样本(野生型/癌症细胞, 重复实验)还可以筛选转录组文件, 然后进行相互比较, 按照唯一的参考转录组进行拼接
  - 拼接好后的文件为transcripts.gtf,可以使用Cufflink工具包中的Cuffcompare程序与其他转录组进行比较
os模块: 15.1. os — 操作系统接口模块 — Python 2.7.18 文档
包装器(wrapper): 可以运行其他程序的程序
程序间传递数据:
- raw_input()直接手动输入数据
- 使用Python变量
- 将数据存储在另一个文件里
- 使用命令行参数, 比如sys.argv列表变量, 参见28.1. sys — 系统相关的参数和函数 — Python 2.7.18 文档
关闭文件的延时问题: 通过增加一个中间步骤(比如打开一个文件然后写入一些字符保存, 来确保这个中间步骤的上下两个步骤互不冲突
os.path 操控文件目录: 10.1. os.path — 常见路径操作 — Python 2.7.18 文档
MSA(multiple sequence alignment, 多序列比对):

Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).

By contrast, Multiple Sequence Alignment (MSA) is the alignment of three or more biological sequences of similar length. From the output of MSA applications, homology can be inferred and the evolutionary relationship between the sequences studied.
- MSA工具: Multiple Sequence Alignment - CLUSTALW (genome.jp)
- T-COFFEE算法: T-Coffee Home Page (tcoffee.org)
  
  T-Coffee is a multiple sequence alignment package. You can use T-Coffee to align sequences or to combine the output of your favorite alignment methods (Clustal, Mafft, Probcons, Muscle…) into one unique alignment (M-Coffee).

⚓️第15章编写良好的程序

⚓️15.2 问题描述：不确定性

《编程的准则》(Dogma of Programming):
- 著: Donald Knuth
软件工程周期:
- 搜集需求, 形成以用户为视角的用户故事
- 构建简单有效的模块
- 构成整个程序, 按照新需求修改
PEP8代码规范:
- PEP 8: The Style Guide for Python Code (pep8.org)
- PEP 8 – Style Guide for Python Code | Python.org
pylint:
- Pylint - code analysis for Python | www.pylint.org
- 【python】pylint在项目中的使用 - 简书 (jianshu.com)
代码管理工具:
- Git: Git手册
python代码自动打包工具
python代码生成可执行文件
- py2exe: FrontPage - py2exe.org
- pyinstaller: PyInstaller bundles Python applications

敏捷开发: 敏捷软件开发宣言 (agilemanifesto.org)

敏捷软件开发宣言

我们一直在实践中探寻更好的软件开发方法，
身体力行的同时也帮助他人。由此我们建立了如下价值观：

个体和互动高于流程和工具
工作的软件高于详尽的文档
客户合作高于合同谈判
响应变化高于遵循计划

也就是说，尽管右项有其价值，
我们更重视左项的价值。

Kent Beck Mike Beedle Arie van Bennekum Alistair Cockburn Ward Cunningham Martin Fowler James Grenning Jim Highsmith Andrew Hunt Ron Jeffries Jon Kern Brian Marick Robert C. Martin Steve Mellor Ken Schwaber Jeff Sutherland Dave Thomas

著作权为上述作者所有，2001年

适于科研界的工业化实例: Helix ALM | ALM Tools | Perforce
对生物信息学研究的各种软件工程技术(如自动化测试/代码评审/用户故事)都有描述的工具箱: A toolbox for developing bioinformatics software | Request PDF (researchgate.net)

⚓️第四部分数据可视化

⚓️第16章创建科学图表

⚓️16.2 案例：核糖体的核苷酸频率

2009年的诺贝尔化学奖授予了Venkatraman RamaKrishnan, Thomas Steitz 和 Ada Yonath, 以表彰他们对核糖体结构和功能的研究. 核糖体是已知的最大的反子机器之一, 由三种RNA组分(原核生物中是23S, 16S 和 5S rRNA) 和许多的蛋白质组成. RNA的构成包括四种基本的核糖核苷酸和一些起微调核糖体功能作用的修是核糖体

matplotlib:
- 官方教程: Matplotlib: Python plotting — Matplotlib 3.3.2 documentation
- 官方模板: Gallery — Matplotlib 3.3.2 documentation

from pylab import figure, plot, savefig

xdata = [1, 2, 3, 4]
ydata = [1.25, 2.5, 5.0, 10.0]

figure()
plot(xdata, ydata)

savefig('figure1.png')

# -----

figure()

xdata = [0.1 * i for i in range(100)]
ydata = [math.sin(j) for j in xdata]

plot(xdata, ydata, 'kd', linewidth = 1)
text(4.8, 0, "$y = sin(x)$", horizontalalignment = 'center', fontsize = 20)
axis([0, 3 * math.pi, -1.2, 1.2])

savefig('sinfunc.png')

# ---

from pylab import figure, title, xlabel, ylabel, hist, axis, grid, savefig

data = [1, 1, 9, 1, 3, 5, 8, 2, 1, 5, 11, 8, 3, 4, 2, 5]
n_bins = 5

figure()
num, bins, patches = hist(data, n_bins, normed = 1.0, histtype = 'bar', facecolor = 'green', alpha = 0.75)

title('Histogram')
xlabel('value')
ylabel('frequency')
axis()
grid(True)

savefig('histogram.png')

⚓️第17章使用PyMOL创建分子图像

⚓️17.2 示例：锌指

锌指:

锌指(英语: Zinc finger),又称锌手指, 是一种小的蛋白质结构模体, 其特征在于配合一个或多个锌离子(Zn²⁺)以稳定折叠. 最初用来描述非洲爪蟾卵母细胞转录因子IIIA假说结构的手指状外观

一个锌指包含由两个螺旋, 它们通过一个锌原子形成特殊的构象, 导致一个螺旋与DNA的大沟匹配

锌指名称现在已经包含了各种不同的蛋白质结构

@科学网—什么是锌指蛋白？ - 倪帅的博文 (sciencenet.cn)

@锌指 - 维基百科，自由的百科全书 (wikipedia.org)
PyMol: PyMOL | pymol.org

PyMOL is a user-sponsored molecular visualization system on an open-source foundation, maintained and distributed by Schrödinger.
- 免费开源版PyMOL安装 - 知乎 (zhihu.com)
- schrodinger/pymol-open-source: Open-source foundation of the user-sponsored PyMOL molecular visualization system. (github.com)

⚓️第18章处理图像

⚓️18.2 案例：画一个质粒

1977年 Francisco Bolivar 和 Raymond Rodriguez 构建了最早的人工质粒. 它包含了4361个DNA碱基对, 一个复制起点, 一个氨苄青霉素抗性基因和一个四环素抗性基因. 该质粒由许多限制性位点, 从而成为构建遗传载体的基础.

PIL(Python Imaging Library)图形库:

The Python Imaging Library (PIL) adds image processing capabilities to your Python interpreter. This library supports many file formats, and provides powerful image processing and graphics capabilities.

PIL 仅支持Python2
- Python Imaging Library (PIL) (pythonware.com)
- PIL - 廖雪峰的官方网站 (liaoxuefeng.com)
- Pillow — Pillow (PIL Fork) 8.0.0 documentation
  
  Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is the Python Imaging Library by Fredrik Lundh and Contributors.
  
  Pillow 支持Python3
- python-pillow/Pillow: The friendly PIL fork (Python Imaging Library) (github.com)
蛋白质结构域架构:
- InterPro数据库: InterPro (ebi.ac.uk)
  
  InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. We combine protein signatures from these member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool.

⚓️第五部分 Biopython

Biopython 是一个计算分子生物学模块的集合, 用它可以实现许多生物信息学项目中的基本任务

Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
- 解析生物数据格式
- 从资源库下载资源
- 运行常用的生物信息学算法
- 运行Biopython实现的算法, 进行聚类, 机器学习, 数据分析和可视化
- 各种文档索引: Documentation · Biopython
- 上手指南: Biopython Tutorial and Cookbook

⚓️第19章使用序列数据

⚓️19.2 案例：如何将一条DNA编码序列翻译成对应的蛋白质序列并把它写入FASTA文件

Chapter 3 Sequence objects (biopython.org)

>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT')
>>> print(my_seq)
AGTACACTGGT

>>> my_seq
Seq('AGTACACTGGT')
>>> my_seq.complement()
Seq('TCATGTGACCA')
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT')

>>> from Bio import SeqIO
>>> for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
...     print(seq_record.id)
...     print(repr(seq_record.seq))
...     print(len(seq_record))
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC')
592

>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCG")
>>> for index, letter in enumerate(my_seq):
...     print("%i %s" % (index, letter))
0 G
1 A
2 T
3 C
4 G
>>> print(len(my_seq))
5

...

IUPAC字符集:

预编译好的字符集, 覆盖了所有生物序列. 包括 IUPACUnamiguousDNA(基本的ATCG字母), IUPACAmbiguousDNA(包含二义字母), ExtendedIUPACDNA(包含修饰的碱基), IUPACUnamiguousRNA, IUPACAmbiguousRNA, ExtendedIUPACRNA, IUPACProtein(IPUAC标准氨基酸)和ExtendedIUPACProtein(包括硒代半胱氨酸, X等)
迭代器:

一个迭代器是产生一系列的项(如SeqRecord对象)的一种数据结构. 它可以像列表一样使用在循环中, 但技术上它不是一个列表. 一个迭代器没有长度, 也不能被索引或者是切片. 用户只能向它申请下一个元素, 当你这样做时, 迭代器寻找是否存在更多的可用的记录. 这种方式下, 迭代器无需把所有的记录一致放在内存中.
NCBI核酸翻译编码标准: The Genetic Codes

⚓️第20章从网络资源中检索数据

⚓️20.2 案例：在PubMed中用关键词搜索文献，下载并解析对应的记录

Chapter 9 Accessing NCBI’s Entrez databases (biopython.org)

⚓️第21章使用三维结构数据

⚓️21.2 案例：从PDB文件中提取原子名及其三维坐标

Chapter 11 Going 3D: The PDB module (biopython.org)

⚓️第六部分编程秘笈

编程秘笈01: PyCogent库

PyCogent: pycogent/pycogent: PyCogent: Official repository for software and unit tests (github.com)

PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for 3rd party applications.

项目疑似不再维护

参考文献: PyCogent: a toolkit for making sense from sequence | Genome Biology | Full Text (biomedcentral.com)

编程秘笈02: 反向互补和随机化序列

9.6. random — 生成伪随机数 — Python 2.7.18 文档

为什么要随机化序列:

每个序列模式可能在序列中随机出现, 在蛋白质和核酸中寻找匹配的功能模体不能保证该匹配一定有生物意义. 为了评估一个模体存在的生物学显著性, 需要检查的序列一啊不能要和随机集合进行比较. 在给定的序列集合中相对于随机序列集合有更显著表达的序列模体, 更倾向于编码一个有功能的性质(更有生物学意义). 一个好的随机序列是缺乏生物学意义的, 但是却与生物对象有相同的氨基酸/核苷酸组分, 可以通过对原序列反向,洗牌,随机化等方式创建.

seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

seq_list = list(seq)    
seq_list.reverse()
rev_seq = ''.join(seq_list)
print(rev_seq)   # 列表化后反转再拼接回字符串

rev_seq = ''
for s in reversed(seq):
    rev_seq = rev_seq + s
print(rev_seq)  # 使用reversed()迭代器循环反向拼接字符串

rev_seq = seq[::-1] # 字符串切片[start:end:step], 设步长为-1, 得反向


import random
# 返回由seq内元素组成的长为len(seq)的列表
ran_seq = random.sample(seq, len(seq)) 
# 转字符串
ran_seq = ''.join(random.sample(seq, len(seq)))
# 控制长度
ran_seq = ''.join(random.sample(seq, len(seq)-10))
# 使用random.choice方法在seq中每次随机返回一个元素, for循环控制次数
ran_seq = ''.join([random.choice(seq) for x in range(len(seq))])

data = list(seq)
random.shuffle(data) # 序列洗牌
shuffled_seq = data
shuffled_seq = ''.join(data)

编程秘笈03: 用概率创建随机序列

import random

nucleotides = list('ACGT')
dna = ''
while len(dna) < 100:
    dna += random.choice(nucleotides) # 随机生成100个碱基组成的DNA序列

nucleotides = list('ACGT')
probs = {'A': 0.3, 'C': 0.2, 'G': 0.2, 'T': 0.3} # 总概率为1
assert sum(probs.values()) == 1.0 
# 上述assert语句等价于下面的两行
if sum(probs.values()) != 1.0:
  raise Exception('Sum of probabilites is not 1.0!')

dna = ''
while len(dna) < 100:
    nuc = random.choice(nucleotides)
    dice = random.random()
    if dice < probs[nuc]:
        dna += nuc
print dna

assert 语句: assert (python.org)

assert 语句是在程序中插入调试性断言的简便方式:

1	assert_stmt ::= "assert" expression ["," expression]

简单形式 assert expression 等价于

1 2	if __debug__: if not expression: raise AssertionError

编程秘笈04: 用Biopython解析多序列联配

多序列联配(MSA):

Pfam数据库: Pfam: Home page

常用格式为.sth (Stockholm 格式)

biostar handbook(六)| 序列联配 - 简书 (jianshu.com)

例子: Pfam: Family: Globin (PF00042) (xfam.org)

通过Biopython的Bio.AlignIO模块解析多联配:

Chapter 6 Multiple Sequence Alignment objects (biopython.org)

编程秘笈05: 从多序列联配中计算共有序列

共有序列(Consensus Sequence): 多联配中每列中频率最高的残基(最保守的)

seqs = [
     'ATCCAGCT',
     'GGGCAACT',
     'ATGGATCT',
     'AAGCAACC',
     'ATGCCATT',
     'ATGGCACT']

n = len(seqs[0])
profile = {'A':[0]*n, 'C':[0]*n, 'G':[0]*n, 'T':[0]*n}

for seq in seqs:
    for i, char in enumerate(seq):
        profile[char][i] += 1
consensus = ''

for i in range(n):
    col = [(profile[nt][i], nt) for nt in 'ATCG']
    consensus += max(col)[1]
    
print consensus

编程秘笈06: 计算系统发生树的节点间的距离

Chapter 13 Phylogenetics with Bio.Phylo

Newick格式是表示系统发生树的一种统一格式. 它可以被大多数的程序读取, 包括MrBayes, GARLI, PHYLIP, TREE-PUZZLE等. Newick树包括物种和祖先节点的名字, 它们的相互关系, 以及每个节点赋予的值(通常用于父节点的距离)

编程秘笈07: 核苷酸序列的密码子频率

# This program calculates the codon frequency in a RNA sequence
# (it could also be an entire genome)
aa_codon = {
'A':['GCU','GCC','GCA','GCG'], 'C':['UGU','UGC'],
'D':['GAU','GAC'],'E':['GAA','GAG'],'F':['UUU','UUC'],
'G':['GGU','GGC','GGA','GGG'], 'H':['CAU','CAC'],
'K':['AAA','AAG'],'I':['AUU','AUC','AUA','AUU','AUC','AUA'],
'L':['UUA','UUG','CUU','CUC','CUA','CUG'],'M':['AUG'],
'N':['AAU','AAC'],'P':['CCU','CCC','CCA','CCG'],
'Q':['CAA','CAG'],'R':['CGU','CGC','CGA','CGG','AGA','AGG'],
'S':['UCU','UCC','UCA','UCG','AGU','AGC',],
'Y':['UAU','UAC'],'T':['ACU','ACC','ACA','ACG'],
'V':['GUU','GUC','GUA','GUG'],'W':['UGG'],
'STOP':['UAG','UGA','UAA']}

codon_count = {
'GCU':0,'GCC':0,'GCA':0,'GCG':0,'CGU':0,'CGC':0,
'CGA':0,'CGG':0,'AGA':0,'AGG':0,'UCU':0,'UCC':0,
'UCA':0,'UCG':0,'AGU':0,'AGC':0,'AUU':0,'AUC':0,
'AUA':0,'AUU':0,'AUC':0,'AUA':0,'UUA':0,'UUG':0,
'CUU':0,'CUC':0,'CUA':0,'CUG':0,'GGU':0,'GGC':0,
'GGA':0,'GGG':0,'GUU':0,'GUC':0,'GUA':0,'GUG':0,
'ACU':0,'ACC':0,'ACA':0,'ACG':0,'CCU':0,'CCC':0,
'CCA':0,'CCG':0,'AAU':0,'AAC':0,'GAU':0,'GAC':0,
'UGU':0,'UGC':0,'CAA':0,'CAG':0,'GAA':0,'GAG':0,
'CAU':0,'CAC':0,'AAA':0,'AAG':0,'UUU':0,'UUC':0,
'UAU':0,'UAC':0,'AUG':0,'UGG':0,'UAG':0,
'UGA':0,'UAA':0}

# Writes the frequency of each codon to a file
def calc_freq(codon_count, out_file):
    count_tot = {}
    for aa in aa_codon.keys():
        n = 0
        for codon in aa_codon[aa]:
            n = n + codon_count[codon]
        count_tot[aa] = float(n)
    for aa in aa_codon.keys():
        for codon in aa_codon[aa]:
            if count_tot[aa] != 0.0:
                freq = codon_count[codon] / count_tot[aa]
            else:
                freq = 0.0
            out_file.write('%4s\t%5s\t%4d\t%9.3f\n'% \
                (aa,codon,codon_count[codon], freq))
in_file = open('A06662.1.fasta')
out_file = open('CodonFrequency.txt', 'w')

# Reads the RNA sequence into a single string
rna = ''
for line in in_file:
    if not line[0] == '>':
        rna = rna + line.strip()
        
# Scans the sequence frame by frame,
# counts the number of occurrences
# of each codon, and stores it in codon_count dictionary.
# Then calls calc_freq()
for j in range(3):
    out_file.write('!!Codon frequency in frame %d\n' %(j+1))
    out_file.write(' AA\tcodon\thits\tfrequency\n')
    prot = ''
    for i in range(j, len(rna), 3):
        codon = rna[i:i + 3]
        if codon in codon_count:
            codon_count[codon] = codon_count[codon] + 1
    calc_freq(codon_count, out_file)
    
out_file.close()

编程秘笈08: 解析Vienna格式的RNA二级结构

RNA序列最重要的性质之一是其折叠和配对的能力. 与DNA不同, RNA的碱基配对不限于双螺旋, 而是能够形成复杂的结构. 预测这些RNA的碱基配对和二级结构是RNA生物信息学的常见任务.

Vienna包: ViennaRNA Web Services (univie.ac.at)

一个命令行和网络的工具集, 可以进行如从RNA序列预测碱基配对等基础任务

这些结果通常被保存为Vienna格式

1
2
3

> two hairpin loops
AAACCCCGUUUCGGGGAACCACCA
((((...)))).((((..)).)).

第一行包含了RNA序列名称, 第二行是序列本身

第三行的点-括号结构表达RNA二级结构

对应的括号指示一个碱基对, 而不配对的碱基对用点表示

class RNAStructure:
    def __init__(self, vienna):
        lines = vienna.split('\n')
        self.name = lines[0].strip()
        self.sequence = lines[1].strip()
        self.basepairs = \
        sorted(self.parse_basepairs(lines[2].strip()))
        
    def parse_basepairs(self, dotbracket):
        stack = []
        for i, char in enumerate(dotbracket):
            if char == '(':
                stack.append(i)
            elif char == ')':
                j = stack.pop()
        yield j, i
        
vienna = '''> two hairpin loops
AAACCCCGUUUCGGGGAACCACCA
((((...)))).((((..)).)).
'''
rna = RNAStructure(vienna)
print rna.name
print rna.sequence
for base1, base2 in rna.basepairs:
    print '(%i, %i)'%(base1, base2)

编程秘笈09: 解析BLAST的XML输出

Biopython BLAST XML解析器读取XML BLAST输出文件

7.3 Parsing BLAST output (biopython.org)

NCBIXML.parse()函数返回一个blast_out实例, 它包含一到多个记录对象(BLAST结果在一个记录里, 而如PSI-BLAST则是对每个查询有一个纪录). 每个查询都包含一到多个匹配(hit)或联配(alignment), 依次包含一到多个局部联配(HSP).

>>> blastp -query P05480.fasta -db nr.00 -out blast_output.xml
-outfmt 5

xml_file = open("blast_output.xml")
blast_out = NCBIXML.parse(xml_file)
for record in blast_out:
    for alignment in record.alignments:
        print alignment.title
        for hsp in alignment.hsps:
            # filter by e-value
            if hsp.expect < 0.0001:
                print "score:", hsp.score
                print "query:", hsp.query
                print "match:", hsp.match
                print "sbjct:", hsp.sbjct
                print '#' * 70
                print

编程秘笈10: 解析SBML文件

Systems Biology Markup Language (SBML), a free and open interchange format for computer models of biological processes. SBML is useful for models of metabolism, cell signaling, and more. It continues to be evolved and expanded by an international community.

Main Page - SBML.caltech.edu

SBML格式用来储存通路,反应和调控网络信息的标准格式

XML 教程 | 菜鸟教程 (runoob.com)

19.5. XML处理模块 — Python 2.7.18 文档

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level2">
    <model name="SBML file with three metabolites">
    <listOfSpecies>
        <species id="M_m78" name="Inosine">
            <p>FORMULA: C10H12N4O5</p>
        </species>
        <species id="M_m79" name="Xanthosine">
            <p>FORMULA: C10H12N4O6</p>
        </species>
        <species id="M_m80" name="Xanthosine">
            <p>FORMULA: C10H12N4O6</p>
        </species>
    </listOfSpecies>
    </model>
</sbml>

from xml.dom.minidom import parse

document = parse('sbml_example.xml')
species_list = document.getElementsByTagName("species")
for species in species_list:
    species_id = species.getAttribute('id')
    name = species.getAttribute('name')
    p_list = species.getElementsByTagName("p")
    p = p_list[0]
    text = p.childNodes[0]
    formula = text.nodeValue
    print "%-20s\t%5s\t%s"%(name, species_id, formula)
    
# Inosine      M_m78 FORMULA: C10H12N4O5
# Xanthosine   M_m79 FORMULA: C10H12N4O6
# Xanthosine   M_m80 FORMULA: C10H12N4O6

编程秘笈11: 运行BLAST

The NCBI provides a suite of command-line tools to run BLAST called BLAST+. This allows users to perform BLAST searches on their own server without size, volume and database restrictions. BLAST+ can be used with a command line so it can be integrated directly into your workflow.

Download BLAST Software and Databases Documentation (nih.gov)

7.2 Running BLAST locally (Biopython)

编程秘笈12: 访问、下载和读取网页

20.5. urllib — Open arbitrary resources by URL — Python 2.7.18 文档

from urllib import urlopen
url = urlopen('http://www.uniprot.org/uniprot/P01308.fasta')
doc = url.read()
print doc

20.6. urllib2 — extensible library for opening URLs — Python 2.7.18 文档

编程秘笈13: 解析HTML文件

19.1. HTMLParser — Simple HTML and XHTML parser — Python 2.7.18 文档

from HTMLParser import HTMLParser
import urllib

class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        self.start_tag = tag
        print "Start tag:", self.start_tag
        
    def handle_endtag(self, tag):
        self.end_tag = tag
        print "End tag :", self.end_tag
        
    def handle_data(self, data):
        self.data = data.strip()
        if self.data:
            print "Data :", self.data
            
parser = MyHTMLParser()
url = 'http://www.ncbi.nlm.nih.gov/pubmed/21998472'
page = urllib.urlopen(url)
data = page.read()
parser.feed(data)

编程秘笈14: 将PDB文件分割成PDB链文件

from struct import unpack
import os.path

filename = '2H8L.pdb'
in_file = open(filename)
pdb_id = filename.split('.')[0]
pdb_format = '6s5s1s4s1s3s1s1s4s1s3s8s8s8s6s6s6s4s2s3s'
chain_old = '@'
for line in in_file:
    if line[0:4] == "ATOM":
        col = unpack(pdb_format, line)
        chain = col[7].strip()
        if chain != chain_old:
            if os.path.exists(pdb_id+chain_old+'.pdb'):
                chain_file.close()
                print "closed:", pdb_id+chain_old+'.pdb'
            chain_file = open(pdb_id+chain+'.pdb','w')
            chain_file.write(line)
            chain_old = chain
        else:
            chain_file.write(line)
chain_file.close()
print "closed:", pdb_id+chain_old+'.pdb'

也可以使用终端的>将结果导出到文件

编程秘笈15: 在PDB结构上找到两个最靠近的Cα原子

from math import sqrt
from struct import unpack

def calc_dist(p1, p2):
    '''returns the distance between two 3D points'''
    tmp = pow(p1[0] - p2[0], 2) + \
        pow(p1[1] - p2[1], 2) + \
        pow(p1[2] - p2[2], 2)
    tmp = sqrt(tmp)
    return tmp

def min_dist(arglist):
    '''
    returns the closest residue pair and their
    CA_CA distance
    '''
    # initialize variables
    maxval = 10000
    residue_pair = ()
    # read arglist starting from the 1st position
    for i in range(len(arglist)):
        # save x,y,z coordinates from the arglist
        # i-element into the atom1 variable
        atom1 = arglist[i][1:]
        # run over all other elements
        for j in range(i + 1, len(arglist)):
            atom2 = arglist[j][1:]
            # calculate the distance
            tmp = calc_dist(atom1, atom2)
            # check if the distance is lower than
            # the previously recorded lowest value
            if tmp < maxval :
                # save the new data
                residue_pair = (arglist[i][0], \
                                arglist[j][0])
                maxval = tmp
    return residue_pair, maxval

def get_list_ca_atoms(pdb_file, chain):
    '''
    returns a list of CA atoms, the residues
    they belong to, and their x,y,z coordinates
    from the input PDB file
    '''
    in_file = open(pdb_file)
    CA_list = []
    pdb_format = '6s5s1s4s1s3s1s1s4s1s3s8s8s8s6s6s6s4s2s3s'
    for line in in_file:
        tmp = unpack(pdb_format, line)
        tmp = [i.strip() for i in tmp]
        # only save CA coords belonging to input chain
        if tmp[0] =="ATOM" and tmp[7] == chain and \
            tmp[3] == "CA":
            # create a tuple (aa_number, x, y, x)
            tmp = (tmp[5]+tmp[8], float(tmp[11]), \
            float(tmp[12]), float(tmp[13]))
            # add the tuple to the list
            CA_list.append(tmp)
    in_file.close()
    return CA_list

# obtain the list of CA atoms of Chain A
CA_list = get_list_ca_atoms("2H8L.pdb", "A")
# identify the closest atoms
res_pair, dist = min_dist(CA_list)
print 'The distance between', res_pair, 'is:', dist

编程秘笈16: 提取两个PDB链间的界面

在一个三维结构中, 两条多肽链之间的界面被定义为满足这两个条件的所有残基对:1.两个残基分别属于不同的链;2.它们之间的距离小于某个阈值

11.2 Structure representation (biopython.org)

from Bio import PDB
parser = PDB.PDBParser()
s = parser.get_structure("2H8L","2H8L.pdb")
first_model = s[0]
chain_A = first_model["A"]
chain_B = first_model["B"]
for res1 in chain_A:
    for res2 in chain_B:
        d = res1["CA"]-res2["CA"]
        if d <= 6.0:
            print res1.resname,res1.get_id()[1], res2.resname,\
                res2.get_id()[1], d

编程秘笈17: 用Modeller建立同源模型

Modeller 是一个为建立蛋白质三维结构的同源模型设计的软件包

About MODELLER (salilab.org)

MODELLER is used for homology or comparative modeling of protein three-dimensional structures (1,2). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints (3,4), and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc.

from modeller import *
from modeller.automodel import *
log.verbose()
env = environ()
env.io.atom_files_directory = ['.', '../atom_files']
a = automodel(env,
            alnfile = 'alignment.ali',
            knowns = '1eq9A',
            sequence = 'MyTarget_Seq')
a.starting_model = 1
a.ending_model = 1
a.make()

编程秘笈18: 用ModeRNA分析RNA三维同源模型

ModeRNA 库: moderna · PyPI 服务维护停止

分析/操纵/建模RNA三维结构

from moderna import *

ehz = load_model('1ehz.ent', 'A')
clean_structure(ehz)
print get_sequence(ehz)
print get_secstruc(ehz)

m = create_model()
copy_some_residues(ehz['1':'15'], m)
write_model(m, '1ehz_15r.ent')
temp = load_template('1ehz_15r.ent')
ali = load_alignment('''> model sequence
ACUGUGAYUA[UACCU#PG
> template: first 15 bases from 1ehz
GCGGA--UUUALCUCAG''')
model = create_model(temp, ali)
print get_sequence(model)

GCGGAUUUALCUCAGDDGGGAGAGCRCCAGABU#AAYAP?UGGAG7UC?UGUGTPCG"UCC
  ACAGAAUUCGCACCA
(((((((..((((........)))).((((.........)))).....
   (((((.......))))))))))))....
ACUGUGAYUA[UACCU#PG

编程秘笈19: 从三级结构计算RNA碱基配对

RNAView: 计算一个RNA分子PDB结构中的碱基配对

NDB | Tools | Downloads (rutgers.edu)

RNA Viewer, a tool for viewing RNA 2-dimensional structure, base pairs and RNA patterns. 2D structure is fully annotated according to Leontis and Westhof (2001) RNA 7:499-512.

Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H.M. and Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res, 31, 3450-3460.

PyCogent库

>>> rnaview lehz.pdb > basepair.out
1_72, A:  1 G-C 72  A: +/+ cis  XIX
2_71, A:  2 C-G 71  A: +/+ cis  XIX
3_70, A:  3 G-C 70  A: +/+ cis  XIX
4_69, A:  4 G-C 69  A: +/+ cis  XIX
5_68, A:  5 G-C 68  A: +/+ cis  XIX
6_67, A:  6 A-U 67  A: W/W cis  n/a
7_66, A:  7 A-U 66  A: -/- cis  XX

from cogent.app.rnaview import RnaView
from cogent.parse.rnaview import RnaviewParser
rna_prog = RnaView()
result = rna_prog('1ehz.pdb')
bpairs = result['base_pairs']
errors = result['StdErr'].read()
stdout = result['StdOut'].read()

bp_dict = RnaviewParser(bpairs)
print 'INFORMATION:'
sys.stderr.write(errors)
print stdout
print 'BASE PAIRS:'
for key in bp_dict:
    print key, bp_dict[key]

编程秘笈20: 结构重叠的真实实例: 丝氨酸蛋白酶催化三分子

结构重叠: 在重叠的过程中, 两个结构之一的(靶标)被放置在一个固定的位置上,另一个(探针)被平移和旋转, 直到连个结构的均方差达到最小. 一个重叠必须要求两个集合的原子数相同才能工作, 所以在重叠两个结构之前, 需要决定每个结构上哪些原子需要重叠, 这甚至可以识整个结构的所有原子.

11.8 Accessing the Protein Data Bank (biopython.org)

11.6.6 Superimposing two structures (biopython.org)

# Superimpose the catalytic triads of two different serine
# proteases(on CA and N atoms of res H57, D102, and S195 of chain A)

from Bio import PDB

# Retrieve PDB files   网络获取三个结构文件
pdbl = PDB.PDBList()
pdbl.retrieve_pdb_file("1EQ9")
pdbl.retrieve_pdb_file("1FXY")

# Parse the two structures  到对应目录读取文件结构信息
from Bio.PDB import PDBParser, Superimposer, PDBIO
parser = PDB.PDBParser()
struct_1 = parser.get_structure("1EQ9", "eq/pdb1eq9.ent")
struct_2 = parser.get_structure("1FXY", "fx/pdb1fxy.ent")

# get the catalytic triads  提取三个催化位点氨基酸信息
res_57_struct_1 = struct_1[0]['A'][57]
res_102_struct_1 = struct_1[0]['A'][102]
res_195_struct_1 = struct_1[0]['A'][195]
res_57_struct_2 = struct_2[0]['A'][57]
res_102_struct_2 = struct_2[0]['A'][102]
res_195_struct_2 = struct_2[0]['A'][195]

# Build 2 lists of atoms for calculating a rot.-trans. matrix
# (target and probe). 每个位点氨基酸的α碳以及氮原子放入目标和探针列表
target = []
backbone_names = ['CA', 'N']
for name in backbone_names:
    target.append(res_57_struct_1[name])
    target.append(res_102_struct_1[name])
    target.append(res_195_struct_1[name])
    
probe = []
for name in backbone_names:
    probe.append(res_57_struct_2[name])
    probe.append(res_102_struct_2[name])
    probe.append(res_195_struct_2[name])
    
# Check whether target and probe lists are equal in size.
# This is needed for calculating a rot.-trans. matrix
assert len(target) == len(probe)  # 确定原子数量相等

# Calculate the rotation-translation matrix.
sup = Superimposer()   # 旋转平移矩阵计算
sup.set_atoms(target, probe)

# Apply the matrix. Remember that it can be applied only on
# lists of atoms.  应用矩阵, 注意只能按照列表提交
struct_2_atoms = [at for at in struct_2.get_atoms()]
sup.apply(struct_2_atoms)

# Write the rotation-translated structure 保存旋转平移结构
out = PDBIO()
out.set_structure(struct_2)
out.save('1FXY-superimposed.pdb')

⚓️附录

⚓️附录A 命令概览

Linux命令见: Linux命令搜索引擎命令，Linux Linux命令搜索引擎命令详解：最专业的Linux命令大全，内容包含Linux命令手册、详解、学习，值得收藏的Linux命令速查手册。 - Linux 命令搜索引擎

Python命令见: Python3 教程 | 菜鸟教程 (runoob.com)

⚓️附录B Python资源

Python官网: Welcome to Python.org

Python官方文档集合: 3.9.0 Documentation (python.org)

Python标准库: The Python Standard Library

Biopython教程: Biopython Tutorial and Cookbook

⚓️附录C 记录样板

# C.1 A SINGLE PROTEIN SEQUENCE FILE IN FASTA FORMAT
>sp |P03372|ESR1_HUMAN Estrogen receptor OS = Homo sapiens GN =
ESR1 PE = 1 SV = 2
MTM TLHTKASGMALLHQIQGNELEPLNRPQLKIPLERPLGEVYLDSSKPAVYNYPEGAAY
EFN AAAAANAQVYGQTGLPYGPGSEAAAFGSNGLGGFPPLNSVSPSPLMLLHPPPQLSPF
LQP HGQQVPYYLENEPSGYTVREAGPPAFYRPNSDNRRQGGRERLASTNDKGSMAMESAK
ETR YCAVCNDYASGYHYGVWSCEGCKAFFKRSIQGHNDYMCPATNQCTIDKNRRKSCQAC
RLR KCYEVGMMKGGIRKDRRGGRMLKHKRQRDDGEGRGEVGSAGDMRAANLWPSPLMIKR
SKK NSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINW
AKR VPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEG
MVE IFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTLKSLEEKDHIHRVLD
KIT DTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLL
LEM LDAHRLHAPTSRGGASVEETDQSHLATAGSTSSHSLQKYYITGEAEGFPATV

# C.2 A SINGLE NUCLEOTIDE SEQUENCE FILE IN FASTA
FORMAT
>ENSG00000188536|hemoglobin alpha 2
ATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCT
GGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTC
CCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCG
CTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCAC
GCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTG
GCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCT
GTGAGCACCGTGCTGACCTCCAAATACCGTTAA

# C.3 AN EXAMPLE OF AN RNA SEQUENCE
IN FASTA FORMAT
>ENSG00000188536|hemoglobin alpha 2
ATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCT
GGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTC
CCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCG
CTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCAC
GCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTG
GCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCT
GTGAGCACCGTGCTGACCTCCAAATACCGTTAA

# C.4 A MULTIPLE SEQUENCE FILE IN FASTA FORMAT
>sp |P03372|ESR1_HUMAN Estrogen receptor OS = Homo sapiens GN =
ESR1 PE = 1 SV = 2
MTM TLHTKASGMALLHQIQGNELEPLNRPQLKIPLERPLGEVYLDSSKPAVYNYPEGAAY
EFN AAAAANAQVYGQTGLPYGPGSEAAAFGSNGLGGFPPLNSVSPSPLMLLHPPPQLSPF
LQP HGQQVPYYLENEPSGYTVREAGPPAFYRPNSDNRRQGGRERLASTNDKGSMAMESAK
ETR YCAVCNDYASGYHYGVWSCEGCKAFFKRSIQGHNDYMCPATNQCTIDKNRRKSCQAC
RLR KCYEVGMMKGGIRKDRRGGRMLKHKRQRDDGEGRGEVGSAGDMRAANLWPSPLMIKR
SKK NSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINW
AKR VPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEG
MVE IFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTLKSLEEKDHIHRVLD
KIT DTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLL
LEM LDAHRLHAPTSRGGASVEETDQSHLATAGSTSSHSLQKYYITGEAEGFPATV
>sp |P62333|PRS10_HUMAN 26S protease regulatory subunit
10B OS = Homo sapiens GN = PSMC6 PE = 1 SV = 1
MAD PRDKALQDYRKKLLEHKEIDGRLKELREQLKELTKQYEKSENDLKALQSVGQIVGEV
LKQ LTEEKFIVKATNGPRYVVGCRRQLDKSKLKPGTRVALDMTTLTIMRYLPREVDPLVY
NMS HEDPGNVSYSEIGGLSEQIRELREVIELPLTNPELFQRVGIIPPKGCLLYGPPGTGK
TLL ARAVASQLDCNFLKVVSSSIVDKYIGESARLIREMFNYARDHQPCIIFMDEIDAIGG
RRF SEGTSADREIQRTLMELLNQMDGFDTLHRVKMIMATNRPDTLDPALLRPGRLDRKIH
IDL PNEQARLDILKIHAGPITKHGEIDYEAIVKLSDGFNGADLRNVCTEAGMFAIRADHD
FVV QEDFMKAVRKVADSKKLESKLDYKPV
>sp |P62509|ERR3_MOUSE Estrogen-related receptor gamma OS = Mus
musculus GN = Esrrg PE = 1 SV = 1
MDS VELCLPESFSLHYEEELLCRMSNKDRHIDSSCSSFIKTEPSSPASLTDSVNHHSPGG
SSD ASGSYSSTMNGHQNGLDSPPLYPSAPILGGSGPVRKLYDDCSSTIVEDPQTKCEYML
NSM PKRLCLVCGDIASGYHYGVASCEACKAFFKRTIQGNIEYSCPATNECEITKRRRKSC
QACR FMKCLKVGMLKEGVRLDRVRGGRQKYKRRIDAENSPYLNPQLVQPAKKPYNKIVSH
LLV AEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSL
LQS AWMEILILGVVYRSLSFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMK
LEK EEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMT
LPLLRQTSTKAVQHFYNIKLEGKVPMHKLFLEMLEAKV

# C.8 AN EXAMPLE OF THE SEQRES LINES OF
A PDB FILE (FROM FILE 1TDL)
SEQ RES 1 A 223 ILE VAL GLY GLY TYR THR CYS GLY ALA ASN THR VAL PRO
SEQ RES 2 A 223 TYR GLN VAL SER LEU ASN SER GLY TYR HIS PHE CYS GLY
SEQ RES 3 A 223 GLY SER LEU ILE ASN SER GLN TRP VAL VAL SER ALA ALA
SEQ RES 4 A 223 HIS CYS TYR LYS SER GLY ILE GLN VAL ARG LEU GLY GLU
SEQ RES 5 A 223 ASP ASN ILE ASN VAL VAL GLU GLY ASN GLU GLN PHE ILE
SEQ RES 6 A 223 SER ALA SER LYS SER ILE VAL HIS PRO SER TYR ASN SER
SEQ RES 7 A 223 ASN THR LEU ASN ASN ASP ILE MET LEU ILE LYS LEU LYS
SEQ RES 8 A 223 SER ALA ALA SER LEU ASN SER ARG VAL ALA SER ILE SER
SEQ RES 9 A 223 LEU PRO THR SER CYS ALA SER ALA GLY THR GLN CYS LEU
SEQ RES 10 A 223 ILE SER GLY TRP GLY ASN THR LYS SER SER GLY THR SER
SEQ RES 11 A 223 TYR PRO ASP VAL LEU LYS CYS LEU LYS ALA PRO ILE LEU
SEQ RES 12 A 223 SER ASP SER SER CYS LYS SER ALA TYR PRO GLY GLN ILE
SEQ RES 13 A 223 THR SER ASN MET PHE CYS ALA GLY TYR LEU GLU GLY GLY
SEQ RES 14 A 223 LYS ASP SER CYS GLN GLY ASP SER GLY GLY PRO VAL VAL
SEQ RES 15 A 223 CYS SER GLY LYS LEU GLN GLY ILE VAL SER TRP GLY SER
SEQ RES 16 A 223 GLY CYS ALA GLN LYS ASN LYS PRO GLY VAL TYR THR LYS
SEQ RES 17 A 223 VAL CYS ASN TYR VAL SER TRP ILE LYS GLN THR ILE ALA
SEQRES 18 A 223 SER ASN

# C.9 AN EXAMPLE OF THE CUFFCOMPARE OUTPUT FOR THREE SAMPLES (Q1, Q2, AND Q3)
# Because of the length of each line, the first field of each new line is in bold,and different lines are highlighted using indentation.
Med ullo-Diff_00000001 XLOC_000001 Lypla1|uc007afh.1
    q1:NSC.P419.228|uc007afh.1 |100|35.109496|34.188903
    |36.030089|397.404732|2433 q2:NSC.
    P429.18|uc007afh.1|100 |15.885823|15.240240
    |16.531407|171.011325 |2433 q3:NSC.
    P437.15|uc007afh.1|100 |18.338541|17.704857|18.97222
    4|181.643949|2433
Med ullo-Diff_00000002 XLOC_000002 Tcea1|uc007afi.2
    q1:NSC.P419.228|uc007afi.2|
    18|1.653393|1.409591|1.897195 |18.587029|2671
    - q3:NSC.P437.108|uc007afi.2|100 |4.624079|4.258801|4
    .989356|45.379750|2671
Med ullo-Diff_00000003 XLOC_000002 Tcea1|uc011wht.1
    q1:NSC.P419.228|uc011wht.1|100 |9.011253|8.560848
    |9.461657|101.302266|2668 q2:NSC.
    P429.116|uc011wht.1|100 |6.889020|6.503460|7.27458
    0|73.238938 |2668q3:NSC.P437.108 |uc011wht.1|90
    |4.170527 |3.817430|4.523625|40.928694|2668
Med ullo-Diff_00000004 XLOC_000003 Tcea1|uc007afi.2
    q1:NSC.P419.231|NSC.P419.231.1
    |100|31.891396|30.892103 |32.890690|379.023601
    |1568q2:NSC.P429.121|NSC.P429.121.1 |100|27.991543
    |27.007869|28.975218 |313.481210|1532 -
Med ullo-Diff_00000005 XLOC_000002 Tcea1|uc007afi.2
    q1:NSC.P419.236 |NSC.P419.236.1
    |100|1.164739|0.868895 |1.460583|13.879881|- - -
    Med ullo-Diff_00000006 XLOC_000004 Atp6v1h|uc007afn.1
    q1:NSC.P419.55|uc007afn.1 |100|39.526818 |38.58510
    2|40.468533|455.599775|1976 q2:NSC.
    P429.43|uc007afn.1 |100|25.034945 |24.182398|25.887
    493|271.738343|1976 q3:NSC.P437.37|uc007afn.1
    |100|20.848047 |20.043989|21.652104|205.866771|1976

⚓️附录D 处理目录和用UNIX编程

Shell 教程 | 菜鸟教程 (runoob.com)

Linux 教程 | 菜鸟教程 (runoob.com)

科技生物

⚓️书籍简介 📚

⚓️写在前面 ✍

⚓️故事就此展开 📖

⚓️第一部分 入门

⚓️第1章 Python shell

⚓️1.2 案例： 计算ATP水解的ΔG

⚓️1.3.4 计算

⚓️第2章 第一个Python程序

⚓️2.2案例： 如何计算胰岛素序列中的氨基酸频率 *

⚓️2.5自测题

⚓️第二部分 数据管理

⚓️第3章 分析数据列

⚓️3.2案例： 树突长度

⚓️第4章 解析数据记录

⚓️4.2 案例： 整合质谱数据， 转化到代谢通路中

⚓️第5章 搜索数据

⚓️5.2案例： 将RNA序列翻译为相应的蛋白质序列

⚓️5.5 自测题

⚓️第6章 过滤数据

⚓️6.2 案例： 使用RNAseq输出数据

⚓️第7章 管理表数据

⚓️7.2 案例： 确定蛋白浓度

⚓️7.5自测题

⚓️第8章 数据排序

⚓️8.2案例： 数据表排序

⚓️第9章 模式匹配和文本挖掘

⚓️9.2 案例： 在蛋白质序列中搜索磷酸化模体

⚓️第三部分 模块化编程

⚓️第10章 将程序划分为函数

⚓️10.2 案例： 处理三维坐标文件

⚓️第11章 用类化繁为简

⚓️11.2 案例： 孟德尔遗传

⚓️第12章 调试

⚓️12.2 案例： 程序无法运行时应该怎样处理

⚓️第13章 使用外部模块:R语言的Python调用接口

⚓️13.2 案例： 从文件中读取数据， 并通过Python使用R计算其平均值

⚓️第14章 构建程序流程

⚓️14.2 案例： 构建NGS流程

⚓️第15章 编写良好的程序

⚓️15.2 问题描述： 不确定性

⚓️第四部分 数据可视化

⚓️第16章 创建科学图表

⚓️16.2 案例： 核糖体的核苷酸频率

⚓️第17章 使用PyMOL创建分子图像

⚓️17.2 示例： 锌指

⚓️第18章 处理图像

⚓️18.2 案例： 画一个质粒