跳到主要内容

09 Methods 撰写:审稿人最看的细节

Methods 是审稿人最仔细看的部分。他们不会逐行检查你的 Results,但会一字一句地看 Methods——因为方法决定了结果的可信度。写得好,审稿人觉得你专业可信;写得差,审稿人会怀疑你的整个分析流程。这一章用 TCGA-LIHC 项目为例,讲如何把你做过的每一步分析写成清晰、完整、可复现的 Methods 段。

Methods 段的标准结构

一篇典型的组学文章,Methods 包含以下子段,按照分析流程排列:

子段核心内容审稿人关注点
Data acquisition数据来源、样本量、纳入/排除标准数据是否公开可获取、样本选择是否有偏
RNA-seq data processing预处理流程、过滤标准过滤阈值是否合理、是否有质控步骤
Differential expression analysis工具、design、阈值、shrinkage是否用了多重检验校正、阈值选择依据
Functional enrichment工具、基因集来源、背景基因集背景基因是全基因组还是检测到的基因
Survival analysis分组方式、校正变量、验证策略分组阈值是否预定义、是否有外部验证
Mutation analysis突变 caller、过滤标准是否用了 somatic 而非 germline 变体
Statistical analysis整体统计框架、显著性定义P 值校正方法、effect size 报告
Data availability数据存放位置、代码仓库是否可复现

每个子段应该做到"读完之后别人能重复你的分析"这个标准。不需要写代码,但需要写清楚:用了什么工具(含版本号)、关键参数是什么、为什么做这个选择。

一个好的检验标准:让你的同事只读 Methods,不看你的代码,能否从头把分析跑出来?如果不能,说明 Methods 写得不够详细。

版本号与参数记录

审稿人问的第一个问题往往是:"你用的什么版本?"这不是挑刺,而是因为不同版本的算法可能给出不同结果。DESeq2 v1.38 和 v1.42 的默认 shrinkage 方法就不一样。

需要记录的信息:

软件层面:

  • R 版本(如 R 4.3.1)
  • Bioconductor 版本(如 BiocManager 3.18)
  • 每个关键包的版本号

参数层面:

  • 预过滤的具体标准(如"至少在 10 个样本中 count ≥ 10")
  • design formula 的完整写法
  • 显著性阈值(如 |log2FC| > 1 且 padj < 0.05)
  • 随机种子(如 set.seed(42))
  • 交叉验证的 fold 数

实际操作中,分析开始前就应该用 sessionInfo()renv::snapshot() 锁定环境。投稿时在 Methods 中写关键包的版本,完整的 session info 放 Supplementary。

以下是一个推荐的版本记录流程:

# 在分析脚本的开头记录环境
cat("=== Session Info ===\n")
cat(sprintf("Date: %s\n", Sys.Date()))
cat(sprintf("R version: %s\n", R.version.string))
cat(sprintf("Platform: %s\n", R.version$platform))

# 关键包版本
key_packages <- c("DESeq2", "clusterProfiler", "survival",
"glmnet", "maftools", "survminer")
for (pkg in key_packages) {
cat(sprintf("%s: %s\n", pkg, packageVersion(pkg)))
}

# 或者用 renv 锁定整个环境
# renv::init() # 项目初始化时
# renv::snapshot() # 每次安装新包后

在 Methods 中的写法示例:

All analyses were performed using R (version 4.3.1) with Bioconductor (version 3.18). Differential expression analysis was performed using DESeq2 (version 1.42.0) with apeglm shrinkage (version 1.24.0). Functional enrichment was performed using clusterProfiler (version 4.10.0). Survival analysis utilized the survival (version 3.5-7) and glmnet (version 4.1-8) packages.

统计学描述的规范写法

Methods 中的统计学描述有固定的写法模式。以下给出三个常见分析场景的模板:

差异表达分析:

Differential gene expression between tumor and normal samples was analyzed using DESeq2 (v1.48.2). Raw count data were filtered to retain genes with ≥10 counts in at least 10 samples. The design formula was specified as ~ condition, where condition indicates tumor or normal status. Log2 fold change estimates were shrunken using the apeglm method. Genes with |log2FC| > 1 and Benjamini-Hochberg adjusted P-value < 0.05 were considered differentially expressed.

关键要素:工具名 + 版本、过滤标准、design formula、shrinkage 方法、多重检验校正方法、阈值。

生存分析:

Univariate Cox proportional hazards regression was performed for each candidate gene using overall survival (OS) as the endpoint. Genes with P < 0.05 in univariate analysis were subjected to LASSO-penalized Cox regression (glmnet v4.1-8, alpha = 1) with 10-fold cross-validation to select the optimal regularization parameter (lambda.min). The risk score for each patient was calculated as the linear combination of selected gene expression values weighted by their LASSO coefficients. Patients were stratified into high-risk and low-risk groups using the median risk score as the cutoff. Kaplan-Meier survival curves were compared using the log-rank test. Multivariate Cox regression was performed to assess whether the risk score was an independent prognostic factor after adjusting for age, pathologic stage, and tumor grade.

关键要素:endpoint 定义、筛选标准、正则化方法和参数、分组依据、校正变量。

富集分析:

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed using clusterProfiler (v4.16.0). The background gene set consisted of all genes that passed the expression filter (n = 15,234). Over-representation analysis (ORA) was performed with a Benjamini-Hochberg adjusted P-value cutoff of 0.05 and a gene ratio cutoff of 0.2. Additionally, Gene Set Enrichment Analysis (GSEA) was performed using the log2 fold change-ranked gene list against KEGG gene sets (minimum gene set size = 15, maximum = 500).

关键要素:数据库版本/来源、背景基因集定义、校正方法、阈值。

数据可用性声明

几乎所有生物医学期刊现在都要求 Data Availability Statement (DAS)。即使用的是公开数据,也需要明确告诉读者数据从哪里来、代码在哪里。

使用公开数据(如 TCGA)的模板:

Data Availability Statement

The TCGA-LIHC RNA-seq count data (HTSeq-Counts), somatic mutation data (MuTect2 MAF), and clinical information were downloaded from the Genomic Data Commons (GDC) portal (https://portal.gdc.cancer.gov/, data release 37.0, accessed on 2024-01-15). The ICGC-LIRI validation dataset was obtained from the International Cancer Genome Consortium data portal (https://dcc.icgc.org/, release 28). All analysis code is available at https://github.com/username/TCGA-LIHC-lipid-signature (DOI: 10.5281/zenodo.XXXXXXX). Processed data and intermediate results are provided as Supplementary Data files.

使用自己产出数据的模板:

Raw sequencing data have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE###### and will be made publicly available upon publication. Processed count matrices are available as Supplementary Data. Analysis code is available at [GitHub URL] with DOI [Zenodo DOI].

注意几个要点:

  • 写明具体的数据库版本或 release 号
  • 写明访问日期(因为数据库会更新)
  • 代码仓库要附 DOI(通过 Zenodo 或 Figshare 生成),不要只给 GitHub URL(URL 可能失效)
  • 如果涉及受控访问数据(如 dbGaP),要说明访问方式

TCGA-LIHC 项目 Methods 完整示例

以下是一份完整的 Methods 段,覆盖 TCGA-LIHC 脂代谢预后 signature 项目的所有分析步骤。可以作为你自己项目的起点进行修改:

Data acquisition

RNA-seq gene expression data (HTSeq raw counts), somatic mutation data (MuTect2 variant calls), and clinical information for hepatocellular carcinoma (HCC) were obtained from The Cancer Genome Atlas (TCGA-LIHC) via the Genomic Data Commons portal (data release 37.0). A total of 371 primary tumor samples and 50 adjacent normal tissue samples were included. Samples with missing overall survival information or zero survival time were excluded, resulting in 365 tumor samples for survival analysis. For external validation, RNA-seq expression data and clinical information from the ICGC Liver Cancer - RIKEN, Japan (LIRI-JP) project were downloaded from the ICGC data portal (release 28), comprising 232 HCC patients with available survival data.

RNA-seq data processing

Raw count data were filtered to retain genes with ≥10 counts in at least 10 samples, yielding 15,234 genes for downstream analysis. Gene identifiers were mapped from Ensembl IDs to gene symbols using the org.Hs.eg.db annotation package (v3.18.0).

Differential expression analysis

Differential expression between tumor (n = 371) and normal (n = 50) samples was assessed using DESeq2 (v1.48.2) with the design formula ~ condition. Log2 fold change estimates were shrunken using the apeglm method. Genes with |log2 fold change| > 1 and Benjamini-Hochberg adjusted P-value < 0.05 were defined as differentially expressed genes (DEGs).

Functional enrichment analysis

Gene Ontology Biological Process (GO-BP) and KEGG pathway enrichment analyses were performed on DEGs using clusterProfiler (v4.16.0) with the set of expressed genes (n = 15,234) as background. Redundant GO terms were removed using semantic similarity-based simplification (cutoff = 0.7). Gene Set Enrichment Analysis (GSEA) was additionally performed using the log2 fold change-ranked full gene list.

Construction of prognostic signature

Among DEGs, lipid metabolism-related genes were identified by intersection with the "lipid metabolic process" GO term (GO:0006629) and KEGG lipid metabolism pathways. Univariate Cox proportional hazards regression was performed for each lipid metabolism-related DEG using overall survival (OS) as the endpoint. Genes with P < 0.05 were entered into LASSO-penalized Cox regression (glmnet v4.1-8, family = "cox", alpha = 1) with 10-fold cross-validation (set.seed = 42). The optimal regularization parameter (lambda.min) was selected to minimize the partial likelihood deviance. The risk score was calculated as the sum of expression values multiplied by their respective LASSO coefficients.

Survival analysis and validation

Patients in the TCGA-LIHC training cohort were divided into high-risk and low-risk groups using the median risk score as the cutoff. Kaplan-Meier curves were generated and compared using the log-rank test. Multivariate Cox regression was performed to assess the independent prognostic value of the risk score after adjusting for age, pathologic stage, and histological grade. Time-dependent ROC analysis (timeROC v0.4) was used to evaluate the predictive accuracy at 1, 3, and 5 years. For external validation, the same gene coefficients derived from the TCGA training set were applied to ICGC-LIRI expression data, and patients were stratified using the median risk score.

Mutation analysis

Somatic mutation data (MuTect2 MAF format) were analyzed using maftools (v2.18.0). The associations between high-frequency mutations (TP53, CTNNB1) and expression levels of signature genes were assessed using the Wilcoxon rank-sum test.

Statistical analysis

All statistical analyses were performed using R (v4.3.1). Two-group comparisons were performed using the Wilcoxon rank-sum test (continuous variables) or Fisher's exact test (categorical variables). Multiple testing correction was performed using the Benjamini-Hochberg method unless otherwise specified. A two-sided P-value < 0.05 was considered statistically significant. All figures were generated using ggplot2 (v3.4.4) and related visualization packages.

不同期刊的 Methods 要求

不同期刊对 Methods 的详细程度要求不同:

期刊类型Methods 位置详细程度备注
Nature/ScienceOnline Methods极详细,可达 3000 词需含所有可复现信息
Cell 系列STAR Methods结构化格式,分 KEY RESOURCES TABLE + METHOD DETAILS有固定模板
Bioinformatics正文 Methods简洁,800-1500 词重点写方法创新
2-5 分期刊正文 Methods中等详细度通常无字数限制
Supplementary附加材料可放超长版含代码片段和完整参数

写作策略:先写一份"最详细版"的 Methods(类似上面的完整示例),投稿时根据目标期刊的要求做删减。正文放核心信息,细节放 Supplementary Methods。

审稿人常见 Methods 质疑

以下是审稿人在 Methods 上最常提的问题,以及你应该在初稿中就预防的对策:

审稿人质疑预防措施
"为什么用 DESeq2 而不是 edgeR?"在 Methods 中简述选择理由,或在 Supplementary 中提供多工具对比
"分组阈值(median)是怎么确定的?"明确写"prespecified median cutoff",或提供 sensitivity analysis
"有没有校正 batch effect?"说明数据来源统一(同一 data release),或描述 batch 校正步骤
"sample size 够不够做 LASSO?"报告 events per variable (EPV) 比值
"随机种子是什么?"在 Methods 中写明 set.seed 的值

常见坑

  • 版本号缺失:投稿半年后审稿人要求补充版本号,但你已经忘了当时用的什么版本,包也更新了好几次。解决方法:分析一开始就用 renv::snapshot()sessionInfo() 锁定环境,把输出保存在项目目录下。

  • 参数描述不完整:只写了"用 DESeq2 做差异分析",没写 design formula、shrinkage 方法、阈值。审稿人不知道你具体做了什么。解决方法:对每个分析步骤,至少写明工具+版本+关键参数+阈值。

  • 统计方法和实际操作不一致:Methods 写的是"BH 校正 FDR < 0.05",但代码里实际用的是 p.adjust(method = "bonferroni")。解决方法:Methods 写完后对照代码逐一核对,确保描述和操作一致。

  • 数据可用性声明缺失或不完整:被编辑退回要求补充数据访问方式,耽误 2-4 周。解决方法:初稿就写好 DAS,确认所有 URL 有效、accession number 正确。

  • Methods 太简略导致反复 revision:审稿人说"请补充 X 的细节",一来一回两个月。解决方法:宁可写多不要写少。正文放不下的内容放 Supplementary Methods,主动提供完整信息比被动补充高效得多。

下一步

接着深入: Methods 写完之后,最后一步是把数据和代码归档,确保可复现。10 数据与代码归档:可复现交付清单 会讲如何组织一个"clone 下来就能跑"的项目仓库。

横向延伸: 如果你对 RNA-seq 分析的更多方法学细节感兴趣,可以看 Bulk RNA-seq 专栏,那里对 DESeq2、edgeR、limma 的统计原理有更深入的解释。

参考资源

  • Greenland S et al. (2016) "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology 31:337-350. 统计学描述的规范写法参考。
  • Wilkinson MD et al. (2016) "The FAIR Guiding Principles for scientific data management and stewardship." Scientific Data 3:160018. FAIR 数据原则,理解数据可用性声明的意义。
  • Love MI, Huber W, Anders S (2014) "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome Biology 15:550. DESeq2 论文,Methods 写作时需要引用的核心参考。
  • Conesa A et al. (2016) "A survey of best practices for RNA-seq data analysis." Genome Biology 17:13. RNA-seq 分析最佳实践,Methods 写作的参考标准。
  • ICMJE (2023) "Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals." 期刊投稿的通用规范。
AI 陪学

让 AI 陪我学这一篇

AI 会读这篇文章后给你 3-5 步学习计划, 逐步陪你学完,最后出 1-3 道题验证你掌握得怎么样。 登录后 AI 才能记住你的进度。

静态文件

离线资料下载

手册 HTML / PDF 已在后台预生成,点击后直接下载网站静态资源。