cellmaps_ppidownloader package

Submodules

cellmaps_ppidownloader.cellmaps_ppidownloadercmd module

cellmaps_ppidownloader.cellmaps_ppidownloadercmd.main(args)[source]

Main entry point for program

Parameters:

args (list) – arguments passed to command line usually sys.argv[1:]()

Returns:

return value of cellmaps_ppidownloader.runner.CellmapsPPIDownloader.run() or 2 if an exception is raised

Return type:

int

cellmaps_ppidownloader.exceptions module

exception cellmaps_ppidownloader.exceptions.CellMapsPPIDownloaderError[source]

Bases: Exception

Base exception for CellMapsPPIDownloader

cellmaps_ppidownloader.gene module

class cellmaps_ppidownloader.gene.APMSGeneNodeAttributeGenerator(apms_edgelist=None, apms_baitlist=None, genequery=<cellmaps_ppidownloader.gene.GeneQuery object>)[source]

Bases: GeneNodeAttributeGenerator

Creates APMS Gene Node Attributes table

Constructor

Parameters:
  • apms_edgelist (list) –

    list of dict elements where each dict is of format:

    {'GeneID1': VAL,
     'Symbol1': VAL,
     'GeneID2': VAL,
     'Symbol2': VAL}
    

  • apms_baitlist (list) –

    list of dict elements where each dict is of format:

    { 'GeneSymbol': VAL,
      'GeneID': VAL,
      'NumIteractors': VAL }
    

  • genequery

BAITLIST_GENE_ID = 'GeneID'
BAITLIST_GENE_SYMBOL = 'GeneSymbol'
BAITLIST_NUM_INTERACTORS = '# Interactors'
GENEID_COL1 = 'GeneID1'
GENEID_COL2 = 'GeneID2'
SYMBOL_COL1 = 'Symbol1'
SYMBOL_COL2 = 'Symbol2'
static get_apms_baitlist_from_tsvfile(tsvfile=None, symbol_col='GeneSymbol', geneid_col='GeneID', numinteractors_col='# Interactors')[source]

Generates list of dicts by parsing TSV file specified by tsvfile with the format header column and corresponding values:

GeneSymbol  GeneID  # Interactors
Parameters:

tsvfile (str) – Path to TSV file with above format

Returns:

list of dicts, with each dict of format:

{ 'GeneSymbol': VAL,
  'GeneID': VAL,
  'NumIteractors': VAL }

Return type:

list

get_apms_edgelist()[source]

Gets apms edgelist passed in via constructor

Returns:

Return type:

list

static get_apms_edgelist_from_tsvfile(tsvfile=None, geneid_one_col='GeneID1', symbol_one_col='Symbol1', geneid_two_col='GeneID2', symbol_two_col='Symbol2')[source]

Generates list of dicts by parsing TSV file specified by tsvfile with the format header column and corresponding values:

GeneID1     Symbol1 GeneID2 Symbol2
Parameters:

tsvfile (str) – Path to TSV file with above format

Returns:

list of dicts, with each dict of format:

{'GeneID1': VAL,
 'Symbol1': VAL,
 'GeneID2': VAL,
 'Symbol2': VAL}

Return type:

list

get_gene_node_attributes()[source]

Gene gene node attributes which is output as a list of dicts in this format:

{ 'GENEID': { 'name': 'GENESYMBOL',
              'represents': 'ensemble:ENSEMBLID1;ENSEMBLID2..',
              'ambiguous': 'ALTERNATE GENEs' }
}
Returns:

(list of dicts containing gene node attributes, list of str describing any errors encountered)

Return type:

tuple

class cellmaps_ppidownloader.gene.CM4AIGeneNodeAttributeGenerator(apms_edgelist=None, genequery=<cellmaps_ppidownloader.gene.GeneQuery object>)[source]

Bases: GeneNodeAttributeGenerator

Creates APMS Gene Node Attributes table from CM4AI data

Constructor

Parameters:
  • apms_edgelist (list) –

    list of dict elements where each dict is of format:

    {'Bait': VAL,
     'Prey': VAL,
     'logOddsScore': VAL,
     'FoldChange.x': VAL,
     'BFDR.x': VAL}
    

  • genequery

get_apms_edgelist()[source]

Gets apms edgelist

Returns:

Return type:

list

static get_apms_edgelist_from_tsvfile(tsvfile=None, bait_col='Bait', prey_col='Prey', bfdr_col=None, foldchange_col=None, foldchange_cutoff=0.0, bfdr_maxcutoff=0.05)[source]

Generates list of dicts by parsing TSV file specified by tsvfile with the format header column and corresponding values:

Bait        Prey    BFDR.x  FoldChange.x

Note

If BFDR.x column does not exist, no BFDR filtering will occur Same goes if FoldChange.x column does not exist

Parameters:
  • tsvfile (str) – Path to TSV file with above format

  • bait_col (str) – Name of bait column

  • prey_col (str) – Name of prey column

  • bfdr_col (str) – Name of BFDR aka false discovery rate column If None no BFDR filtering will occur

  • foldchange_col (str) – Name of FoldChange column If None no FoldChange filtering will occur

  • foldchange_cutoff (float) – Foldchange cutoff. Only keep rows with values greater then this value. If this value is None no filtering will occur

  • bfdr_maxcutoff (float) – BFDR cutoff. Only keep rows with BFDR less then or equal to this value. If this value is None no filtering will occur

Returns:

list of dicts, with each dict of format:

{'Bait': VAL,
 'Prey': VAL}

Return type:

list

get_gene_node_attributes()[source]

Gene gene node attributes which is output as a list of dicts in this format:

{ 'GENEID': { 'name': 'GENESYMBOL',
              'represents': 'ensemble:ENSEMBLID1;ENSEMBLID2..',
              'ambiguous': 'ALTERNATE GENEs',
              'bait': True or False}
}
Returns:

(list of dicts containing gene node attributes, list of str describing any errors encountered)

Return type:

tuple

class cellmaps_ppidownloader.gene.GeneNodeAttributeGenerator[source]

Bases: object

Base class for GeneNodeAttribute Generator

Constructor

static add_geneids_to_set(gene_set=None, ambiguous_gene_dict=None, geneid=None)[source]

Examines geneid passed in and if a comma exists in value split by comma and assume multiple genes. Adds those genes into gene_set and add entry to ambiguous_gene_dict with key set to each gene name and value set to original geneid value

Parameters:
  • gene_set (set) – unique set of genes

  • geneid (str) – name of gene or comma delimited string of genes

Returns:

genes found in geneid or None if gene_set or geneid is None

Return type:

list

get_gene_node_attributes()[source]

Should be implemented by subclasses

Raises:

NotImplementedError – Always

class cellmaps_ppidownloader.gene.GeneQuery(mygeneinfo=<mygene.MyGeneInfo object>)[source]

Bases: object

Gets information about genes from mygene

Constructor

get_symbols_for_genes(genelist=None, scopes='_id')[source]

Queries for genes via GeneQuery() object passed in via constructor

Parameters:
  • genelist (list) – genes to query for valid symbols and ensembl ids

  • scopes (str) – field to query on _id for gene id, ensemble.gene for ENSEMBLE IDs

Returns:

result from mygene which is a list of dict objects where each dict is of format:

{ 'query': 'ID',
  '_id': 'ID', '_score': #.##,
  'ensembl': { 'gene': 'ENSEMBLEID' },
  'symbol': 'GENESYMBOL' }

Return type:

list

querymany(queries, species=None, scopes=None, fields=None)[source]

Simple wrapper that calls MyGene querymany returning the results

Parameters:
  • queries (list) – list of gene ids/symbols to query

  • species (str)

  • scopes (str)

  • fields (list)

Returns:

dict from MyGene usually in format of

Return type:

list

cellmaps_ppidownloader.runner module

class cellmaps_ppidownloader.runner.CellmapsPPIDownloader(outdir=None, imgsuffix='.jpg', apmsgen=None, skip_logging=True, provenance=None, input_data_dict=None, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>, skip_failed=False)[source]

Bases: object

Downloads AP-MS protein-protein interaction data, and registers datasets for provenance tracking in FAIRSCAPE.

Constructor

Parameters:
  • outdir (str) – directory where images will be downloaded to

  • apmsgen (APMSGeneNodeAttributeGenerator) – gene node attribute generator for APMS data

  • skip_logging (bool) – If True skip logging, if None or False do NOT skip logging

  • provenance (dict) –

    Provenance information about input files as dictionary.

    Example:

    {
         'name': 'Example input dataset',
         'organization-name': 'CM4AI',
         'project-name': 'Example',
         'edgelist': {
             'name': 'sample edgelist',
             'author': 'Krogan Lab',
             'version': '1.0',
             'date-published': '07-31-2023',
             'description': 'AP-MS Protein interactions on HSC2 cell line, example dataset',
             'data-format': 'tsv'
         },
         'baitlist': {
             'name': 'sample baitlist',
             'author': 'Krogan Lab',
             'version': '1.0',
             'date-published': '07-31-2023',
             'description': 'AP-MS Baits used for Protein interactions on HSC2 cell line',
             'data-format': 'tsv'
         }
     }
    

  • input_data_dict (dict) –

    All attributes and their corresponding values of the input data e.g.

    {'outdir': 'test', 'baitlist': 'path/to/file/with/baitlist'}
    

  • imgsuffix (str) –

    Unused parameter.

    Deprecated since version 0.2.2.

    The imgsuffix parameter is deprecated and will be removed in a future release.

BAITLIST_FILEKEY = 'baitlist'
CM4AI_ROCRATE = 'cm4ai_rocrate'
EDGELIST_FILEKEY = 'edgelist'
generate_readme()[source]
static get_example_provenance(requiredonly=True, with_ids=False)[source]

Gets a dict of provenance parameters needed to add/register a dataset with FAIRSCAPE

Parameters:
  • requiredonly (bool) – If True only output required fields, otherwise output all fields. This value is ignored if with_ids is True

  • with_ids (bool) – If True only output the fields to set dataset guids and ignore value of requiredonly parameter.

Returns:

get_ppi_edgelist_file()[source]
Returns:

get_ppi_gene_node_attributes_file()[source]

Gets full path to ppi gene node attribute file under output directory created when invoking run()

Returns:

Path to file

Return type:

str

get_ppi_gene_node_errors_file()[source]

Gets full path to ppi gene node attribute errors file under output directory created when invoking run()

Returns:

Path to file

Return type:

str

run()[source]

Downloads ppi data to output directory specified in constructor

Raises:

CellMapsPPIDownloaderError – If there is an error

Returns:

0 upon success, otherwise failure

Module contents

Top-level package for cellmaps_ppidownloader.