Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Using Sankey Diagram to Analyze Drug Pipeline, Lecture notes of Advanced Data Analysis

This paper analyzes data from clinicaltrials.gov for selected few clinical trial sponsors and uses that info to create sankey diagram. A sankey ...

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

fuller
fuller 🇬🇧

4.8

(6)

241 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
- 1 -
Paper DV03
Using Sankey Diagram to Analyze Drug Pipeline
Tanmay Khole, Bristol-Myers Squibb, Berkeley Heights NJ, USA
ABSTRACT
Sankey diagrams are a specific type of flow diagram, in which the width of the
arrows is shown proportionally to the flow quantity. Sankey diagrams put a
visual emphasis on the major transfers or flows within a system. They are
helpful in locating dominant contributions to an overall flow. This paper will
focus on drug pipeline of a sponsor and leverage data from clinicaltrials.gov to
analyze number of clinical trials a sponsor has with respect to conditions,
interventions, and phases. This will be visualized with the use of Sankey
diagram and display the weightage a sponsor has given to a drug or a condition
based on the phases of clinical trials. A drug pipeline gives us an idea about
the future of a company and this paper will give a deep dive on some of the
aspects by use of sankey diagram.
INTRODUCTION
This paper analyzes data from clinicaltrials.gov for selected few clinical trial
sponsors and uses that info to create sankey diagram. A sankey diagram is a
visualization used to depict a flow from one set of values to another. The things
being connected are called nodes and the connections are called links. Sankeys
are best used when you want to show a many-to-many mapping between two
domains or multiple paths through a set of stages and data from
clinicaltrials.gov is an excellent example to analyze a sponsor’s drug pipeline to
see which clinical condition or interventions are focused by sponsor with
respect to stages of clinical trials. Techniques such as data mapping, data
analysis and data visualization are used to create the sankey diagrams
displayed in this paper. Phase I clinical trials are excluded from data analysis
and data visualization for ease of understanding the flow of clinical trials which
are in Phase 2-4. Data is obtained in csv file format from clinicaltrials.gov using
advanced search option and searching only for sponsor section. Analysis is
performed on trials with status: "Active, not recruiting", "Available", "Enrolling
by invitation, "Not yet recruiting", or "Recruiting".
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Using Sankey Diagram to Analyze Drug Pipeline and more Lecture notes Advanced Data Analysis in PDF only on Docsity!

Paper DV

Using Sankey Diagram to Analyze Drug Pipeline

Tanmay Khole, Bristol-Myers Squibb, Berkeley Heights NJ, USA

ABSTRACT

Sankey diagrams are a specific type of flow diagram, in which the width of the

arrows is shown proportionally to the flow quantity. Sankey diagrams put a

visual emphasis on the major transfers or flows within a system. They are

helpful in locating dominant contributions to an overall flow. This paper will

focus on drug pipeline of a sponsor and leverage data from clinicaltrials.gov to

analyze number of clinical trials a sponsor has with respect to conditions,

interventions, and phases. This will be visualized with the use of Sankey

diagram and display the weightage a sponsor has given to a drug or a condition

based on the phases of clinical trials. A drug pipeline gives us an idea about

the future of a company and this paper will give a deep dive on some of the

aspects by use of sankey diagram.

INTRODUCTION

This paper analyzes data from clinicaltrials.gov for selected few clinical trial

sponsors and uses that info to create sankey diagram. A sankey diagram is a

visualization used to depict a flow from one set of values to another. The things

being connected are called nodes and the connections are called links. Sankeys

are best used when you want to show a many-to-many mapping between two

domains or multiple paths through a set of stages and data from

clinicaltrials.gov is an excellent example to analyze a sponsor’s drug pipeline to

see which clinical condition or interventions are focused by sponsor with

respect to stages of clinical trials. Techniques such as data mapping, data

analysis and data visualization are used to create the sankey diagrams

displayed in this paper. Phase I clinical trials are excluded from data analysis

and data visualization for ease of understanding the flow of clinical trials which

are in Phase 2-4. Data is obtained in csv file format from clinicaltrials.gov using

advanced search option and searching only for sponsor section. Analysis is

performed on trials with status: "Active, not recruiting", "Available", "Enrolling

by invitation, "Not yet recruiting", or "Recruiting".

SANKEY DIAGRAM FOR CLINICALTRIALS.GOV DATA

Data obtained from clinicaltrials.gov in csv format is one record per trial, see

figure 1. In order to use it for Sankey diagram, it needs to be processed as per

below steps:

• Data Mapping

• Data Analysis

• Data Visualization

Figure 1: Data obtained from clinicaltrials.gov and imported into SAS® dataset.

Sponsors listed in table 1 are considered in this paper for data analysis and to

create sankey diagrams for the on-going clinical trials of each sponsor.

Clinical trials with status: "Active, not recruiting", "Available", "Enrolling by

invitation, "Not yet recruiting", or "Recruiting" are considered as on-going.

Only those clinical trials are selected where sponsor is the lead sponsor of that

clinical trial.

Sponsor Distinct On-going

Clinical Trials Count

Data Extraction Date

Sponsor 1 Bristol-Myers Squibb 250 22NOV20 19

Sponsor 2 Janssen 126

Sponsor 3 Merck & Co. 173

Sponsor 4 Amgen 56

Sponsor 5 Bayer 56

Table 1: List of Sponsors

22JAN

Below mapping rules are applied before data analysis step. The mapping rules

are designed to identify the focus of the sponsor regards to clinical

conditions/interventions.

• Clinical trials with multiple phases are mapped toward the higher phase

• Clinical trials with multiple clinical conditions are mapped towards each

condition

• Clinical trials with multiple interventions are mapped towards each

intervention of the respective sponsor

Example 1 : Clinical trial NCT03331198, title “Study Evaluating Safety and

Efficacy of JCAR017 in Subjects With Relapsed or Refractory Chronic

Lymphocytic Leukemia (CLL) or Small Lymphocytic Lymphoma (SLL)”, has trial

design for phase 1 and phase 2. As per the mapping rules, it will be mapped for

Phase 2 only. This trial also has multiple clinical conditions listed such as

Chronic Lymphocytic Leukemia, Small Lymphocytic Lymphoma, and will be

mapped to each clinical condition as per the mapping rules.

Example 2: Clinical trial NCT04088500, title “A Study of Combination

Nivolumab and Ipilimumab Retreatment in Patients With Advanced Renal Cell

Carcinoma” has multiple interventions: Nivolumab and Ipilimumab. As per the

mapping rules, this trial will be mapped to each intervention listed.

Example 3: Clinical trial NCT03036098, title “Study of Nivolumab in

Combination With Ipilimumab or Standard of Care Chemotherapy Compared to

the Standard of Care Chemotherapy Alone in Treatment of Patients With

Untreated Inoperable or Metastatic Urothelial Cancer” has multiple

interventions: nivolumab, ipilimumab, gemcitabine, cisplatin, carboplatin but

only the first two are sponsor’s compounds, hence this trial will be mapped to

two interventions: nivolumab & ipilimumab.

Data mapping for this paper is performed by creating flags/identifiers for each

condition and intervention listed in respective sponsor’s clinical trials data.

Each sponsor listed in table 1 have unique compounds and mapping of each

compound/intervention is required by closely observing the data.

Data obtained from clinicaltrials.gov is one record per trial (horizontal data

format) and it needs to be transformed into vertical data format as shown in

figure 5 by using the flags created for each condition category and intervention.

Figure 5 : Horizontal data mapped and transformed into vertical data format

DATA ANALYSIS

Data analysis is performed by calculating number of objects with respect to its

categories which needs to be displayed in sankey diagram. The categories are

used as nodes and the count of those objects are used to determine the width

of links between the selected categories.

In this paper, data analysis is performed by calculating number of clinical

trials with respect to sponsor, conditions, interventions, and phases. This step

is performed after data mapping to ensure correct connection of links and

nodes. SAS® macro %sankey_nodes is used for data analysis and reference

code can be found in the appendix.

%sankey_nodes(inds = ct_gov

,outds = sankey_out

,nodes=%str(sponsor|conditions|interventions|phases)

,cond =

%sankey_nodes will calculate the number of objects, in this case, number of

clinical trials. The mapped data is fed into “inds” macro parameter. The nodes

(categories) which needs to be displayed in the sankey diagram are listed in

“nodes” macro parameter and if any condition needs to be applied, it can be

listed in “cond” macro parameter. This macro creates a macro variable

&sankeydata. and output dataset which has data for sankey diagram stored in

it. It gets used in the data visualization step to create sankey diagram.

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3 : Interventions; Node 4 : Clinical Trial Phases

Number of on-going clinical trials for each node are displayed in parenthesis.

Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards

each condition; Clinical trials with multiple interventions are counted towards each intervention.

Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline

but based on the data acquired from clinicaltrials.gov.

SANKEY DIAGRAM 1

Sponsor: Bristol-Myers Squibb

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3 : Interventions; Node 4 : Clinical Trial Phases

Number of on-going clinical trials for each node are displayed in parenthesis.

Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition;

Clinical trials with multiple interventions are counted towards each intervention.

Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the

data acquired from clinicaltrials.gov.

SANKEY DIAGRAM 2

Sponsor: Janssen

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3 : Interventions; Node 4 : Clinical Trial Phases

Number of on-going clinical trials for each node are displayed in parenthesis.

Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition;

Clinical trials with multiple interventions are counted towards each intervention.

Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the

data acquired from clinicaltrials.gov.

SANKEY DIAGRAM 4

Sponsor: Amgen

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3 : Interventions; Node 4 : Clinical Trial Phases

Number of on-going clinical trials for each node are displayed in parenthesis.

Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition;

Clinical trials with multiple interventions are counted towards each intervention.

Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the

data acquired from clinicaltrials.gov.

SANKEY DIAGRAM 5

Sponsor: Bayer

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author

at:

Author Name: Tanmay Khole

Company: Bristol-Myers Squibb

Address: 300 Connell Drive, Berkeley Heights

City / Postcode: NJ 07922

Email: tkhole@celgene.com

Brand and product names are trademarks of their respective companies.

APPENDIX

%macro sankey_nodes(inds=, outds=, nodes=, cond=); %let cnt = %eval(%sysfunc(countc(&nodes.,"|")) +1); %put &cnt.; data inds; set &inds.; run; %do i = 1 %to &cnt; %let single&i. = %scan(&nodes, &i , '|'); %put single&i. = &&single&i; %end; proc sql; %do i = 1 %to %eval(&cnt. - 1 ); create table &&single&i. **.** wt_chk as select distinct %do k= 1 %to &i. ; &&single&k., %end; %superq(single%eval(&i. + 1 )), &&single&i. as SOURCE length= 100 , %superq(single%eval(&i. + 1 )) as TARGET length= 100 , count(&&single&i.) as VALUE, " {'source':'"||strip(&&single&i.)||"','target':'"||strip(%superq(single%eval(&i.

  • 1 )))||"','value':"||strip(put(count(&&single&i.), 5.0 ))||"}," as final length= 1000 from inds %if &cond. ne %then %do; where &cond. %end; group by %do k= 1 %to &i. ; &&single&k., %end; %superq(single%eval(&i. + 1 )) ; %end; quit; data &outds.; set %do i = 1 %to %eval(&cnt. - 1 ); &&single&i. **.** wt_chk %end; ; run;

options linesize=max; %global sankeydata; proc sql noprint; select final into: sankeydata separated by " " from &outds. ; quit; %put &sankeydata. ; %mend sankey_nodes; %macro sankey2html(indata=, outfl=, width=, height=, flow_num=); data null; file "&outfl."; put ''; put ''; put ''; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' <script'; put ' type="text/javascript"'; put ' src="https://d3js.org/d3.v3.js"'; put ' '; put ' >'; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put '