Datasets

class torchxrayvision.datasets.Dataset

The datasets in this library aim to fit a simple interface where the imgpath and csvpath are specified. Some datasets require more than one metadata file and for some the metadata files are packaged in the library so only the imgpath needs to be specified.

pathologies: List[str]: A list of strings identifying the pathologies contained in this dataset. This list corresponds to the columns of the .labels matrix. Although it is called pathologies, the contents do not have to be pathologies and may simply be attributes of the patient.

labels: ndarray: A NumPy array which contains a 1, 0, or NaN for each pathology. Each column is a pathology and each row corresponds to an item in the dataset. A 1 represents that the pathology is present, 0 represents the pathology is absent, and NaN represents no information.

csv: DataFrame: A Pandas DataFrame of the metadata .csv file that is included with the data. For some datasets multiple metadata files have been merged together. It is largely a “catch-all” for associated data and the referenced publication should explain each field. Each row aligns with the elements of the dataset so indexing using .iloc will work. Alignment between the DataFrame and the dataset items will be maintained when using tools from this library.

totals() → Dict[str, Dict[str, int]]

Compute counts of pathologies.

Returns: A dict containing pathology name -> (label->value)

__repr__() → str

Returns the name and a description of the dataset such as:

CheX_Dataset num_samples=191010 views=['PA', 'AP']

If in a jupyter notebook it will also print the counts of the pathology counts returned by .totals()

{'Atelectasis': {0.0: 17621, 1.0: 29718},
 'Cardiomegaly': {0.0: 22645, 1.0: 23384},
 'Consolidation': {0.0: 30463, 1.0: 12982},
 ...}

class torchxrayvision.datasets.NIH_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', bbox_list_path='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, seed=0, unique_patients=True, pathology_masks=False)

NIH ChestX-ray14 dataset

ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30, 805 unique patients with the text-mined fourteen disease image labels ( where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: https://arxiv.org/abs/1705.02315

Dataset release website: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

Download full size images here: https://academictorrents.com/details/557481faacd824c83fbf57dcf7b6da9383b3235a

Download resized (224x224) images here: https://academictorrents.com/details/e615d3aebce373f1dc8bd9d11064da55bdadede0

class torchxrayvision.datasets.RSNA_Pneumonia_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', dicomcsvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, pathology_masks=False, extension='.jpg')

RSNA Pneumonia Detection Challenge

Citation:

Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Shih, George, Wu, Carol C., Halabi, Safwan S., Kohli, Marc D., Prevedello, Luciano M., Cook, Tessa S., Sharma, Arjun, Amorosa, Judith K., Arteaga, Veronica, Galperin-Aizenberg, Maya, Gill, Ritu R., Godoy, Myrna C.B., Hobbs, Stephen, Jeudy, Jean, Laroia, Archana, Shah, Palmi N., Vummidi, Dharshan, Yaddanapudi, Kavitha, and Stein, Anouk. Radiology: Artificial Intelligence, 1 2019. doi: 10.1148/ryai.2019180041.

More info: https://www.rsna.org/en/education/ai-resources-and-training/ai-image-challenge/RSNA-Pneumonia-Detection-Challenge-2018

Challenge site: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge

JPG files stored here: https://academictorrents.com/details/95588a735c9ae4d123f3ca408e56570409bcf2a9

class torchxrayvision.datasets.NIH_Google_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, unique_patients=True)

A relabelling of a subset of images from the NIH dataset. The data tables should be applied against an NIH download. A test and validation split are provided in the original. They are combined here, but one or the other can be used by providing the original csv to the csvpath argument.

Citation:

Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation Anna Majkowska, Sid Mittal, David F. Steiner, Joshua J. Reicher, Scott Mayer McKinney, Gavin E. Duggan, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Sreenivasa Raju Kalidindi, Alexander Ding, Greg S. Corrado, Daniel Tse, and Shravya Shetty. Radiology 2020

https://pubs.rsna.org/doi/10.1148/radiol.2019191293

NIH data can be downloaded here: https://academictorrents.com/details/e615d3aebce373f1dc8bd9d11064da55bdadede0

class torchxrayvision.datasets.PC_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, flat_dir=True, seed=0, unique_patients=True)

PadChest dataset from the Hospital San Juan de Alicante - University of Alicante

Note that images with null labels (as opposed to normal), and images that cannot be properly loaded (listed as ‘missing’ in the code) are excluded, which makes the total number of available images slightly less than the total number of image files.

Citation:

PadChest: A large chest x-ray image dataset with multi-label annotated reports. Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. arXiv preprint, 2019. https://arxiv.org/abs/1901.07441

Dataset website: http://bimcv.cipf.es/bimcv-projects/padchest/

Download full size images here: https://academictorrents.com/details/dec12db21d57e158f78621f06dcbe78248d14850

Download resized (224x224) images here (recropped): https://academictorrents.com/details/96ebb4f92b85929eadfb16761f310a6d04105797

class torchxrayvision.datasets.CheX_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, flat_dir=True, seed=0, unique_patients=True)

CheXpert Dataset

Citation:

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. Jeremy Irvin *, Pranav Rajpurkar *, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng. https://arxiv.org/abs/1901.07031

Dataset website here: https://stanfordmlgroup.github.io/competitions/chexpert/

A small validation set is provided with the data as well, but is so tiny, it is not included here.

class torchxrayvision.datasets.MIMIC_Dataset(imgpath, csvpath, metacsvpath, views=['PA'], transform=None, data_aug=None, seed=0, unique_patients=True)

MIMIC-CXR Dataset

Citation:

Johnson AE, Pollard TJ, Berkowitz S, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019 Jan 21.

https://arxiv.org/abs/1901.07042

Dataset website here: https://physionet.org/content/mimic-cxr-jpg/2.0.0/

class torchxrayvision.datasets.Openi_Dataset(imgpath, xmlpath='USE_INCLUDED_FILE', dicomcsv_path='USE_INCLUDED_FILE', tsnepacsv_path='USE_INCLUDED_FILE', use_tsne_derived_view=False, views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, unique_patients=True)

OpenI Dataset

Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 2016. doi: 10.1093/jamia/ocv080.

Views have been determined by projection using T-SNE. To use the T-SNE view rather than the view defined by the record, set use_tsne_derived_view to true.

Dataset website: https://openi.nlm.nih.gov/faq

Download images: https://academictorrents.com/details/5a3a439df24931f410fac269b87b050203d9467d

class torchxrayvision.datasets.COVID19_Dataset(imgpath: str, csvpath: str, views=['PA', 'AP'], transform=None, data_aug=None, seed: int = 0, semantic_masks=False)

COVID-19 Image Data Collection

This dataset currently contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19. It was manually aggregated from publication figures as well as various web based repositories into a machine learning (ML) friendly format with accompanying dataloader code. We collected frontal and lateral view imagery and metadata such as the time since first symptoms, intensive care unit (ICU) status, survival status, intubation status, or hospital location. We present multiple possible use cases for the data such as predicting the need for the ICU, predicting patient survival, and understanding a patient’s trajectory during treatment.

Citations:

COVID-19 Image Data Collection: Prospective Predictions Are the Future Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi arXiv:2006.11988, 2020

COVID-19 image data collection, Joseph Paul Cohen and Paul Morrison and Lan Dao arXiv:2003.11597, 2020

Dataset: https://github.com/ieee8023/covid-chestxray-dataset

Paper: https://arxiv.org/abs/2003.11597

class torchxrayvision.datasets.NLMTB_Dataset(imgpath, transform=None, data_aug=None, seed=0, views=['PA'])

National Library of Medicine Tuberculosis Datasets

https://lhncbc.nlm.nih.gov/publication/pub9931 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256233/

Note that each dataset should be loaded separately by this class (they may be merged afterwards). All images are of view PA.

Jaeger S, Candemir S, Antani S, Wang YX, Lu PX, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg. 2014 Dec;4(6):475-7. doi: 10.3978/j.issn.2223-4292.2014.11.20. PMID: 25525580; PMCID: PMC4256233.

Download Links: Montgomery County https://academictorrents.com/details/ac786f74878a5775c81d490b23842fd4736bfe33 http://openi.nlm.nih.gov/imgs/collections/NLM-MontgomeryCXRSet.zip

Shenzhen https://academictorrents.com/details/462728e890bd37c05e9439c885df7afc36209cc8 http://openi.nlm.nih.gov/imgs/collections/ChinaSet_AllFiles.zip

class torchxrayvision.datasets.SIIM_Pneumothorax_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', transform=None, data_aug=None, seed=0, unique_patients=True, pathology_masks=False)

SIIM Pneumothorax Dataset

https://academictorrents.com/details/6ef7c6d039e85152c4d0f31d83fa70edc4aba088 https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation

“The data is comprised of images in DICOM format and annotations in the form of image IDs and run-length-encoded (RLE) masks. Some of the images contain instances of pneumothorax (collapsed lung), which are indicated by encoded binary masks in the annotations. Some training images have multiple annotations. Images without pneumothorax have a mask value of -1.”

class torchxrayvision.datasets.VinBrain_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=None, transform=None, data_aug=None, seed=0, pathology_masks=False)

VinBrain Dataset

d_vin = xrv.datasets.VinBrain_Dataset(
    imgpath=".../train",
    csvpath=".../train.csv"
)

Nguyen, H. Q., Lam, K., Le, L. T., Pham, H. H., Tran, D. Q., Nguyen, D. B., Le, D. D., Pham, C. M., Tong, H. T. T., Dinh, D. H., Do, C. D., Doan, L. T., Nguyen, C. N., Nguyen, B. T., Nguyen, Q. V., Hoang, A. D., Phan, H. N., Nguyen, A. T., Ho, P. H., … Vu, V. (2020). VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. http://arxiv.org/abs/2012.15029

https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection

class torchxrayvision.datasets.StonyBrookCOVID_Dataset(imgpath, csvpath, transform=None, data_aug=None, views=['AP'], seed=0)

Stonybrook Radiographic Assessment of Lung Opacity Score Dataset

https://doi.org/10.5281/zenodo.4633999

Citation will be set soon.

class torchxrayvision.datasets.ObjectCXR_Dataset(imgzippath, csvpath, transform=None, data_aug=None, seed=0)

ObjectCXR Dataset

“We provide a large dataset of chest X-rays with strong annotations of foreign objects, and the competition for automatic detection of foreign objects. Specifically, 5000 frontal chest X-ray images with foreign objects presented and 5000 frontal chest X-ray images without foreign objects are provided. All the chest X-ray images were filmed in township hospitals in China and collected through our telemedicine platform. Foreign objects within the lung field of each chest X-ray are annotated with bounding boxes, ellipses or masks depending on the shape of the objects.”

Challenge dataset from MIDL2020

https://jfhealthcare.github.io/object-CXR/

https://academictorrents.com/details/fdc91f11d7010f7259a05403fc9d00079a09f5d5