Datasets
- class torchxrayvision.datasets.Dataset
The datasets in this library aim to fit a simple interface where the imgpath and csvpath are specified. Some datasets require more than one metadata file and for some the metadata files are packaged in the library so only the imgpath needs to be specified.
- pathologies: List[str]
A list of strings identifying the pathologies contained in this dataset. This list corresponds to the columns of the .labels matrix. Although it is called pathologies, the contents do not have to be pathologies and may simply be attributes of the patient.
- labels: ndarray
A NumPy array which contains a 1, 0, or NaN for each pathology. Each column is a pathology and each row corresponds to an item in the dataset. A 1 represents that the pathology is present, 0 represents the pathology is absent, and NaN represents no information.
- csv: DataFrame
A Pandas DataFrame of the metadata .csv file that is included with the data. For some datasets multiple metadata files have been merged together. It is largely a “catch-all” for associated data and the referenced publication should explain each field. Each row aligns with the elements of the dataset so indexing using .iloc will work. Alignment between the DataFrame and the dataset items will be maintained when using tools from this library.
- totals() Dict[str, Dict[str, int]]
Compute counts of pathologies.
Returns: A dict containing pathology name -> (label->value)
- __repr__() str
Returns the name and a description of the dataset such as:
CheX_Dataset num_samples=191010 views=['PA', 'AP']
If in a jupyter notebook it will also print the counts of the pathology counts returned by .totals()
{'Atelectasis': {0.0: 17621, 1.0: 29718}, 'Cardiomegaly': {0.0: 22645, 1.0: 23384}, 'Consolidation': {0.0: 30463, 1.0: 12982}, ...}
- class torchxrayvision.datasets.NIH_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', bbox_list_path='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, seed=0, unique_patients=True, pathology_masks=False)
NIH ChestX-ray14 dataset
ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30, 805 unique patients with the text-mined fourteen disease image labels ( where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: https://arxiv.org/abs/1705.02315
Dataset release website: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
Download full size images here: https://academictorrents.com/details/557481faacd824c83fbf57dcf7b6da9383b3235a
Download resized (224x224) images here: https://academictorrents.com/details/e615d3aebce373f1dc8bd9d11064da55bdadede0
- class torchxrayvision.datasets.RSNA_Pneumonia_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', dicomcsvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, pathology_masks=False, extension='.jpg')
RSNA Pneumonia Detection Challenge
Citation:
Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Shih, George, Wu, Carol C., Halabi, Safwan S., Kohli, Marc D., Prevedello, Luciano M., Cook, Tessa S., Sharma, Arjun, Amorosa, Judith K., Arteaga, Veronica, Galperin-Aizenberg, Maya, Gill, Ritu R., Godoy, Myrna C.B., Hobbs, Stephen, Jeudy, Jean, Laroia, Archana, Shah, Palmi N., Vummidi, Dharshan, Yaddanapudi, Kavitha, and Stein, Anouk. Radiology: Artificial Intelligence, 1 2019. doi: 10.1148/ryai.2019180041.
Challenge site: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
JPG files stored here: https://academictorrents.com/details/95588a735c9ae4d123f3ca408e56570409bcf2a9
- class torchxrayvision.datasets.NIH_Google_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, unique_patients=True)
A relabelling of a subset of images from the NIH dataset. The data tables should be applied against an NIH download. A test and validation split are provided in the original. They are combined here, but one or the other can be used by providing the original csv to the csvpath argument.
Citation:
Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation Anna Majkowska, Sid Mittal, David F. Steiner, Joshua J. Reicher, Scott Mayer McKinney, Gavin E. Duggan, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Sreenivasa Raju Kalidindi, Alexander Ding, Greg S. Corrado, Daniel Tse, and Shravya Shetty. Radiology 2020
https://pubs.rsna.org/doi/10.1148/radiol.2019191293
NIH data can be downloaded here: https://academictorrents.com/details/e615d3aebce373f1dc8bd9d11064da55bdadede0
- class torchxrayvision.datasets.PC_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, flat_dir=True, seed=0, unique_patients=True)
PadChest dataset from the Hospital San Juan de Alicante - University of Alicante
Note that images with null labels (as opposed to normal), and images that cannot be properly loaded (listed as ‘missing’ in the code) are excluded, which makes the total number of available images slightly less than the total number of image files.
Citation:
PadChest: A large chest x-ray image dataset with multi-label annotated reports. Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. arXiv preprint, 2019. https://arxiv.org/abs/1901.07441
Dataset website: http://bimcv.cipf.es/bimcv-projects/padchest/
Download full size images here: https://academictorrents.com/details/dec12db21d57e158f78621f06dcbe78248d14850
Download resized (224x224) images here (recropped): https://academictorrents.com/details/96ebb4f92b85929eadfb16761f310a6d04105797
- class torchxrayvision.datasets.CheX_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=['PA'], transform=None, data_aug=None, flat_dir=True, seed=0, unique_patients=True)
CheXpert Dataset
Citation:
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. Jeremy Irvin *, Pranav Rajpurkar *, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng. https://arxiv.org/abs/1901.07031
Dataset website here: https://stanfordmlgroup.github.io/competitions/chexpert/
A small validation set is provided with the data as well, but is so tiny, it is not included here.
- class torchxrayvision.datasets.MIMIC_Dataset(imgpath, csvpath, metacsvpath, views=['PA'], transform=None, data_aug=None, seed=0, unique_patients=True)
MIMIC-CXR Dataset
Citation:
Johnson AE, Pollard TJ, Berkowitz S, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019 Jan 21.
https://arxiv.org/abs/1901.07042
Dataset website here: https://physionet.org/content/mimic-cxr-jpg/2.0.0/
- class torchxrayvision.datasets.Openi_Dataset(imgpath, xmlpath='USE_INCLUDED_FILE', dicomcsv_path='USE_INCLUDED_FILE', tsnepacsv_path='USE_INCLUDED_FILE', use_tsne_derived_view=False, views=['PA'], transform=None, data_aug=None, nrows=None, seed=0, unique_patients=True)
OpenI Dataset
Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 2016. doi: 10.1093/jamia/ocv080.
Views have been determined by projection using T-SNE. To use the T-SNE view rather than the view defined by the record, set use_tsne_derived_view to true.
Dataset website: https://openi.nlm.nih.gov/faq
Download images: https://academictorrents.com/details/5a3a439df24931f410fac269b87b050203d9467d
- class torchxrayvision.datasets.COVID19_Dataset(imgpath: str, csvpath: str, views=['PA', 'AP'], transform=None, data_aug=None, seed: int = 0, semantic_masks=False)
COVID-19 Image Data Collection
This dataset currently contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19. It was manually aggregated from publication figures as well as various web based repositories into a machine learning (ML) friendly format with accompanying dataloader code. We collected frontal and lateral view imagery and metadata such as the time since first symptoms, intensive care unit (ICU) status, survival status, intubation status, or hospital location. We present multiple possible use cases for the data such as predicting the need for the ICU, predicting patient survival, and understanding a patient’s trajectory during treatment.
Citations:
COVID-19 Image Data Collection: Prospective Predictions Are the Future Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi arXiv:2006.11988, 2020
COVID-19 image data collection, Joseph Paul Cohen and Paul Morrison and Lan Dao arXiv:2003.11597, 2020
Dataset: https://github.com/ieee8023/covid-chestxray-dataset
- class torchxrayvision.datasets.NLMTB_Dataset(imgpath, transform=None, data_aug=None, seed=0, views=['PA'])
National Library of Medicine Tuberculosis Datasets
https://lhncbc.nlm.nih.gov/publication/pub9931 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256233/
Note that each dataset should be loaded separately by this class (they may be merged afterwards). All images are of view PA.
Jaeger S, Candemir S, Antani S, Wang YX, Lu PX, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg. 2014 Dec;4(6):475-7. doi: 10.3978/j.issn.2223-4292.2014.11.20. PMID: 25525580; PMCID: PMC4256233.
Download Links: Montgomery County https://academictorrents.com/details/ac786f74878a5775c81d490b23842fd4736bfe33 http://openi.nlm.nih.gov/imgs/collections/NLM-MontgomeryCXRSet.zip
Shenzhen https://academictorrents.com/details/462728e890bd37c05e9439c885df7afc36209cc8 http://openi.nlm.nih.gov/imgs/collections/ChinaSet_AllFiles.zip
- class torchxrayvision.datasets.SIIM_Pneumothorax_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', transform=None, data_aug=None, seed=0, unique_patients=True, pathology_masks=False)
SIIM Pneumothorax Dataset
https://academictorrents.com/details/6ef7c6d039e85152c4d0f31d83fa70edc4aba088 https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation
“The data is comprised of images in DICOM format and annotations in the form of image IDs and run-length-encoded (RLE) masks. Some of the images contain instances of pneumothorax (collapsed lung), which are indicated by encoded binary masks in the annotations. Some training images have multiple annotations. Images without pneumothorax have a mask value of -1.”
- class torchxrayvision.datasets.VinBrain_Dataset(imgpath, csvpath='USE_INCLUDED_FILE', views=None, transform=None, data_aug=None, seed=0, pathology_masks=False)
VinBrain Dataset
d_vin = xrv.datasets.VinBrain_Dataset( imgpath=".../train", csvpath=".../train.csv" )
Nguyen, H. Q., Lam, K., Le, L. T., Pham, H. H., Tran, D. Q., Nguyen, D. B., Le, D. D., Pham, C. M., Tong, H. T. T., Dinh, D. H., Do, C. D., Doan, L. T., Nguyen, C. N., Nguyen, B. T., Nguyen, Q. V., Hoang, A. D., Phan, H. N., Nguyen, A. T., Ho, P. H., … Vu, V. (2020). VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. http://arxiv.org/abs/2012.15029
https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection
- class torchxrayvision.datasets.StonyBrookCOVID_Dataset(imgpath, csvpath, transform=None, data_aug=None, views=['AP'], seed=0)
Stonybrook Radiographic Assessment of Lung Opacity Score Dataset
https://doi.org/10.5281/zenodo.4633999
Citation will be set soon.
- class torchxrayvision.datasets.ObjectCXR_Dataset(imgzippath, csvpath, transform=None, data_aug=None, seed=0)
ObjectCXR Dataset
“We provide a large dataset of chest X-rays with strong annotations of foreign objects, and the competition for automatic detection of foreign objects. Specifically, 5000 frontal chest X-ray images with foreign objects presented and 5000 frontal chest X-ray images without foreign objects are provided. All the chest X-ray images were filmed in township hospitals in China and collected through our telemedicine platform. Foreign objects within the lung field of each chest X-ray are annotated with bounding boxes, ellipses or masks depending on the shape of the objects.”
Challenge dataset from MIDL2020
https://jfhealthcare.github.io/object-CXR/
https://academictorrents.com/details/fdc91f11d7010f7259a05403fc9d00079a09f5d5