作者|KIDGINBROOK更新|潘麗晨
上節(jié)介紹所有節(jié)點(diǎn)執(zhí)行了bootstrap網(wǎng)絡(luò)連接的建立,接下來(lái)介紹下拓?fù)浞治觥?/p>
(資料圖片)
由于GPU機(jī)器架構(gòu)是多種多樣的,一臺(tái)機(jī)器上可能有多個(gè)網(wǎng)卡,多個(gè)GPU卡,卡間連接也各不相同,因此需要對(duì)機(jī)器內(nèi)設(shè)備連接拓?fù)溥M(jìn)行分析,以使性能在各種拓?fù)浣Y(jié)構(gòu)下都盡可能好。
接著上回繼續(xù)看initTransportsRank。
static ncclResult_t initTransportsRank(struct ncclComm* comm, ncclUniqueId* commId) { // We use 3 AllGathers // 1. { peerInfo, comm } // 2. ConnectTransport[nranks], ConnectValue[nranks] // 3. { nThreads, nrings, compCap, prev[MAXCHANNELS], next[MAXCHANNELS] } int rank = comm->rank; int nranks = comm->nRanks; uint64_t commHash = getHash(commId->internal, NCCL_UNIQUE_ID_BYTES); TRACE(NCCL_INIT, "comm %p, commHash %lx, rank %d nranks %d - BEGIN", comm, commHash, rank, nranks); NCCLCHECK(bootstrapInit(commId, rank, nranks, &comm->bootstrap)); // AllGather1 - begin struct { struct ncclPeerInfo peerInfo; struct ncclComm* comm; } *allGather1Data; NCCLCHECK(ncclCalloc(&allGather1Data, nranks)); allGather1Data[rank].comm = comm; struct ncclPeerInfo* myInfo = &allGather1Data[rank].peerInfo; NCCLCHECK(fillInfo(comm, myInfo, commHash)); ...}
創(chuàng)建nrank個(gè)allGather1Data,然后通過(guò)fillInfo 填充當(dāng)前rank的peerInfo,ncclPeerInfo是rank的一些基本信息,比如rank號(hào),在哪個(gè)機(jī)器的哪個(gè)進(jìn)程等。
struct ncclPeerInfo { int rank; int cudaDev; int gdrSupport; uint64_t hostHash; uint64_t pidHash; dev_t shmDev; int64_t busId;}; static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, uint64_t commHash) { info->rank = comm->rank; CUDACHECK(cudaGetDevice(&info->cudaDev)); info->hostHash=getHostHash()+commHash; info->pidHash=getPidHash()+commHash; // Get the device MAJOR:MINOR of /dev/shm so we can use that // information to decide whether we can use SHM for inter-process // communication in a container environment struct stat statbuf; SYSCHECK(stat("/dev/shm", &statbuf), "stat"); info->shmDev = statbuf.st_dev; info->busId = comm->busId; NCCLCHECK(ncclGpuGdrSupport(&info->gdrSupport)); return ncclSuccess;}
獲取當(dāng)前卡的rank,PCIe busId,/dev/shm的設(shè)備號(hào),填充到ncclPeerInfo,然后通過(guò)ncclGpuGdrSupport查看是否支持gdr,rdma在通信前需要注冊(cè)一段內(nèi)存,使得網(wǎng)卡知道虛擬地址和物理地址的映射,但是如果每次通信都需要將data從顯存拷貝到內(nèi)存再通信的話效率就比較低。 而IB提供了peer memory的接口,使得ib網(wǎng)卡可以訪問(wèn)其他PCIe空間,nv基于peer memory實(shí)現(xiàn)了自己的驅(qū)動(dòng),使得rdma可以直接注冊(cè)顯存,這樣通信就可以避免host和device的內(nèi)存拷貝,IB可以直接dma顯存,即gdr。
static ncclResult_t ncclGpuGdrSupport(int* gdrSupport) { int netDevs; NCCLCHECK(ncclNetDevices(&netDevs)); *gdrSupport = 0; for (int dev=0; dev
這里會(huì)遍歷每一個(gè)網(wǎng)卡,獲取網(wǎng)卡的信息,由第一節(jié)可以知道這里的ncclNet就是ncclNetIb。
ncclResult_t ncclIbGdrSupport(int ibDev) { static int moduleLoaded = -1; if (moduleLoaded == -1) { moduleLoaded = (access("/sys/kernel/mm/memory_peers/nv_mem/version", F_OK) == -1) ? 0 : 1; } if (moduleLoaded == 0) return ncclSystemError; return ncclSuccess;} ncclResult_t ncclIbGetProperties(int dev, ncclNetProperties_t* props) { props->name = ncclIbDevs[dev].devName; props->pciPath = ncclIbDevs[dev].pciPath; props->guid = ncclIbDevs[dev].guid; props->ptrSupport = NCCL_PTR_HOST; if (ncclIbGdrSupport(dev) != ncclSuccess) { INFO(NCCL_NET,"NET/IB : GPU Direct RDMA Disabled for HCA %d "%s" (no module)", dev, ncclIbDevs[dev].devName); } else { props->ptrSupport |= NCCL_PTR_CUDA; } props->speed = ncclIbDevs[dev].speed; props->port = ncclIbDevs[dev].port + ncclIbDevs[dev].realPort; props->maxComms = ncclIbDevs[dev].maxQp; return ncclSuccess;}
這里主要是獲取網(wǎng)卡名,PCIe路徑,guid等信息,然后查看是否有/sys/kernel/mm/memory_peers/nv_mem/version判斷是否安裝了nv_peermem,即nv的驅(qū)動(dòng),如果安裝了的話則設(shè)置props->ptrSupport |= NCCL_PTR_CUDA,表示可以注冊(cè)顯存。
然后嘗試注冊(cè)顯存,如果可以注冊(cè)則設(shè)置gdrSupport為1,這里其實(shí)會(huì)創(chuàng)建rdma連接,這個(gè)在后邊會(huì)單獨(dú)介紹,本次先略過(guò)。
static ncclResult_t initTransportsRank(struct ncclComm* comm, ncclUniqueId* commId) { ... NCCLCHECK(bootstrapAllGather(comm->bootstrap, allGather1Data, sizeof(*allGather1Data))); NCCLCHECK(ncclCalloc(&comm->peerInfo, nranks+1)); // Extra rank to represent CollNet root for (int i = 0; i < nranks; i++) { memcpy(comm->peerInfo+i, &allGather1Data[i].peerInfo, sizeof(struct ncclPeerInfo)); if ((i != rank) && (comm->peerInfo[i].hostHash == myInfo->hostHash) && (comm->peerInfo[i].busId == myInfo->busId)) { WARN("Duplicate GPU detected : rank %d and rank %d both on CUDA device %x", rank, i, myInfo->busId); return ncclInvalidUsage; } } // AllGather1 data is used again below // AllGather1 - end // Topo detection / System graph creation NCCLCHECK(ncclTopoGetSystem(comm, &comm->topo)); ...}
然后bootstrapAllGather廣播allGather1Data,將獲取到的其他節(jié)點(diǎn)peerinfo拷貝到comm里。 ?
在看具體拓?fù)浞治隽鞒讨跋群?jiǎn)單了解一下PCIe的一些概念,一個(gè)簡(jiǎn)單的PCIe系統(tǒng)示例如下。
每個(gè)CPU都有自己的root complex,后簡(jiǎn)稱為RC,RC會(huì)幫助cpu和其他部分通信,比如和內(nèi)存,和PCIe系統(tǒng),當(dāng)cpu發(fā)送過(guò)來(lái)一個(gè)物理地址之后,如果這個(gè)地址是在PCIe空間,會(huì)被RC轉(zhuǎn)換成PCIe請(qǐng)求進(jìn)行通信。
switch的作用是擴(kuò)展PCIe端口,下邊可以連接設(shè)備或者其他switch,上游來(lái)的請(qǐng)求被被他轉(zhuǎn)發(fā),PCIe設(shè)備可以連在RC,也可以連在swtich,一個(gè)switch的內(nèi)部如下所示。
內(nèi)部有一個(gè)PCIe總線 ,然后通過(guò)多個(gè)Bridge擴(kuò)展出多個(gè)端口,其中上邊的那個(gè)稱為上游端口,其他的叫做下游端口。
前文有提到NCCL中很常用的一個(gè)變量名叫busId,比如gpu和ib網(wǎng)卡,注意區(qū)分NCCL里的busId并不是指的總線號(hào),指的其實(shí)是定位一個(gè)PCIe設(shè)備用到的id,即BDF(bus + device + function),一個(gè)bus上有多個(gè)設(shè)備,一個(gè)設(shè)備有多個(gè)功能,因此通過(guò)BDF就可以定位一個(gè)設(shè)備,在機(jī)器啟動(dòng)完成PCIe的配置之后會(huì)將相關(guān)信息通過(guò)sysfs提供給用戶,NCCL就是通過(guò)sysfs來(lái)完成拓?fù)錂z測(cè)的。
然后看下執(zhí)行的ncclTopoGetSystem,這個(gè)函數(shù)就是本節(jié)的重點(diǎn),會(huì)將當(dāng)前rank的PCI樹建立起來(lái),分為兩個(gè)步驟,先使用xml表示整個(gè)PCI樹結(jié)構(gòu),然后基于xml轉(zhuǎn)成ncclTopoNode,其中xml定義如下,一個(gè)ncclXmlNode表示了PCI樹的一個(gè)節(jié)點(diǎn)。
struct ncclXmlNode { char name[MAX_STR_LEN]; struct { char key[MAX_STR_LEN]; char value[MAX_STR_LEN]; } attrs[MAX_ATTR_COUNT+1]; // Need an extra one to consume extra params int nAttrs; int type; struct ncclXmlNode* parent; struct ncclXmlNode* subs[MAX_SUBS]; int nSubs;}; struct ncclXml { struct ncclXmlNode nodes[MAX_NODES]; int maxIndex;};
ncclXmlNode表示一個(gè)節(jié)點(diǎn),記錄了父節(jié)點(diǎn)和所有子節(jié)點(diǎn),節(jié)點(diǎn)有name和attr,通過(guò)xmlSetAttr進(jìn)行設(shè)置屬性。
ncclXml中預(yù)分配了所有的node,maxIndex表示分配到了哪里,然后簡(jiǎn)單介紹下幾個(gè)xml相關(guān)的api。
static ncclResult_t xmlAddNode(struct ncclXml* xml, struct ncclXmlNode* parent, const char* subName, struct ncclXmlNode** sub);
xmlAddNode進(jìn)行node的分配,表示在xml里新申請(qǐng)一個(gè)節(jié)點(diǎn)sub,sub的name設(shè)置為subName,父節(jié)點(diǎn)為parent。
static ncclResult_t xmlFindTagKv(struct ncclXml* xml, const char* tagName, struct ncclXmlNode** node, const char* attrName, const char* attrValue)
xmlFindTagKv會(huì)遍歷xml已分配的節(jié)點(diǎn),找到節(jié)點(diǎn)名為tagName的節(jié)點(diǎn)n,然后判斷節(jié)點(diǎn)n["attrName"]是否等于attrValue,如果相等,則設(shè)置node為n。
static ncclResult_t xmlGetAttrIndex(struct ncclXmlNode* node, const char* attrName, int* index)
xmlGetAttrIndex會(huì)查看attrName是node的第幾個(gè)屬性。
然后開始看拓?fù)浞治龅倪^(guò)程。
ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** system) { struct ncclXml* xml; NCCLCHECK(ncclCalloc(&xml, 1)); char* xmlTopoFile = getenv("NCCL_TOPO_FILE"); if (xmlTopoFile) { INFO(NCCL_ENV, "NCCL_TOPO_FILE set by environment to %s", xmlTopoFile); NCCLCHECK(ncclTopoGetXmlFromFile(xmlTopoFile, xml)); } if (xml->maxIndex == 0) { // Create top tag struct ncclXmlNode* top; NCCLCHECK(xmlAddNode(xml, NULL, "system", &top)); NCCLCHECK(xmlSetAttrInt(top, "version", NCCL_TOPO_XML_VERSION)); } // Auto-detect GPUs if needed for (int r=0; r
首先通過(guò)xmlAddNode創(chuàng)建根節(jié)點(diǎn)"system"(后續(xù)使用雙引號(hào)表示xml樹節(jié)點(diǎn)),并設(shè)置根節(jié)點(diǎn)屬性"system" ["version"] = NCCL_TOPO_XML_VERSION,然后遍歷每個(gè)rank的hosthash,如果相等的話說(shuō)明在同一個(gè)機(jī)器,然后執(zhí)行ncclTopoFillGpu,將gpu加入到xml樹。
ncclResult_t ncclTopoFillGpu(struct ncclXml* xml, const char* busId, struct ncclXmlNode** gpuNode) { struct ncclXmlNode* node; NCCLCHECK(ncclTopoGetPciNode(xml, busId, &node)); NCCLCHECK(ncclTopoGetXmlFromSys(node, xml)); ...}
ncclResult_t ncclTopoGetPciNode(struct ncclXml* xml, const char* busId, struct ncclXmlNode** pciNode) { NCCLCHECK(xmlFindTagKv(xml, "pci", pciNode, "busid", busId)); if (*pciNode == NULL) { NCCLCHECK(xmlAddNode(xml, NULL, "pci", pciNode)); } NCCLCHECK(xmlSetAttr(*pciNode, "busid", busId)); return ncclSuccess;}
通過(guò)ncclTopoGetPciNode獲取xml中的有沒(méi)有創(chuàng)建當(dāng)前卡的xml node,此時(shí)沒(méi)有,所以就新建一個(gè)xml node叫做"pci",表示當(dāng)前gpu卡,設(shè)置"pci"["busid"]=busd。
然后執(zhí)行ncclTopoGetXmlFromSys,這個(gè)函數(shù)主要邏輯就是在sysfs中獲取gpu節(jié)點(diǎn)到cpu的路徑,通過(guò)這個(gè)路徑轉(zhuǎn)成xml樹,并讀取該路徑下相關(guān)屬性設(shè)置到xml里。
ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml* xml) { // Fill info, then parent const char* busId; NCCLCHECK(xmlGetAttr(pciNode, "busid", &busId)); char* path = NULL; int index; NCCLCHECK(xmlGetAttrIndex(pciNode, "class", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); NCCLCHECK(ncclTopoSetAttrFromSys(pciNode, path, "class", "class")); } NCCLCHECK(xmlGetAttrIndex(pciNode, "link_speed", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); char deviceSpeedStr[MAX_STR_LEN]; float deviceSpeed; NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_speed", deviceSpeedStr)); sscanf(deviceSpeedStr, "%f GT/s", &deviceSpeed); char portSpeedStr[MAX_STR_LEN]; float portSpeed; NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_speed", portSpeedStr)); sscanf(portSpeedStr, "%f GT/s", &portSpeed); NCCLCHECK(xmlSetAttr(pciNode, "link_speed", portSpeed < deviceSpeed ? portSpeedStr : deviceSpeedStr)); } NCCLCHECK(xmlGetAttrIndex(pciNode, "link_width", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); char strValue[MAX_STR_LEN]; NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_width", strValue)); int deviceWidth = strtol(strValue, NULL, 0); NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_width", strValue)); int portWidth = strtol(strValue, NULL, 0); NCCLCHECK(xmlSetAttrInt(pciNode, "link_width", std::min(deviceWidth,portWidth))); } ...}
首先設(shè)置pciNode的各種屬性,通過(guò)getPciPath獲取busId對(duì)應(yīng)的sysfs路徑path,其實(shí)這個(gè)路徑就是PCI樹中根到葉結(jié)點(diǎn)的路徑。
static ncclResult_t getPciPath(const char* busId, char** path) { char busPath[] = "/sys/class/pci_bus/0000:00/../../0000:00:00.0"; memcpylower(busPath+sizeof("/sys/class/pci_bus/")-1, busId, BUSID_REDUCED_SIZE-1); memcpylower(busPath+sizeof("/sys/class/pci_bus/0000:00/../../")-1, busId, BUSID_SIZE-1); *path = realpath(busPath, NULL); if (*path == NULL) { WARN("Could not find real path of %s", busPath); return ncclSystemError; } return ncclSuccess;}
舉個(gè)例子比如path是 /sys/devices/pci0000:10/0000:10:00.0/0000:11:00.0/0000:12:00.0/0000:13:00.0/0000
:14:00.0/0000:15:00.0/0000:16:00.0/0000:17:00.0,其中GPU的busId是0000:17:00.0,那么這個(gè)path對(duì)應(yīng)下圖,注意,下圖略去了15:00.0對(duì)應(yīng)的switch。
然后讀取path下的屬性,獲取class(PCI設(shè)備類型),link_speed,link_width等設(shè)置到xml pciNode中,ncclTopoGetStrFromSys其實(shí)就是讀取path下的內(nèi)核文件保存到strValue。
ncclResult_t ncclTopoGetStrFromSys(const char* path, const char* fileName, char* strValue) { char filePath[PATH_MAX]; sprintf(filePath, "%s/%s", path, fileName); int offset = 0; FILE* file; if ((file = fopen(filePath, "r")) != NULL) { while (feof(file) == 0 && ferror(file) == 0 && offset < MAX_STR_LEN) { int len = fread(strValue+offset, 1, MAX_STR_LEN-offset, file); offset += len; } fclose(file); } if (offset == 0) { strValue[0] = "\0"; INFO(NCCL_GRAPH, "Topology detection : could not read %s, ignoring", filePath); } else { strValue[offset-1] = "\0"; } return ncclSuccess;}
ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml* xml) { // Fill info, then parent ... struct ncclXmlNode* parent = pciNode->parent; if (parent == NULL) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); // Save that for later in case next step is a CPU char numaIdStr[MAX_STR_LEN]; NCCLCHECK(ncclTopoGetStrFromSys(path, "numa_node", numaIdStr)); // Go up one level in the PCI tree. Rewind two "/" and follow the upper PCI // switch, or stop if we reach a CPU root complex. int slashCount = 0; int parentOffset; for (parentOffset = strlen(path)-1; parentOffset>0; parentOffset--) { if (path[parentOffset] == "/") { slashCount++; path[parentOffset] = "\0"; int start = parentOffset - 1; while (start>0 && path[start] != "/") start--; // Check whether the parent path looks like "BBBB:BB:DD.F" or not. if (checkBDFFormat(path+start+1) == 0) { // This a CPU root complex. Create a CPU tag and stop there. struct ncclXmlNode* topNode; NCCLCHECK(xmlFindTag(xml, "system", &topNode)); NCCLCHECK(xmlGetSubKv(topNode, "cpu", &parent, "numaid", numaIdStr)); if (parent == NULL) { NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent)); NCCLCHECK(xmlSetAttr(parent, "numaid", numaIdStr)); } } else if (slashCount == 2) { // Continue on the upper PCI switch for (int i = strlen(path)-1; i>0; i--) { if (path[i] == "/") { NCCLCHECK(xmlFindTagKv(xml, "pci", &parent, "busid", path+i+1)); if (parent == NULL) { NCCLCHECK(xmlAddNode(xml, NULL, "pci", &parent)); NCCLCHECK(xmlSetAttr(parent, "busid", path+i+1)); } break; } } } } if (parent) break; } pciNode->parent = parent; parent->subs[parent->nSubs++] = pciNode; } if (strcmp(parent->name, "pci") == 0) { NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml)); } else if (strcmp(parent->name, "cpu") == 0) { NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml)); } free(path); return ncclSuccess;}
然后從pciNode開始往上跳,因?yàn)橐粋€(gè)switch的上游端口和下游端口分別對(duì)應(yīng)了一個(gè)bridge,NCCL使用上游端口bridge的busid表示這個(gè)switch,因此這里要向上跳兩次再建立一個(gè)xml node表示這個(gè)switch,往上找到一個(gè)PCI設(shè)備就將slashCount加一。 當(dāng)slashCount==2就找到了一個(gè)switch上游端口,這個(gè)時(shí)候創(chuàng)建一個(gè)新的xml pci節(jié)點(diǎn)parent表示當(dāng)前switch,然后將當(dāng)前節(jié)點(diǎn)pciNode鏈接到parent,此時(shí)parent仍然是xml pci節(jié)點(diǎn)。 因此,繼續(xù)遞歸執(zhí)行ncclTopoGetXmlFromSys,直到遇到RC,此時(shí)給"system"創(chuàng)建一個(gè)子節(jié)點(diǎn)"cpu",停止遞歸,然后執(zhí)行ncclTopoGetXmlFromCpu,設(shè)置"cpu"的各種屬性,比如arch(比如x86還是arm),affinity(該cpu的numa都有哪些cpu core),numaid等。
到這里ncclTopoGetXmlFromSys就執(zhí)行結(jié)束了,接著看ncclTopoFillGpu。
ncclResult_t ncclTopoFillGpu(struct ncclXml* xml, const char* busId, struct ncclXmlNode** gpuNode) { ... NCCLCHECK(wrapNvmlSymbols()); NCCLCHECK(wrapNvmlInit()); nvmlDevice_t nvmlDev; if (wrapNvmlDeviceGetHandleByPciBusId(busId, &nvmlDev) != ncclSuccess) nvmlDev = NULL; NCCLCHECK(ncclTopoGetXmlFromGpu(node, nvmlDev, xml, gpuNode)); return ncclSuccess;}
然后通過(guò)wrapNvmlSymbols加載動(dòng)態(tài)庫(kù)libnvidia-ml.so.1,用來(lái)獲取gpu的相關(guān)信息。
ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvmlDev, struct ncclXml* xml, struct ncclXmlNode** gpuNodeRet) { struct ncclXmlNode* gpuNode = NULL; NCCLCHECK(xmlGetSub(pciNode, "gpu", &gpuNode)); if (gpuNode == NULL) NCCLCHECK(xmlAddNode(xml, pciNode, "gpu", &gpuNode)); int index = -1; int dev = -1; NCCLCHECK(xmlGetAttrIndex(gpuNode, "dev", &index)); if (index == -1) { if (nvmlDev == NULL) { WARN("No NVML, trying to use CUDA instead"); const char* busId; NCCLCHECK(xmlGetAttr(pciNode, "busid", &busId)); if (busId == NULL || cudaDeviceGetByPCIBusId(&dev, busId) != cudaSuccess) dev = -1; } else { NCCLCHECK(wrapNvmlDeviceGetIndex(nvmlDev, (unsigned int*)&dev)); } NCCLCHECK(xmlSetAttrInt(gpuNode, "dev", dev)); } NCCLCHECK(xmlGetAttrInt(gpuNode, "dev", &dev)); if (dev == -1) { *gpuNodeRet = NULL; return ncclSuccess; } NCCLCHECK(xmlGetAttrIndex(gpuNode, "sm", &index)); if (index == -1) { int cudaMajor, cudaMinor; if (nvmlDev == NULL) { cudaDeviceProp devProp; CUDACHECK(cudaGetDeviceProperties(&devProp, dev)); cudaMajor = devProp.major; cudaMinor = devProp.minor; } else { NCCLCHECK(wrapNvmlDeviceGetCudaComputeCapability(nvmlDev, &cudaMajor, &cudaMinor)); } NCCLCHECK(xmlSetAttrInt(gpuNode, "sm", cudaMajor*10+cudaMinor)); } int sm; NCCLCHECK(xmlGetAttrInt(gpuNode, "sm", &sm)); struct ncclXmlNode* nvlNode = NULL; NCCLCHECK(xmlGetSub(pciNode, "nvlink", &nvlNode)); if (nvlNode == NULL) { // NVML NVLink detection int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : 12; if (maxNvLinks > 0 && nvmlDev == NULL) { WARN("No NVML device handle. Skipping nvlink detection.\n"); maxNvLinks = 0; } for (int l=0; l
首先在xml gpu節(jié)點(diǎn)"pci"下創(chuàng)建節(jié)點(diǎn)"gpu",然后設(shè)置"gpu"節(jié)點(diǎn)的屬性,比如dev,計(jì)算能力sm,然后開始查詢nvlink相關(guān)信息,遍歷所有可能的nvlink,通過(guò)nvmlDeviceGetNvLinkCapability查詢nvlink信息。 如果這個(gè)nvlink被啟用,那么在"gpu"節(jié)點(diǎn)下新建一個(gè)"nvlink"節(jié)點(diǎn),設(shè)置"target"屬性表示nvlink對(duì)端的PCIe busId,將"target"相同的"nvlink"節(jié)點(diǎn)表示為一個(gè),用"count"表示起止點(diǎn)之間有多少條nvlink,然后設(shè)置屬性"tclass"表示"target"是什么類型的PCI設(shè)備。
到這里ncclTopoFillGpu就執(zhí)行結(jié)束了,此時(shí)xml如下所示,圖里只展示了一張網(wǎng)卡的情況,其中"gpu"和他的父節(jié)點(diǎn)其實(shí)都是指的同一個(gè)gpu。
然后回到ncclTopoGetSystem,會(huì)設(shè)置"gpu"的rank和gdr屬性。
然后是對(duì)于所有的網(wǎng)卡,類似上述gpu的過(guò)程,通過(guò)ncclTopoGetXmlFromSys建立xml樹,如下所示,只展示一張網(wǎng)卡的情況,其中"net","nic"和"nic"的父節(jié)點(diǎn)都表示同一張網(wǎng)卡。
總結(jié)一下,本節(jié)主要介紹了NCCL拓?fù)浞治龅倪^(guò)程,通過(guò)sysfs將gpu和網(wǎng)卡對(duì)應(yīng)的pci樹結(jié)構(gòu)建立出來(lái)了xml樹。
(原文:
https://blog.csdn.net/KIDGIN7439/article/details/126990961)
其他人都在看
向量嵌入:AutoGPT的幻覺(jué)解法
推演語(yǔ)言模型的大小與計(jì)算開銷
NCCL源碼解析②:Bootstrap網(wǎng)絡(luò)連接的建立
谷歌科學(xué)家:ChatGPT秘密武器的演進(jìn)與局限
比快更快,開源Stable Diffusion刷新作圖速度
OneEmbedding:單卡訓(xùn)練TB級(jí)推薦模型不是夢(mèng)
GLM訓(xùn)練加速:性能最高提升3倍,顯存節(jié)省1/3
歡迎Star、試用OneFlow: github.com/Oneflow-Inc/oneflow/
關(guān)鍵詞: