0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 1 Hybrid Precoding for Multi-User Millimeter Wave Massive MIMO Systems: A Deep Learning Approach Ahmet M. Elbir and Anastasios Papazafeiropoulos, Senior Member, IEEE Abstract—In multi-user millimeter wave (mmWave) multiple- input-multiple-output (MIMO) systems, hybrid precoding is a crucial task to lower the complexity and cost while achieving a sufficient sum-rate. Previous works on hybrid precoding were usually based on optimization or greedy approaches. These methods either provide higher complexity or have sub-optimum performance. Moreover, the performance of these methods mostly relies on the quality of the channel data. In this work, we propose a deep learning (DL) framework to improve the per- formance and provide less computation time as compared to conventional techniques. In fact, we design a convolutional neural network for MIMO (CNN-MIMO) that accepts as input an imperfect channel matrix and gives the analog precoder and combiners at the output. The procedure includes two main stages. First, we develop an exhaustive search algorithm to select the analog precoder and combiners from a predefined codebook maximizing the achievable sum-rate. Then, the selected precoder and combiners are used as output labels in the training stage of CNN-MIMO where the input-output pairs are obtained. We evaluate the performance of the proposed method through numerous and extensive simulations and show that the proposed DL framework outperforms conventional techniques. Overall, CNN-MIMO provides a robust hybrid precoding scheme in the presence of imperfections regarding the channel matrix. On top of this, the proposed approach exhibits less computation time with comparison to the optimization and codebook based approaches. Index Terms—Hybrid precoding, mmWave systems, multi- user MIMO transmission, deep learning, convolutional neural networks. I. INTRODUCTION Millimeter wave (mmWave) communication systems pro- vide a higher data rate and wider bandwidth at high fre- quencies (in the range of 30 − 300 GHz) [1]. Reasonably, it has become a leading candidate to be realized in the fifth- generation (5G) wireless networks [2]. However, in mmWave bands, the propagation loss is higher as compared to conven- tional systems with lower frequencies [1], [2]. To overcome the high propagation path-loss and to provide beamforming power gain, massive numbers of antennas are used at both the transmitter and receiver sides by yielding a massive multiple- Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. A. M. Elbir is with the Department of Electrical and Electronics Engineer- ing, Duzce University, Duzce, Turkey (e-mail: ahmetelbir@duzce.edu.tr). A. Papazafeiropoulos is with the Communications and Intelligent Systems Research Group, University of Hertfordshire, Hatfield AL10 9AB, U.K., and also with SnT (http://www.securityandtrust.lu), University of Luxembourg, L- 1855 Luxembourg City, Luxembourg (e-mail: tapapazaf@gmail.com). input-multiple-output (MIMO) structure enhancing the signal- to-noise ratio (SNR) at the received signal [3]. Signal processing in conventional systems with frequencies lower than 3GHz is performed digitally where both the am- plitude and the phases are processed in the baseband. For this reason, dedicated radio-frequency (RF) hardware for each antenna element is required [4]. Unfortunately, in the case of mmWave MIMO systems implemented with a large number of antennas, digital processing is not cost-efficient since it brings high cost at the system hardware and significant complexity. To reduce the cost and provide sufficient performance, hy- brid precoding architectures are proposed where the signal is processed by both analog and digital precoders [5]–[8]. Especially, in the analog processing part of the hybrid systems, phase shifters with constant modulus are usually used. The role of phase shifters is the introduction of discrete phases to the transmitted/received signal to steer the beam, and thus, increase the gain [8]. In recent years, several techniques have been proposed to design the hybrid precoding in mmWave MIMO systems. In particular, initial works focused on the single-user scenario [6]. In such a case, the user is assumed to be deployed with multiple antennas. While the single-user case constitutes the baseline for multi-user systems being of practical interest, the interference from other users should be taken into account when designing the precoders [7]–[10]. In [8], the performance of low-resolution analog to digital converters (ADCs) are investigated when a single RF chain is used at mobile users. In [9], simultaneous channel estimation is considered for multiple-user systems, while, in [10], antenna selection in mmWave MIMO is considered together with hybrid precoding estimation. The authors in [7] also consider the multi-user scenario but the hybrid precoders are obtained by a greedy-like approach as in [6] where a simultaneous orthogonal matching pursuit (SOMP) algorithm is proposed. It is worthwhile to mention that all of the above methods are based on the assumption of perfect channel state information and the avail- ability of the array response sets, namely, F and W for the precoder and combiner design, respectively. These sets are composed of the transmit and receive steering vectors with respect to the direction-of-arrival/departures (DOA/DODs) of the user locations. Taking into consideration that these array responses are directly related to the singular value matrix of the channel through a linear transformation, they become the best candidates for the precoder design problem [5]–[7]. As a class of machine learning techniques, DL has gained 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 2 much interest recently for the solution of many challenging problems such as speech recognition, visual object recognition, and language processing [11], [12]. DL has several advan- tages such as low computational complexity when solving optimization-based or combinatorial search problems and the ability to extrapolate new features from a limited set of features contained in a training set [11]. Very recently, a great deal of attention has been received for DL-based techniques regarding radar [13], and fundamental communication theory topics [14]–[24] such as channel estimation [16], DOA estimation [17], and analog beam selection [18]. Especially, in the physi- cal layer of wireless communications, DL has been applied for signal detection [19], channel estimation [21], [25], [26] and dynamic multi-channel access problems [20]. In this direction, an end-to-end communication scenario is modeled in [21] and [22] by using auto-encoders where single-input-single-output (SISO) systems are considered. The authors in [23] have also used auto-encoders for the channel state information (CSI) feedback problem. Interestingly, [24] studies the physical layer structures without channel models via DL. An interesting topic concerns the investigation of the hybrid precoding problem in the context of DL [27]–[31]. Inspired from dense fully connected layers, deep multilayer perceptrons (MLPs) have been proposed in [27]–[29]. Specifically, in [27] and [28], MLP has been employed only for the precoder design and just for the single-user scenario. In [29], an MLP architecture is considered for coordinated beam training where the perfect CSI is assumed to be known. Moreover, in [30], a convolutional neural network (CNN)-based approach has been proposed for the joint precoder and combiner design problem but for the single-user setting again. Also, in [31], quantized and unquantized CNNs have been used for hybrid precoding in the case of a single-user MIMO system. The performance of DL-based approaches such as [27]–[29] strongly relies on the perfectness of the channel matrix whereas in [30] and [31], robust DL approaches are proposed against the imperfections in the channel data but these works are developed only for the single-user scenario. A. Motivation Although there are optimization-based approaches that di- rectly estimate the precoders, they appear large computational complexity and local-minimum problems due to random ini- tialization [32]. Also, the design of hybrid precoders for the common multi-user MIMO scenario, being of high practical importance, has not been considered in the context of DL. Thus, driven by the advantages of DL such as its provided low computational complexity, we develop a method that can handle the hybrid precoding design in the case of multi-user MIMO transmission in the mmWave region when corrupted channel feedback data is available. B. Contribution In this paper, we propose a DL framework in terms of a CNN, which is for mmWaves hybrid precoding design, henceforth called CNN-MIMO. In our DL framework, the channel matrix of users is selected as the input of CNN- MIMO, and the output labels are selected as the hybrid precoder weights. In the training stage, which is an offline process (please see Fig. 2), we generate several channel real- izations of multiple users and obtain the corresponding hybrid precoders via an exhaustive search algorithm. This process requires the knowledge of the feasible sets of array responses F ,W which are not used in the prediction stage. Once the network is trained, CNN-MIMO is used to predict the hybrid precoders by simply feeding the network with the channel matrix of users. The proposed DL framework provides a nonlinear mapping between the channel matrix and the hybrid beamformers. Hence, the proposed method achieves more robust performance than the competing algorithms since the deep network can handle the imperfections and the corruptions in the input channel data whereas the other algorithms do not have such capability. The proposed approach also has superior sum-rate performance due to the use of the “best” hybrid beamformers which are obtained via an exhaustive search in the training process. The main contributions of this work are as follows. • A DL-based approach is proposed for the hybrid pre- coding in multi-user massive MIMO mmWave systems. We leverage DL to estimate the precoder and combiner weights so that CNN-MIMO is more robust against the deviations in the channel matrix. Hence, the proposed DL framework has superior performance with comparison to the conventional greedy and codebook based tech- niques [6]–[8] whose performances strongly rely on the quality of the channel. • In most of the previous works such as [6], [7], the codebooks formed by the feasible set of array responses F and W are assumed to be known. Then, the analog precoding design problem reduces to the selection of the best candidates in F andW to maximize the sum-rate. In this work, we only need F andW in the training stage to obtain the network labels and the proposed DL technique does not require such information in the prediction stage where DL network itself obtains the analog precoder weights by learning the features hidden in the input data. • To train the network, a very large training data (almost half a million samples) is generated. Hence, a robust performance against the imperfect channel case and the deviations in the channel data is achieved. • The proposed approach also enjoys less computation time for hybrid precoding design. While the conventional techniques require an optimization process or greedy searches, our CNN approach estimates the precoders by simply feeding the network with the corrupted channel matrix. C. Notation Vectors and matrices are denoted by boldface lower and upper case symbols, respectively. In the case of a vector a, [a]i represents its ith element. For a matrix A, [A]:,i and [A]i,j denote the ith column and the (i, j)th entry, respectively. IN is the identity matrix of size N × N , E{·} denotes the statistical expectation, and ‖ · ‖F is the Frobenious norm. The notation (·)† denotes the Moore-Penrose pseudo-inverse while ∠{·} denotes the angle of a complex scalar/vector while the notation, expressing a convolutional layer with N filters of 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 3 Fig. 1. A multi-user MIMO system with hybrid (analog and baseband) precoding on the BS and analog-only combining at K users. size D × D, is given by N@D × D. For a complex scalar a = ejϕ with continuous phase ϕ, Q(a) = ejϕB denotes the quantization operator where ϕB is the quantized angle in [0, 2pi] sampled with 2B points. II. SYSTEM MODEL We consider a multi-user mmWave MIMO system as shown in Fig. 1. The base station (BS), serving K users each of which has NR antennas, is employed with NT antennas and NRFT RF chains. By taking into consideration of cheaper hardware at each user, and subsequently, low power consump- tion, we assume that the BS communicates with each user via a single stream, i.e., NS = 1 [7]. Hence, only analog combining is applied at the receiver. Another assumption is that NRFT ≥ K, i.e., the maximum number of simultaneously served users cannot be greater than the number of BS RF chains. In the downlink, the BS applies baseband precoding FBB = [fBB1 , fBB2 , . . . , fBBK ] ∈ CN RF T ×K to the transmit signal s = [s1, s2, . . . , sK ]T ∈ CK obeying to E{ssH} = P K IK by assuming equal power allocation among the users. Note that P denotes the average power. The RF precoders FRF ∈ CNT×NRFT , which are constructed by phase shifters, are used to convey the signal to NT transmit antennas. Also, given that FRF consists of analog phase shifters, we assume that the RF precoder has constant equal-norm elements, i.e., |[FRF]i,j |2 = 1/NT. In addition, we have the power constraint ‖FRFFBB‖2F = K that is enforced by the normalization of FBB. Thus, the NT × 1 transmitted signal is written as x = FRFFBBs. (1) We can write the received signal of the kth user for a narrowband block-fading channel as [33] y˜k = Hk K∑ n=1 FRFfBBnsn + nk, (2) where Hk ∈ CNR×NT is the channel matrix between the BS and the kth user with ‖Hk‖F = NRNT. The vector nk ∈ CNR denotes the complex additive white Gaussian noise (AWGN) with nk ∼ CN (0, σ2INR). Once the transmitted signal is received from the kth user, the received signal is processed by the combiner wRFk ∈ CNR as yk = wHRFk y˜k, i.e., yk = w H RFk Hk K∑ n=1 FRFfBBnsn + w H RFk nk, (3) where the RF combiners are constructed by means of phase shifters with the normalization constraint as |[wRFk ]i|2 = 1/NR. A. Channel Model In mmWave transmission, the channel can be represented by a geometric model with limited scattering [34]–[36]. Hence, we assume that the channel matrix Hk includes the contribu- tions of L scattering paths. Considering a 2-D uniform planar array (UPA), the channel matrix corresponding to the kth user is given by Hk = γ L∑ l=1 αl,kgR(Θ (l,k) R )gT(Θ (l,k) T )aR(Θ (l,k) R )a H T (Θ (l,k) T ), where Θ(l,k)R = (φ (l,k) R , θ (l,k) R ) and Θ (l,k) T = (φ (l,k) T , θ (l,k) T ) denote the angle of arrivals and departures, respectively. Note that the angular parameters φ and θ ∈ [0, 2pi] correspond to the azimuth and the elevation angles, respectively. The scalar γ = √ NTNR/L is the normalization factor and αl,k is the complex channel gain associated with the kth user and lth path l = 1, . . . , L. Also, gR(Θ (l,k) R ) and gT(Θ (l,k) T ) are the antenna element gains for the antennas in the arrays while aR(Θ (l,k) R ) and aT(Θ (l,k) T ) are the NR × 1 and NT × 1 steering vectors representing the array responses at the kth user and the BS, respectively. The nth element of the steering vector aR(Θ (l,k) R ) is given as [aR(Θ (l,k) R )]n = exp { −2pi λ pTnr(Θ (l,k) R ) } , (4) where λ is the wavelength, pn = [xn, yn, zn]T is the posi- tion of the nth antenna in the Cartesian coordinate system. Regarding the direction vector, it is given by r(Θ (l,k) R ) =[sin(φ (l,k) R ) cos(θ (l,k) R ), sin(φ (l,k) R ) sin(θ (l,k) R ), cos(θ (l,k) R )] T . (5) In a similar way, the transmitter side steering vector aT(Θ (l,k) T ) can also be defined as for aR(Θ (l,k) R ). By assuming that Gaussian symbols are transmitted through the mmWave channel under study, the achievable rate for the kth user is written as [5], [7] Rk = log2 ∣∣∣∣1 + PK |wHRFkHkFRFfBBk |2P K ∑ n6=k |wHRFnHnFRFfBBn |2 + σ2 ∣∣∣∣ (6) and the achievable sum-rate of the system is given by R¯ =∑K k=1Rk. III. PROBLEM FORMULATION The principal aim in this work is to design the hybrid precoder and combiners FBB, FRF and {wRFk}Kk=1 in the presence of imperfect channel data by maximizing the sum- rate. Specifically, we first develop an algorithm to compute the hybrid precoders which maximizes the sum-rate, and then a 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 4 deep network is designed such that the hybrid precoders are predicted by feeding the network with imperfect CSI. In a nutshell, the proposed DL framework provides a nonlinear mapping from the channel matrix H to the analog beamformers FRF and {wRFk}Kk=1. The label generation process depends on the channel model which is not required for updating the network parameters in the training stage. Hence, CNN-MIMO can also be used for various channel models in mmWave systems [37]. Given that our main focus is hybrid beamforming, in this work, we use the block-fading channel model due to the simplistic structure of channel matrix model and rate computation [25], [26], [38]. The application of DL to other channel models is the topic of ongoing research. The estimation process of the channel matrix of the users is a challenging task, especially in the case of a large number of antennas taking place in massive MIMO systems [39], [40]. In addition, since the coherence time of the channel is very short in the mmWave massive MIMO scenario, the parameters related to the channel characteristics change greatly in a short time [41]. To obtain a robust precoding performance, we feed the deep network with several channel realizations which are corrupted by synthetic noise in the training stage which is an offline process. Hence, in the testing stage when the network predicts the precoder weights, the network does not necessarily require the perfect CSI [30]. We show, through simulations, that the proposed approach can handle the corrupted channel matrix case and exhibits satisfactory performance regarding the achievable sum-rate. The main stages of the proposed DL framework are label generation, training, and prediction. In the following section, we first discuss how the labels are obtained from the channel data. Then, in Section V, we present the details of the training and the prediction stages. IV. HYBRID PRECODING DESIGN IN MULTI-USER MIMO SYSTEMS In order to design the network and training data, we first need to solve the hybrid precoding problem and obtain the labels of the training data samples. For this reason, we first develop an exhaustive search algorithm that visits all precoder and combiner combinations in the feasible sets F and W such that the sum-rate in (6) is maximized. Then, we solve the exhaustive search problem in an offline manner to obtain the training data inputs and labels. The advantage of using a DL approach is the reduction of the computation time of the hybrid precoding design problem and obtain near-optimum performance that can be obtained from an exhaustive search. We start by formulating the optimization problem for hybrid precoding in the multi-user scenario as {FˆBB, FˆRF,WˆRF} = argmax FBB,FRF,WRF R¯ subject to: FRF ∈ F , WRF ∈ W, ‖FRFFBB‖2F = K, (7) where WRF = [wRF1 ,wRF2 , . . . ,wRFK ] denotes the analog combiner of all users while F and W are the feasible sets of the precoder and combiners. In practice, both F and W are composed of the steering vectors aT(Θ (l,k) T ) and aR(Θ (l,k) R ), ∀l, k with quantized phases, respectively. Specifically, the array response sets are selected as F = {Q(aT(Θ(1,1)T )), . . . , Q(aT(Θ(L,K)T ))}, (8) and W = {Q(aR(Θ(1,1)R )), . . . , Q(aR(Θ(L,K)R ))}, (9) where Q(·) denotes the phase quantization operator as men- tioned before. In the exhaustive search algorithm, it is desired to visit all possible combinations of the elements in the feasible sets F and W to achieve near-optimum performance. For this reason, we design new feasible sets F and W which include all precoder and combiner combinations. The search algorithm visits all the nodes in the direction set D = [0, 2pi L¯ , 4pi L¯ , . . . , (L¯− 1)2pi L¯ ], (10) where |D| = L¯. By assuming that the BS receives L¯ paths from each user, the kth column of FRF can take L¯ different values, i.e., {Q(aT(Θ(l,k)T ))}L¯l=1. If we generalize it for all users, we have QF = L¯K possible candidates to design FRF. Thus, we define a new set as F = {F1,F2, . . . ,FQF }, (11) where FqF ∈ CNT×K is given by FqF =[Q(aT(Θ (l1,1) T )), Q(aT(Θ (l2,2) T )), . . . , Q(aT(Θ (lK ,K) T ))] with the indices for each user given by l1, l2, . . . , lK = 1, . . . , L¯. Hence, we have qF = 1, . . . , L¯K which de- notes the precoder candidates for K users. In a similar way, the set for the analog combiners is defined as W = {W1,W2, . . . ,WQW } where WqW ∈ CNR×K is given by WqW =[Q(aR(Θ (l1,1) R )), Q(aR(Θ (l2,2) R )), . . . , Q(aR(Θ (lK ,K) R ))] with wRFk selected from the kth column of W, i.e., Q(aR(Θ (lk,k) R )). Once the analog precoders are selected from the sets F and W, the effective channel HeffqF ,qW ∈ CK×N RF T is given by HeffqF ,qW =  heffqF ,qW ,1 heffqF ,qW ,2 ... heffqF ,qW ,K  , (12) where the corresponding effective channel for each user can be calculated as heffqF ,qW ,k = [WqW ] H :,kHkFqF . (13) The baseband precoder can be given by FBB,qF ,qW = ( HeffqF ,qW )† and it is normalized as f (qF ,qW ) BBk = f (qF ,qW ) BBk /‖FqF f (qF ,qW )BBk ‖F [7]. Thus, the achievable sum-rate then can be written as R¯qF ,qW = log2 ∣∣∣∣IK+ P Kσ2 HeffqF ,qWFBB,qF ,qWF H BB,qF ,qWH effH qF ,qW ∣∣∣∣. (14) 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 5 Algorithm 1 Hybrid precoding for Multi-user MIMO Input: {Hk}Kk=1, F, W, D. Output: FˆRF, WˆRF. 1: for 1 ≤ qF ≤ QF do 2: FRF = FqF , 3: for 1 ≤ qW ≤ QW do 4: wRFk = [WqW ]:,k, 5: heffk = w H RFk HkFRF, 6: FBB = ( Heff )† , 7: fBBk = fBBk/‖FRFfBBk‖F, 8: Compute R¯qF ,qW as in (14). 9: end for qW , 10: end for qF , 11: {q¯F , q¯W } = arg maxqF ,qW R¯qF ,qW . 12: FˆRF = Fq¯F and WˆRF = Wq¯W . Using the sets F and W, the optimization problem in (7) can be rewritten as {q¯F , q¯W } = argmax qF ,qW R¯qF ,qW subject to: FRF = FqF ,wRFk = [WqW ]:,k, heffk = w H RFk HkFRF, FBB = ( Heff )† , fBBk = fBBk/‖FRFfBBk‖F, (15) where q¯F and q¯W denote the indices providing the maximum sum-rate. We summarize the algorithmic steps of the proposed approach in Algorithm 1. Note that the proposed hybrid precoding optimization in (15) is different than the one in [7], in which, not all possible combinations of the analog precoders are considered as it is done in this work. In Section VI, we show that (15) yields better results as compared to [7]. The problem in (15) requires to visit QFQW nodes to estimate the hybrid precoders. To reduce the complexity and the need for the array responses, in the following section, we propose a DL-based approach where we elaborate on the details of the training data generation and network architecture. V. LEARNING-BASED HYBRID PRECODING In this part, we present our DL framework for hybrid pre- coding design. The proposed network architecture is illustrated in Fig. 2. The CNN-MIMO architecture consists of ten layers and it accepts an input data of size NR ×NT × 3 while it yields a K(NT +NR) × 1 vector at the output. The overall network architecture of CNN-MIMO can be represented by the function Π(·) : RNR×NT×3 → RK(NR+NT). Let us define the arithmetic operation of the ith layer in the network with f (i)(·), then the representation of the overall network can be given as Π(X) = f (10) ( f (9)(· · · f (1)(X) · · · )) = z, (16) where each layer has certain task described above and we explicitly show the arithmetic operations for fully connected layers are convolutional layers in the sequel. Let W¯ ∈ RCx×Cy be the weights of a fully connected layer in the network with input x¯ ∈ RCx and output y¯ ∈ RCy . The cyth element of the output of the layer can be given by the inner product y¯cy = 〈W¯cy , x¯〉 = ∑ i [W¯]Tcy,ix¯i, (17) for cy = 1, . . . , Cy and W¯cy is the cyth column vector of W¯. For a convolutional layer, define X¯ ∈ Rdx×dx×Cx and Y¯ ∈ Rdy×dy×Cy as the feature maps and output of a convolutional layer, respectively. Let us also define dx×dy as the size of the convolutional kernel, and Cx×Cy as the size of the response of convolutional layer for each feature map. Then, the response of a convolutional layer becomes Y¯py,cy = ∑ pk,px 〈W¯cy,pk , X¯px〉, (18) where Y¯py,cy is the response for the 2-D spatial region py in the cyth channel of the feature maps, W¯cy,pk ∈ RCx denotes the weights of the cyth convolutional kernel, and X¯px ∈ RCx is the input feature map at spatial position px. Hence we define px and pk as the 2-D spatial positions in the feature maps and convolutional kernels, respectively [42]. A. Training Data Generation In order to train the network, we prepare a training dataset for several channel realizations. We generate N different channel realizations for K users. Next, each of these channel matrices are corrupted by a synthetic noise for G realizations. The noise is added to each term in the channel matrix and we define the SNR for the training data generation as SNRTRAIN = 20 log10( |[H(n,g)k ]i,j |2 σ2TRAIN ), where σ2TRAIN is the variance of synthetic noise. Note that [H(n,g)k ]i,j denotes the (i, j)th entry of the kth channel matrix for the (n, g)th realization with n = 1, . . . , N and g = 1, . . . , G. The input of the network consists of three channels. In the first channel, the absolute values of the entries in the channel matrix are used. The second and the third channels include the real and imaginary parts of the channel matrix, respectively. This approach provides good features for the solution of the problems [31]. Specifically, let X ∈ RNR×NT×3 be the input of the network, then, for a channel matrix H ∈ CNR×NT , the first channel of the input is given by [[X]:,:,1]i,j = |[H]i,j |. The second and the third channels are given by [[X]:,:,2]i,j = Re{[H]i,j} and [[X]:,:,3]i,j = Im{[H]i,j}, respectively. The output of the network is composed of the analog precoder and combiners. Let z ∈ RNTK+NRK be a real valued vector, then we design the output as z = [∠{vec(FRF)T },∠{vec(WRF)T }]T , (19) where FRF ∈ CNT×K and WRF ∈ CNR×K . Hence the input-output pair of the network is (X, z). We summarize the data generation process in Algorithm 2. The total number of inputs is T = NGK for K users. Note that the input data is composed of each user channel information as in lines 7−12 of Algorithm 2 and we record the analog precoder and combiner associated with each user channel. Note also that the same analog precoders are used for all noisy channel realizations. This is to introduce synthetic noise in the input dataset to make the network robust against the corrupted channel data [13], [31]. 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 6 Fig. 2. (Top) The proposed network architecture. The input is the channel matrix of any user in the network and the output is the corresponding analog precoder and combiners. (Bottom) The diagram for the training and prediction stage of the proposed DL framework. Algorithm 2 Training data generation for CNN-MIMO. Input: N , G, K, SNRTRAIN. Output: Training data DTRAIN. 1: Generate N different realizations of the multi-user MIMO scenario with channel matrices {H(n)k }Nn=1 and corre- sponding feasible sets {F(n)}Nn=1, {W(n)}Nn=1 ∀k. 2: Initialize with t=1while the dataset length is T = NGK. 3: for 1 ≤ n ≤ N do 4: for 1 ≤ g ≤ G do 5: [H (n,g) k ]i,j ∼ CN ([H(n)k ]i,j , σ2TRAIN). 6: Using H(n,g)k , F(n), W(n) in Algorithm 1, find Fˆ (n,g) RF and Wˆ (n,g) RF using q¯ (n,g) F and q¯ (n,g) W . 7: for 1 ≤ k ≤ K do 8: [[X(t)]:,:,1]i,j = |[H(n,g)k ]i,j |. 9: [[X(t)]:,:,2]i,j = Re{[H(n,g)k ]i,j} . 10: [[X(t)]:,:,3]i,j = Im{[H(n,g)k ]i,j} ∀ij. 11: z(t) = [∠{vec(Fˆ(n,g)RF )T },∠{vec(Wˆ(n,g)RF )T }]T . 12: Construct the input-output pair (X(t), z(t)). 13: t = t+ 1. 14: end for k, 15: end for g, 16: end for n, 17: Training data for CNN-MIMO is obtained from the col- lection of the input-output pairs as DTRAIN = ( (X(1), z(1)), (X(2), z(2)), . . . , (X(T ), z(T )) ) . B. Network Architecture The proposed network shown in Fig. 2 is composed of ten layers. The first layer is the input layer accepting the channel matrix data of size NR×NT×3 which denotes 3 ”channels”, each of which has size equal to NR×NT. The second and the fourth layer are the convolutional layers with 256 filters of size 2× 2 to extract the features hidden in the input data. We feed the network with the real and imaginary parts of the channel data which provides a large number of features [13], [30] to be handled to help the network map and learn the input data in accordance with their label data. After each convolutional layer, there is a normalization layer to normalize the output and provide better convergence. The sixth and eighth layers are fully connected layers with 2048 units, respectively. There are dropout layers after the fully connected layers (the seventh and ninth layers) with a 50% probability. The dropout layers make the network non-dependent on the initial weights. The output layer is the regression layer with K(NR + NT) units which include the phase information of the analog precoders. In order to obtain the network parameters such as the number of layers, number of filters and kernel sizes, we have conducted a hyperparameter tuning process to achieve the sufficiently good network accuracy and sum-rate performance [11], [13], [30]. The current network architecture with a kernel size 2 × 2 is one possible solution of the considered problem with similar/same performance with network structures having different kernels. In other words, although different kernel sizes can also be used for this problem, in this work, we have first considered a hyperparameter tuning process providing the sufficient performance for the considered scenario with less computational complexity [11], [13], [30]. The computational workload of a CNN is the result of intensive use of arithmetic operations in its layers. Most of the operations occur on the convolutional parts of the network. Hence, convolutional layers are responsible for more than 90% of the execution time during the inference [43]. Conversely to computations, most of the CNN weights are included on the fully connected layers which require approximately 90% of memory due to a large number of weights [43]. Hence, the complexity of CNN is directly proportional to the number of parameters and the number of layers. The layers of the proposed CNN structure are described above and the number of parameters can be calculated as C2 ( 2Ncv(wh + 1) + ([Nfc1 +1]+[Nfc2 +1]) · 50100 ) [43]. Here, C = 3 corresponds to the number of channels, w = h = 2 is the filter size, and Ncv = 256 is the number of filters in both convolutional 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 7 layers. The variables Nfc1 = Nfc2 = 2048 describe the number of units in the fully connected layers for 50% dropout probability. Hence, the CNN-MIMO structure in Fig. 2 has 41481 parameters. C. Training The CNN structure in Fig. 2 is realized and trained in MATLAB on a PC with a single GPU and a 768-core processor. We have used the stochastic gradient descent al- gorithm with momentum 0.9 [44] and updated the network parameters with learning rate 0.005 and mini-batch size of 500 samples for 100 epochs. As a loss function, we used the MSE given by L = 1T ∑T t=1 ( z(t)− f(X(t)))2 where f(X) is a function of the input data X, which represents the nonlinear transformation achieved by the network [11]. To train the proposed CNN structure, N = 500 different multi-user scenarios are realized with K = 3 users (1500 channel realizations in total) as in Algorithm 2. For each channel matrix, AWGN is added for different powers of SNRTRAIN ∈ {15, 20, 25}dB with G = 100 to account for different channel characteristics. The use of multiple SNRTRAIN levels provides a wide range of corrupted data in the training which improves the learning and robustness of the network. Hence, the total size of the training data is NR × NT × 3 × 450000. In the training process, 80% and 20% of all generated data are selected as the training and validation datasets, respectively. The validation aids in hyperparameter tuning during the training phase to avoid the network simply memorizing the training data rather than learning general features for accurate prediction with new data. The validation data is used to test the performance of the network in the simulations for JT = 100 Monte Carlo trials. In order to prevent the similarity between the test data and the training data we also add synthetic noise to the test data where the SNR during testing is defined similar to SNRTRAIN as SNRTEST = 20 log10( |[H]i,j |2 σ2TEST ). The number of grid points is selected as L¯ = 60 for azimuth and L¯ = 20 for elevation angular sectors in Algorithm 1. In addition, the propagation environment is modeled with L = 10 paths from the users and all the user directions, i.e., all the azimuth and elevation angles, are uniform randomly selected from the intervals φ ∈ [−30◦, 30◦] and θ ∈ [−20◦, 20◦], respectively [6]. We use sectorized angular range by selecting the antenna gains gR(Θ (l,k) R ), gT(Θ (l,k) T ) as unity for these angular ranges and zero otherwise to provide a sectorized angular interval in- creasing the beamforming gain and reducing interference and provide increased beamforming gain [5]. Hence, the training data includes a large number of scenarios where the users are randomly located. For each scenario, the corresponding precoder and combiners are obtained by Algorithm 1. The training stage takes about 5 hours for T = 450000 samples. This process includes both the labeling and the input data generation. Note that the training stage is performed only once. Then, in the prediction stage, it takes only milliseconds to estimate the hybrid precoders as demonstrated in the sim- ulations (please see Table I). Hence, the proposed approach, providing high data rate and low latency, is quite attractive since it meets the 5G requirements. The trained network can work for different parameters such as the number of users1 K, number of paths L, SNRTEST and SNRTRAIN which motivates the practical implementation of the proposed DL framework. The proposed CNN structure requires to be retrained if there is a change in the parameters like NT, NR, NRFT , which directly dictate the input and output dimensions of the deep network. The performance of the network also depends on the angular interval selected in D when designing the feasible sets F and W as well as the antenna gains obtaining sectorized angular intervals. D. Prediction Once the CNN-MIMO is trained offline as demonstrated in Fig. 2, it can be used for the prediction of the hybrid beam- formers. In order to generate the test data in the prediction stage, we have picked users randomly from the validation data and the synthetic noise is also added to the test data with SNRTEST to eliminate the similarity between the test and training datasets. The corrupted channel data of each user is fed to the network and the analog precoders are predicted from the output layer of the network. Then, their phases are quantized in [0, 2pi] with 2B discrete points. Specifically, the values of the quantized phases belong in the set { 2pib 2B }2Bb=1 to allow the realization of the analog precoder and combiners in a hardware-efficient manner. VI. NUMERICAL SIMULATIONS In this section, we present the performance of the proposed method, CNN-MIMO, via several experiments where we train the network with the parameters described in Section V-B such as N = 500, K = 3, G = 100, SNRTRAIN = {15, 20, 25} dB, learning rate 0.005, batch size 500 and number of epochs 100. We compare the performance of CNN-MIMO with state-of- the-art hybrid precoding techniques such as the manifold op- timization (MO) [45], the low-resolution hybrid beamforming (LRHB) [8], SOMP [6] and the two-stage hybrid beamforming (TS-HB) algorithm [7]. While manifold optimization and SOMP were proposed for a single-user scenario, we adapt the algorithms for the multi-user case by using the same strategy for interference cancellation as in [7]. CNN-MIMO is also compared with the DL-based approach MLP proposed in [27]. MLP is designed as described in [27] but adapted for the multi-user scenario with the same training data used for CNN- MIMO. As another benchmark and denoted as ”No interfer- ence” in the simulations, we present the performance of fully- digital beamforming and combining where the interference is completely eliminated. In addition, the performance plot of the precoders used in the test data (obtained from Algorithm 1) is indicated as ”Algorithm 1” in the experiments. In Fig. 3, we present the achievable sum-rate performance of the algorithms with respect to different SNR levels. The design parameters of CNN-MIMO are given in Section IV- B. Moreover, we select the number of antennas per BS and per user as NT = 36, NR = 9, respectively. Synthetic 1When the network is trained for KTRAIN users, the output size of the network is z ∈ RNTKTRAIN+NRKTRAIN . Then we can use the trained network for hybrid beamforming when there are K ≤ KTRAIN users by substituting network output of size NTK + NRK × 1 corresponding to those K users. 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 8 -30 -25 -20 -15 -10 -5 0 5 10 SNR, [dB] 0 2 4 6 8 10 12 Su m -R at e [bi ts/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP -5.1 -5 -4.9 4.8 4.9 5 5.1 Fig. 3. Sum-rate versus SNR (NT = 36, NR = 9, K = 3, B = 3 and SNRTEST = 20 dB). noise is added to both the channel matrices and the array responses with SNRTEST = 20 dB and B = 3 quantization bits are used. The number of users is K = 3 and there are L = 10 paths for each user. As a benchmark, we use the fully digital beamforming and the MO algorithm which has the best performance since it obtains near-optimum analog and baseband precoders. Our CNN approach follows the perfor- mance of the MO algorithm. In fact, CNN-MIMO provides the highest sum-rate as compared to the other algorithms. Notably, although LRHB is the state-of-the-art technique based on phase extraction and it is regarded as the technique having the best performance in the literature [8], we observe the outperformance of CNN-MIMO. MLP has poorer performance due to the lack of feature extraction that is achieved by the convolutional layers in CNN-MIMO. In particular, the effec- tiveness of CNN-MIMO can be attributed to the maximization of the sum-rate by visiting all possible combinations for the analog parts at both the receiver and transmitter side through an exhaustive search and well-trained deep network. We can point out that the ultimate performance from CNN-MIMO can be obtained if CNN-MIMO yields the output exactly the same as the labels obtained in Algorithm 1. Hence, we can say that the performance of CNN-MIMO is limited by the performance of Algorithm 1. We observe that the performance of CNN-MIMO is close to Algorithm 1 where the gap between these two is due to the corruption in the input data. SOMP and TS-HB have poorer performance as compared to CNN- MIMO. Especially, while SOMP was initially proposed for the single-user case, we have adapted the algorithm for the multi-user scenario where the analog precoders are designed based on the similarity between the optimum precoder and the analog precoders. As a result, SOMP does not always find the optimum weights maximizing the sum-rate [31]. TS- HB algorithm has better performance than SOMP since it is based on the maximization of the sum-rate and its performance converges to the same one as SOMP when there is a single path from each user. The feedback data, namely, the channel matrix {Hk}Kk=1 and the feasible array response sets F and W may not always 0 5 10 15 20 25 30 35 40 SNRTEST, [dB] 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 Su m -R at e [bi ts/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP (a) 0 5 10 15 20 25 30 35 40 SNRTEST, [dB] 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 RM SE Manifold Optimization CNN-MIMO LRHB MLP TS-HB SOMP (b) 0 5 10 15 20 25 30 35 40 SNRTEST, [dB] 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 RM SE Manifold Optimization CNN-MIMO LRHB MLP TS-HB SOMP (c) Fig. 4. Performance comparison for corrupted channel data. In (a), sum-rate versus SNRTEST is given whereas the RMSE for precoder FRF and combiner WRF are shown in (b) and (c), respectively (NT = 36, NR = 9, K = 3, B = 3 and SNR= 0 dB). be perfectly available. In order to evaluate the performance of the algorithms on the robustness against the corrupted feedback, we simulate the performance of the algorithms for different SNRTEST levels for the same setting as in the previous simulation. In this case, complex AWGN was added to both channel and array response data to resemble the 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 9 1 2 3 4 5 6 7 8 Number of Quantization Bits 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 Su m -R at e [bi ts/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP Fig. 5. Sum-rate versus angular resolution of the analog precoders (NT = 36, NR = 9, SNR= 0 dB, SNRTEST = 20 dB). deviations in the feedback data. The results are presented in Fig. 4 where we present the achievable sum-rate in Fig.4(a) while the RMS error on precoder FRF and combiner WRF are shown in Figs. 4(b) and 4(c), respectively. Note that Algorithm 1 is fed with perfect CSI to demonstrate the best achievable performance. As can be seen from Fig. 4, CNN- MIMO is more robust against the corruption in the channel data as compared to the other methods. Note that the manifold optimization, LRHB, MLP, and CNN-MIMO are only affected by the corruption in the channel data since they automatically estimate the analog precoders, unlike SOMP and TS-HB which require the feasible sets F and W as input. As a result, the performance of TS-HB and SOMP heavily rely on the accuracy of both the channel matrix and the array response sets. Moreover, the knowledge of channel data and the feasible sets F and W is only needed in the training stage of the network to obtain the labels and it is not used in the prediction stage. However, the other algorithms like SOMP and TS-HB, require this information to solve the hybrid precoding problem. Overall, these results show the robustness of the proposed CNN-MIMO. The analog precoders are designed with discrete phase shifters with constant modulus to steer the beam in spatial precoding. To assess the performance for the phase resolu- tion in the phase shifters, we present the sum-rate of the algorithms for different quantization resolutions where the phases of the analog precoder and combiners are quantized for B = {1, . . . , 8} bits. The results are depicted in Fig. 5 where we observe that the other algorithms converge after 4 bits while, remarkably, the proposed CNN approach achieves higher sum-rate starting from one-bit quantization. In Fig. 6(a), the performance is evaluated for varying number of users, namely, K ∈ {2, . . . , 8} where L = 10 is fixed. Notably, CNN-MIMO performs better than the other algorithms. In particular, the gap between ”No interference” and CNN-MIMO becomes larger as K increases. We observe that the performance of MLP becomes better than LRHB after K ≥ 5 and exhibits robust performance like CNN- MIMO with a certain performance loss. The main reason is that the use of training data prepared with Algorithm 1 which provides more accurate beamformers than the other algorithms. We also see that CNN-MIMO closely follows the performance of Algorithm 1. However, this gap appears due to the insufficient performance of interference cancellation. Hence, it is suggested to develop more effective algorithms to handle the interference among the users. In Fig. 6(b), we evaluate the performance of CNN-MIMO when the number of paths for each user is not fixed. Hence, we train the network with the same parameters except selecting L uniform randomly from the interval [1, 10]. Using varying L values for different users reduces the similarity between the channel data of users and we obtain satisfactory performance of CNN-MIMO similar to the observations made when L is fixed. 2 3 4 5 6 7 8 Number of Users 5.5 6 6.5 7 7.5 8 8.5 9 Su m -R at e [bi ts/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP (a) 2 3 4 5 6 7 8 Number of Users 3 4 5 6 7 8 9 Sp ec tra l E ffic ien cy [b its/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP (b) Fig. 6. Sum-rate versus number of users. The number of paths is fixed as L = 10 in (a), and L is selected uniform randomly in the interval [1, 10] in (b) respectively. (NT = 100, NR = 9, SNR= 0 dB and SNRTEST = 20 dB). In Fig. 7, we illustrate the performance for varying number of BS antennas. As can be seen, similar observations can be obtained. Specifically, CNN-MIMO performs better than the other algorithms. Furthermore, we present the computation times of the algorithms for a different number of BS antennas in Table I in seconds. While the complexity of Algorithm 1 is the highest due to the exhaustive search, DL-based approaches, 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 10 0 20 40 60 80 100 Number of BS Antennas 1 2 3 4 5 6 7 8 9 Su m -R at e [bi ts/ s/H z] No Interference Manifold Optimization Algorithm 1 CNN-MIMO LRHB MLP TS-HB SOMP Fig. 7. Sum-rate versus number of BS antennas (K = 3, NR = 9, SNR= 0 dB and SNRTEST = 20 dB). TABLE I COMPUTATION TIMES (IN SECONDS). NT Algorithm 1 CNN-MIMO MLP LRHB TS-HB SOMP 4 0.1061 0.0039 0.0034 0.0059 0.0093 0.0122 16 0.1164 0.0043 0.0038 0.0113 0.0103 0.0139 64 0.1175 0.0049 0.0045 0.0159 0.0108 0.0216 100 0.1242 0.0052 0.0049 0.0318 0.0125 0.0282 i.e., CNN-MIMO and MLP have the least computation time as compared to LRHB and the rest. MLP appears slightly lower complexity than CNN-MIMO due to its less complex structure, however, it has poorer performance as was shown in the previous experiments. In addition, regarding the complexity of TS-HB and SOMP, given its dependence on the number of elements in the feasible sets F and W , it is observed that TS-HB has less computation time than SOMP since it does not follow an OMP stage to obtain the precoders but it selects the ones with the highest channel gain from the codebook [7]. It is also worthwhile to mention the trade-off between the computation time and the performance of CNN-MIMO. While the MO algorithm has slightly better performance than CNN- MIMO, the proposed DL framework provides a significantly faster computation of the hybrid beamformers than the MO algorithm. The complexity of MO also increases at a higher rate than that of CNN-MIMO. This observation demonstrates that CNN-MIMO is more useful in terms of computational complexity even for a very large number of antennas which is the case in 5G systems. Hence, we believe the proposed approach can be a promising technique to be used in mmWave systems where low complexity and robust performance are required. The run times of CNN-MIMO can be further ac- celerated by implementing the network in general-purpose hardware such as FPGA. For example, domain-specific archi- tectures have been implemented in [46] for AlexNet [43] and VGG-16 for real-time image classification with 194 GOP/s (billions of fixed-point OPerations per second) and consuming only 300 mW. These promising results encourage us to develop more energy-efficient DL approaches for the problems in communications systems. VII. CONCLUSIONS We proposed a DL framework for hybrid precoding design in multi-user mmWave MIMO systems. The proposed network architecture is a CNN which accepts as input the channel matrix of users and gives at the output the analog precoder and combiners. The proposed technique was compared with both optimization- and greedy-based approaches as well as DL-based techniques such as MLP. The effectiveness of the proposed CNN approach was evaluated through several experiments and it is shown that CNN-MIMO achieves a better performance than the state-of-the-art hybrid precoding approaches as well as less computation time. The effectiveness of CNN-MIMO can be attributed to the use of exhaustive search to obtain the best analog precoders and combiners in the training stage. In order to train the network, a large training data, with a length of nearly half a million, was used. Notably, large training data provides robust performance against the deviations in the channel data. Moreover, we showed that CNN-MIMO achieves more robust results in the presence of imperfections regarding the channel matrix and array responses. REFERENCES [1] R. W. Heath, N. Gonza´lez-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An Overview of Signal Processing Techniques for Millimeter Wave MIMO Systems,” IEEE J. Sel. Topics Signal Process., vol. 10, pp. 436–453, April 2016. [2] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What Will 5G Be?,” IEEE J. Sel. Areas Commun., vol. 32, pp. 1065–1082, June 2014. [3] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling Up MIMO: Opportunities and Challenges with Very Large Arrays,” IEEE Signal Process. Mag., vol. 30, pp. 40–60, Jan 2013. [4] L. Wei, R. Q. Hu, Y. Qian, and G. Wu, “Key elements to enable mil- limeter wave communications for 5G wireless systems,” IEEE Wireless Communications, vol. 21, pp. 136–143, December 2014. [5] A. Alkhateeb, O. E. Ayach, G. Leus, and R. W. Heath, “Hybrid precoding for millimeter wave cellular systems with partial channel knowledge,” in 2013 Information Theory and Applications Workshop (ITA), pp. 1–5, Feb 2013. [6] O. E. Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially Sparse Precoding in Millimeter Wave MIMO Systems,” IEEE Trans. Wireless Commun., vol. 13, pp. 1499–1513, March 2014. [7] A. Alkhateeb, G. Leus, and R. W. Heath, “Limited feedback hybrid pre- coding for Multi-User millimeter wave systems,” IEEE Trans. Wireless Commun., vol. 14, pp. 6481–6494, Nov. 2015. [8] Z. Wang, M. Li, Q. Liu, and A. L. Swindlehurst, “Hybrid Precoder and Combiner Design With Low-Resolution Phase Shifters in mmWave MIMO Systems,” IEEE J. Sel. Topics Signal Process., vol. 12, pp. 256– 269, May 2018. [9] M. Kokshoorn, H. Chen, Y. Li, and B. Vucetic, “Beam-On-Graph: Simultaneous Channel Estimation for mmWave MIMO Systems With Multiple Users,” IEEE Trans. Commun., vol. 66, pp. 2931–2946, July 2018. [10] X. Zhai, Y. Cai, Q. Shi, M. Zhao, G. Y. Li, and B. Champagne, “Joint Transceiver Design With Antenna Selection for Large-Scale MU-MIMO mmWave Systems,” IEEE J. Sel. Areas Commun., vol. 35, pp. 2085– 2096, Sep. 2017. [11] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [12] D. Yu and L. Deng, “Deep learning and its applications to signal and information processing [exploratory dsp],” IEEE Signal Process. Mag., vol. 28, pp. 145–154, Jan 2011. [13] A. M. Elbir, K. V. Mishra, and Y. C. Eldar, “Cognitive radar antenna selection via deep learning,” IET Radar, Sonar & Navigation, vol. 13, pp. 871–880(9), June 2019. 0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2951501, IEEE Transactions on Vehicular Technology 11 [14] Z. Jiang, S. Chen, A. F. Molisch, R. Vannithamby, S. Zhou, and Z. Niu, “Exploiting Wireless Channel State Information Structures Beyond Linear Correlations: A Deep Learning Approach,” IEEE Commun. Mag., vol. 57, pp. 28–34, March 2019. [15] M. Feng and S. Mao, “Dealing with Limited Backhaul Capacity in Millimeter-Wave Systems: A Deep Reinforcement Learning Approach,” IEEE Commun. Mag., vol. 57, pp. 50–55, March 2019. [16] H. Ye, G. Y. Li, and B. Juang, “Power of Deep Learning for Channel Estimation and Signal Detection in OFDM Systems,” IEEE Wireless Communications Letters, vol. 7, pp. 114–117, Feb 2018. [17] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep Learning for Super-Resolution Channel Estimation and DOA Estimation Based Massive MIMO System,” IEEE Trans. Veh. Technol., vol. 67, pp. 8549– 8560, Sep. 2018. [18] Y. Long, Z. Chen, J. Fang, and C. Tellambura, “Data-Driven-Based Analog Beam Selection for Hybrid Beamforming Under mm-Wave Channels,” IEEE J. Sel. Topics Signal Process., vol. 12, pp. 340–352, May 2018. [19] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, July 2017. [20] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep Reinforce- ment Learning for Dynamic Multichannel Access in Wireless Networks,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, pp. 257–265, June 2018. [21] S. Do¨rner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep Learning Based Communication Over the Air,” IEEE J. Sel. Topics Signal Process., vol. 12, pp. 132–143, Feb 2018. [22] V. Raj and S. Kalyani, “Backpropagating Through the Air: Deep Learning at Physical Layer Without Channel Models,” IEEE Commun. Lett., vol. 22, pp. 2278–2281, Nov 2018. [23] C. Wen, W. Shih, and S. Jin, “Deep Learning for Massive MIMO CSI Feedback,” IEEE Wireless Communications Letters, vol. 7, pp. 748–751, Oct 2018. [24] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning at physical layer without channel models,” IEEE Commun. Lett., vol. 22, pp. 2278–2281, Nov. 2018. [25] P. Dong, H. Zhang, G. Y. Li, N. Naderializadeh, and I. Gaspar, “Deep cnn based channel estimation for mmwave massive mimo systems,” ArXiv, vol. abs/1904.06761, 2019. [26] H. He, C. Wen, S. Jin, and G. Y. Li, “Deep Learning-Based Channel Estimation for Beamspace mmWave Massive MIMO Systems,” IEEE Wireless Communications Letters, vol. 7, pp. 852–855, Oct 2018. [27] H. Huang, Y. Song, J. Yang, G. Gui, and F. Adachi, “Deep-Learning- based Millimeter-Wave Massive MIMO for Hybrid Precoding,” IEEE Trans. Veh. Technol., pp. 1–1, 2019. [28] T. Lin and Y. Zhu, “Beamforming Design for Large-Scale Antenna Arrays Using Deep Learning,” arXiv e-prints, p. arXiv:1904.03657, Apr 2019. [29] A. Alkhateeb, S. P. Alex, P. Varkey, Y. Li, Q. Z. Qu, and D. Tujkovic, “Deep Learning Coordinated Beamforming for Highly-Mobile Millime- ter Wave Systems,” IEEE Access, vol. 6, pp. 37328–37348, 2018. [30] A. M. Elbir, “CNN-based precoder and combiner design in mmWave MIMO systems,” IEEE Commun. Lett., vol. 23, no. 7, pp. 1240–1243, 2019. [31] A. M. Elbir and K. V. Mishra, “Joint Antenna Selection and Hybrid Beamformer Design using Unquantized and Quantized Deep Learning Networks,” arXiv e-prints, p. arXiv:1905.03107, May 2019. [32] X. Yu, J. Shen, J. Zhang, and K. B. Letaief, “Alternating minimization algorithms for hybrid precoding in millimeter wave MIMO systems,” IEEE J. Sel. Top. Signal Process., vol. 10, pp. 485–500, Apr. 2016. [33] E. Torkildson, C. Sheldon, U. Madhow, and M. Rodwell, “Millimeter- Wave Spatial Multiplexing in an Indoor Environment,” in 2009 IEEE Globecom Workshops, pp. 1–6, Nov 2009. [34] R. Me´ndez-Rial, C. Rusu, A. Alkhateeb, N. Gonzlez-Prelcic, and R. W. Heath, “Channel estimation and hybrid combining for mmWave: Phase shifters or switches?,” in 2015 Information Theory and Applications Workshop (ITA), pp. 90–97, Feb 2015. [35] V. Raghavan and A. M. Sayeed, “Multi-antenna capacity of sparse multipath channels,” IEEE TRANS. INFORM. THEORY, 2006. [36] T. S. Rappaport, F. Gutierrez, E. Ben-Dor, J. N. Murdock, Y. Qiao, and J. I. Tamir, “Broadband Millimeter-Wave Propagation Measurements and Models Using Adaptive-Beam Antennas for Outdoor Urban Cellular Communications,” IEEE Trans. Antennas Propag., vol. 61, pp. 1850– 1859, April 2013. [37] I. A. Hemadeh, K. Satyanarayana, M. El-Hajjar, and L. Hanzo, “Millimeter-Wave Communications: Physical Channel Models, Design Considerations, Antenna Constructions, and Link-Budget,” IEEE Com- mun. Surveys Tuts., vol. 20, pp. 870–913, Secondquarter 2018. [38] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep learning for super-resolution channel estimation and doa estimation based massive mimo system,” IEEE Trans. Veh. Technol., vol. 67, pp. 8549–8560, Sept 2018. [39] Z. Marzi, D. Ramasamy, and U. Madhow, “Compressive Channel Estimation and Tracking for Large Arrays in mm-Wave Picocells,” IEEE J. Sel. Topics Signal Process., vol. 10, pp. 514–527, April 2016. [40] J. Wang, Z. Lan, C. woo Pyo, T. Baykas, C. sean Sum, M. A. Rahman, J. Gao, R. Funada, F. Kojima, H. Harada, and S. Kato, “Beam codebook based beamforming protocol for multi-Gbps millimeter-wave WPAN systems,” IEEE J. Sel. Areas Commun., vol. 27, pp. 1390–1399, October 2009. [41] E. Bjo¨rnson, L. Van der Perre, S. Buzzi, and E. G. Larsson, “Massive MIMO in Sub-6 GHz and mmWave: Physical, Practical, and Use-Case Differences,” arXiv e-prints, p. arXiv:1803.11023, Mar 2018. [42] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized CNN: A unified approach to accelerate and compress convolutional networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 4730–4743, 2018. [43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Infor- mation Processing Systems, pp. 1097–1105, 2012. [44] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, New York, 2006. [45] X. Yu, J. Shen, J. Zhang, and K. B. Letaief, “Alternating Minimization Algorithms for Hybrid Precoding in Millimeter Wave MIMO Systems,” IEEE J. Sel. Topics Signal Process., vol. 10, pp. 485–500, April 2016. [46] B. Sun, L. Yang, P. Dong, W. Zhang, J. Dong, and C. Young, “Ultra Power-Efficient CNN Domain Specific Accelerator with 9.3TOPS/Watt for Mobile and Embedded Applications,” arXiv e-prints, p. arXiv:1805.00361, Apr 2018. Ahmet M. Elbir received the B.S. degree with Honors from Firat University in 2009 and the Ph.D. degree from Middle East Technical University (METU) in 2016, both in electrical engineering. He is the recipient of 2016 METU best Ph.D. thesis award for his doctoral studies. He serves as an Asso- ciate Editor for IEEE Access since 2018. Currently, he continues his studies at the Dept. of Electrical and Electronics Engineering, Duzce University, Turkey. His research interests include array signal process- ing, sparsity-driven convex optimization, signal pro- cessing for communications and deep learning for array signal processing. Anastasios Papazafeiropoulos [S’06-M’10-SM’19] received the B.Sc. degree (Hons.) in physics, the M.Sc. degree (Hons.) in electronics and computers science, and the Ph.D. degree from the University of Patras, Greece, in 2003, 2005, and 2010, re- spectively. From 2011 to 2012 and from 2016 to 2017, he was with the Institute for Digital Com- munications at The University of Edinburgh, U.K., as a Post-Doctoral Research Fellow. From 2012 to 2014, he was a Research Fellow with Imperial College London, U.K., awarded with a Marie Curie fellowship (IEF-IAWICOM). He is currently a Vice-Chancellor Fellow at the University of Hertfordshire, U.K. He is also a Visiting Research Fellow at SnT, University of Luxembourg, Luxembourg. He has been involved in several EPSRC and EU FP7 projects such as HIATUS and HARP. His research interests span machine learning for wireless communications, massive MIMO, heterogeneous networks, 5G wireless networks, full-duplex radio, mm-wave communications, random matrix theory, hardware-constrained communica- tions, and performance analysis of fading channels.