Received: from mnm [127.0.0.1]
	by localhost with POP3 (fetchmail-5.9.0)
	for akpm@localhost (single-drop); Fri, 27 Jun 2003 23:17:43 -0700 (PDT)
Received: by mangalore (mbox akpm)
 (with Cubic Circle's cucipop (v1.31 1998/05/13) Sat Jun 28 16:17:30 2003)
X-From_: haveblue@us.ibm.com  Sat Jun 28 16:14:22 2003
Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.104])
	by mangalore.zipworld.com.au (8.12.3/8.12.3/Debian-6.4) with ESMTP id h5S6EJNN005239
	for <akpm@zip.com.au>; Sat, 28 Jun 2003 16:14:20 +1000
Received: from northrelay04.pok.ibm.com (northrelay04.pok.ibm.com [9.56.224.206])
	by e4.ny.us.ibm.com (8.12.9/8.12.2) with ESMTP id h5S6EGi8154664;
	Sat, 28 Jun 2003 02:14:16 -0400
Received: from nighthawk.sr71.net (d01av02.pok.ibm.com [9.56.224.216])
	by northrelay04.pok.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h5S6EEQV030176;
	Sat, 28 Jun 2003 02:14:14 -0400
Subject: Re: 2.5.73-mm2
From: Dave Hansen <haveblue@us.ibm.com>
To: akpm@zip.com.au
Cc: "Martin J. Bligh" <mbligh@aracnet.com>
In-Reply-To: <20030627202130.066c183b.akpm@digeo.com>
References: <20030627202130.066c183b.akpm@digeo.com>
Content-Type: multipart/mixed; boundary="=-0X7S2SAZNKi17LAVnQ0Z"
Organization: 
Message-Id: <1056780851.19849.235.camel@nighthawk>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.4 
Date: 27 Jun 2003 23:14:11 -0700
X-Spam-Status: No, hits=-20.0 required=6.0
	tests=BAYES_01,IN_REP_TO,REFERENCES,USER_AGENT_XIMIAN
	autolearn=ham version=2.53
X-Spam-Level: 
X-Spam-Checker-Version: SpamAssassin 2.53 (1.174.2.15-2003-03-30-exp)


--=-0X7S2SAZNKi17LAVnQ0Z
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

Damn.  Something in the digeo firewall does *not* like when I send patches
in attachments to you.

> +numa-memory-reporting-fix.patch
> 
>  Fix NUMA memory reporting (needs more work)

I sent Martin a newer version of this patch which was a lot more
comprehensive.  You'll need to back out the previous one to apply this. 
I fear that the current one has broken other NUMA architectures than the
i386 ones.


The current numa meminfo code exports (via sysfs) pgdat->node_size, as
totalram.  This variable is consistently used elsewhere to mean "the
number of physical pages that this particular node spans".  This is
_not_ what we want to see from meminfo, which is: "how much actual
memory does this node have?" 

The following patch removes pgdat->node_size, and replaces it with
->node_spanned_pages.  This is to avoid confusion with a new variable,
node_present_pages, which is the _actual_ value that we want to export
in meminfo.  Most of the patch is a simple
s/node_size/node_spanned_pages/.  The node_size() macro is also removed,
and replaced with new ones for node_{spanned,present}_pages() to avoid
confusion.

We were bitten by this problem in this bug:
http://bugme.osdl.org/show_bug.cgi?id=818

Compiled and tested on NUMA-Q.  
-- 
Dave Hansen
haveblue@us.ibm.com

--=-0X7S2SAZNKi17LAVnQ0Z
Content-Disposition: attachment; filename=node_spanned_pages-2.5.73-mm1-0.patch
Content-Type: text/x-patch; NAME=node_spanned_pages-2.5.73-mm1-0.patch; CHARSET=ANSI_X3.4-1968
Content-Transfer-Encoding: 7bit

diff -pur linux-2.5.73-mm1/arch/alpha/mm/numa.c linux-2.5.73-mm1-work/arch/alpha/mm/numa.c
--- linux-2.5.73-mm1/arch/alpha/mm/numa.c	Thu Jun 26 16:04:18 2003
+++ linux-2.5.73-mm1-work/arch/alpha/mm/numa.c	Thu Jun 26 16:38:11 2003
@@ -338,7 +338,7 @@ void __init mem_init(void)
 
 		lmem_map = node_mem_map(nid);
 		pfn = NODE_DATA(nid)->node_start_pfn;
-		for (i = 0; i < node_size(nid); i++, pfn++)
+		for (i = 0; i < node_spanned_pages(nid); i++, pfn++)
 			if (page_is_ram(pfn) && PageReserved(lmem_map+i))
 				reservedpages++;
 	}
@@ -372,7 +372,7 @@ show_mem(void)
 	printk("Free swap:       %6dkB\n",nr_swap_pages<<(PAGE_SHIFT-10));
 	for (nid = 0; nid < numnodes; nid++) {
 		struct page * lmem_map = node_mem_map(nid);
-		i = node_size(nid);
+		i = node_spanned_pages(nid);
 		while (i-- > 0) {
 			total++;
 			if (PageReserved(lmem_map+i))
diff -pur linux-2.5.73-mm1/arch/arm/mm/init.c linux-2.5.73-mm1-work/arch/arm/mm/init.c
--- linux-2.5.73-mm1/arch/arm/mm/init.c	Thu Jun 26 16:04:23 2003
+++ linux-2.5.73-mm1-work/arch/arm/mm/init.c	Thu Jun 26 17:13:47 2003
@@ -79,7 +79,7 @@ void show_mem(void)
 		struct page *page, *end;
 
 		page = NODE_MEM_MAP(node);
-		end  = page + NODE_DATA(node)->node_size;
+		end  = page + NODE_DATA(node)->node_spanned_pages;
 
 		do {
 			total++;
@@ -576,7 +576,7 @@ void __init mem_init(void)
 	for (node = 0; node < numnodes; node++) {
 		pg_data_t *pgdat = NODE_DATA(node);
 
-		if (pgdat->node_size != 0)
+		if (pgdat->node_spanned_pages != 0)
 			totalram_pages += free_all_bootmem_node(pgdat);
 	}
 
diff -pur linux-2.5.73-mm1/arch/arm26/mm/init.c linux-2.5.73-mm1-work/arch/arm26/mm/init.c
--- linux-2.5.73-mm1/arch/arm26/mm/init.c	Thu Jun 26 16:04:13 2003
+++ linux-2.5.73-mm1-work/arch/arm26/mm/init.c	Thu Jun 26 16:39:05 2003
@@ -68,7 +68,7 @@ void show_mem(void)
 
 
 	page = NODE_MEM_MAP(0);
-	end  = page + NODE_DATA(0)->node_size;
+	end  = page + NODE_DATA(0)->node_spanned_pages;
 
 	do {
 		total++;
@@ -353,7 +353,7 @@ void __init mem_init(void)
 	max_mapnr   = virt_to_page(high_memory) - mem_map;
 
 	/* this will put all unused low memory onto the freelists */
-	if (pgdat->node_size != 0)
+	if (pgdat->node_spanned_pages != 0)
 		totalram_pages += free_all_bootmem_node(pgdat);
 
 	printk(KERN_INFO "Memory:");
diff -pur linux-2.5.73-mm1/arch/i386/mm/pgtable.c linux-2.5.73-mm1-work/arch/i386/mm/pgtable.c
--- linux-2.5.73-mm1/arch/i386/mm/pgtable.c	Thu Jun 26 16:07:19 2003
+++ linux-2.5.73-mm1-work/arch/i386/mm/pgtable.c	Thu Jun 26 17:09:22 2003
@@ -35,7 +35,7 @@ void show_mem(void)
 	show_free_areas();
 	printk("Free swap:       %6dkB\n",nr_swap_pages<<(PAGE_SHIFT-10));
 	for_each_pgdat(pgdat) {
-		for (i = 0; i < pgdat->node_size; ++i) {
+		for (i = 0; i < pgdat->node_spanned_pages; ++i) {
 			page = pgdat->node_mem_map + i;
 			total++;
 			if (PageHighMem(page))
diff -pur linux-2.5.73-mm1/arch/ia64/mm/init.c linux-2.5.73-mm1-work/arch/ia64/mm/init.c
--- linux-2.5.73-mm1/arch/ia64/mm/init.c	Thu Jun 26 16:04:17 2003
+++ linux-2.5.73-mm1-work/arch/ia64/mm/init.c	Thu Jun 26 16:39:51 2003
@@ -232,7 +232,7 @@ show_mem(void)
 		printk("Free swap:       %6dkB\n", nr_swap_pages<<(PAGE_SHIFT-10));
 		for_each_pgdat(pgdat) {
 			printk("Node ID: %d\n", pgdat->node_id);
-			for(i = 0; i < pgdat->node_size; i++) {
+			for(i = 0; i < pgdat->node_spanned_pages; i++) {
 				if (PageReserved(pgdat->node_mem_map+i))
 					reserved++;
 				else if (PageSwapCache(pgdat->node_mem_map+i))
@@ -240,7 +240,7 @@ show_mem(void)
 				else if (page_count(pgdat->node_mem_map + i))
 					shared += page_count(pgdat->node_mem_map + i) - 1;
 			}
-			printk("\t%d pages of RAM\n", pgdat->node_size);
+			printk("\t%d pages of RAM\n", pgdat->node_spanned_pages);
 			printk("\t%d reserved pages\n", reserved);
 			printk("\t%d pages shared\n", shared);
 			printk("\t%d pages swap cached\n", cached);
diff -pur linux-2.5.73-mm1/arch/ppc64/mm/init.c linux-2.5.73-mm1-work/arch/ppc64/mm/init.c
--- linux-2.5.73-mm1/arch/ppc64/mm/init.c	Thu Jun 26 16:04:22 2003
+++ linux-2.5.73-mm1-work/arch/ppc64/mm/init.c	Thu Jun 26 16:40:10 2003
@@ -109,7 +109,7 @@ void show_mem(void)
 	show_free_areas();
 	printk("Free swap:       %6dkB\n",nr_swap_pages<<(PAGE_SHIFT-10));
 	for_each_pgdat(pgdat) {
-		for (i = 0; i < pgdat->node_size; i++) {
+		for (i = 0; i < pgdat->node_spanned_pages; i++) {
 			page = pgdat->node_mem_map + i;
 			total++;
 			if (PageReserved(page))
@@ -564,7 +564,7 @@ void __init mem_init(void)
 	int nid;
 
         for (nid = 0; nid < numnodes; nid++) {
-		if (node_data[nid].node_size != 0) {
+		if (node_data[nid].node_spanned_pages != 0) {
 			printk("freeing bootmem node %x\n", nid);
 			totalram_pages +=
 				free_all_bootmem_node(NODE_DATA(nid));
diff -pur linux-2.5.73-mm1/arch/ppc64/mm/numa.c linux-2.5.73-mm1-work/arch/ppc64/mm/numa.c
--- linux-2.5.73-mm1/arch/ppc64/mm/numa.c	Thu Jun 26 16:04:22 2003
+++ linux-2.5.73-mm1-work/arch/ppc64/mm/numa.c	Thu Jun 26 16:41:27 2003
@@ -160,21 +160,21 @@ new_range:
 		 * this simple case and complain if there is a gap in
 		 * memory
 		 */
-		if (node_data[numa_domain].node_size) {
+		if (node_data[numa_domain].node_spanned_pages) {
 			unsigned long shouldstart =
 				node_data[numa_domain].node_start_pfn + 
-				node_data[numa_domain].node_size;
+				node_data[numa_domain].node_spanned_pages;
 			if (shouldstart != (start / PAGE_SIZE)) {
 				printk(KERN_ERR "Hole in node, disabling "
 						"region start %lx length %lx\n",
 						start, size);
 				continue;
 			}
-			node_data[numa_domain].node_size += size / PAGE_SIZE;
+			node_data[numa_domain].node_spanned_pages += size / PAGE_SIZE;
 		} else {
 			node_data[numa_domain].node_start_pfn =
 				start / PAGE_SIZE;
-			node_data[numa_domain].node_size = size / PAGE_SIZE;
+			node_data[numa_domain].node_spanned_pages = size / PAGE_SIZE;
 		}
 
 		for (i = start ; i < (start+size); i += MEMORY_INCREMENT)
@@ -202,7 +202,7 @@ void setup_nonnuma(void)
 		map_cpu_to_node(i, 0);
 
 	node_data[0].node_start_pfn = 0;
-	node_data[0].node_size = lmb_end_of_DRAM() / PAGE_SIZE;
+	node_data[0].node_spanned_pages = lmb_end_of_DRAM() / PAGE_SIZE;
 
 	for (i = 0 ; i < lmb_end_of_DRAM(); i += MEMORY_INCREMENT)
 		numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0;
@@ -224,12 +224,12 @@ void __init do_init_bootmem(void)
 		unsigned long bootmem_paddr;
 		unsigned long bootmap_pages;
 
-		if (node_data[nid].node_size == 0)
+		if (node_data[nid].node_spanned_pages == 0)
 			continue;
 
 		start_paddr = node_data[nid].node_start_pfn * PAGE_SIZE;
 		end_paddr = start_paddr + 
-				(node_data[nid].node_size * PAGE_SIZE);
+				(node_data[nid].node_spanned_pages * PAGE_SIZE);
 
 		dbg("node %d\n", nid);
 		dbg("start_paddr = %lx\n", start_paddr);
@@ -311,7 +311,7 @@ void __init paging_init(void)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (node_data[nid].node_size == 0)
+		if (node_data[nid].node_spanned_pages == 0)
 			continue;
 
 		start_pfn = plat_node_bdata[nid].node_boot_start >> PAGE_SHIFT;
diff -pur linux-2.5.73-mm1/arch/x86_64/mm/init.c linux-2.5.73-mm1-work/arch/x86_64/mm/init.c
--- linux-2.5.73-mm1/arch/x86_64/mm/init.c	Thu Jun 26 16:04:24 2003
+++ linux-2.5.73-mm1-work/arch/x86_64/mm/init.c	Thu Jun 26 16:41:37 2003
@@ -64,7 +64,7 @@ void show_mem(void)
 	printk("Free swap:       %6dkB\n",nr_swap_pages<<(PAGE_SHIFT-10));
 
 	for_each_pgdat(pgdat) {
-               for (i = 0; i < pgdat->node_size; ++i) {
+               for (i = 0; i < pgdat->node_spanned_pages; ++i) {
                        page = pgdat->node_mem_map + i;
 		total++;
                        if (PageReserved(page))
diff -pur linux-2.5.73-mm1/arch/x86_64/mm/numa.c linux-2.5.73-mm1-work/arch/x86_64/mm/numa.c
--- linux-2.5.73-mm1/arch/x86_64/mm/numa.c	Thu Jun 26 16:04:24 2003
+++ linux-2.5.73-mm1-work/arch/x86_64/mm/numa.c	Thu Jun 26 16:42:20 2003
@@ -86,7 +86,7 @@ void __init setup_node_bootmem(int nodei
 	memset(NODE_DATA(nodeid), 0, sizeof(pg_data_t));
 	NODE_DATA(nodeid)->bdata = &plat_node_bdata[nodeid];
 	NODE_DATA(nodeid)->node_start_pfn = start_pfn;
-	NODE_DATA(nodeid)->node_size = end_pfn - start_pfn;
+	NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
 	/* Find a place for the bootmem map */
 	bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn); 
diff -pur linux-2.5.73-mm1/include/asm-alpha/mmzone.h linux-2.5.73-mm1-work/include/asm-alpha/mmzone.h
--- linux-2.5.73-mm1/include/asm-alpha/mmzone.h	Thu Jun 26 16:04:30 2003
+++ linux-2.5.73-mm1-work/include/asm-alpha/mmzone.h	Thu Jun 26 16:42:36 2003
@@ -31,7 +31,6 @@ extern pg_data_t node_data[];
 
 #define pa_to_nid(pa)		alpha_pa_to_nid(pa)
 #define NODE_DATA(nid)		(&node_data[(nid)])
-#define node_size(nid)		(NODE_DATA(nid)->node_size)
 
 #define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
@@ -124,7 +123,7 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, 
 #define pfn_to_nid(pfn)		pa_to_nid(((u64)pfn << PAGE_SHIFT))
 #define pfn_valid(pfn)							\
 	(((pfn) - node_start_pfn(pfn_to_nid(pfn))) <			\
-	 node_size(pfn_to_nid(pfn)))					\
+	 node_spanned_pages(pfn_to_nid(pfn)))					\
 
 #define virt_addr_valid(kaddr)	pfn_valid((__pa(kaddr) >> PAGE_SHIFT))
 
diff -pur linux-2.5.73-mm1/include/asm-i386/mmzone.h linux-2.5.73-mm1-work/include/asm-i386/mmzone.h
--- linux-2.5.73-mm1/include/asm-i386/mmzone.h	Thu Jun 26 16:04:31 2003
+++ linux-2.5.73-mm1-work/include/asm-i386/mmzone.h	Thu Jun 26 17:14:56 2003
@@ -32,8 +32,7 @@ extern struct pglist_data *node_data[];
 #define alloc_bootmem_low_pages_node(ignore, x) \
 	__alloc_bootmem_node(NODE_DATA(0), (x), PAGE_SIZE, 0)
 
-#define node_size(nid)		(node_data[nid]->node_size)
-#define node_localnr(pfn, nid)	((pfn) - node_data[nid]->node_start_pfn)
+#define node_localnr(pfn, nid)		((pfn) - node_data[nid]->node_start_pfn)
 
 /*
  * Following are macros that each numa implmentation must define.
@@ -54,7 +53,7 @@ extern struct pglist_data *node_data[];
 #define node_end_pfn(nid)						\
 ({									\
 	pg_data_t *__pgdat = NODE_DATA(nid);				\
-	__pgdat->node_start_pfn + __pgdat->node_size;			\
+	__pgdat->node_start_pfn + __pgdat->node_spanned_pages;		\
 })
 
 #define local_mapnr(kvaddr)						\
diff -pur linux-2.5.73-mm1/include/asm-mips64/mmzone.h linux-2.5.73-mm1-work/include/asm-mips64/mmzone.h
--- linux-2.5.73-mm1/include/asm-mips64/mmzone.h	Thu Jun 26 16:07:24 2003
+++ linux-2.5.73-mm1-work/include/asm-mips64/mmzone.h	Thu Jun 26 16:43:03 2003
@@ -24,7 +24,7 @@ extern plat_pg_data_t *plat_node_data[];
 
 #define PHYSADDR_TO_NID(pa)		NASID_TO_COMPACT_NODEID(NASID_GET(pa))
 #define PLAT_NODE_DATA(n)		(plat_node_data[n])
-#define PLAT_NODE_DATA_SIZE(n)	     (PLAT_NODE_DATA(n)->gendata.node_size)
+#define PLAT_NODE_DATA_SIZE(n)	     (PLAT_NODE_DATA(n)->gendata.node_spanned_pages)
 #define PLAT_NODE_DATA_LOCALNR(p, n) \
 		(((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
 
diff -pur linux-2.5.73-mm1/include/asm-ppc64/mmzone.h linux-2.5.73-mm1-work/include/asm-ppc64/mmzone.h
--- linux-2.5.73-mm1/include/asm-ppc64/mmzone.h	Thu Jun 26 16:04:26 2003
+++ linux-2.5.73-mm1-work/include/asm-ppc64/mmzone.h	Thu Jun 26 16:37:03 2003
@@ -54,7 +54,6 @@ static inline int pa_to_nid(unsigned lon
  */
 #define NODE_DATA(nid)		(&node_data[nid])
 
-#define node_size(nid)		(NODE_DATA(nid)->node_size)
 #define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
 /*
diff -pur linux-2.5.73-mm1/include/asm-x86_64/mmzone.h linux-2.5.73-mm1-work/include/asm-x86_64/mmzone.h
--- linux-2.5.73-mm1/include/asm-x86_64/mmzone.h	Thu Jun 26 16:04:26 2003
+++ linux-2.5.73-mm1-work/include/asm-x86_64/mmzone.h	Thu Jun 26 16:43:18 2003
@@ -40,8 +40,7 @@ static inline __attribute__((pure)) int 
 #define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
-				 NODE_DATA(nid)->node_size)
-#define node_size(nid)		(NODE_DATA(nid)->node_size)
+				 NODE_DATA(nid)->node_spanned_pages)
 
 #define local_mapnr(kvaddr) \
 	( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) )
diff -pur linux-2.5.73-mm1/include/linux/mmzone.h linux-2.5.73-mm1-work/include/linux/mmzone.h
--- linux-2.5.73-mm1/include/linux/mmzone.h	Thu Jun 26 16:04:27 2003
+++ linux-2.5.73-mm1-work/include/linux/mmzone.h	Thu Jun 26 16:37:59 2003
@@ -184,11 +184,16 @@ typedef struct pglist_data {
 	unsigned long *valid_addr_bitmap;
 	struct bootmem_data *bdata;
 	unsigned long node_start_pfn;
-	unsigned long node_size;
+	unsigned long node_present_pages; /* total number of physical pages */
+	unsigned long node_spanned_pages; /* total size of physical page 
+					     range, including holes */
 	int node_id;
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 } pg_data_t;
+
+#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
+#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
 
 extern int numnodes;
 extern struct pglist_data *pgdat_list;
diff -pur linux-2.5.73-mm1/mm/page_alloc.c linux-2.5.73-mm1-work/mm/page_alloc.c
--- linux-2.5.73-mm1/mm/page_alloc.c	Thu Jun 26 16:07:26 2003
+++ linux-2.5.73-mm1-work/mm/page_alloc.c	Thu Jun 26 16:50:45 2003
@@ -903,7 +903,7 @@ void si_meminfo_node(struct sysinfo *val
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
 
-	val->totalram = pgdat->node_size;
+	val->totalram = pgdat->node_present_pages;
 	val->freeram = nr_free_pages_pgdat(pgdat);
 	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
 	val->freehigh = pgdat->node_zones[ZONE_HIGHMEM].free_pages;
@@ -1138,12 +1138,13 @@ static void __init calculate_zone_totalp
 
 	for (i = 0; i < MAX_NR_ZONES; i++)
 		totalpages += zones_size[i];
-	pgdat->node_size = totalpages;
+	pgdat->node_spanned_pages = totalpages;
 
 	realtotalpages = totalpages;
 	if (zholes_size)
 		for (i = 0; i < MAX_NR_ZONES; i++)
 			realtotalpages -= zholes_size[i];
+	pgdat->node_present_pages = realtotalpages;
 	printk("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
 }
 
@@ -1349,7 +1350,7 @@ void __init free_area_init_node(int nid,
 	pgdat->node_start_pfn = node_start_pfn;
 	calculate_zone_totalpages(pgdat, zones_size, zholes_size);
 	if (!node_mem_map) {
-		size = (pgdat->node_size + 1) * sizeof(struct page); 
+		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page); 
 		node_mem_map = alloc_bootmem_node(pgdat, size);
 	}
 	pgdat->node_mem_map = node_mem_map;

--=-0X7S2SAZNKi17LAVnQ0Z--

